How long until AI automates all cognitive labor?
How the forecasts of Daniel Kokotajlo, Dario Amodei, Ajeya Cotra, Peter Wildeford, Demis Hassabis, and other top AI researchers have changed on when AI will automate all cognitive labor
Technical posts from the FutureSearch team.
How the forecasts of Daniel Kokotajlo, Dario Amodei, Ajeya Cotra, Peter Wildeford, Demis Hassabis, and other top AI researchers have changed on when AI will automate all cognitive labor
One year after AI 2027 launched, we revisit the forecasts and compare the scenario's predictions to what actually happened with Anthropic, the Pentagon, and Claude Mythos.
A tutorial on using AI to forecast Polymarket prediction market questions, and then sorting markets by expected return on investment given the forecasts.
The full Claude Code transcript from discovering and responding to the litellm 1.82.8 PyPI supply chain attack on March 24, 2026 — from mysterious process explosions to malware identification to public disclosure.
The litellm 1.82.7 and 1.82.8 supply chain attack on PyPI hit 47,000 downloads in 46 minutes. We analyzed all 2,337 dependent packages - 88% had version specs that allowed the compromised versions.
litellm version 1.82.8 on PyPI contains a malicious .pth file that harvests SSH keys, cloud credentials, and secrets on every Python startup, then attempts lateral movement across Kubernetes clusters. First reported to PyPI by FutureSearch, whose report led to the package being quarantined.
A malicious litellm release on PyPI compromised our machine through an MCP server's unpinned dependency. No prompt injection, no LLM trickery, just a poisoned package auto-downloaded by uvx.
JavaScript's Date.parse() will turn almost any string into a date. We tested V8, SpiderMonkey, and JavaScriptCore with surprising inputs and documented every quirk, from ISO 8601 UTC gotchas to why date-fns parseISO rejects what new Date() accepts.
We run an SEO pipeline that reads Google Search Console data, spawns an Opus agent per page, and proposes title and description changes. Each agent reads the history of every experiment we've run on that page. Over time, the suggestions get better.
We built a pipeline that scans 18 community sources every morning, classifies opportunities with a 13-question rubric, and drafts responses. 2-3% signal rate. The hard part isn't running it - it's knowing what to look for.
Docker-based MCP servers leave behind zombie containers because the Docker daemon keeps them alive after Claude Code exits. Switching from docker run to uvx eliminates the problem entirely.
We built a Claude Code skill that reviews our AI agent traces and catches issues we'd miss ourselves. Here's how it works, and why it only became possible now.
Here's what Claude's effort parameter actually controls. For Opus and Sonnet 4.6, high effort primarily increases reasoning depth, but also...
We run Claude Code in Kubernetes for long-running marketing CronJobs. This originally sounded like a terrible idea, but after running it for a few months, we think it's a genuinely valid engineering approach - for the right kind of work.
How we replaced Airflow and CI pipelines with Claude Code skills and subagents. Markdown files define multi-step workflows, agents execute each phase, and outputs land as plain files in GitHub.
A case study using FutureSearch to run hundreds of parallel research agents and generate forecasts with full rationales across 100 Kalshi prediction markets.
We take our AI forecaster's probability estimates, compare them to live Kalshi order books, and build a simulated portfolio to benchmark whether the forecasts are actually accurate.
Instead of dumping thousands of rows into the MCP tool response, split the audience: content for the model (text summary), structuredContent for the user (interactive widget at zero token cost), and a download URL for the sandbox.
OpenAI's Responses and Chat Completions APIs have inexplicable inconsistencies. A real-world example of Conway's Law, where org structure dictates software design.
Inlining data in MCP tool calls eats the LLM's context window. We show how to use presigned URLs so Claude can upload files directly to your server, keeping the context clean with a 36-character artifact ID.
LLM APIs look interchangeable on paper. In practice, they diverge in subtle ways that break your code. We document the provider-specific quirks we've hit while running thousands of LLM calls per day across Anthropic, Google, and OpenAI.
Before merging datasets, LLM agents should classify whether the join is one-to-one, one-to-many, or many-to-many. Getting cardinality wrong leads to duplicated rows, missing matches, and broken pipelines. Here's how to classify merge problems automatically.
We benchmarked the cost and speed of deep research across ChatGPT, Gemini, Perplexity, and Grok. See which model gives the best answers per dollar on Deep Research Bench.
Learn how to deduplicate tens of thousands of rows using LLMs at minimum cost and high accuracy.
Semantic deduplication uses AI to catch duplicates that exact matching misses. Learn how fuzzy matching detects entries like "IBM" and "International Business Machines" as the same entity across thousands of rows.
Learn how to merge tables without a common key using AI. This tutorial walks through fuzzy matching, entity resolution, and joining datasets where VLOOKUP and exact-match joins fail.
Learn how to evaluate companies by criteria like founder alignment, moat strength, and capital allocation. Score and compare stocks by any criteria to evaluate and test a custom investment thesis
Which companies have had the most C-suite churn over the last 10 years? I researched all S&P 500 companies to find out.
Who leads the AI race in 2026? We rank OpenAI, Anthropic, Google DeepMind, Meta AI, and xAI across model quality, data, compute, talent, and R&D. Predictions of Anthropic's rise ahead of their March 2026 surge in revenue
We calculate intrinsic value the way Finance 101 teaches: forecasting revenue, margins, and shareholder payouts over the life of every company. By projecting actual cash flows 10+ years out with probabilistic forecasts, we can sort all stocks by discount to fair value without anchoring on market prices.
Six months after the AI 2027 report predicted a fast AGI timeline, we revisit the forecasts alongside Karpathy's critiques. How have the original predictions held up, and why are timelines shifting?
Stockfisher enables value investors to screen the entire market for the highest long-term returns based on detailed 10-year cash flow forecasts. For the first time, compare 3,000+ companies apples-to-apples with rigorous fundamental analysis at quantitative scale.
Practical guide to building LLM-powered web research agents — covering search strategies, source evaluation, and synthesis.
A critical look at the AI 2027 report's claims about superhuman coding — why the timeline is likely too aggressive.
Even American-made products rely on imported components, steel, and aluminum. Our analysis breaks down how tariffs increase the real cost of a Ford F-150 by $2,600-$3,800 and a Tesla Model 3 by $1,900-$2,400, with data on every major input.
OpenAI declined Apple's offer to power Siri with ChatGPT. Learn why the partnership failed and what it means for Apple Intelligence and AI integration in iOS.
Developers of agents must reckon with two types of failures: giving up too early (lack of persistence) and repeating failed approaches (lack of adaptation). Analysis of OpenAI and Perplexity's Deep Research products helps those building or working with agents understand how to balance these tradeoffs.
OpenAI's Deep Research tool was initially impressive when released in Feb 2025, but it actually underperformed the later release of ChatGPT-o3+search. Careful analysis of 6 strange failures show the subtle unreliability of "Deep Research" style products.
The inside story of how prediction markets were built, killed, and revived at Google — published in Asterisk Magazine.
Video presentation on integrating AI into forecasting workflows — covering practical approaches and lessons learned.
A recap of the forecasting tournament pitting human forecasters against AI bots on Manifold Markets.