Blog

Technical posts from the FutureSearch team.

LiteLLM made us accidentally make our product free for a week

June 10, 2026·
Jack WildmanJack Wildman

In LiteLLM v1.83.10, the allow_client_tags security feature started silently stripping caller-supplied request_tags unless an admin opts the key or team in. The request still returns 200, with no error and no caller-side signal, so the tags our billing pipeline depends on vanished from LiteLLM_SpendLogs. For six and a half days every paid task and conversation deducted $0. Here is how the tag stripping happened, why our spend alerts missed it, and what we changed.

DeepSeek V4 Pro vs GPT-5.5: What the Benchmarks and Forecasts Say

June 9, 2026·
Dan SchwarzDan Schwarz

DeepSeek V4 Pro beats GPT-5.5 Pro on benchmarks at a fifth of the cost, but three forecasts put its enterprise API share near 4% by end of 2026, with OpenAI holding $180 pricing and the coding lead.

GitHub Copilot Per-Token Pricing: The Median Seat Still Costs $19

June 9, 2026·
Dan SchwarzDan Schwarz

GitHub Copilot's move to per-token billing sparked 'Tokenpocalypse' warnings of a 14x cost spike. A forecast puts the effective cost per seat at ~$19/month, because typical usage fits inside the included credit allowance.

Intel Crescent Island: Why No Major Cloud Will Deploy It by 2027

June 3, 2026·
Dan SchwarzDan Schwarz

A forecast of whether Intel's Crescent Island AI GPU, pitched as an HBM-free Nvidia and AMD rival, lands a major cloud. The answer: a 13% chance any top-four hyperscaler deploys it by end of 2027.

Microsoft's GPQA Diamond Score in 2026: Forecasting a Top Four Lab

June 3, 2026·
Dan SchwarzDan Schwarz

Microsoft declared independence from OpenAI at Build 2026 and is chasing top-four-lab status. A forecast puts its best in-house GPQA Diamond score at a median 86.5% by year-end, just below the ~88-92% frontier band where DeepMind, OpenAI, and Anthropic sit.

Some rare examples of AIs being underconfident

May 6, 2026·
Dan SchwarzDan Schwarz

AI overconfidence gets most of the attention, but expert auditors found cases where an Opus 4.6 forecasting agent derived the right answer, wrote out the math, cited the correct precedent, and then assigned a probability inconsistent with its own analysis. Four case studies from BTF-2.

History doesn't repeat itself as often as LLMs think

May 5, 2026·
Dan SchwarzDan Schwarz

Frontier AI forecasting agents are good at finding historical patterns. They are noticeably worse at noticing when the conditions that produced those patterns have changed. Three case studies from the BTF-2 benchmark show the same failure: agents extrapolate base rates without checking whether the generating mechanism is still active.

Agents sometimes catastrophize

May 4, 2026·
Dan SchwarzDan Schwarz

AI forecasting agents tend to model the most extreme version of an outcome, correctly explain why it's unlikely, and then assign that low probability to the entire question. We identify this catastrophizing pattern across three geopolitical case studies from the BTF-2 benchmark.

Run agents twice for fun and profit

May 1, 2026·
Dan SchwarzDan Schwarz

Running the same forecasting agent more than once and averaging beats any single run. Ensembling across two Opus 4.6 runs and other frontier models cuts Brier score on 1,367 BTF-2 benchmark questions, and a worked example shows how a second run finds context the first missed.

AGI Timeline Predictions: How Top Forecasters Updated, 2023 to 2026

April 12, 2026·
Dan SchwarzDan Schwarz

How the forecasts of Daniel Kokotajlo, Dario Amodei, Ajeya Cotra, Peter Wildeford, Demis Hassabis, and other top AI researchers have changed on when AI will automate all cognitive labor

AI 2027 Update: A One Year Timeline Check

April 8, 2026·
Dan SchwarzDan Schwarz

One year after AI 2027 launched, we revisit the forecasts and compare the scenario's predictions to what actually happened with Anthropic, the Pentagon, and Claude Mythos.

Forecasting Polymarket Questions with AI

April 1, 2026·
Dan SchwarzDan Schwarz

A hands-on workflow for using AI to research and forecast Polymarket prediction markets. Pick 100 questions, forecast them with research agents, pull live order book data, and rank markets by annualized ROI to find an edge.

Catching the LiteLLM PyPI Attack: The Full Claude Code Transcript

March 25, 2026·
Callum McMahon

The full Claude Code transcript from discovering and responding to the litellm 1.82.8 PyPI supply chain attack on March 24, 2026 — from mysterious process explosions to malware identification to public disclosure.

LiteLLM Hack: Were You One of the 47,000?

March 25, 2026·
Daniel Hnyk

The litellm 1.82.7 and 1.82.8 supply chain attack on PyPI hit 47,000 downloads in 46 minutes. We analyzed all 2,337 dependent packages - 88% had version specs that allowed the compromised versions.

litellm 1.82.8 Supply Chain Attack on PyPI (March 2026)

March 24, 2026·
Callum McMahon

litellm 1.82.7 and 1.82.8 on PyPI were compromised in March 2026 with a malicious .pth file that steals SSH keys, cloud credentials, and secrets, then spreads across Kubernetes clusters. FutureSearch first reported it to PyPI. Learn which versions to avoid, how to check if you are affected, and what to do next.

How a Poisoned litellm Package Compromised an MCP Server in Cursor

March 24, 2026·
Callum McMahon

A malicious litellm release on PyPI compromised our machine through an MCP server's unpinned dependency. No prompt injection, no LLM trickery, just a poisoned package auto-downloaded by uvx.

Why new Date() Parses Almost Any String: V8 and the Implementation Defined Trap

March 18, 2026·
Robert GambeeRobert Gambee

JavaScript's Date.parse() will turn almost any string into a date. We tested V8, SpiderMonkey, and JavaScriptCore with surprising inputs and documented every quirk, from ISO 8601 UTC gotchas to why date-fns parseISO rejects what new Date() accepts.

The Self-Optimizing SEO Pipeline: Claude Code Agents on Google Search Console Data

March 18, 2026·
Daniel HnykDaniel Hnyk

We run an SEO pipeline that reads Google Search Console data, spawns an Opus agent per page, and proposes title and description changes. Each agent reads the history of every experiment we've run on that page. Over time, the suggestions get better.

How We Built a Marketing Pipeline with Claude Code

March 11, 2026·
Daniel HnykDaniel Hnyk

A Claude Code pipeline that scans 18 communities every morning (Reddit, HubSpot, Salesforce, StackOverflow), classifies opportunities with a 13-question rubric, and drafts responses. 2-3% signal rate. The hard part isn't running it. It's knowing what to look for.

How to Stop MCP Servers From Leaving Orphaned Docker Containers

March 6, 2026·
Robert GambeeRobert Gambee

Docker-based MCP servers leave behind zombie containers because the Docker daemon keeps them alive after Claude Code exits. Switching from docker run to uvx eliminates the problem entirely.

How to Debug AI Agents by Analyzing Their Own Traces with LLMs

March 3, 2026·
Peter MühlbacherPeter Mühlbacher

A practical guide to LLM trace analysis: how we use a Claude Code skill to debug AI agents by reviewing their own traces, catching scaffolding bugs, tool failures, and reasoning errors that manual skimming misses.

Caution: Read the Docs for Claude 4.6's Effort Parameter

March 2, 2026·
Peter MühlbacherPeter Mühlbacher

What Claude 4.6's effort parameter actually controls. For Opus and Sonnet 4.6, effort sets reasoning depth and also how thoroughly the model acts: how many tool calls it makes, how much it cross references, and whether it follows your system prompt. Why effort=low can quietly ignore instructions, and when to bump it to medium.

How to Run Claude Code as a Kubernetes CronJob

February 26, 2026·
Daniel HnykDaniel Hnyk

A practical guide to running Claude Code as a Kubernetes CronJob. The Dockerfile (Python plus Node), the entrypoint and claude command flags, jq log filtering, timeout safety nets, and the CronJob manifest, with the gotchas we hit running it in production for months.

What Is a Claude Code Workflow? Running Pipelines as Markdown

February 26, 2026·
Daniel HnykDaniel Hnyk

How we replaced Airflow and CI pipelines with Claude Code skills and subagents. Markdown files define multi-step workflows, agents execute each phase, and outputs land as plain files in GitHub.

Unleashing AI forecasters on Kalshi prediction markets

February 26, 2026·
Tom LiptayTom Liptay

How we used AI to forecast Kalshi prediction markets, scanning all open events, screening for adverse selection, then running six research agents and three forecasting models per market with full rationales for every position.

Can AI Beat Kalshi? Simulating a Prediction Market Portfolio

February 26, 2026·
Tom LiptayTom Liptay

We take our AI forecaster's probability estimates, compare them to live Kalshi order books, and build a simulated portfolio to benchmark whether the forecasts are actually accurate.

MCP structuredContent: How to Return Large Results Without Flooding the Context Window

February 26, 2026·
Rafael PoyiadziRafael Poyiadzi

Instead of dumping thousands of rows into the MCP tool response, split the audience: content for the model (text summary), structuredContent for the user (interactive widget at zero token cost), and a download URL for the sandbox.

OpenAI Responses vs Chat Completions API: Why Structured Outputs Differ

February 26, 2026·
Robert GambeeRobert Gambee

OpenAI's Responses and Chat Completions APIs have inexplicable inconsistencies. A real-world example of Conway's Law, where org structure dictates software design.

How to Upload Large Files to an MCP Server Without Filling the Context Window

February 25, 2026·
Rafael PoyiadziRafael Poyiadzi

How to let Claude upload large CSVs and files to your MCP server without eating the context window. A step by step guide to presigned URLs, the code execution sandbox, and the artifact ID pattern, covering Claude Code, Claude.ai, and Claude Desktop.

LLM API Differences That Break Your Code: Anthropic vs OpenAI vs Google

February 24, 2026·
Robert GambeeRobert Gambee

Anthropic, OpenAI, and Google LLM APIs look interchangeable on paper, but the differences break your code: JSON Schema rejections, structured output mismatches, forced tool calls, and prompt caching gotchas. The concrete quirks we hit running tens of thousands of LLM calls per day across all three providers.

Ask LLM Agents to Classify Problems Before Starting

February 17, 2026·
Christoph SträterChristoph Sträter

Before merging datasets, LLM agents should classify whether the join is one-to-one, one-to-many, or many-to-many. Getting cardinality wrong leads to duplicated rows, missing matches, and broken pipelines. Here's how to classify merge problems automatically.

How Much Does Deep Research Cost? A Model-by-Model Breakdown

February 12, 2026·
Peter MühlbacherPeter Mühlbacher

We benchmarked the cost and speed of deep research agents on Deep Research Bench. Claude 4.6 Opus and Gemini 3 Flash lead, with frontier models from $0.05 to $0.55 per task. See the best accuracy per dollar.

Using LLMs for Data Cleaning At Scale

February 6, 2026·
Rafael PoyiadziRafael Poyiadzi

How we deduplicated 20,000 rows of companies, contacts, and products with LLMs at 0.996 F1, about $1.12 per 1,000 rows, and conservative matching that never wrongly merges distinct records.

How AI Finds Fuzzy Duplicates in Large Datasets

January 19, 2026·
Nikos BosseNikos Bosse

Semantic deduplication uses AI to catch duplicates that exact matching misses. Learn how fuzzy matching detects entries like "IBM" and "International Business Machines" as the same entity across thousands of rows.

How LLM Agents Solve the Table Merging Problem

January 16, 2026·
Christoph SträterChristoph Sträter

Learn how to merge tables without a common key using AI. This tutorial walks through fuzzy matching, entity resolution, and joining datasets where VLOOKUP and exact-match joins fail.

Do Founder-Led Companies Outperform? The S&P 500 Returned 118% vs 59%

January 15, 2026·
Tom LiptayTom Liptay

Learn how to evaluate companies by criteria like founder alignment, moat strength, and capital allocation. Score and compare stocks by any criteria to evaluate and test a custom investment thesis

How to Rank S&P 500 Companies by Risk of Management Turnover

January 14, 2026·
Robert GambeeRobert Gambee

Which companies have had the most C-suite churn over the last 10 years? I researched all S&P 500 companies to find out.

Top Frontier AI Labs and Models in 2026: Who Is Leading the AI Race

January 8, 2026·
Dan SchwarzDan Schwarz

Who leads the AI race in 2026? We rank OpenAI, Anthropic, Google DeepMind, Meta AI, and xAI across model quality, data, compute, talent, and R&D. Predictions of Anthropic's rise ahead of their March 2026 surge in revenue

AI 2027 Six Months Later: Karpathy, Kokotajlo, and Shifting AGI Timelines

October 19, 2025·
Dan SchwarzDan Schwarz

Six months after the AI 2027 report predicted a fast AGI timeline, we revisit the forecasts alongside Karpathy's critiques. How have the original predictions held up, and why are timelines shifting?

A Guide for LLM Assisted Web Research

June 26, 2025·
Dan SchwarzDan Schwarz

A practical guide to building LLM-powered web research agents, covering search strategies, source evaluation, and synthesis of findings.

Superhuman Coders in AI 2027 - Not So Fast

May 1, 2025·
Dan SchwarzDan SchwarzTom LiptayTom Liptay

A critical look at the AI 2027 report's claims about superhuman coding — why the timeline is likely too aggressive.

How Tariffs Will Increase Prices on American-Made Products: Cost Impact Analysis

April 18, 2025·
Tom LiptayTom LiptayDan SchwarzDan Schwarz

Even American-made products rely on imported components, steel, and aluminum. Our analysis breaks down how tariffs increase the real cost of a Ford F-150 by $2,600-$3,800 and a Tesla Model 3 by $1,900-$2,400, with data on every major input.

Apple's Plan to Power Siri with ChatGPT was a Predictable Failure

March 10, 2025·
Dan SchwarzDan SchwarzLawrence PhillipsLawrence Phillips

The Apple and OpenAI Siri partnership has frayed, with OpenAI weighing legal action and Apple now eyeing Google Gemini. See how FutureSearch predicted this breakdown back in June 2024, before the WWDC keynote.

How Deep Research Agents Fail: Lessons from OpenAI, Gemini, and Perplexity

February 28, 2025·
Dan SchwarzDan Schwarz

Worked examples of how OpenAI, Gemini, and Perplexity Deep Research handle hard web research tasks, drawn from FutureSearch's Deep Research Bench evals. See where agents give up too early (lack of persistence) and where they repeat the same failed search instead of adapting, with concrete CyberSeek and UK excess deaths cases.

OpenAI Deep Research: Honest Analysis and Real Limitations in 2025

February 19, 2025·
Dan SchwarzDan Schwarz

OpenAI launched Deep Research on February 3, 2025. FutureSearch tested the new tool on six research tasks with known answers and found a jagged frontier of performance: confident wrong answers, odd source choices, and incomplete research that looks comprehensive. Here is what it gets right and where it fails.

The Death and Life of Prediction Markets at Google

November 11, 2024·
Dan SchwarzDan Schwarz

The inside story of how prediction markets were built, killed, and revived at Google — published in Asterisk Magazine.

How to Integrate AI Into Forecasting

June 9, 2024·
Dan SchwarzDan SchwarzLawrence PhillipsLawrence Phillips

A video presentation from Dan Schwarz and Lawrence Phillips on integrating AI into forecasting workflows, with practical approaches and lessons learned.

The Human v Bots Forecasting Tournament

January 8, 2024·
Dan SchwarzDan Schwarz

A recap of the forecasting tournament pitting human forecasters against AI bots on Manifold Markets.