Unleashing AI forecasters on Kalshi prediction markets

The Opportunity

Prediction markets let you trade on the outcome of real-world events — elections, economic data, policy decisions. Prices reflect the crowd's probability estimate: a contract trading at 40 cents implies the market believes there's a 40% chance of YES.

The math is simple: if you can estimate probabilities more accurately than the market, you make money in expectation. Buy YES when you think the true probability is higher than the price. Buy NO when you think it's lower. Over enough trades, better accuracy wins.

The hard part isn't the math — it's the research and the judgment. Kalshi has thousands of open markets at any given time. Finding the ones where the market is wrong requires doing deep research on each question: What's the current state of the world? What are the base rates? What do experts think? What are the strongest arguments on both sides? And then you have to synthesize all of that into an accurate probability estimate — weighing conflicting evidence, calibrating your confidence, and resisting the pull of cognitive biases. No human can do this systematically across hundreds of markets. It's tedious, time-consuming, and the kind of work where human biases — confirmation bias, anchoring, availability bias — creep in and degrade both the research and the judgment.

This is a job ideally suited for AI.

We built a notebook that scans the highest-volume Kalshi markets, screens out markets where we don't expect to add value and worry about being at an information disadvantage (insider information, sports, crypto), uses AI to do in-depth research and best-practice forecasting for each remaining market, and highlights the markets where the AI's probability estimate disagrees most with the market price. The full research — both sides of every argument — is included for every question.

We think this is useful for anyone trying to forecast underlying probabilities on prediction markets, whether you use the AI's estimates directly or just as a starting point for your own analysis. (This is less relevant for momentum traders, technical traders, or market makers — it's a fundamentals-first tool.)

Our broader aspiration is to build a forecaster that's useful beyond prediction markets — one that helps people reason about complex questions in politics, geopolitics, economics, and policy. Prediction markets give us a hard benchmark to measure against: if the AI can't outperform the crowd's probability estimates, we need to question whether it's adding accuracy at all. If it can, that's a strong signal the same research and reasoning methodology is worth applying to questions that don't have a market price attached.

Disclaimer: This is not investment advice. The forecasts and research presented here are for informational and educational purposes only.

A key advantage of an AI researcher is that the research is 100% transparent and totally neutral. Unlike human analysts who might omit information — intentionally or through bias — the AI researches both sides with equal rigor. You can read every piece of research and every rationale, and decide for yourself whether you agree.

Metric	Value
Markets scanned	~3,500 (all open Kalshi events)
Markets forecasted	Top events by volume, after screening
Screening steps	2 (insider info/moral hazard, then methodology fit)
Research agents per question	6 (current state, base rates, key factors, expert opinions, YES thesis, NO thesis)
Forecasting models	3 (Gemini, Claude Opus, Claude Opus — second independent run)
Final forecast	Median of all 3 models
Cost per question	~$0.60 for research + forecasting

How It Works

The system runs as a Google Colab notebook in five stages:

1. Find markets

The notebook fetches all open events from Kalshi's API (~3,500 events), ranks them by trading volume, and selects the top events. For events with multiple subquestions (e.g., "Who will be the next Fed Chair?" has one subquestion per candidate), it picks the top 2 by volume. High-volume markets are more likely to have meaningful price signals and enough liquidity to actually trade.

Markets are filtered by price range (3%–97%) to exclude near-certain outcomes, by minimum volume to focus on liquid markets, and by resolution date to exclude markets resolving within 10 days — near-term markets tend to be more information-sensitive, where insiders or close observers have an edge over AI research.

2. Screen for adverse selection and methodology fit

Not all high-volume markets are good candidates for AI forecasting. Before spending money on research, two AI-powered filters remove unsuitable markets using FutureSearch's classify operation:

Insider information filter — Rejects markets where insiders likely know the outcome: pre-taped reality TV shows (Survivor, Bachelor), celebrity personal decisions ("Will X attend the Super Bowl?"), outcomes already determined but not yet public, and markets with moral hazard where a bettor could influence the outcome (stunts, self-fulfilling bets).

Methodology fit filter — Rejects markets where specialized traders have a structural edge over AI research: sports (dedicated bettors with statistical models and injury intel) and cryptocurrency (driven by technical analysis and on-chain data our research agents don't capture). We focus on elections, geopolitics, policy, economics, and similar domains where deep web research and reasoning provide a forecasting edge.

Both filters use FutureSearch's classify operation, which runs an LLM evaluation on each row and keeps only those that pass. This costs ~$0.01 per market — far cheaper than running full research on a market we shouldn't be trading.

3. Research each market

This is where FutureSearch does the heavy lifting. For each market that passes classification, six research agents run in parallel, each performing live web searches:

Current state — What's happening right now that's relevant to this question?
Base rates — How often have similar events occurred historically?
Key factors — What will most influence the outcome?
Expert and market opinions — What do other forecasters and prediction markets think?
YES investor thesis — The strongest case that the market is underpricing YES
NO investor thesis — The strongest case that the market is overpricing YES

The opposing YES/NO theses are critical. They force the system to steelman both sides before making a judgment, which helps counteract the confirmation bias that plagues human forecasters. This research structure is borrowed from the superforecasting literature — the Good Judgment Project found that systematically considering multiple angles produces more accurate predictions than intuitive judgment alone.

Each agent performs web searches to ground its analysis in current information rather than relying on training data.

4. Generate forecasts

Three independent forecasting models receive the full research dossier and each produce a probability estimate with a detailed written rationale:

Gemini (Google's model)
Claude Opus (Anthropic's model)
Claude Opus (2) (a second independent Opus run)

The final forecast is the median of all three models. Using three forecasters with a median reduces the impact of any single model having a bad take on a particular question — if two out of three agree, the outlier gets overruled. Running two independent Opus calls also lets us see how much variance there is in the model's reasoning for the same question.

Every forecast is fully transparent — a multi-paragraph rationale explaining the reasoning, the key uncertainties, and what could change the probability. You can read the research and the rationale for every single position — nothing is a black box.

5. Compare to market prices

For each market, the system compares the median AI forecast to the current Kalshi price, computes the edge (forecast minus market), and sorts by the largest disagreement — the markets where the models think the crowd is most wrong. A summary table shows all questions with each forecaster's estimate, the median, and the edge at a glance.

Recent Run: February 26, 2026

Here are the 30 markets where the AI's median forecast disagrees most with the Kalshi market price, sorted by the size of the disagreement:

Question	Kalshi %	Forecast %	Edge
How many launches will SpaceX have in February 2026? [Above 12]	6	45	+39
Will the US take control of any part of Greenland? [Before January 2027]	41	10	-31
Which companies will have a top-ranked AI model this year? [xAI]	50	78	+28
Which companies will have a top-ranked AI model this year? [OpenAI]	59	85	+26
Will Trump buy at least part of Greenland? [Before January 20, 2029]	29	8	-21
Who will run for the Democratic presidential nomination in 2028? [Gretchen Whitmer]	60	78	+18
Texas Democratic Senate nominee? [Jasmine Crockett]	30	47	+17
2026 Texas Senate matchup? [Talarico vs. Paxton]	62	45	-17
Will Trump take back the Panama Canal?	31	14	-17
Will the U.S. confirm that aliens exist before 2027?	23	8	-15
World leaders out in 2026? [Ali Khamenei]	56	42	-14
Who will leave the Trump administration in 2026? [Kristi Noem]	50	62	+12
Will marijuana be rescheduled? [Before 2027]	56	45	-11
2028 Republican nominee for President? [J.D. Vance]	45	55	+10
Florida Republican Governor nominee? [James Fishback]	16	6	-10
CPI year-over-year in May 2026? [Exactly 2.8%]	26	16	-10
How much will the US acquire Greenland for? [$0 / No Acquisition]	78	87	+9
Will the US acquire any new territory? [Before Jan 2027]	23	14	-9
When will DHS receive full-year funding? [Before Mar 20, 2026]	32	23	-9
Will Trump buy at least part of Greenland? [Before 2027]	14	5	-9
Who will be the next Prime Minister of the UK? [Rupert Lowe]	13	4	-9
Will the US take control of any part of Greenland? [Before 2027]	15	7	-8
Will Americans receive tariff stimulus checks? [Before 2027]	21	13	-8
Will Trump invoke the Insurrection Act? [Before Jan 20, 2029]	56	48	-8
California Governor winner? (Person) [Eric Swalwell]	50	42	-8
Will a court order a tariff refund? [Before 2027]	78	85	+7
2026 Texas Senate matchup? [Crockett vs. Paxton]	25	32	+7
Will Trump be impeached? [Before Jan 1, 2027]	14	7	-7
Ali Khamenei out as Supreme Leader? [Before July 1, 2026]	39	32	-7
Will Reza Pahlavi lead Iran in 2026?	18	12	-6

We think the forecaster is pretty good — the research it produces is thorough, and the rationales are well-reasoned. But we're sure it makes mistakes. We don't expect anyone to take these forecasts at face value, and we don't either. Instead, when we see something unexpected — like a big disagreement with the market — we read the full research and rationales to understand why the AI thinks what it does. More often than not, we walk away having learned something interesting about the question, even when we ultimately disagree with the forecast. The value isn't just the number — it's the structured, neutral research behind it.

The full research, rationales, and forecasts for all 153 questions from this run are available in this spreadsheet.

Example: Which companies will have a top-ranked AI model this year? [xAI] — Market 50%, Forecast 78%

To give a sense of the depth of research, here's what the system produced for one question. The AI forecasts 78% (median of Gemini at 62%, Opus at 78%, Opus 2 at 82%) versus the Kalshi market price of 50%.

Base rates research:

Frequency of Leaderboard Rank Changes: In the 2024–2025 period, the #1 spot on the Chatbot Arena leaderboard changed hands approximately 8-12 times. For example, during one 6-month stretch in 2024, the average duration for a model at #1 was only about 3 weeks. This high turnover rate suggests a high probability that any top-tier model currently in the top 5 will cycle into the #1 position as new versions are released.

xAI's Success Rate in Reaching #1: Based on historical performance from 2024 through early 2026, xAI has reached the #1 spot with at least two distinct major model iterations (Grok 3 in early 2025 and Grok 4.1 in late 2025/early 2026). Given that xAI has released roughly 3-4 major model updates per year, the base rate for an xAI flagship eventually hitting #1 at some point in its lifecycle is approximately 50-66% per major version.

Persistence of the 'Big Four' at the Top: Since the inception of the Chatbot Arena in May 2023, the #1 spot has been exclusively held by models from four companies: OpenAI, Anthropic, Google, and xAI. No other company has successfully broken into the #1 position for a sustained period. As xAI is firmly established in this "frontier" group, its baseline probability of hitting #1 is significantly higher than the broader field of AI developers.

xAI Model Release Cadence: Between November 2023 and February 2026, xAI has maintained an aggressive release cycle of a new flagship or major "point" release every 3 to 4 months on average. In a landscape where the #1 spot typically changes with every major "frontier" release, this cadence provides xAI with 3-4 distinct "at-bats" per year to claim the top rank.

Historical Rebound Probability: Models or companies that fall from #1 to #2 or #3 have a high historical frequency of "reclaiming" the top spot with a follow-up "Thinking" or "Turbo" variant. xAI's current #2 and #4 positions place it within a statistical "dead heat" (often within 5-10 Elo points), where a minor update or even a shift in voting distributions can trigger a rank change.

Impact of "Reasoning/Thinking" Variants: Starting in late 2024 and continuing through 2025, the release of "Thinking" or "Reasoning" variants has consistently resulted in a jump to #1. Every major lab that released a successful reasoning model saw that model reach #1 within weeks of its full Arena integration.

Opus forecast rationale (78%):

This question asks whether xAI will have a #1 ranked AI model at any point before January 1, 2027. The resolution is cumulative — it only needs to happen once.

Most important factor: Likely already achieved. The research consistently indicates that xAI's Grok 4.1 (Thinking) held the sole #1 position on the LMSYS Chatbot Arena in January 2026. Multiple sources corroborate this. If confirmed under Kalshi's resolution criteria, this would already satisfy the condition for YES resolution regardless of future performance.

Even setting aside January 2026: xAI is currently at #2 with an Elo of 1502, just 3 points behind Anthropic's Claude Opus 4.6 at 1505. With 309 days remaining, xAI has multiple "at-bats": the full Grok 4.20 release (currently in beta at #4), the anticipated Grok 5 (68% probability of release by June 30 per Polymarket), and likely additional iterations. Historical base rates show the #1 spot changes 8-12 times per year, and xAI has reached #1 with ~50-66% probability per major version.

Why I'm above the market price of 50%: The market appears to be anchored on current rankings (recency bias) and merger disruption concerns. The NO thesis about "integration tax" from the SpaceX merger is plausible but likely overweighted given xAI released Grok 4.20 Beta in mid-February, demonstrating continued shipping velocity post-merger.

Key Uncertainties:

Did January 2026 count as #1 under Kalshi's rules? If yes (~60% likely given multiple sources say "sole #1"), the question is effectively already resolved YES.

Will Grok 5 deliver on the parameter scaling thesis? If the 6T parameter model launches and underperforms, xAI's main catalyst disappears. But historical patterns of reasoning models reaching #1 strongly favor YES.

Will LMSYS methodology change significantly? A shift to agentic or safety-weighted benchmarks could structurally disadvantage xAI's "uncensored" approach.

Competitive response timing: If OpenAI's GPT-6 or Anthropic's Claude 5 launches with a step-function improvement, the threshold for #1 could move beyond xAI's reach.

Weighting all factors: the combination of likely already having achieved #1, extreme proximity to #1 currently, massive compute infrastructure, aggressive release cadence, and 309 days of remaining opportunity makes YES substantially more likely than the 50% market price suggests.

The Pipeline in Code

The core of the system is built on FutureSearch's agent_map and classify operations. Here's the screening stage:

from futuresearch.ops import agent_map, classify

result = await classify(
    INSIDER_SCREEN_TASK, session,
    categories=["safe", "unsafe"],
    input=df[["complete_question", "rules"]],
)

passing = set(result.data["complete_question"])
df = df[df["complete_question"].isin(passing)].reset_index(drop=True)

The research stage uses agent_map to run agents across every row in parallel:

# Stage 1: Current state (runs first since later stages depend on it)
class CurrentStateResponse(BaseModel):
    days_remaining: int
    resolution_date: str
    current_state: str

stage1_result = await agent_map(
    """You are a forecasting analyst researching a prediction question from Kalshi.

    For the provided question, answer:
    1. How much time is left until the outcome is determined?
    2. What is the current state of the world relevant to this question?
    3. Summarize key background and developments from the last 2 years.

    You MUST perform at least one web search to ensure your information is current.""",
    session,
    input=df,
    response_model=CurrentStateResponse,
)

# Stage 2: Five research agents run in parallel
(base_rates_result, key_factors_result, market_expert_result,
 yes_investor_result, no_investor_result) = await asyncio.gather(
    agent_map(base_rates_prompt, session, input=stage1_combined, response_model=BaseRatesResponse),
    agent_map(key_factors_prompt, session, input=stage1_combined, response_model=KeyFactorsResponse),
    agent_map(market_expert_prompt, session, input=stage1_combined, response_model=MarketExpertResponse),
    agent_map(yes_thesis_prompt, session, input=stage1_combined, response_model=YesThesisResponse),
    agent_map(no_thesis_prompt, session, input=stage1_combined, response_model=NoThesisResponse),
)

The forecasting stage runs three models in parallel and takes the median:

from futuresearch.task import LLM

class ForecastResponse(BaseModel):
    rationale: str
    probability: int

async def forecaster_gemini(input):
    return await agent_map(
        FORECASTER_PROMPT, session, input=input,
        response_model=ForecastResponse,
        llm=LLM["GEMINI_3_PREVIEW_HIGH"],
    )

async def forecaster_opus(input):
    return await agent_map(
        FORECASTER_PROMPT, session, input=input,
        response_model=ForecastResponse,
        llm=LLM["CLAUDE_4_6_OPUS_HIGH"],
    )

# Run all three in parallel (forecaster_opus_2 is identical to forecaster_opus)
gemini_result, opus_result, opus_2_result = await asyncio.gather(
    forecaster_gemini(all_research),
    forecaster_opus(all_research),
    forecaster_opus_2(all_research),
)

# Median of three forecasters
all_research['median_probability'] = all_research[
    ['gemini_probability', 'opus_probability', 'opus_2_probability']
].median(axis=1).astype(int)

Try It Yourself

Run forecasts directly at futuresearch.ai/app.

The default configuration forecasts a handful of top markets for ~$5. Scale up by adjusting AUTO_TOP_EVENTS in the configuration cell.

You can run the notebook yourself on different questions, or just use ours as a starting point. We built this to show how FutureSearch's tools can be applied to real problems — we hope it's useful for your own forecasting.

What's Next

We plan to track whether these forecasts would lead to a profitable portfolio using play money — not to encourage trading, but as a hard benchmark for accuracy. If the AI's probability estimates are better than the market's, a simulated portfolio that buys when the forecast is above the market and sells when it's below should make money over time. If it doesn't, we need to improve the forecaster.

The companion Kalshi Trader notebook takes a forecast CSV, fetches live order books, and simulates portfolio allocation — we'll use it to track performance over the coming weeks and report back on whether the AI is actually adding accuracy over the crowd.