The Opportunity
Prediction markets let you trade on the outcome of real-world events like elections, economic data, policy decisions, weather. Prices reflect the crowd's probability estimate: a contract trading at 40 cents implies the market believes there's a 40% chance of YES.
The math is simple: if you can estimate probabilities more accurately than the market, you make money in expectation. Buy YES when you think the true probability is higher than the price. Buy NO when you think it's lower. Over enough trades, better accuracy wins.
The hard part isn't the math, but the research and the judgment. Kalshi has thousands of open markets at any given time. Finding the ones where the market is wrong requires doing deep research on each question: What's the current state of the world? What are the base rates? What do experts think? What are the strongest arguments on both sides? And then you have to synthesize all of that into an accurate probability estimate that requires weighing conflicting evidence, calibrating your confidence, and resisting the pull of cognitive biases. No human can do this systematically across hundreds of markets. It's tedious, time-consuming, and the kind of work where human biases — confirmation bias, anchoring, availability bias — creep in and degrade both the research and the judgment.
This is a job ideally suited for AI.
We built a notebook that scans the highest-volume Kalshi markets, uses AI to do in-depth research and best-practice forecasting for each one, and highlights the markets where the AI's probability estimate disagrees most with the market price. The full research, including both sides of every argument, is included for every question.
We think this is useful for anyone trying to forecast underlying probabilities on prediction markets, whether you use the AI's estimates directly or just as a starting point for your own analysis. (This is less relevant for momentum traders, technical traders, or market makers as it's a fundamentals-first tool.)
A key advantage of doing this with AI is that the research is 100% transparent and totally neutral. Unlike human analysts who might omit information whether intentionally or through bias, the AI researches both sides with equal rigor. You can read every piece of research and every rationale, and decide for yourself whether you agree.
| Metric | Value |
|---|---|
| Markets scanned | ~3,500 (all open Kalshi events) |
| Markets forecasted | 100 (top events by volume, up to 2 subquestions each) |
| Research agents per question | 6 (current state, base rates, key factors, expert opinions, YES thesis, NO thesis) |
| Total research agents | 700 (100 questions x 7 stages, including initial research + 5 parallel agents + forecaster) |
| Forecasting model | Claude Opus with extended thinking |
| Total cost | ~$100 |
| Runtime | ~45 minutes |
How It Works
The system runs as a Google Colab notebook in four stages:
1. Find markets
The notebook fetches all open events from Kalshi's API (~3,500 events), ranks them by trading volume, and selects the top 100. For events with multiple subquestions (e.g., "Who will be the next Fed Chair?" has one subquestion per candidate), it picks the top 2 by volume. High-volume markets are more likely to have meaningful price signals and enough liquidity to actually trade.
2. Research each market
This is where everyrow does the heavy lifting. For each of the 100 markets, six research agents run in parallel, each performing live web searches:
- Current state: What's happening right now that's relevant to this question?
- Base rates: How often have similar events occurred historically?
- Key factors: What will most influence the outcome?
- Expert and market opinions: What do other forecasters and prediction markets think?
- YES investor thesis: The strongest case that the market is underpricing YES
- NO investor thesis: The strongest case that the market is overpricing YES
The opposing YES/NO theses are critical. They force the system to steelman both sides before making a judgment, which helps counteract the confirmation bias that plagues human forecasters. This research structure is borrowed from superforecasting literature: the Good Judgment Project found that systematically considering multiple angles produces more accurate predictions than intuitive judgment alone.
This produces ~600 parallel research agent calls. Each agent performs web searches to ground its analysis in current information rather than relying on training data.
3. Generate forecasts
Claude Opus with extended thinking receives the full research dossier and produces a probability estimate with a detailed written rationale.
Every forecast is fully transparent. It includes a multi-paragraph rationale explaining the reasoning, the key uncertainties, and what could change the probability. You can read the research and the rationale for every single position, because nothing is a black box.
4. Compare to market prices
For each market, the system compares the AI forecast to the current market price and highlights the biggest disagreements (the markets where the model thinks the crowd is most wrong.)
The Pipeline in Code
The core of the system is built on everyrow's agent_map operation, which runs an LLM agent across every row of a dataframe in parallel. Here's the research stage:
from everyrow.ops import agent_map
from pydantic import BaseModel
# Stage 1: Current state (runs first since later stages depend on it)
class CurrentStateResponse(BaseModel):
days_remaining: int
resolution_date: str
current_state: str
stage1_result = await agent_map(
"""You are a forecasting analyst researching a prediction question from Kalshi.
For the provided question, answer:
1. How much time is left until the outcome is determined?
2. What is the current state of the world relevant to this question?
3. Summarize key background and developments from the last 2 years.
You MUST perform at least one web search to ensure your information is current.""",
session,
input=df,
response_model=CurrentStateResponse,
)
# Stage 2: Five research agents run in parallel
(base_rates_result, key_factors_result, market_expert_result,
yes_investor_result, no_investor_result) = await asyncio.gather(
agent_map(base_rates_prompt, session, input=stage1_combined, response_model=BaseRatesResponse),
agent_map(key_factors_prompt, session, input=stage1_combined, response_model=KeyFactorsResponse),
agent_map(market_expert_prompt, session, input=stage1_combined, response_model=MarketExpertResponse),
agent_map(yes_thesis_prompt, session, input=stage1_combined, response_model=YesThesisResponse),
agent_map(no_thesis_prompt, session, input=stage1_combined, response_model=NoThesisResponse),
)
The forecasting stage:
from everyrow.task import LLM
class ForecastResponse(BaseModel):
rationale: str
probability: int
result = await agent_map(
FORECASTER_PROMPT, session, input=all_research,
response_model=ForecastResponse,
llm=LLM["CLAUDE_4_5_OPUS_THINKING"],
)
all_research['probability'] = result.data['probability']
all_research['rationale'] = result.data['rationale']
First Run: February 23, 2026
Here's what the system produced on its first live run.
Out of 100 markets researched, here are the 37 where Claude Opus's forecast disagrees most with the market price, sorted by the size of the disagreement:
| Question | Pos | Mkt% | Fcast% |
|---|---|---|---|
| Snow in New York City from Feb 21 - Feb 24? [Above 20.0 inches] | YES | 37 | 58 |
| Texas Democratic Senate nominee? [Jasmine Crockett] | YES | 22 | 35 |
| World leaders out in 2026? [Ali Khamenei] | NO | 64 | 52 |
| Will Americans receive tariff stimulus checks? [Before 2027] | NO | 28 | 18 |
| Best AI in Feb 2026? [Gemini] | YES | 3 | 12 |
| Who will win Survivor Season 50? [Aubry Bracco] | NO | 74 | 67 |
| What will Trump say during the State of the Union? [Crypto / Bitcoin] | YES | 25 | 30 |
| Florida Republican Governor nominee? [James Fishback] | NO | 15 | 10 |
| Snow in New York City in Feb 2026? [Above 25.0 inches] | YES | 19 | 24 |
| Fed decision in Mar 2026? [Cut 25bps] | YES | 5 | 10 |
| Rookie of the Year Winner? [Cooper Flagg] | YES | 74 | 79 |
| Will Americans receive tariff stimulus checks? [Before August] | NO | 12 | 8 |
| Oscar for Best Supporting Actress? [Teyana Taylor] | NO | 52 | 48 |
| Pro Basketball Playoff Qualifiers? [Charlotte] | NO | 52 | 48 |
| English Premier League Winner? [Man City] | NO | 38 | 34 |
| Will marijuana be rescheduled? [Before 2027] | NO | 57 | 53 |
| Will marijuana be rescheduled? [Before July 2026] | NO | 29 | 25 |
| Texas Democratic Senate nominee? [James Talarico] | NO | 79 | 75 |
| Maine Democratic Senate nominee? [Janet Mills] | YES | 36 | 40 |
| 2026 ICC Men's T20 World Cup Winner [India] | NO | 33 | 30 |
| When will DHS be funded again? [Before Mar 10, 2026] | NO | 40 | 37 |
| Maine Democratic Senate nominee? [Graham Platner] | NO | 64 | 62 |
| When will Bitcoin cross $100k again? [Before July 2026] | YES | 17 | 19 |
| Stanley Cup Champion? [Colorado Avalanche] | NO | 26 | 24 |
| Snow in New York City from Feb 21 - Feb 24? [Above 22.0 inches] | YES | 3 | 5 |
| MVP Winner? [Nikola Jokic] | NO | 22 | 20 |
| Oscar for Best Supporting Actor? [Stellan Skarsgard] | NO | 26 | 24 |
| Western Conference Champion? [San Antonio] | YES | 14 | 16 |
| 2026 ICC Men's T20 World Cup Winner [South Africa] | YES | 29 | 31 |
| Oscar for Best Supporting Actress? [Amy Madigan] | YES | 17 | 19 |
| Best AI in Feb 2026? [Claude] | NO | 93 | 91 |
| Ali Khamenei out as Supreme Leader? [Before March 1, 2026] | YES | 4 | 6 |
| How many pure album sales will Megan Moroney's 'Cloud 9' have...? [At least 65,000 albums] | NO | 95 | 93 |
| Men's College Basketball Champion [Arizona] | YES | 13 | 15 |
| Most Improved Player Winner? [Jalen Duren] | NO | 14 | 13 |
| Oscar for Best Picture? [Sinners] | YES | 14 | 15 |
| Oscar for Best Cinematography? [Sinners] | YES | 19 | 20 |
The breadth is worth noting. In a single 45-minute run, the system researched and produced forecasts across weather, geopolitics, entertainment awards, professional and college sports, economic policy, cryptocurrency, AI, and electoral politics. Each forecast comes with a full research dossier and rationale that you can read and evaluate. You will find every disagreement with the market is explained, not just asserted.
Example: Best AI in Feb 2026? [Gemini] — Market 3%, Forecast 12%
Here's the full rationale from Claude Opus for one of the larger disagreements. The question asks whether Google's Gemini will be the top-ranked LLM on the LMSYS Chatbot Arena on February 28, 2026. The market prices this at just 3%. Opus thinks the true probability is 12% which is still unlikely, but meaningfully higher than the market implies. Judge for yourself:
This question asks whether Google's Gemini will be the top-ranked LLM on the LMSYS Chatbot Arena on February 28, 2026. With 5 days remaining, I'll synthesize the key evidence.
Current State Assessment: As of February 23, 2026, Gemini 3.1 Pro (released Feb 19) is in an extremely tight race with Claude Opus 4.6. The Elo ratings show Claude at ~1506 and Gemini at ~1505 — a 1-point difference that is within daily statistical noise (standard deviation ~1.5 Elo points). Gemini leads in specific sub-categories like "Vision-to-Code" and "Abstract Logic" and has briefly touched #1 already.
Market Signals: Prediction markets are strongly bearish on Gemini: Kalshi at 3%, Polymarket at 10%, Manifold at ~5%, with Claude/Anthropic dominating at 86–93%. This consensus is striking given the research suggesting a near-tie in actual Elo rankings.
Key Uncertainties:
Elo Stabilization Direction: Gemini 3.1 Pro is only 4 days old, and its rating is still stabilizing. If the influx of new votes confirms its strong reasoning performance, it could overtake Claude. Conversely, if early "hype votes" fade, it may settle below Claude.
Competitor Stealth Drops: Historical patterns show competitors respond within 3–14 days. With 5 days remaining, there's meaningful probability (~15–25%) that Anthropic or OpenAI could release an update, though it would need rapid vote accumulation to affect rankings.
Resolution Criteria Interpretation: The exact string matching for model names and whether "Style Control" filters are applied could significantly affect resolution. Claude appears to have advantages in general conversational tasks that drive overall Arena rankings.
Vote Volume Dynamics: Weekend voting (Feb 28 is a Saturday) tends to increase volatility and favor models with broad appeal over niche strengths.
Weighing the Evidence: The YES thesis correctly identifies that a 1-point Elo difference is essentially a coin flip on any given day, making 3% seem too low. However, the strong market consensus suggests traders see structural advantages for Claude that the statistical tie obscures — likely Claude's consistent performance in head-to-head battles for general tasks, which drive overall rankings more than category-specific leads.
The NO thesis's concern about vote volume requirements (3,000+ net-positive votes to flip rankings) and the potential for Gemini's early momentum to level off is credible. However, the claim of needing to close a 5-point gap appears outdated — current research shows only a 1-point gap.
Base Rate Consideration: In highly competitive periods, rank swaps occur every 10–14 days. With Gemini at virtual parity and within its "honeymoon" period (typically 1–2 weeks), the probability of maintaining or achieving #1 is higher than the market suggests.
Conclusion: The market appears to be somewhat underpricing Gemini given the near-statistical-tie, but I weight the strong consensus of experienced market participants who likely have real-time information and understand resolution nuances. Claude's structural advantages in human preference voting for general tasks, combined with its current slight lead, justify Claude as the favorite. However, the 3% Kalshi price seems too extreme given the 1-point Elo difference.
Try It Yourself
The full system runs as a Google Colab notebook (no local setup required.)
- Get a free API key at everyrow.io/api-key ($20 in free credits included)
- Open the Kalshi Forecaster notebook in Google Colab
- Add your API key to Colab Secrets
- Click Runtime > Run all
The default configuration forecasts a handful of top markets for ~$5. Scale up to 100 questions by adjusting AUTO_TOP_EVENTS in the configuration cell.
You can run the notebook yourself on different questions, or just use ours as a starting point. We built this to show how everyrow's tools can be applied to real problems. We hope it's useful for your own forecasting.