FutureSearch Logofuturesearch
  • Solutions
  • Pricing
  • Research
  • Docs
  • Evals
  • Blog
  • Company
  • Try it for free
FutureSearch Logo

General inquiry? You can reach us at hello@futuresearch.ai.

Company

Team & CareersPressPrivacy PolicyTerms of Service

Developers

SDK DocsAPI ReferenceCase StudiesGitHubSupport

Integrations

Claude CodeCursorChatGPT CodexClaude.ai

Follow Us

X (Twitter)@dschwarz26LinkedIn
FutureSearchdocs
Your research team
Installation
  • All install methods
  • Claude.ai
  • Claude Code
  • Web App
  • Python SDK
  • Skill
Reference
  • API Key
  • forecast
  • multi_agent
  • agent_map
  • rank
  • classify
  • merge
  • dedupe
  • MCP Server
  • Progress Monitoring
  • Chaining Operations
Guides
  • LLM-Powered Data Labeling
  • Add a Column via Web Research
  • Classify and Label Rows
  • Deduplicate Training Data
  • Error Handling
  • Filter a Dataset Intelligently
  • Find Profitable Prediction Market Trades
  • Forecast Outcomes for a List of Entities
  • Value a Private Company
  • Join Tables Without Shared Keys
  • Rank Data by External Metrics
  • Research a Question with a Team of Agents
  • Resolve Duplicate Entities
  • Scale Deduplication to 20K Rows
  • Turn Claude into an Accurate Forecaster
Case Studies
  • Deduplicate Contact Lists
  • Deduplicate CRM Records
  • Enrich Contacts with Company Data
  • Find Startups Selling to Frontier AI Labs
  • Forecast a Sum-of-the-Parts SpaceX IPO Valuation
  • Forecast Anthropic and OpenAI IPO Valuations
  • Forecast Founder Seed Valuations for AI Researchers
  • Forecast When Anthropic and OpenAI Will IPO
  • Fuzzy Match Across Tables
  • Link Records Across Medical Datasets
  • LLM Cost vs. Accuracy
  • Merge Costs and Speed
  • Merge Thousands of Records
  • Multi-Stage Lead Qualification
  • Research and Rank Web Data
  • Research Formal Verification for AI
  • Run 10,000 LLM Web Research Agents
  • Score Cold Leads via Web Research
  • Score Leads from Fragmented Data
  • Screen 10,000 Rows
  • Screen Job Listings
  • Screen Stocks by Economic Sensitivity
  • Screen Stocks by Investment Thesis
FutureSearchby futuresearch
by futuresearch

Research Formal Verification for AI

Some questions are too broad for one agent. "What is the state of formal verification for AI systems?" spans verification tooling, scalability theory, real deployments, and competing approaches. A single researcher would over-weight whichever angle it happened to start with.

multi_agent splits the question across five research agents, each with its own brief, then synthesizes their findings into one structured answer.

MetricValue
Research agents5
Effort levelhigh
Output6-field brief
Total cost$1.43
Time~4 minutes

Add FutureSearch to Claude Code if you haven't already:

claude mcp add futuresearch --scope project --transport http https://mcp.futuresearch.ai/mcp

Then ask Claude:

Research the current state of formal verification methods applied to AI and
ML systems. Cover what techniques exist and how mature they are, what cannot
yet be verified, real-world deployments, and where the field is heading.

Claude calls FutureSearch's multi_agent tool, assigns each agent a distinct angle, and defines the answer's shape with a response_schema:

Tool: futuresearch_multi_agent
├─ task: "Research the current state of formal verification for AI/ML systems"
├─ effort_level: "high"
├─ directions: [
│    "Neural network verification tools and their capabilities (a,b-CROWN...)",
│    "The scalability problem for verifying LLMs and transformers",
│    "Real-world deployments in safety-critical applications",
│    "Theoretical and practical limits, including undecidability",
│    "Alternative and hybrid approaches to AI assurance" ]
└─ response_schema: {overall_maturity, what_works, key_limitations,
                     frontier_ai_gap, real_deployments, trajectory}

→ Submitted: multi-agent research starting.
  Session: https://futuresearch.ai/sessions/40b1aa59-b7f9-4a97-a8ec-adb22611813a
  Task ID: acb5...

Tool: futuresearch_progress
→ Running: 5 agents researching (60s elapsed)

...

Tool: futuresearch_progress
→ Completed in 249s. Synthesized 1 row, 6 fields.

Tool: futuresearch_results
→ Saved to /Users/you/formal_verification.csv

Add the FutureSearch connector if you haven't already. Then ask Claude:

Research the current state of formal verification methods applied to AI and ML systems. Cover what techniques exist and how mature they are, what cannot yet be verified, real-world deployments, and where the field is heading.

Results take about 4 minutes.

Go to futuresearch.ai/app and enter:

Research the current state of formal verification methods applied to AI and ML systems. Cover what techniques exist and how mature they are, what cannot yet be verified, real-world deployments, and where the field is heading.

The response_schema defines the fields of the synthesized answer. Without it, the result is a single answer string.

pip install futuresearch
export FUTURESEARCH_API_KEY=your_key_here  # Get one at futuresearch.ai/app/api-key
import asyncio
import pandas as pd
from futuresearch import create_session
from futuresearch.ops import multi_agent

SCHEMA = {
    "type": "object",
    "properties": {
        "overall_maturity": {"type": "string"},
        "what_works": {"type": "string"},
        "key_limitations": {"type": "string"},
        "frontier_ai_gap": {"type": "string"},
        "real_deployments": {"type": "string"},
        "trajectory": {"type": "string"},
    },
    "required": ["overall_maturity", "what_works", "key_limitations",
                 "frontier_ai_gap", "real_deployments", "trajectory"],
}

async def main():
    async with create_session(name="Formal verification for AI") as session:
        result = await multi_agent(
            session=session,
            task="Research the current state of formal verification methods "
                 "applied to AI and machine learning systems.",
            input=pd.DataFrame(),
            effort_level="high",
            directions=[
                "Neural network verification tools and their capabilities.",
                "The scalability problem for verifying LLMs and transformers.",
                "Real-world deployments in safety-critical applications.",
                "Theoretical and practical limits, including undecidability.",
                "Alternative and hybrid approaches to AI assurance.",
            ],
            response_schema=SCHEMA,
        )
        return result.data

print(asyncio.run(main()).iloc[0])

Results

The five agents synthesized into a six-field brief. Each field is one section of the answer:

FieldWhat it found
overall_maturityFormal verification for AI is at an inflection point: maturing fast for small-to-medium neural networks, but unable to scale to frontier-scale models.
what_worksBound propagation with branch-and-bound (the α,β-CROWN verifier, VNN-COMP winner five years running), SMT-based verification (Marabou 2.0), abstract interpretation, and runtime safety shields.
key_limitationsVerifying ReLU networks is NP-complete; the specification problem; the gap between proving software correctness and ensuring safe physical outcomes; undecidability of non-trivial semantic properties.
frontier_ai_gapDeterministic verification is limited to roughly BERT-sized models. Softmax attention's cross-nonlinearities defeat exact solvers, so frontier LLMs cannot be verified end-to-end.
real_deploymentsACAS Xu collision-avoidance in aviation, Airbus's use of Astrée, Mobileye's Responsibility-Sensitive Safety model, NASA DAIDALUS, and FDA-cleared medical AI devices.
trajectoryA move toward layered, portfolio assurance: AI-assisted proof generation with Lean and Coq, compositional verification, and runtime monitoring around unverified models.

The answer holds together because no single agent had to cover everything. The agent on real deployments surfaced concrete, named systems (ACAS Xu, Astrée, Mobileye RSS) while the agent on theoretical limits brought the undecidability results, and the synthesis step reconciled them into one consistent narrative rather than five disconnected summaries. Every claim in the output carries citations back to the specific pages the agents read, so each section can be checked against its sources.

Going deeper

  • Reference: multi_agent
  • Guide: Research a question with a team of agents
  • Case study: Find startups selling to AI labs