Research Formal Verification for AI
Some questions are too broad for one agent. "What is the state of formal verification for AI systems?" spans verification tooling, scalability theory, real deployments, and competing approaches. A single researcher would over-weight whichever angle it happened to start with.
multi_agent splits the question across five research agents, each with its own brief, then synthesizes their findings into one structured answer.
| Metric | Value |
|---|---|
| Research agents | 5 |
| Effort level | high |
| Output | 6-field brief |
| Total cost | $1.43 |
| Time | ~4 minutes |
Add FutureSearch to Claude Code if you haven't already:
claude mcp add futuresearch --scope project --transport http https://mcp.futuresearch.ai/mcp
Then ask Claude:
Research the current state of formal verification methods applied to AI and
ML systems. Cover what techniques exist and how mature they are, what cannot
yet be verified, real-world deployments, and where the field is heading.
Claude calls FutureSearch's multi_agent tool, assigns each agent a distinct angle, and defines the answer's shape with a response_schema:
Tool: futuresearch_multi_agent
├─ task: "Research the current state of formal verification for AI/ML systems"
├─ effort_level: "high"
├─ directions: [
│ "Neural network verification tools and their capabilities (a,b-CROWN...)",
│ "The scalability problem for verifying LLMs and transformers",
│ "Real-world deployments in safety-critical applications",
│ "Theoretical and practical limits, including undecidability",
│ "Alternative and hybrid approaches to AI assurance" ]
└─ response_schema: {overall_maturity, what_works, key_limitations,
frontier_ai_gap, real_deployments, trajectory}
→ Submitted: multi-agent research starting.
Session: https://futuresearch.ai/sessions/40b1aa59-b7f9-4a97-a8ec-adb22611813a
Task ID: acb5...
Tool: futuresearch_progress
→ Running: 5 agents researching (60s elapsed)
...
Tool: futuresearch_progress
→ Completed in 249s. Synthesized 1 row, 6 fields.
Tool: futuresearch_results
→ Saved to /Users/you/formal_verification.csv
Add the FutureSearch connector if you haven't already. Then ask Claude:
Research the current state of formal verification methods applied to AI and ML systems. Cover what techniques exist and how mature they are, what cannot yet be verified, real-world deployments, and where the field is heading.
Results take about 4 minutes.
Go to futuresearch.ai/app and enter:
Research the current state of formal verification methods applied to AI and ML systems. Cover what techniques exist and how mature they are, what cannot yet be verified, real-world deployments, and where the field is heading.
The response_schema defines the fields of the synthesized answer. Without it, the result is a single answer string.
pip install futuresearch
export FUTURESEARCH_API_KEY=your_key_here # Get one at futuresearch.ai/app/api-key
import asyncio
import pandas as pd
from futuresearch import create_session
from futuresearch.ops import multi_agent
SCHEMA = {
"type": "object",
"properties": {
"overall_maturity": {"type": "string"},
"what_works": {"type": "string"},
"key_limitations": {"type": "string"},
"frontier_ai_gap": {"type": "string"},
"real_deployments": {"type": "string"},
"trajectory": {"type": "string"},
},
"required": ["overall_maturity", "what_works", "key_limitations",
"frontier_ai_gap", "real_deployments", "trajectory"],
}
async def main():
async with create_session(name="Formal verification for AI") as session:
result = await multi_agent(
session=session,
task="Research the current state of formal verification methods "
"applied to AI and machine learning systems.",
input=pd.DataFrame(),
effort_level="high",
directions=[
"Neural network verification tools and their capabilities.",
"The scalability problem for verifying LLMs and transformers.",
"Real-world deployments in safety-critical applications.",
"Theoretical and practical limits, including undecidability.",
"Alternative and hybrid approaches to AI assurance.",
],
response_schema=SCHEMA,
)
return result.data
print(asyncio.run(main()).iloc[0])
Results
The five agents synthesized into a six-field brief. Each field is one section of the answer:
| Field | What it found |
|---|---|
overall_maturity | Formal verification for AI is at an inflection point: maturing fast for small-to-medium neural networks, but unable to scale to frontier-scale models. |
what_works | Bound propagation with branch-and-bound (the α,β-CROWN verifier, VNN-COMP winner five years running), SMT-based verification (Marabou 2.0), abstract interpretation, and runtime safety shields. |
key_limitations | Verifying ReLU networks is NP-complete; the specification problem; the gap between proving software correctness and ensuring safe physical outcomes; undecidability of non-trivial semantic properties. |
frontier_ai_gap | Deterministic verification is limited to roughly BERT-sized models. Softmax attention's cross-nonlinearities defeat exact solvers, so frontier LLMs cannot be verified end-to-end. |
real_deployments | ACAS Xu collision-avoidance in aviation, Airbus's use of Astrée, Mobileye's Responsibility-Sensitive Safety model, NASA DAIDALUS, and FDA-cleared medical AI devices. |
trajectory | A move toward layered, portfolio assurance: AI-assisted proof generation with Lean and Coq, compositional verification, and runtime monitoring around unverified models. |
The answer holds together because no single agent had to cover everything. The agent on real deployments surfaced concrete, named systems (ACAS Xu, Astrée, Mobileye RSS) while the agent on theoretical limits brought the undecidability results, and the synthesis step reconciled them into one consistent narrative rather than five disconnected summaries. Every claim in the output carries citations back to the specific pages the agents read, so each section can be checked against its sources.
Going deeper
- Reference: multi_agent
- Guide: Research a question with a team of agents
- Case study: Find startups selling to AI labs