FutureSearch Logofuturesearch
  • Solutions
  • Pricing
  • Research
  • Docs
  • Evals
  • Blog
  • Company
  • LiteLLM Checker
  • Get Researchers
FutureSearch Logo

General inquiry? You can reach us at hello@futuresearch.ai.

Company

Team & CareersPressPrivacy PolicyTerms of Service

Developers

SDK DocsAPI ReferenceCase StudiesGitHubSupport

Integrations

Claude CodeCursorChatGPT CodexClaude.ai

Follow Us

X (Twitter)@dschwarz26LinkedIn
FutureSearchdocs
Your research team
Installation
  • All install methods
  • Claude.ai
  • Claude Cowork
  • Claude Code
  • Web App
  • Python SDK
  • Skill
  • MCP Server
Reference
  • API Key
  • classify
  • dedupe
  • forecast
  • merge
  • rank
  • agent_map
  • Progress Monitoring
  • Chaining Operations
Guides
  • LLM-Powered Data Labeling
  • Add a Column via Web Research
  • Classify and Label Rows
  • Deduplicate Training Data
  • Filter a Dataset Intelligently
  • Find Profitable Polymarket Trades
  • Forecast Outcomes for a List of Entities
  • Value a Private Company
  • Join Tables Without Shared Keys
  • Rank Data by External Metrics
  • Resolve Duplicate Entities
  • Scale Deduplication to 20K Rows
  • Turn Claude into an Accurate Forecaster
Case Studies
  • Deduplicate Contact Lists
  • Deduplicate CRM Records
  • Enrich Contacts with Company Data
  • Fuzzy Match Across Tables
  • Link Records Across Medical Datasets
  • LLM Cost vs. Accuracy
  • Merge Costs and Speed
  • Merge Thousands of Records
  • Multi-Stage Lead Qualification
  • Research and Rank Web Data
  • Run 10,000 LLM Web Research Agents
  • Score Cold Leads via Web Research
  • Score Leads from Fragmented Data
  • Screen 10,000 Rows
  • Screen Job Listings
  • Screen Stocks by Economic Sensitivity
  • Screen Stocks by Investment Thesis
FutureSearchby futuresearch
by futuresearch

LLM-Powered Data Labeling

Human data labeling is slow and expensive. LLM-powered labeling produces structured labels at scale, matching human annotator accuracy at a fraction of the cost.

Here, we label 200 text samples from the DBpedia-14 dataset into 14 ontology categories, achieving 98.5% accuracy.

MetricValue
Labels produced200
Strict accuracy96.0%
Normalized accuracy98.5%
Time4.7 minutes
Cost$3.35

Add FutureSearch to Claude Code if you haven't already:

claude mcp add futuresearch --scope project --transport http https://mcp.futuresearch.ai/mcp

Prepare a CSV with 200 text samples from the DBpedia-14 dataset. Tell Claude:

Classify each text in dbpedia_samples.csv into exactly one DBpedia ontology
category: Company, Educational Institution, Artist, Athlete, Office Holder,
Mean Of Transportation, Building, Natural Place, Village, Animal, Plant,
Album, Film, or Written Work.

Claude calls FutureSearch's agent MCP tool with the classification schema:

Tool: futuresearch_agent
├─ task: "Classify this text into exactly one DBpedia ontology category."
├─ input_csv: "/Users/you/dbpedia_samples.csv"
└─ response_schema: {"category": "enum of 14 DBpedia categories"}

→ Submitted: 200 rows for processing.
  Session: https://futuresearch.ai/sessions/5f5a052a-c240-43d8-91a4-ad7ad274f6e1
  Task ID: 5f5a...

Tool: futuresearch_progress
├─ task_id: "5f5a..."
→ Running: 0/200 complete, 200 running (15s elapsed)

...

Tool: futuresearch_progress
→ Completed: 200/200 (0 failed) in 279s.

Tool: futuresearch_results
├─ task_id: "5f5a..."
├─ output_path: "/Users/you/dbpedia_classified.csv"
→ Saved 200 rows to /Users/you/dbpedia_classified.csv

View the session.

Add the FutureSearch connector if you haven't already. Then upload a CSV with 200 text samples from the DBpedia-14 dataset and ask Claude:

Classify each text into exactly one DBpedia ontology category: Company, Educational Institution, Artist, Athlete, Office Holder, Mean Of Transportation, Building, Natural Place, Village, Animal, Plant, Album, Film, or Written Work.

Go to futuresearch.ai/app, upload a CSV with 200 text samples from the DBpedia-14 dataset, and enter:

Classify each text into exactly one DBpedia ontology category: Company, Educational Institution, Artist, Athlete, Office Holder, Mean Of Transportation, Building, Natural Place, Village, Animal, Plant, Album, Film, or Written Work.

Active Learning: Ground Truth vs LLM Oracle

Using EffortLevel.LOW keeps the cost to $0.26 per run ($0.0013 per label). The learning curves between LLM and human annotation overlap almost perfectly.

pip install futuresearch
export FUTURESEARCH_API_KEY=your_key_here  # Get one at futuresearch.ai/app/api-key
from typing import Literal

import pandas as pd
from pydantic import BaseModel, Field

from futuresearch import create_session
from futuresearch.ops import agent_map
from futuresearch.task import EffortLevel


LABEL_NAMES = {
    0: "Company", 1: "Educational Institution", 2: "Artist",
    3: "Athlete", 4: "Office Holder", 5: "Mean Of Transportation",
    6: "Building", 7: "Natural Place", 8: "Village",
    9: "Animal", 10: "Plant", 11: "Album", 12: "Film", 13: "Written Work",
}
CATEGORY_TO_ID = {v: k for k, v in LABEL_NAMES.items()}


class DBpediaClassification(BaseModel):
    category: Literal[
        "Company", "Educational Institution", "Artist",
        "Athlete", "Office Holder", "Mean Of Transportation",
        "Building", "Natural Place", "Village",
        "Animal", "Plant", "Album", "Film", "Written Work",
    ] = Field(description="The DBpedia ontology category")


async def query_llm_oracle(texts_df: pd.DataFrame) -> list[int]:
    async with create_session(name="Active Learning Oracle") as session:
        result = await agent_map(
            session=session,
            task="Classify this text into exactly one DBpedia ontology category.",
            input=texts_df[["text"]],
            response_model=DBpediaClassification,
            effort_level=EffortLevel.LOW,
        )
        return [CATEGORY_TO_ID.get(result.data["category"].iloc[i], -1)
                for i in range(len(texts_df))]

The full pipeline is available as a companion notebook on Kaggle. See also the full blog post.

Results

CategoryCount
Building22
Artist20
Mean Of Transportation18
Animal15
Educational Institution15
Company13
Album13
Office Holder12
Film12
Natural Place11

Of the 8 "strict" mismatches against ground truth, 5 were formatting variants (e.g., "WrittenWork" vs "Written Work"), not true errors. Only 3 were genuinely incorrect classifications: a Village labeled as Settlement, an Educational Institution labeled as University, and an Artist labeled as Writer. These are semantic near-misses, not random errors.

When used as an oracle in an active learning loop (10 iterations, 200 labels total, 10 independent repeats), the LLM oracle matches ground truth:

Data Labeling MethodFinal Accuracy (mean +/- std)
Human annotation (ground truth)80.6% +/- 1.0%
LLM annotation (FutureSearch)80.7% +/- 0.8%

Built with FutureSearch. Related guides: Classify DataFrame Rows (label data at scale), Deduplicate Training Data (clean ML datasets before training).