FutureSearch Logofuturesearch
  • Blog
  • Solutions
  • Research
  • Docs
  • Evals
  • Company
  • Get Researchers
FutureSearch Logo

General inquiry? You can reach us at hello@futuresearch.ai.

Company

Team & CareersPressPrivacy PolicyTerms of Service

Developers

SDK DocsAPI ReferenceCase StudiesGitHub

Follow Us

X (Twitter)@dschwarz26LinkedIn
FutureSearchdocs
Your research team
Installation
  • All install methods
  • Claude.ai
  • Claude Cowork
  • Claude Code
  • Web App
  • Python SDK
  • Skill
  • MCP Server
Reference
  • API Key
  • classify
  • dedupe
  • forecast
  • merge
  • rank
  • agent_map
  • screen
  • Progress Monitoring
  • Chaining Operations
Guides
  • LLM-Powered Data Labeling
  • Add a Column via Web Research
  • Classify and Label Rows
  • Deduplicate Training Data
  • Filter a Dataset Intelligently
  • Join Tables Without Shared Keys
  • Rank Data by External Metrics
  • Resolve Duplicate Entities
  • Scale Deduplication to 20K Rows
Case Studies
  • Deduplicate Contact Lists
  • Deduplicate CRM Records
  • Enrich Contacts with Company Data
  • Fuzzy Match Across Tables
  • Link Records Across Medical Datasets
  • LLM Cost vs. Accuracy
  • Merge Costs and Speed
  • Merge Thousands of Records
  • Multi-Stage Lead Qualification
  • Research and Rank Web Data
  • Run 10,000 LLM Web Research Agents
  • Score Cold Leads via Web Research
  • Score Leads from Fragmented Data
  • Screen 10,000 Rows
  • Screen Job Listings
  • Screen Stocks by Economic Sensitivity
  • Screen Stocks by Investment Thesis
FutureSearchby futuresearch
by futuresearch

LLM-Powered Data Labeling

Go to futuresearch.ai/app, upload a CSV with 200 text samples from the DBpedia-14 dataset, and enter:

Classify each text into exactly one DBpedia ontology category: Company, Educational Institution, Artist, Athlete, Office Holder, Mean Of Transportation, Building, Natural Place, Village, Animal, Plant, Album, Film, or Written Work.

200 labels produced in under 5 minutes with 98.5% normalized accuracy.

Add the everyrow connector if you haven't already. Then upload a CSV with 200 text samples from the DBpedia-14 dataset and ask Claude:

Classify each text into exactly one DBpedia ontology category: Company, Educational Institution, Artist, Athlete, Office Holder, Mean Of Transportation, Building, Natural Place, Village, Animal, Plant, Album, Film, or Written Work.

200 labels produced in under 5 minutes with 98.5% normalized accuracy.

Claude Code's interactive classification works for labeling a dozen items in conversation. When an active learning loop requests hundreds of labels programmatically, with consistent schema and structured output every time, you need a labeling service that can run on demand.

Here, we get Claude Code to label 200 text samples from the DBpedia-14 dataset into 14 ontology categories, achieving 98.5% accuracy.

MetricValue
Labels produced200
Strict accuracy96.0%
Normalized accuracy98.5%
Time4.7 minutes
Cost$3.35

Add everyrow to Claude Code if you haven't already:

claude mcp add futuresearch --scope project --transport http https://mcp.futuresearch.ai/mcp

Prepare a CSV with 200 text samples from the DBpedia-14 dataset. Tell Claude:

Classify each text in dbpedia_samples.csv into exactly one DBpedia ontology
category: Company, Educational Institution, Artist, Athlete, Office Holder,
Mean Of Transportation, Building, Natural Place, Village, Animal, Plant,
Album, Film, or Written Work.

Claude calls everyrow's agent MCP tool with the classification schema:

Tool: everyrow_agent
├─ task: "Classify this text into exactly one DBpedia ontology category."
├─ input_csv: "/Users/you/dbpedia_samples.csv"
└─ response_schema: {"category": "enum of 14 DBpedia categories"}

→ Submitted: 200 rows for processing.
  Session: https://futuresearch.ai/sessions/5f5a052a-c240-43d8-91a4-ad7ad274f6e1
  Task ID: 5f5a...

Tool: everyrow_progress
├─ task_id: "5f5a..."
→ Running: 0/200 complete, 200 running (15s elapsed)

...

Tool: everyrow_progress
→ Completed: 200/200 (0 failed) in 279s.

Tool: everyrow_results
├─ task_id: "5f5a..."
├─ output_path: "/Users/you/dbpedia_classified.csv"
→ Saved 200 rows to /Users/you/dbpedia_classified.csv

200 labels in 4.7 minutes. View the session.

CategoryCount
Building22
Artist20
Mean Of Transportation18
Animal15
Educational Institution15
Company13
Album13
Office Holder12
Film12
Natural Place11

Of the 8 "strict" mismatches against ground truth, 5 were formatting variants (e.g., "WrittenWork" vs "Written Work"), not true errors. Only 3 were genuinely incorrect classifications: a Village labeled as Settlement, an Educational Institution labeled as University, and an Artist labeled as Writer. These are semantic near-misses, not random errors.

Human data labeling is slow and expensive. The everyrow SDK can replace the human annotator in an active learning loop, producing structured labels at scale. 200 labels in under 5 minutes for $0.26.

Active Learning: Ground Truth vs LLM Oracle

MetricValue
Labels per run200
Cost per run$0.26
Cost per labeled item$0.0013
Final accuracy (LLM)80.7% ± 0.8%
Final accuracy (human)80.6% ± 1.0%
LLM-human label agreement96.1% ± 1.6%
Repeats10
DatasetDBpedia-14 (14-class text classification)
pip install everyrow
export EVERYROW_API_KEY=your_key_here  # Get one at futuresearch.ai/api-key

We used a TF-IDF + LightGBM classifier with entropy-based uncertainty sampling. Each iteration selects the 20 most uncertain examples, sends them to the LLM for annotation, and retrains. 10 iterations, 200 labels total. We ran 10 independent repeats with different seeds, comparing the LLM oracle against ground truth labels.

from typing import Literal

import pandas as pd
from pydantic import BaseModel, Field

from everyrow import create_session
from everyrow.ops import agent_map
from everyrow.task import EffortLevel


LABEL_NAMES = {
    0: "Company", 1: "Educational Institution", 2: "Artist",
    3: "Athlete", 4: "Office Holder", 5: "Mean Of Transportation",
    6: "Building", 7: "Natural Place", 8: "Village",
    9: "Animal", 10: "Plant", 11: "Album", 12: "Film", 13: "Written Work",
}
CATEGORY_TO_ID = {v: k for k, v in LABEL_NAMES.items()}


class DBpediaClassification(BaseModel):
    category: Literal[
        "Company", "Educational Institution", "Artist",
        "Athlete", "Office Holder", "Mean Of Transportation",
        "Building", "Natural Place", "Village",
        "Animal", "Plant", "Album", "Film", "Written Work",
    ] = Field(description="The DBpedia ontology category")


async def query_llm_oracle(texts_df: pd.DataFrame) -> list[int]:
    async with create_session(name="Active Learning Oracle") as session:
        result = await agent_map(
            session=session,
            task="Classify this text into exactly one DBpedia ontology category.",
            input=texts_df[["text"]],
            response_model=DBpediaClassification,
            effort_level=EffortLevel.LOW,
        )
        return [CATEGORY_TO_ID.get(result.data["category"].iloc[i], -1)
                for i in range(len(texts_df))]

The learning curves overlap almost perfectly. Final test accuracies averaged over 10 repeats:

Data Labeling MethodFinal Accuracy (mean ± std)
Human annotation (ground truth)80.6% ± 1.0%
LLM annotation (everyrow)80.7% ± 0.8%

The LLM oracle is within noise of the ground truth baseline. The LLM agreed with ground truth labels 96.1% ± 1.6% of the time. Roughly 1 in 25 labels disagrees, but that does not hurt the downstream classifier.

The low cost ($0.26 per run) comes from using EffortLevel.LOW, which selects a small, fast model without web research. For more ambiguous tasks, use EffortLevel.MEDIUM or EffortLevel.HIGH for higher quality labels.

The full pipeline is available as a companion notebook on Kaggle. See also the full blog post.


Built with everyrow. Related guides: Classify DataFrame Rows (label data at scale), Deduplicate Training Data (clean ML datasets before training).