FutureSearch Logofuturesearch
  • Blog
  • Solutions
  • Research
  • Docs
  • Evals
  • Company
  • Get Researchers
FutureSearch Logo

General inquiry? You can reach us at hello@futuresearch.ai.

Company

Team & CareersPressPrivacy PolicyTerms of Service

Developers

SDK DocsAPI ReferenceCase StudiesGitHub

Follow Us

X (Twitter)@dschwarz26LinkedIn
FutureSearchdocs
Your research team
Installation
  • All install methods
  • Claude.ai
  • Claude Cowork
  • Claude Code
  • Web App
  • Python SDK
  • Skill
  • MCP Server
Reference
  • API Key
  • classify
  • dedupe
  • forecast
  • merge
  • rank
  • agent_map
  • screen
  • Progress Monitoring
  • Chaining Operations
Guides
  • LLM-Powered Data Labeling
  • Add a Column via Web Research
  • Classify and Label Rows
  • Deduplicate Training Data
  • Filter a Dataset Intelligently
  • Join Tables Without Shared Keys
  • Rank Data by External Metrics
  • Resolve Duplicate Entities
  • Scale Deduplication to 20K Rows
Case Studies
  • Deduplicate Contact Lists
  • Deduplicate CRM Records
  • Enrich Contacts with Company Data
  • Fuzzy Match Across Tables
  • Link Records Across Medical Datasets
  • LLM Cost vs. Accuracy
  • Merge Costs and Speed
  • Merge Thousands of Records
  • Multi-Stage Lead Qualification
  • Research and Rank Web Data
  • Run 10,000 LLM Web Research Agents
  • Score Cold Leads via Web Research
  • Score Leads from Fragmented Data
  • Screen 10,000 Rows
  • Screen Job Listings
  • Screen Stocks by Economic Sensitivity
  • Screen Stocks by Investment Thesis
FutureSearchby futuresearch
by futuresearch

Scale Deduplication to 20K Rows

Go to futuresearch.ai/app, upload fda_products.csv, and enter:

Deduplicate this dataset. Two rows are duplicates if they have the same ingredient + same strength + same applicant + same dosage form.

20,000 rows deduplicated in about 22 minutes. 1,922 duplicates removed (9.6% reduction).

Add the everyrow connector if you haven't already. Then upload fda_products.csv and ask Claude:

Deduplicate this dataset. Two rows are duplicates if they have the same ingredient + same strength + same applicant + same dosage form.

20,000 rows deduplicated in about 22 minutes. 1,922 duplicates removed (9.6% reduction).

Claude Code handles deduplication of a few hundred rows natively. Scaling to 20,000 rows needs an approach where embeddings and clustering narrow the search space first, so LLM calls target the ambiguous pairs instead of all 200 million possible combinations.

Here, we get Claude Code to deduplicate 20,000 FDA drug product records, using a funnel of embeddings, clustering, and targeted LLM calls.

MetricValue
Input rows20,000
Unique after dedupe18,078
Duplicates removed1,922 (9.6%)
Time22.5 minutes
Cost$26.11

Add everyrow to Claude Code if you haven't already:

claude mcp add futuresearch --scope project --transport http https://mcp.futuresearch.ai/mcp

Download fda_products.csv (20,000 rows from the FDA Drugs@FDA database with ingredient, strength, applicant, and dosage form columns). Tell Claude:

Deduplicate fda_products.csv. Two rows are duplicates if they have the same
ingredient + same strength + same applicant + same dosage form.

Claude calls everyrow's dedupe MCP tool:

Tool: everyrow_dedupe
├─ equivalence_relation: "Same ingredient + same strength + same applicant + same dosage form = duplicate"
└─ input_csv: "/Users/you/fda_products.csv"

→ Submitted: 20,000 rows for deduplication.
  Session: https://futuresearch.ai/sessions/71e68a7f-a856-43ba-8080-89e4093afb1c
  Task ID: 71e6...

Tool: everyrow_progress
├─ task_id: "71e6..."
→ Running: 0/20000 complete (60s elapsed)

...

Tool: everyrow_progress
→ Completed: 20000/20000 (0 failed) in 1350s.

Tool: everyrow_results
├─ task_id: "71e6..."
├─ output_path: "/Users/you/fda_deduplicated.csv"
→ Saved 20000 rows to /Users/you/fda_deduplicated.csv

20,000 rows deduplicated in 22.5 minutes for $26.11 ($1.31 per 1,000 rows). View the session.

ClusterMembersPattern
Oxytocin / Fresenius Kabi3Different package sizes: 10/100/300 USP units, same concentration
Gadodiamide / GE Healthcare3Different volumes: 287mg/mL in bulk vs 50mL vs 100mL
Diazepam / Hikma3Strength formatting: "50MG/10ML (5MG/ML)" vs "5MG/ML"
Acyclovir 800MG / Teva3Company variants: TEVA, IVAX SUB TEVA PHARMS, TEVA PHARMS

The pipeline catches semantic duplicates across strength formatting variants, company name variations, and minor formatting differences. At 20,000 rows, Precision is 1.000 (zero false merges) while Recall is 0.992. The system only merges when it's confident.

LLM-powered deduplication gives you semantic understanding that string matching can't, but naive pairwise comparison is quadratic. At 20,000 rows that's 200 million pairs. Everyrow's dedupe pipeline uses a funnel of embeddings, clustering, and targeted LLM calls to keep cost linear and accuracy high.

FDA Drug Products — Deduplication at Scale

Error rates stay near zero as scale increases. Cost and LLM calls scale linearly. Runtime is under 5 minutes up to 10,000 rows and 25 minutes at 20,000.

pip install everyrow
export EVERYROW_API_KEY=your_key_here  # Get one at futuresearch.ai/api-key
import asyncio
import pandas as pd
from everyrow.ops import dedupe

data = pd.read_csv("fda_products.csv")

async def main():
    result = await dedupe(
        input=data,
        equivalence_relation=(
            "Same ingredient + same strength + same applicant "
            "+ same dosage form = duplicate"
        ),
    )

    clean = result.data[result.data["selected"] == True]
    print(f"Reduced {len(data)} to {len(clean)} unique records")
    clean.to_csv("deduplicated.csv", index=False)

asyncio.run(main())

Cost stays between $0.90 and $1.50 per 1,000 rows across all datasets tested:

DatasetEntityRowsDup%F1Cost$/1k rows
Small Companiescompany2008%1.000$0.18$0.90
Medium Peopleperson1,00020%0.994$1.18$1.18
Medium Transactionstransaction1,00020%0.945$1.41$1.41
Large Companies (Messy)company3,00010%0.974$3.21$1.07
Large Products (FDA)product5,0005%0.997$6.37$1.27
Company Namescompany8,62810%0.976$12.58$1.46
FDA Productsproduct20,00010%0.996$22.40$1.12

Rough formula: $1-1.50 per 1,000 rows depending on data complexity.

Every deduplication system makes two kinds of mistakes. Over-merging (low Precision) is data loss: distinct entities incorrectly grouped together. Under-merging (low Recall) means your data stays messy, but nothing is lost. At 20,000 rows, Precision is 1.000 (zero false merges) while Recall is 0.992 (8 of ~2,000 duplicates were missed). The system only merges when it's confident.

The equivalence_relation parameter is the single most important input. Be specific and enumerate the fields that must match:

# Good: mentions all matching fields
equivalence_relation="Same ingredient + same strength + same applicant + same dosage form = duplicate"

# Less good: vague
equivalence_relation="Same drug"

Built with everyrow. See the dedupe documentation for more options. Related guides: Resolve Duplicate Entities (500-row CRM walkthrough), Deduplicate Training Data (semantic dedup for ML datasets).