FutureSearch Logofuturesearch
  • Blog
  • Solutions
  • Research
  • Docs
  • Evals
  • Company
  • Get Researchers
FutureSearch Logo

General inquiry? You can reach us at hello@futuresearch.ai.

Company

Team & CareersPressPrivacy PolicyTerms of Service

Developers

SDK DocsAPI ReferenceCase StudiesGitHub

Follow Us

X (Twitter)@dschwarz26LinkedIn
FutureSearchdocs
Your research team
Installation
  • All install methods
  • Claude.ai
  • Claude Cowork
  • Claude Code
  • Web App
  • Python SDK
  • Skill
  • MCP Server
Reference
  • API Key
  • classify
  • dedupe
  • forecast
  • merge
  • rank
  • agent_map
  • screen
  • Progress Monitoring
  • Chaining Operations
Guides
  • LLM-Powered Data Labeling
  • Add a Column via Web Research
  • Classify and Label Rows
  • Deduplicate Training Data
  • Filter a Dataset Intelligently
  • Join Tables Without Shared Keys
  • Rank Data by External Metrics
  • Resolve Duplicate Entities
  • Scale Deduplication to 20K Rows
Case Studies
  • Deduplicate Contact Lists
  • Deduplicate CRM Records
  • Enrich Contacts with Company Data
  • Fuzzy Match Across Tables
  • Link Records Across Medical Datasets
  • LLM Cost vs. Accuracy
  • Merge Costs and Speed
  • Merge Thousands of Records
  • Multi-Stage Lead Qualification
  • Research and Rank Web Data
  • Run 10,000 LLM Web Research Agents
  • Score Cold Leads via Web Research
  • Score Leads from Fragmented Data
  • Screen 10,000 Rows
  • Screen Job Listings
  • Screen Stocks by Economic Sensitivity
  • Screen Stocks by Investment Thesis
FutureSearchby futuresearch
by futuresearch

Deduplicate Training Data

Go to futuresearch.ai/app, upload a CSV of 3,000 sentences from the PAWS paraphrase dataset, and enter:

Deduplicate this dataset of sentences. Two sentences are duplicates if they convey the same meaning, even if phrased differently. This includes paraphrases, minor grammatical variations, and sentences about the same fact that would be redundant in a training set. They are NOT duplicates if they describe different facts, even if they share many words.

973 duplicates found and removed (32.4% reduction). Results take about 28 minutes.

Add the everyrow connector if you haven't already. Then upload a CSV of 3,000 sentences from the PAWS paraphrase dataset and ask Claude:

Deduplicate this dataset of sentences. Two sentences are duplicates if they convey the same meaning, even if phrased differently. This includes paraphrases, minor grammatical variations, and sentences about the same fact that would be redundant in a training set. They are NOT duplicates if they describe different facts, even if they share many words.

973 duplicates found and removed (32.4% reduction). Results take about 28 minutes.

Claude Code handles exact deduplication natively by writing Python to hash and compare rows. Scaling to 3,000 sentences where the duplicates are paraphrases needs an approach where each pair is evaluated for semantic equivalence. "The cat sat on the mat" and "On the mat, the cat was sitting" share no exact n-grams but mean the same thing.

Here, we get Claude Code to deduplicate 3,000 sentences from the PAWS paraphrase dataset, finding sentences that mean the same thing even when phrased differently.

MetricValue
Input rows3,000
Unique after dedupe2,027
Duplicates removed973 (32.4%)
Time28.2 minutes
Cost$13.27

Add everyrow to Claude Code if you haven't already:

claude mcp add futuresearch --scope project --transport http https://mcp.futuresearch.ai/mcp

The dataset is 3,000 sentences extracted from the PAWS paraphrase dataset, where many sentence pairs convey the same fact with different word order. With the CSV in your working directory, tell Claude:

Deduplicate this dataset of sentences. Two sentences are duplicates if they
convey the same meaning, even if phrased differently. This includes paraphrases,
minor grammatical variations, and sentences about the same fact that would be
redundant in a training set. They are NOT duplicates if they describe different
facts, even if they share many words.

Claude calls everyrow's dedupe MCP tool with your equivalence relation:

Tool: everyrow_dedupe
├─ equivalence_relation: "Two sentences are duplicates if they convey the same meaning..."
└─ input_csv: "/Users/you/paws_sentences.csv"

→ Submitted: 3,000 rows for deduplication.
  Session: https://futuresearch.ai/sessions/8eac50da-f318-49cf-9f00-67b35700eb8a
  Task ID: 8eac...

Tool: everyrow_progress
├─ task_id: "8eac..."
→ Running: 0/3000 complete (30s elapsed)

...

Tool: everyrow_progress
→ Completed: 3000/3000 (0 failed) in 1691s.

Tool: everyrow_results
├─ task_id: "8eac..."
├─ output_path: "/Users/you/paws_deduplicated.csv"
→ Saved 3000 rows to /Users/you/paws_deduplicated.csv

973 duplicates found and removed (32.4% reduction). View the session.

Examples of duplicates the system found:

ClusterVariants
Nick Smith and Duncan become friends"Chris Egan (Nick Smith) settles down with his family..." / "Nick Smith (Chris Egan) settles in Summer Bay..."
WORHP software library description"WORHP, also referred to as eNLP (European NLP Solver)..." / "WORHP, also referred to by ESA as eNLP..."
Baseball series in Havana"Another series was played in Havana between Cincinnati Reds..." / "In Havana, another series was played between..."

These are semantic duplicates that exact-match deduplication would miss entirely. The sentences have different word order, name variations, and grammatical structure, but they describe the same facts and would be redundant in a training set. The output includes equivalence_class_id, equivalence_class_name, and selected columns. Filter to selected == True to get the deduplicated dataset.

Near-duplicates in ML training data cause data leakage, overfitting, and memorization. The everyrow SDK finds and removes semantically similar examples that aren't exact matches: paraphrases, reformatted text, or records conveying the same information with different words.

MetricValue
Input rows3,000
Unique after dedupe1,928
Duplicates removed1,072 (35.7%)
Time5.3 minutes
Cost$4.21
Sessionview

Standard deduplication with pandas.drop_duplicates() only catches exact matches. MinHash/LSH (datasketch) works for near-exact text but not semantic similarity. Libraries like dedupe.io require labeled training data. None handle "same meaning, different words" without manual setup.

pip install everyrow datasets
export EVERYROW_API_KEY=your_key_here  # Get one at futuresearch.ai/api-key
import asyncio
import pandas as pd
from datasets import load_dataset
from everyrow.ops import dedupe

# Load a dataset with potential semantic duplicates
# Using PAWS - paraphrase pairs from Wikipedia
dataset = load_dataset(
    "google-research-datasets/paws",
    "labeled_final",
    split="train"
)

# Extract sentences into a dataframe
sentences = []
seen = set()
for row in dataset:
    for s in [row["sentence1"], row["sentence2"]]:
        if s not in seen:
            seen.add(s)
            sentences.append(s)
        if len(sentences) >= 3000:
            break
    if len(sentences) >= 3000:
        break

df = pd.DataFrame({"text": sentences})
print(f"Training examples: {len(df)}")

async def dedupe_training_data():
    result = await dedupe(
        input=df,
        equivalence_relation="""
            Two sentences are duplicates if they convey the same meaning,
            even if phrased differently. This includes:
            - Paraphrases (same meaning, different words or word order)
            - Minor grammatical variations
            - Sentences about the same fact that would be redundant

            NOT duplicates if they describe different facts, even if
            they share many words.
        """,
    )

    # Get deduplicated dataset
    clean_df = result.data[result.data["selected"] == True]
    print(f"After deduplication: {len(clean_df)}")

    return clean_df

clean_data = asyncio.run(dedupe_training_data())

The output includes three columns added to your data: equivalence_class_id groups duplicates together, equivalence_class_name gives each cluster a readable label, and selected marks the canonical example to keep. Filter to selected == True to get your deduplicated dataset.

Here are examples of duplicates the system found:

Cluster: "Glenn Howard's Ontario Championship win"
  ✓ Glenn Howard won the Ontario Championship for the 17th time as either third or skip.
    For the 17th time the Glenn Howard won the Ontario Championship as third or skip.

Cluster: "Chananian village location"
  ✓ Chananian is a village in Azad Kashmir, the Leepa Valley, Hattian Bala District, Pakistan.
    Chananian is a village in Leepa Valley, Hattian Bala District of Azad Kashmir, Pakistan.
    Chananian is a village in the Leepa Valley, Hattian Bala district of Azad Kashmir, Pakistan.

Cluster: "Person's birth and death details"
  ✓ David Spurlock was born on 18 November 1959 in Dallas, Texas, and moved to Memphis...
    J. David Spurlock was born on November 18, 1959 in Dallas, Texas. He moved to Memphis...

The 35.7% reduction rate is typical for datasets that weren't explicitly deduplicated during creation. The cost scales linearly: roughly $1.40 per 1,000 rows for text data of this complexity.


Built with everyrow. See the dedupe documentation for more options including equivalence relation design.