Deduplicate Training Data

Near-duplicates in ML training data cause data leakage, overfitting, and memorization. Exact-match deduplication misses paraphrases: "The cat sat on the mat" and "On the mat, the cat was sitting" share no exact n-grams but mean the same thing.

Here, we deduplicate 3,000 sentences from the PAWS paraphrase dataset, finding sentences that mean the same thing even when phrased differently.

Metric	Value
Input rows	3,000
Unique after dedupe	2,027
Duplicates removed	973 (32.4%)
Time	28.2 minutes
Cost	$13.27

Add FutureSearch to Claude Code if you haven't already:

claude mcp add futuresearch --scope project --transport http https://mcp.futuresearch.ai/mcp

The dataset is 3,000 sentences extracted from the PAWS paraphrase dataset. With the CSV in your working directory, tell Claude:

Deduplicate this dataset of sentences. Two sentences are duplicates if they
convey the same meaning, even if phrased differently. This includes paraphrases,
minor grammatical variations, and sentences about the same fact that would be
redundant in a training set. They are NOT duplicates if they describe different
facts, even if they share many words.

Claude calls FutureSearch's dedupe MCP tool:

Tool: futuresearch_dedupe
├─ equivalence_relation: "Two sentences are duplicates if they convey the same meaning..."
└─ input_csv: "/Users/you/paws_sentences.csv"

→ Submitted: 3,000 rows for deduplication.
  Session: https://futuresearch.ai/sessions/8eac50da-f318-49cf-9f00-67b35700eb8a
  Task ID: 8eac...

Tool: futuresearch_progress
├─ task_id: "8eac..."
→ Running: 0/3000 complete (30s elapsed)

...

Tool: futuresearch_progress
→ Completed: 3000/3000 (0 failed) in 1691s.

Tool: futuresearch_results
├─ task_id: "8eac..."
├─ output_path: "/Users/you/paws_deduplicated.csv"
→ Saved 3000 rows to /Users/you/paws_deduplicated.csv

View the session.

Add the FutureSearch connector if you haven't already. Then upload a CSV of 3,000 sentences from the PAWS paraphrase dataset and ask Claude:

Deduplicate this dataset of sentences. Two sentences are duplicates if they convey the same meaning, even if phrased differently. This includes paraphrases, minor grammatical variations, and sentences about the same fact that would be redundant in a training set. They are NOT duplicates if they describe different facts, even if they share many words.

Go to futuresearch.ai/app, upload a CSV of 3,000 sentences from the PAWS paraphrase dataset, and enter:

Deduplicate this dataset of sentences. Two sentences are duplicates if they convey the same meaning, even if phrased differently. This includes paraphrases, minor grammatical variations, and sentences about the same fact that would be redundant in a training set. They are NOT duplicates if they describe different facts, even if they share many words.

pip install futuresearch datasets
export FUTURESEARCH_API_KEY=your_key_here  # Get one at futuresearch.ai/app/api-key

import asyncio
import pandas as pd
from datasets import load_dataset
from futuresearch.ops import dedupe

dataset = load_dataset(
    "google-research-datasets/paws",
    "labeled_final",
    split="train"
)

sentences = []
seen = set()
for row in dataset:
    for s in [row["sentence1"], row["sentence2"]]:
        if s not in seen:
            seen.add(s)
            sentences.append(s)
        if len(sentences) >= 3000:
            break
    if len(sentences) >= 3000:
        break

df = pd.DataFrame({"text": sentences})

async def dedupe_training_data():
    result = await dedupe(
        input=df,
        equivalence_relation="""
            Two sentences are duplicates if they convey the same meaning,
            even if phrased differently. This includes:
            - Paraphrases (same meaning, different words or word order)
            - Minor grammatical variations
            - Sentences about the same fact that would be redundant

            NOT duplicates if they describe different facts, even if
            they share many words.
        """,
    )
    clean_df = result.data[result.data["selected"] == True]
    return clean_df

clean_data = asyncio.run(dedupe_training_data())

Results

Examples of duplicates the system found:

Cluster	Variants
Nick Smith and Duncan become friends	"Chris Egan (Nick Smith) settles down with his family..." / "Nick Smith (Chris Egan) settles in Summer Bay..."
WORHP software library description	"WORHP, also referred to as eNLP (European NLP Solver)..." / "WORHP, also referred to by ESA as eNLP..."
Baseball series in Havana	"Another series was played in Havana between Cincinnati Reds..." / "In Havana, another series was played between..."

These are semantic duplicates that exact-match deduplication would miss entirely. The sentences have different word order, name variations, and grammatical structure, but they describe the same facts and would be redundant in a training set.

The output includes equivalence_class_id, equivalence_class_name, and selected columns. Filter to selected == True to get the deduplicated dataset. Cost scales linearly: roughly $1.40 per 1,000 rows for text data of this complexity.

Built with FutureSearch. See the dedupe documentation for more options including equivalence relation design.

Deduplicate Training Data

Here, we deduplicate 3,000 sentences from the PAWS paraphrase dataset, finding sentences that mean the same thing even when phrased differently.

Metric

Value

Input rows

3,000

Unique after dedupe

2,027

Duplicates removed

973 (32.4%)

Time

28.2 minutes

Cost

$13.27

Add FutureSearch to Claude Code if you haven't already:

claude mcp add futuresearch --scope project --transport http https://mcp.futuresearch.ai/mcp

The dataset is 3,000 sentences extracted from the PAWS paraphrase dataset. With the CSV in your working directory, tell Claude:

Deduplicate this dataset of sentences. Two sentences are duplicates if they
convey the same meaning, even if phrased differently. This includes paraphrases,
minor grammatical variations, and sentences about the same fact that would be
redundant in a training set. They are NOT duplicates if they describe different
facts, even if they share many words.

Claude calls FutureSearch's dedupe MCP tool:

Tool: futuresearch_dedupe
├─ equivalence_relation: "Two sentences are duplicates if they convey the same meaning..."
└─ input_csv: "/Users/you/paws_sentences.csv"

→ Submitted: 3,000 rows for deduplication.
  Session: https://futuresearch.ai/sessions/8eac50da-f318-49cf-9f00-67b35700eb8a
  Task ID: 8eac...

Tool: futuresearch_progress
├─ task_id: "8eac..."
→ Running: 0/3000 complete (30s elapsed)

...

Tool: futuresearch_progress
→ Completed: 3000/3000 (0 failed) in 1691s.

Tool: futuresearch_results
├─ task_id: "8eac..."
├─ output_path: "/Users/you/paws_deduplicated.csv"
→ Saved 3000 rows to /Users/you/paws_deduplicated.csv

View the session.

Add the FutureSearch connector if you haven't already. Then upload a CSV of 3,000 sentences from the PAWS paraphrase dataset and ask Claude:

Deduplicate this dataset of sentences. Two sentences are duplicates if they convey the same meaning, even if phrased differently. This includes paraphrases, minor grammatical variations, and sentences about the same fact that would be redundant in a training set. They are NOT duplicates if they describe different facts, even if they share many words.

Go to futuresearch.ai/app, upload a CSV of 3,000 sentences from the PAWS paraphrase dataset, and enter:

Deduplicate this dataset of sentences. Two sentences are duplicates if they convey the same meaning, even if phrased differently. This includes paraphrases, minor grammatical variations, and sentences about the same fact that would be redundant in a training set. They are NOT duplicates if they describe different facts, even if they share many words.

pip install futuresearch datasets
export FUTURESEARCH_API_KEY=your_key_here  # Get one at futuresearch.ai/app/api-key

import asyncio
import pandas as pd
from datasets import load_dataset
from futuresearch.ops import dedupe

dataset = load_dataset(
    "google-research-datasets/paws",
    "labeled_final",
    split="train"
)

sentences = []
seen = set()
for row in dataset:
    for s in [row["sentence1"], row["sentence2"]]:
        if s not in seen:
            seen.add(s)
            sentences.append(s)
        if len(sentences) >= 3000:
            break
    if len(sentences) >= 3000:
        break

df = pd.DataFrame({"text": sentences})

async def dedupe_training_data():
    result = await dedupe(
        input=df,
        equivalence_relation="""
            Two sentences are duplicates if they convey the same meaning,
            even if phrased differently. This includes:
            - Paraphrases (same meaning, different words or word order)
            - Minor grammatical variations
            - Sentences about the same fact that would be redundant

            NOT duplicates if they describe different facts, even if
            they share many words.
        """,
    )
    clean_df = result.data[result.data["selected"] == True]
    return clean_df

clean_data = asyncio.run(dedupe_training_data())

Results

Examples of duplicates the system found:

Cluster

Variants

Nick Smith and Duncan become friends

"Chris Egan (Nick Smith) settles down with his family..." / "Nick Smith (Chris Egan) settles in Summer Bay..."

WORHP software library description

"WORHP, also referred to as eNLP (European NLP Solver)..." / "WORHP, also referred to by ESA as eNLP..."

Baseball series in Havana

"Another series was played in Havana between Cincinnati Reds..." / "In Havana, another series was played between..."

Built with FutureSearch. See the dedupe documentation for more options including equivalence relation design.