FutureSearch Logofuturesearch
  • Blog
  • Solutions
  • Research
  • Docs
  • Evals
  • Company
  • Get Researchers
FutureSearch Logo

General inquiry? You can reach us at hello@futuresearch.ai.

Company

Team & CareersPressPrivacy PolicyTerms of Service

Developers

SDK DocsAPI ReferenceCase StudiesGitHub

Follow Us

X (Twitter)@dschwarz26LinkedIn
FutureSearchdocs
Your research team
Installation
  • All install methods
  • Claude.ai
  • Claude Cowork
  • Claude Code
  • Web App
  • Python SDK
  • Skill
  • MCP Server
Reference
  • API Key
  • classify
  • dedupe
  • forecast
  • merge
  • rank
  • agent_map
  • screen
  • Progress Monitoring
  • Chaining Operations
Guides
  • LLM-Powered Data Labeling
  • Add a Column via Web Research
  • Classify and Label Rows
  • Deduplicate Training Data
  • Filter a Dataset Intelligently
  • Join Tables Without Shared Keys
  • Rank Data by External Metrics
  • Resolve Duplicate Entities
  • Scale Deduplication to 20K Rows
Case Studies
  • Deduplicate Contact Lists
  • Deduplicate CRM Records
  • Enrich Contacts with Company Data
  • Fuzzy Match Across Tables
  • Link Records Across Medical Datasets
  • LLM Cost vs. Accuracy
  • Merge Costs and Speed
  • Merge Thousands of Records
  • Multi-Stage Lead Qualification
  • Research and Rank Web Data
  • Run 10,000 LLM Web Research Agents
  • Score Cold Leads via Web Research
  • Score Leads from Fragmented Data
  • Screen 10,000 Rows
  • Screen Job Listings
  • Screen Stocks by Economic Sensitivity
  • Screen Stocks by Investment Thesis
FutureSearchby futuresearch
by futuresearch

Dedupe

dedupe groups duplicate rows in a DataFrame based on a natural-language equivalence relation, assigns cluster IDs, and selects a canonical row per cluster. The duplicate criterion is semantic and LLM-powered: agents reason over the data and, when needed, search the web for external information to establish equivalence. This handles abbreviations, name variations, job changes, and entity relationships that no string similarity threshold can capture.

Examples

from everyrow.ops import dedupe

result = await dedupe(
    input=crm_data,
    equivalence_relation="Two entries are duplicates if they represent the same legal entity",
)
print(result.data.head())

The equivalence_relation is natural language. Be as specific as you need:

result = await dedupe(
    input=researchers,
    equivalence_relation="""
        Two rows are duplicates if they're the same person, even if:
        - They changed jobs (different org/email)
        - Name is abbreviated (A. Smith vs Alex Smith)
        - There are typos (Naomi vs Namoi)
        - They use a nickname (Bob vs Robert)
    """,
)
print(result.data.head())

Strategies

Control what happens after clusters are identified using the strategy parameter:

select (default)

Picks the best representative from each cluster. Three columns are added:

  • equivalence_class_id — rows with the same ID are duplicates of each other
  • equivalence_class_name — human-readable label for the cluster
  • selected — True for the canonical record in each cluster
result = await dedupe(
    input=crm_data,
    equivalence_relation="Same legal entity",
    strategy="select",
    strategy_prompt="Prefer the record with the most complete contact information",
)
deduped = result.data[result.data["selected"] == True]

identify

Cluster only — no selection or combining. Useful when you want to review clusters before deciding what to do.

  • equivalence_class_id — rows with the same ID are duplicates of each other
  • equivalence_class_name — human-readable label for the cluster
result = await dedupe(
    input=crm_data,
    equivalence_relation="Same legal entity",
    strategy="identify",
)

combine

Synthesizes a single combined row per cluster, merging the best information from all duplicates. Original rows are marked selected=False, and new combined rows are added with selected=True.

result = await dedupe(
    input=crm_data,
    equivalence_relation="Same legal entity",
    strategy="combine",
    strategy_prompt="For each field, keep the most recent and complete value",
)
combined = result.data[result.data["selected"] == True]

What you get back

Three columns added to your data (when using select or combine strategy):

  • equivalence_class_id — rows with the same ID are duplicates of each other
  • equivalence_class_name — human-readable label for the cluster ("Alexandra Butoi", "Naomi Saphra", etc.)
  • selected — True for the canonical record in each cluster (usually the most complete one)

To get just the deduplicated rows:

deduped = result.data[result.data["selected"] == True]

Example

Input:

name org email
A. Butoi Rycolab a.butoi@edu
Alexandra Butoi Ryoclab —
Namoi Saphra — nsaphra@alumni
Naomi Saphra Harvard nsaphra@harvard.edu

Output (selected rows only):

name org email
Alexandra Butoi Rycolab a.butoi@edu
Naomi Saphra Harvard nsaphra@harvard.edu

Parameters

Name Type Description
input DataFrame Data with potential duplicates
equivalence_relation str What makes two rows duplicates
strategy str "identify", "select" (default), or "combine"
strategy_prompt str Optional instructions for selection or combining
session Session Optional, auto-created if omitted

Performance

Rows Time Cost
200 ~90 sec ~$0.40
500 ~2 min ~$1.67
2,000 ~8 min ~$7

Via MCP

MCP tool: everyrow_dedupe

Parameter Type Description
csv_path string Path to input CSV file
equivalence_relation string What makes two rows duplicates

Related docs

Guides

  • Remove Duplicates from ML Training Data
  • Resolve Duplicate Entities

Case Studies

  • Dedupe CRM Company Records

Blog posts

  • CRM Deduplication
  • Researcher Deduplication