Dedupe

dedupe groups duplicate rows in a DataFrame based on a natural-language equivalence relation, assigns cluster IDs, and selects a canonical row per cluster. The duplicate criterion is semantic and LLM-powered. This handles abbreviations, name variations, job changes, and entity relationships that no string similarity threshold can capture.

Examples

from futuresearch.ops import dedupe

result = await dedupe(
    input=crm_data,
    equivalence_relation="Two entries are duplicates if they represent the same legal entity",
)
print(result.data.head())

The equivalence_relation is natural language. Be as specific as you need:

result = await dedupe(
    input=researchers,
    equivalence_relation="""
        Two rows are duplicates if they're the same person, even if:
        - They changed jobs (different org/email)
        - Name is abbreviated (A. Smith vs Alex Smith)
        - There are typos (Naomi vs Namoi)
        - They use a nickname (Bob vs Robert)
    """,
)
print(result.data.head())

Strategies

Control what happens after clusters are identified using the strategy parameter:

`select` (default)

Picks the best representative from each cluster. Three columns are added:

equivalence_class_id: rows with the same ID are duplicates of each other
equivalence_class_name: human-readable label for the cluster
selected: True for the canonical record in each cluster

result = await dedupe(
    input=crm_data,
    equivalence_relation="Same legal entity",
    strategy="select",
    strategy_prompt="Prefer the record with the most complete contact information",
)
deduped = result.data[result.data["selected"] == True]

`identify`

Cluster only, with no selection or combining. Useful when you want to review clusters before deciding what to do.

equivalence_class_id: rows with the same ID are duplicates of each other
equivalence_class_name: human-readable label for the cluster

result = await dedupe(
    input=crm_data,
    equivalence_relation="Same legal entity",
    strategy="identify",
)

`combine`

Synthesizes a single combined row per cluster, merging the best information from all duplicates. Original rows are marked selected=False, and new combined rows are added with selected=True.

result = await dedupe(
    input=crm_data,
    equivalence_relation="Same legal entity",
    strategy="combine",
    strategy_prompt="For each field, keep the most recent and complete value",
)
combined = result.data[result.data["selected"] == True]

What you get back

Three columns added to your data (when using select or combine strategy):

equivalence_class_id: rows with the same ID are duplicates of each other
equivalence_class_name: human-readable label for the cluster ("Alexandra Butoi", "Naomi Saphra", etc.)
selected: True for the canonical record in each cluster (usually the most complete one)

To get just the deduplicated rows:

deduped = result.data[result.data["selected"] == True]

Example

Input:

name	org	email
A. Butoi	Rycolab	a.butoi@edu
Alexandra Butoi	Ryoclab
Namoi Saphra		nsaphra@alumni
Naomi Saphra	Harvard	nsaphra@harvard.edu

Output (selected rows only):

name	org	email
Alexandra Butoi	Rycolab	a.butoi@edu
Naomi Saphra	Harvard	nsaphra@harvard.edu

Parameters

Name	Type	Description
`input`	DataFrame	Data with potential duplicates
`equivalence_relation`	str	What makes two rows duplicates
`strategy`	str	`"identify"`, `"select"` (default), or `"combine"`
`strategy_prompt`	str	Optional instructions for selection or combining
`session`	Session	Optional, auto-created if omitted

Performance

Rows	Time	Cost
200	~90 sec	~$0.40
500	~2 min	~$1.67
2,000	~8 min	~$7

Via MCP

MCP tool: futuresearch_dedupe

Parameter	Type	Description
`csv_path`	string	Path to input CSV file
`equivalence_relation`	string	What makes two rows duplicates

Related docs

Guides

Case Studies

Dedupe CRM Company Records

Blog posts

Dedupe

Examples

from futuresearch.ops import dedupe

result = await dedupe(
    input=crm_data,
    equivalence_relation="Two entries are duplicates if they represent the same legal entity",
)
print(result.data.head())

The equivalence_relation is natural language. Be as specific as you need:

result = await dedupe(
    input=researchers,
    equivalence_relation="""
        Two rows are duplicates if they're the same person, even if:
        - They changed jobs (different org/email)
        - Name is abbreviated (A. Smith vs Alex Smith)
        - There are typos (Naomi vs Namoi)
        - They use a nickname (Bob vs Robert)
    """,
)
print(result.data.head())

Strategies

Control what happens after clusters are identified using the strategy parameter:

`select` (default)

Picks the best representative from each cluster. Three columns are added:

equivalence_class_id: rows with the same ID are duplicates of each other
equivalence_class_name: human-readable label for the cluster
selected: True for the canonical record in each cluster

result = await dedupe(
    input=crm_data,
    equivalence_relation="Same legal entity",
    strategy="select",
    strategy_prompt="Prefer the record with the most complete contact information",
)
deduped = result.data[result.data["selected"] == True]

`identify`

Cluster only, with no selection or combining. Useful when you want to review clusters before deciding what to do.

equivalence_class_id: rows with the same ID are duplicates of each other
equivalence_class_name: human-readable label for the cluster

result = await dedupe(
    input=crm_data,
    equivalence_relation="Same legal entity",
    strategy="identify",
)

`combine`

Synthesizes a single combined row per cluster, merging the best information from all duplicates. Original rows are marked selected=False, and new combined rows are added with selected=True.

result = await dedupe(
    input=crm_data,
    equivalence_relation="Same legal entity",
    strategy="combine",
    strategy_prompt="For each field, keep the most recent and complete value",
)
combined = result.data[result.data["selected"] == True]

What you get back

Three columns added to your data (when using select or combine strategy):

equivalence_class_id: rows with the same ID are duplicates of each other
equivalence_class_name: human-readable label for the cluster ("Alexandra Butoi", "Naomi Saphra", etc.)
selected: True for the canonical record in each cluster (usually the most complete one)

To get just the deduplicated rows:

deduped = result.data[result.data["selected"] == True]

Example

Input:

name	org	email
A. Butoi	Rycolab	a.butoi@edu
Alexandra Butoi	Ryoclab
Namoi Saphra		nsaphra@alumni
Naomi Saphra	Harvard	nsaphra@harvard.edu

Output (selected rows only):

name	org	email
Alexandra Butoi	Rycolab	a.butoi@edu
Naomi Saphra	Harvard	nsaphra@harvard.edu

Parameters

Name	Type	Description
`input`	DataFrame	Data with potential duplicates
`equivalence_relation`	str	What makes two rows duplicates
`strategy`	str	`"identify"`, `"select"` (default), or `"combine"`
`strategy_prompt`	str	Optional instructions for selection or combining
`session`	Session	Optional, auto-created if omitted

Performance

Rows	Time	Cost
200	~90 sec	~$0.40
500	~2 min	~$1.67
2,000	~8 min	~$7

Via MCP

MCP tool: futuresearch_dedupe

Parameter	Type	Description
`csv_path`	string	Path to input CSV file
`equivalence_relation`	string	What makes two rows duplicates

Dedupe

Examples

Strategies

select (default)

identify

combine

What you get back

Example

Parameters

Performance

Via MCP

Related docs

Guides

Case Studies

Blog posts

Dedupe

Examples

Strategies

select (default)

identify

combine

What you get back

Example

Parameters

Performance

Via MCP

Related docs

Guides

Case Studies

Blog posts

`select` (default)

`identify`

`combine`

`select` (default)

`identify`

`combine`