Dedupe
dedupe groups duplicate rows in a DataFrame based on a natural-language equivalence relation, assigns cluster IDs, and selects a canonical row per cluster. The duplicate criterion is semantic and LLM-powered. This handles abbreviations, name variations, job changes, and entity relationships that no string similarity threshold can capture.
Examples
from futuresearch.ops import dedupe
result = await dedupe(
input=crm_data,
equivalence_relation="Two entries are duplicates if they represent the same legal entity",
)
print(result.data.head())
The equivalence_relation is natural language. Be as specific as you need:
result = await dedupe(
input=researchers,
equivalence_relation="""
Two rows are duplicates if they're the same person, even if:
- They changed jobs (different org/email)
- Name is abbreviated (A. Smith vs Alex Smith)
- There are typos (Naomi vs Namoi)
- They use a nickname (Bob vs Robert)
""",
)
print(result.data.head())
Strategies
Control what happens after clusters are identified using the strategy parameter:
select (default)
Picks the best representative from each cluster. Three columns are added:
equivalence_class_id: rows with the same ID are duplicates of each otherequivalence_class_name: human-readable label for the clusterselected: True for the canonical record in each cluster
result = await dedupe(
input=crm_data,
equivalence_relation="Same legal entity",
strategy="select",
strategy_prompt="Prefer the record with the most complete contact information",
)
deduped = result.data[result.data["selected"] == True]
identify
Cluster only, with no selection or combining. Useful when you want to review clusters before deciding what to do.
equivalence_class_id: rows with the same ID are duplicates of each otherequivalence_class_name: human-readable label for the cluster
result = await dedupe(
input=crm_data,
equivalence_relation="Same legal entity",
strategy="identify",
)
combine
Synthesizes a single combined row per cluster, merging the best information from all duplicates. Original rows are marked selected=False, and new combined rows are added with selected=True.
result = await dedupe(
input=crm_data,
equivalence_relation="Same legal entity",
strategy="combine",
strategy_prompt="For each field, keep the most recent and complete value",
)
combined = result.data[result.data["selected"] == True]
What you get back
Three columns added to your data (when using select or combine strategy):
equivalence_class_id: rows with the same ID are duplicates of each otherequivalence_class_name: human-readable label for the cluster ("Alexandra Butoi", "Naomi Saphra", etc.)selected: True for the canonical record in each cluster (usually the most complete one)
To get just the deduplicated rows:
deduped = result.data[result.data["selected"] == True]
Example
Input:
| name | org | |
|---|---|---|
| A. Butoi | Rycolab | a.butoi@edu |
| Alexandra Butoi | Ryoclab | |
| Namoi Saphra | nsaphra@alumni | |
| Naomi Saphra | Harvard | nsaphra@harvard.edu |
Output (selected rows only):
| name | org | |
|---|---|---|
| Alexandra Butoi | Rycolab | a.butoi@edu |
| Naomi Saphra | Harvard | nsaphra@harvard.edu |
Parameters
| Name | Type | Description |
|---|---|---|
input |
DataFrame | Data with potential duplicates |
equivalence_relation |
str | What makes two rows duplicates |
strategy |
str | "identify", "select" (default), or "combine" |
strategy_prompt |
str | Optional instructions for selection or combining |
session |
Session | Optional, auto-created if omitted |
Performance
| Rows | Time | Cost |
|---|---|---|
| 200 | ~90 sec | ~$0.40 |
| 500 | ~2 min | ~$1.67 |
| 2,000 | ~8 min | ~$7 |
Via MCP
MCP tool: futuresearch_dedupe
| Parameter | Type | Description |
|---|---|---|
csv_path |
string | Path to input CSV file |
equivalence_relation |
string | What makes two rows duplicates |