Resolve Duplicate Entities

Consolidating records where the same entity appears under different spellings, abbreviations, and nicknames. Here, 500 messy CRM company records are resolved down to their unique entities -- handling typos, ticker symbols, and brand variants that string similarity misses entirely.

Metric	Value
Records processed	500
Unique entities	~131-156
Duplicates resolved	~344-369 (68-74%)
Cost	~$0.74-2.02
Time	~100s - 15 min

Go to futuresearch.ai/app, upload case_01_crm_data.csv, and enter:

Deduplicate this CRM dataset. Two entries are duplicates if they represent the same company.

500 records resolved to about 156 unique entities (68.8% duplicates removed). Results take about 15 minutes.

Add the FutureSearch connector if you haven't already. Then upload case_01_crm_data.csv and ask Claude:

Deduplicate this CRM dataset. Two entries are duplicates if they represent the same company.

500 records resolved to about 156 unique entities (68.8% duplicates removed). Results take about 15 minutes.

Add FutureSearch to Claude Code if you haven't already:

claude mcp add futuresearch --scope project --transport http https://mcp.futuresearch.ai/mcp

Download the dataset: case_01_crm_data.csv (500 messy company records with typos, abbreviations, and missing fields). With the CSV in your working directory, tell Claude:

Deduplicate this CRM dataset. Two entries are duplicates if they represent
the same company.

Claude calls FutureSearch's dedupe MCP tool:

Tool: futuresearch_dedupe
├─ equivalence_relation: "Two entries are duplicates if they represent the same company."
└─ input_csv: "/Users/you/case_01_crm_data.csv"

→ Submitted: 500 rows for deduplication.
  Session: https://futuresearch.ai/sessions/d5d1ef67-653e-4dd9-90e7-9653cda2af85
  Task ID: d5d1...

Tool: futuresearch_progress
├─ task_id: "d5d1..."
→ Running: 0/500 complete (30s elapsed)

...

Tool: futuresearch_progress
→ Completed: 500/500 (0 failed) in 920s.

Tool: futuresearch_results
├─ task_id: "d5d1..."
├─ output_path: "/Users/you/crm_deduplicated.csv"
→ Saved 500 rows to /Users/you/crm_deduplicated.csv

500 records resolved to 156 unique entities. View the session.

pip install futuresearch
export FUTURESEARCH_API_KEY=your_key_here  # Get one at futuresearch.ai

We'll use a messy CRM dataset with 500 company records. To follow along, right click this link and save the CSV file to your computer.

import asyncio
import pandas as pd
from futuresearch.ops import dedupe

data = pd.read_csv("case_01_crm_data.csv").fillna("")

async def main():
    result = await dedupe(
        input=data,
        equivalence_relation="Two entries are duplicates if they represent the same company.",
    )

    # Filter to keep only the best record per entity
    unique = result.data[result.data["selected"] == True]
    print(f"Reduced {len(data)} records to {len(unique)} unique entities")

asyncio.run(main())

View the session.

Results

The input data contains variations like these, all representing the same company:

company_name	contact_name	email_address
AbbVie Inc.	Richard Gonzales	info@abbvie-bio.com
AbbVie Pharmaceutical	Richard Gonzales
Abbvie		info@abbvie-bio.com
Abvie Inc	Richard Gonzales

The system clusters these into a single entity and selects the most complete record. The output includes equivalence_class_id and selected columns. Filter to selected == True to get one record per entity.

Cluster	Records	Variants
Palo Alto Networks	8	Pallow Alto, PANW, Palo Alto Net Inc, Paloalto Networks
Walmart Inc.	8	W-Mart, Wall-Mart, WMT Corp, Walmart Corporation
Uber Technologies	8	Ubar, Ubr, Uber Tech, Uber Corporation
ServiceNow	6	Service Now, Service-Now, SerivceNow, Service Now Inc

The system handles cases that string similarity misses entirely. "AAPL" matches to "Apple Inc." because the model knows the ticker symbol. "Big Blue" matches to "IBM Corporation" because that's IBM's nickname. "W-Mart" and "Wallmart" match to "Walmart Inc." despite having different typos.

The equivalence relation is flexible. For matching people: "Two entries are duplicates if they refer to the same person, accounting for name variations and nicknames." For products: "Two entries represent the same product if they're the same item sold under different names or SKUs."

See the full notebook for additional examples including how to merge the clustered records into consolidated entries.

Built with FutureSearch. See the dedupe documentation for more options.

Resolve Duplicate Entities

Metric

Value

Records processed

500

Unique entities

~131-156

Duplicates resolved

~344-369 (68-74%)

Cost

~$0.74-2.02

Time

~100s - 15 min

Go to futuresearch.ai/app, upload case_01_crm_data.csv, and enter:

Deduplicate this CRM dataset. Two entries are duplicates if they represent the same company.

500 records resolved to about 156 unique entities (68.8% duplicates removed). Results take about 15 minutes.

Add the FutureSearch connector if you haven't already. Then upload case_01_crm_data.csv and ask Claude:

Deduplicate this CRM dataset. Two entries are duplicates if they represent the same company.

500 records resolved to about 156 unique entities (68.8% duplicates removed). Results take about 15 minutes.

Add FutureSearch to Claude Code if you haven't already:

claude mcp add futuresearch --scope project --transport http https://mcp.futuresearch.ai/mcp

Download the dataset: case_01_crm_data.csv (500 messy company records with typos, abbreviations, and missing fields). With the CSV in your working directory, tell Claude:

Deduplicate this CRM dataset. Two entries are duplicates if they represent
the same company.

Claude calls FutureSearch's dedupe MCP tool:

Tool: futuresearch_dedupe
├─ equivalence_relation: "Two entries are duplicates if they represent the same company."
└─ input_csv: "/Users/you/case_01_crm_data.csv"

→ Submitted: 500 rows for deduplication.
  Session: https://futuresearch.ai/sessions/d5d1ef67-653e-4dd9-90e7-9653cda2af85
  Task ID: d5d1...

Tool: futuresearch_progress
├─ task_id: "d5d1..."
→ Running: 0/500 complete (30s elapsed)

...

Tool: futuresearch_progress
→ Completed: 500/500 (0 failed) in 920s.

Tool: futuresearch_results
├─ task_id: "d5d1..."
├─ output_path: "/Users/you/crm_deduplicated.csv"
→ Saved 500 rows to /Users/you/crm_deduplicated.csv

500 records resolved to 156 unique entities. View the session.

pip install futuresearch
export FUTURESEARCH_API_KEY=your_key_here  # Get one at futuresearch.ai

We'll use a messy CRM dataset with 500 company records. To follow along, right click this link and save the CSV file to your computer.

import asyncio
import pandas as pd
from futuresearch.ops import dedupe

data = pd.read_csv("case_01_crm_data.csv").fillna("")

async def main():
    result = await dedupe(
        input=data,
        equivalence_relation="Two entries are duplicates if they represent the same company.",
    )

    # Filter to keep only the best record per entity
    unique = result.data[result.data["selected"] == True]
    print(f"Reduced {len(data)} records to {len(unique)} unique entities")

asyncio.run(main())

View the session.

Results

The input data contains variations like these, all representing the same company:

company_name

contact_name

email_address

AbbVie Inc.

Richard Gonzales

info@abbvie-bio.com

AbbVie Pharmaceutical

Richard Gonzales

Abbvie

info@abbvie-bio.com

Abvie Inc

Richard Gonzales

Cluster

Records

Variants

Palo Alto Networks

Pallow Alto, PANW, Palo Alto Net Inc, Paloalto Networks

Walmart Inc.

W-Mart, Wall-Mart, WMT Corp, Walmart Corporation

Uber Technologies

Ubar, Ubr, Uber Tech, Uber Corporation

ServiceNow

Service Now, Service-Now, SerivceNow, Service Now Inc

See the full notebook for additional examples including how to merge the clustered records into consolidated entries.