Deduplicate CRM Records

Messy CRM data has entries like "PANW", "Pallow Alto", and "Paloalto Networks" all referring to the same company. This case study deduplicates 500 CRM records down to unique entities, handling ticker symbols, nicknames, and typos.

Metric	Value
Records processed	500
Unique entities	146
Duplicates removed	354 (70.8%)
Cost	$1.38
Time	7.0 minutes

Add FutureSearch to Claude Code if you haven't already:

claude mcp add futuresearch --scope project --transport http https://mcp.futuresearch.ai/mcp

Download case_01_crm_data.csv. Tell Claude:

Deduplicate this CRM dataset. Two entries are duplicates if they include data
for the same legal entity.

Claude calls FutureSearch's dedupe MCP tool:

Tool: futuresearch_dedupe
├─ equivalence_relation: "Two entries are duplicates if they include data for the same legal entity."
└─ input_csv: "/Users/you/case_01_crm_data.csv"

→ Submitted: 500 rows for deduplication.
  Session: https://futuresearch.ai/sessions/0f6aa459-6e83-4df0-b9e8-bdb8ec594d91
  Task ID: 0f6a...

Tool: futuresearch_progress
├─ task_id: "0f6a..."
→ Running: 0/500 complete (30s elapsed)

...

Tool: futuresearch_progress
→ Completed: 500/500 (0 failed) in 422s.

Tool: futuresearch_results
├─ task_id: "0f6a..."
├─ output_path: "/Users/you/crm_deduplicated.csv"
→ Saved 500 rows to /Users/you/crm_deduplicated.csv

500 records resolved to 146 unique entities. View the session.

Add the FutureSearch connector if you haven't already. Then upload case_01_crm_data.csv and ask Claude:

Deduplicate this CRM dataset. Two entries are duplicates if they include data for the same legal entity.

Go to futuresearch.ai/app, upload case_01_crm_data.csv, and enter:

Deduplicate this CRM dataset. Two entries are duplicates if they include data for the same legal entity.

pip install futuresearch
export FUTURESEARCH_API_KEY=your_key_here  # Get one at futuresearch.ai/app/api-key

import asyncio
import pandas as pd
from futuresearch import create_session
from futuresearch.ops import dedupe

data = pd.read_csv("case_01_crm_data.csv")

async def main():
    async with create_session(name="CRM Deduplication") as session:
        result = await dedupe(
            session=session,
            input=data,
            equivalence_relation="Two entries are duplicates if they include data for the same legal entity.",
        )
        deduplicated = result.data[result.data["selected"]]
        return deduplicated

clean_data = asyncio.run(main())

Results

Cluster	Records	Variants
Palo Alto Networks	8	Pallow Alto, PANW, Paloalto Networks, Palo Alto Net Inc
Walmart	8	W-Mart, Wall-Mart, WMT Corp, Wallmart, Wal-Mart Stores
Uber	8	Ubar, Ubr, Uber Tech, Uber Corporation
ServiceNow	6	Service Now, Service-Now, SerivceNow, Service Now Inc
Nike	4	Nyke, Nike Corp, Nike Incorporated, Nike Inc.

The output includes equivalence_class_id and selected columns. Filter to selected == True to get one record per entity. The system uses embeddings for initial clustering, then LLM pairwise comparison for accuracy. It handles ticker symbols (PANW to Palo Alto Networks), nicknames (Big Blue to IBM), and typos (Wallmart to Walmart).

Deduplicate CRM Records

Metric	Value
Records processed	500
Unique entities	146
Duplicates removed	354 (70.8%)
Cost	$1.38
Time	7.0 minutes

Add FutureSearch to Claude Code if you haven't already:

claude mcp add futuresearch --scope project --transport http https://mcp.futuresearch.ai/mcp

Download case_01_crm_data.csv. Tell Claude:

Deduplicate this CRM dataset. Two entries are duplicates if they include data
for the same legal entity.

Claude calls FutureSearch's dedupe MCP tool:

Tool: futuresearch_dedupe
├─ equivalence_relation: "Two entries are duplicates if they include data for the same legal entity."
└─ input_csv: "/Users/you/case_01_crm_data.csv"

→ Submitted: 500 rows for deduplication.
  Session: https://futuresearch.ai/sessions/0f6aa459-6e83-4df0-b9e8-bdb8ec594d91
  Task ID: 0f6a...

Tool: futuresearch_progress
├─ task_id: "0f6a..."
→ Running: 0/500 complete (30s elapsed)

...

Tool: futuresearch_progress
→ Completed: 500/500 (0 failed) in 422s.

Tool: futuresearch_results
├─ task_id: "0f6a..."
├─ output_path: "/Users/you/crm_deduplicated.csv"
→ Saved 500 rows to /Users/you/crm_deduplicated.csv

500 records resolved to 146 unique entities. View the session.

Add the FutureSearch connector if you haven't already. Then upload case_01_crm_data.csv and ask Claude:

Deduplicate this CRM dataset. Two entries are duplicates if they include data for the same legal entity.

Go to futuresearch.ai/app, upload case_01_crm_data.csv, and enter:

Deduplicate this CRM dataset. Two entries are duplicates if they include data for the same legal entity.

pip install futuresearch
export FUTURESEARCH_API_KEY=your_key_here  # Get one at futuresearch.ai/app/api-key

import asyncio
import pandas as pd
from futuresearch import create_session
from futuresearch.ops import dedupe

data = pd.read_csv("case_01_crm_data.csv")

async def main():
    async with create_session(name="CRM Deduplication") as session:
        result = await dedupe(
            session=session,
            input=data,
            equivalence_relation="Two entries are duplicates if they include data for the same legal entity.",
        )
        deduplicated = result.data[result.data["selected"]]
        return deduplicated

clean_data = asyncio.run(main())

Results

Cluster	Records	Variants
Palo Alto Networks	8	Pallow Alto, PANW, Paloalto Networks, Palo Alto Net Inc
Walmart	8	W-Mart, Wall-Mart, WMT Corp, Wallmart, Wal-Mart Stores
Uber	8	Ubar, Ubr, Uber Tech, Uber Corporation
ServiceNow	6	Service Now, Service-Now, SerivceNow, Service Now Inc
Nike	4	Nyke, Nike Corp, Nike Incorporated, Nike Inc.