FutureSearch Logofuturesearch
  • Blog
  • Solutions
  • Research
  • Docs
  • Evals
  • Company
  • Get Researchers
FutureSearch Logo

General inquiry? You can reach us at hello@futuresearch.ai.

Company

Team & CareersPressPrivacy PolicyTerms of Service

Developers

SDK DocsAPI ReferenceCase StudiesGitHub

Follow Us

X (Twitter)@dschwarz26LinkedIn
FutureSearchdocs
Your research team
Installation
  • All install methods
  • Claude.ai
  • Claude Cowork
  • Claude Code
  • Web App
  • Python SDK
  • Skill
  • MCP Server
Reference
  • API Key
  • classify
  • dedupe
  • forecast
  • merge
  • rank
  • agent_map
  • screen
  • Progress Monitoring
  • Chaining Operations
Guides
  • LLM-Powered Data Labeling
  • Add a Column via Web Research
  • Classify and Label Rows
  • Deduplicate Training Data
  • Filter a Dataset Intelligently
  • Join Tables Without Shared Keys
  • Rank Data by External Metrics
  • Resolve Duplicate Entities
  • Scale Deduplication to 20K Rows
Case Studies
  • Deduplicate Contact Lists
  • Deduplicate CRM Records
  • Enrich Contacts with Company Data
  • Fuzzy Match Across Tables
  • Link Records Across Medical Datasets
  • LLM Cost vs. Accuracy
  • Merge Costs and Speed
  • Merge Thousands of Records
  • Multi-Stage Lead Qualification
  • Research and Rank Web Data
  • Run 10,000 LLM Web Research Agents
  • Score Cold Leads via Web Research
  • Score Leads from Fragmented Data
  • Screen 10,000 Rows
  • Screen Job Listings
  • Screen Stocks by Economic Sensitivity
  • Screen Stocks by Investment Thesis
FutureSearchby futuresearch
by futuresearch

Resolve Duplicate Entities

Go to futuresearch.ai/app, upload case_01_crm_data.csv, and enter:

Deduplicate this CRM dataset. Two entries are duplicates if they represent the same company.

500 records resolved to about 156 unique entities (68.8% duplicates removed). Results take about 15 minutes.

Add the everyrow connector if you haven't already. Then upload case_01_crm_data.csv and ask Claude:

Deduplicate this CRM dataset. Two entries are duplicates if they represent the same company.

500 records resolved to about 156 unique entities (68.8% duplicates removed). Results take about 15 minutes.

Claude Code is great at writing normalization code to standardize company names. Stripping "Inc." and lowercasing gets you some matches. But "AbbVie Inc.", "AbbVie Pharmaceutical", "Abbvie", and "Abvie Inc" need more than normalization. Two of those are typos, one is a brand variant, and one is a division name.

Here, we get Claude Code to resolve 500 messy CRM records down to their unique entities.

MetricValue
Records processed500
Unique entities156
Duplicates resolved344 (68.8%)
Cost$2.02
Time15.3 minutes

Add everyrow to Claude Code if you haven't already:

claude mcp add futuresearch --scope project --transport http https://mcp.futuresearch.ai/mcp

Download the dataset: case_01_crm_data.csv (500 messy company records with typos, abbreviations, and missing fields). With the CSV in your working directory, tell Claude:

Deduplicate this CRM dataset. Two entries are duplicates if they represent
the same company.

Claude calls everyrow's dedupe MCP tool:

Tool: everyrow_dedupe
├─ equivalence_relation: "Two entries are duplicates if they represent the same company."
└─ input_csv: "/Users/you/case_01_crm_data.csv"

→ Submitted: 500 rows for deduplication.
  Session: https://futuresearch.ai/sessions/d5d1ef67-653e-4dd9-90e7-9653cda2af85
  Task ID: d5d1...

Tool: everyrow_progress
├─ task_id: "d5d1..."
→ Running: 0/500 complete (30s elapsed)

...

Tool: everyrow_progress
→ Completed: 500/500 (0 failed) in 920s.

Tool: everyrow_results
├─ task_id: "d5d1..."
├─ output_path: "/Users/you/crm_deduplicated.csv"
→ Saved 500 rows to /Users/you/crm_deduplicated.csv

500 records resolved to 156 unique entities. View the session.

ClusterRecordsVariants
Palo Alto Networks8Pallow Alto, PANW, Palo Alto Net Inc, Paloalto Networks
Walmart Inc.8W-Mart, Wall-Mart, WMT Corp, Walmart Corporation
Uber Technologies8Ubar, Ubr, Uber Tech, Uber Corporation
ServiceNow6Service Now, Service-Now, SerivceNow, Service Now Inc

The system handles cases that string similarity misses entirely. "AAPL" matches to "Apple Inc." because the model knows the ticker symbol. "Big Blue" matches to "IBM Corporation" because that's IBM's nickname. The output includes equivalence_class_id and selected columns. Filter to selected == True to get one record per entity.

Identifying matching records that represent the same entity across messy data typically requires labeled training data, manual blocking rules, or extensive threshold tuning. The everyrow SDK uses LLMs to solve this at high accuracy in a single method call.

MetricValue
Records processed500
Unique entities131
Duplicates resolved369
Cost$0.74
Time~100 seconds
Sessionview
pip install everyrow
export EVERYROW_API_KEY=your_key_here  # Get one at futuresearch.ai

We'll use a messy CRM dataset with 500 company records. To follow along, right click this link and save the CSV file to your computer.

import asyncio
import pandas as pd
from everyrow.ops import dedupe

data = pd.read_csv("case_01_crm_data.csv").fillna("")

async def main():
    result = await dedupe(
        input=data,
        equivalence_relation="Two entries are duplicates if they represent the same company.",
    )

    # Filter to keep only the best record per entity
    unique = result.data[result.data["selected"] == True]
    print(f"Reduced {len(data)} records to {len(unique)} unique entities")

asyncio.run(main())

The input data contains variations like these, all representing the same company:

company_namecontact_nameemail_address
AbbVie Inc.Richard Gonzalesinfo@abbvie-bio.com
AbbVie PharmaceuticalRichard Gonzales
Abbvieinfo@abbvie-bio.com
Abvie IncRichard Gonzales

The SDK clusters these into a single entity and selects the most complete record. The output DataFrame includes equivalence_class_id and equivalence_class_name columns showing which records were grouped together, plus a selected boolean indicating which record to keep.

This approach handles cases that string similarity misses entirely. "AAPL" matches to "Apple Inc." because the model knows the ticker symbol. "Big Blue" matches to "IBM Corporation" because that's IBM's nickname. "W-Mart" and "Wallmart" match to "Walmart Inc." despite having different typos.

The equivalence relation is flexible. For matching people: "Two entries are duplicates if they refer to the same person, accounting for name variations and nicknames." For products: "Two entries represent the same product if they're the same item sold under different names or SKUs."

See the full notebook for additional examples including how to merge the clustered records into consolidated entries.


Built with everyrow. See the dedupe documentation for more options.