FutureSearch Logofuturesearch
  • Solutions
  • Pricing
  • Research
  • Docs
  • Evals
  • Markets
  • Blog
  • Company
  • Try it for free
FutureSearch Logo

General inquiry? You can reach us at hello@futuresearch.ai.

Company

Team & CareersPressPrivacy PolicyTerms of Service

Developers

SDK DocsAPI ReferenceCase StudiesGitHubSupport

Integrations

Claude CodeCursorChatGPT CodexClaude.ai

Follow Us

X (Twitter)@dschwarz26LinkedIn
FutureSearchdocs
Your research team
Installation
  • All install methods
  • Claude.ai
  • Claude Code
  • Web App
  • Python SDK
  • Skill
Reference
  • API Key
  • forecast
  • multi_agent
  • agent_map
  • rank
  • classify
  • merge
  • dedupe
  • MCP Server
  • Progress Monitoring
  • Chaining Operations
Guides
  • LLM-Powered Data Labeling
  • Add a Column via Web Research
  • Classify and Label Rows
  • Deduplicate Training Data
  • Error Handling in FutureSearch: Failed Rows and Partial Results
  • Filter a Dataset Intelligently
  • Find Profitable Prediction Market Trades
  • Forecast Outcomes for a List of Entities
  • Value a Private Company
  • Join Tables Without Shared Keys
  • Rank Data by External Metrics
  • Research a Question with a Team of Agents
  • Resolve Duplicate Entities
  • Scale Deduplication to 20K Rows
  • Turn Claude into an Accurate Forecaster
Case Studies
  • Deduplicate Contact Lists
  • Deduplicate CRM Records
  • Enrich Contacts with Company Data
  • Find Startups Selling to Frontier AI Labs
  • Forecast a Sum-of-the-Parts SpaceX IPO Valuation
  • Forecast Anthropic and OpenAI IPO Valuations
  • Forecast Founder Seed Valuations for AI Researchers
  • Forecast When Anthropic and OpenAI Will IPO
  • Fuzzy Match Across Tables
  • Link Records Across Medical Datasets
  • LLM Cost vs. Accuracy
  • Merge Costs and Speed
  • Merge Thousands of Records
  • Multi-Stage Lead Qualification
  • Research and Rank Web Data
  • Research Formal Verification for AI
  • Run 10,000 LLM Web Research Agents
  • Score Cold Leads via Web Research
  • Score Leads from Fragmented Data
  • Screen 10,000 Rows
  • Screen Job Listings
  • Screen Stocks by Economic Sensitivity
  • Screen Stocks by Investment Thesis
FutureSearchby futuresearch
by futuresearch

Dedupe

dedupe groups duplicate rows in a DataFrame based on a natural-language equivalence relation, assigns cluster IDs, and selects a canonical row per cluster. The duplicate criterion is semantic and LLM-powered. This handles abbreviations, name variations, job changes, and entity relationships that no string similarity threshold can capture.

Examples

from futuresearch.ops import dedupe

result = await dedupe(
    input=crm_data,
    equivalence_relation="Two entries are duplicates if they represent the same legal entity",
)
print(result.data.head())

The equivalence_relation is natural language. Be as specific as you need:

result = await dedupe(
    input=researchers,
    equivalence_relation="""
        Two rows are duplicates if they're the same person, even if:
        - They changed jobs (different org/email)
        - Name is abbreviated (A. Smith vs Alex Smith)
        - There are typos (Naomi vs Namoi)
        - They use a nickname (Bob vs Robert)
    """,
)
print(result.data.head())

Strategies

Control what happens after clusters are identified using the strategy parameter:

select (default)

Picks the best representative from each cluster. Three columns are added:

  • equivalence_class_id: rows with the same ID are duplicates of each other
  • equivalence_class_name: human-readable label for the cluster
  • selected: True for the canonical record in each cluster
result = await dedupe(
    input=crm_data,
    equivalence_relation="Same legal entity",
    strategy="select",
    strategy_prompt="Prefer the record with the most complete contact information",
)
deduped = result.data[result.data["selected"] == True]

identify

Cluster only, with no selection or combining. Useful when you want to review clusters before deciding what to do.

  • equivalence_class_id: rows with the same ID are duplicates of each other
  • equivalence_class_name: human-readable label for the cluster
result = await dedupe(
    input=crm_data,
    equivalence_relation="Same legal entity",
    strategy="identify",
)

combine

Synthesizes a single combined row per cluster, merging the best information from all duplicates. Original rows are marked selected=False, and new combined rows are added with selected=True.

result = await dedupe(
    input=crm_data,
    equivalence_relation="Same legal entity",
    strategy="combine",
    strategy_prompt="For each field, keep the most recent and complete value",
)
combined = result.data[result.data["selected"] == True]

What you get back

Three columns added to your data (when using select or combine strategy):

  • equivalence_class_id: rows with the same ID are duplicates of each other
  • equivalence_class_name: human-readable label for the cluster ("Alexandra Butoi", "Naomi Saphra", etc.)
  • selected: True for the canonical record in each cluster (usually the most complete one)

To get just the deduplicated rows:

deduped = result.data[result.data["selected"] == True]

Example

Input:

name org email
A. Butoi Rycolab a.butoi@edu
Alexandra Butoi Ryoclab
Namoi Saphra nsaphra@alumni
Naomi Saphra Harvard nsaphra@harvard.edu

Output (selected rows only):

name org email
Alexandra Butoi Rycolab a.butoi@edu
Naomi Saphra Harvard nsaphra@harvard.edu

Parameters

Name Type Description
input DataFrame Data with potential duplicates
equivalence_relation str What makes two rows duplicates
strategy str "identify", "select" (default), or "combine"
strategy_prompt str Optional instructions for selection or combining
session Session Optional, auto-created if omitted

Performance

Rows Time Cost
200 ~90 sec ~$0.40
500 ~2 min ~$1.67
2,000 ~8 min ~$7

Via MCP

MCP tool: futuresearch_dedupe

Parameter Type Description
csv_path string Path to input CSV file
equivalence_relation string What makes two rows duplicates

Related docs

Guides

  • Remove Duplicates from ML Training Data
  • Resolve Duplicate Entities

Case Studies

  • Dedupe CRM Company Records

Blog posts

  • CRM Deduplication
  • Researcher Deduplication