← Back to Research

Deduplicating 200 Contacts with Conflicting Signals

How I used LLMs to cheaply and accurately clean a dataset

The User's Problem

This case study is inspired by a common Salesforce pain point: companies purchase AI tools like Agentforce but can't use them because their CRM data is a mess—same contacts listed 3-4 different ways, missing emails, and records linked to wrong accounts.

"We bought Agentforce licenses but they're just sitting there. Same companies listed 3-4 different ways, contacts missing emails, opportunities linked to wrong accounts. We need clean data before any AI can actually work." — r/salesforce user

I faced a similar problem building a database of AI researchers from academic lab websites. Scraping produced:

  • Name variations: "Julie Kallini" vs "J. Kallini", "Vincenzo Moscato" vs "Moscato, Vincenzo"
  • Typos: "Hanshen Chen" vs "Hansheng Chen", "Bryan Wiledr" vs "Bryan Wilder"
  • Career changes: Same person listed at "BAIR Lab" and later at "BAIR Lab (Former)" with different emails
  • GitHub handles: Sometimes the only reliable link between records like "Alexandra Butoi" and "A. Butoi" sharing butoialexandra
  • Nicknames: Research shows some people go by shortened names (Robert/Bob, William/Bill)

The Dataset

We used a synthetic dataset designed to mirror real CRM deduplication challenges. The schema includes name, position, organization, email, university, and GitHub—with realistic data quality issues:

row_id

name

position

organization

email

university

github

2A. ButoiPhD StudentRycolab

[email protected]

ETH Zurich

butoialexandra

8Alexandra ButoiRyoclab

butoialexandra

43Namoi SaphraResearch Fellow

nsaphra@alumni

-

nsaphra

47Naomi SaphraHarvard University

[email protected]

nsaphra

18T. GuptaPhD StudentAUTON Lab (Former)Carnegie Mellon

tejus-gupta

26Tejus GuptaPhD StudentAUTON Lab

[email protected]

Carnegie Mellon

tejus-gupta

The dataset contains 200 rows representing 160 unique people, with duplicates ranging from easy (exact GitHub match) to expert-level (minimal distinguishing information, adversarial pairs).

Alternative Approaches

Approach

Limitation

Fuzzy string matching


fuzzywuzzy, rapidfuzz

Threshold problem: too low catches false positives, too high misses "A. Butoi" ↔ "Alexandra Butoi" which has low character overlap

Email/ID matching

Fails when same person has different emails at different institutions

Power Query Fuzzy Lookup

Single-column only, struggles with row alignment after merge,

users report giving up

Rule-based systems

Requires maintaining hundreds of edge cases for name variations, nicknames, abbreviations

Enterprise MDM


Cloudingo, DataGroomr

$1,000-2,500/year, requires deployment and oversight—overkill for one-off cleanups

Manual review

200 rows × ~5 comparisons each = hours of tedious work

The core issue: deduplication requires semantic understanding. Is "Robert" the same as "Bob"? Is "BAIR Lab (Former)" the same organization as "BAIR Lab"? These are judgment calls that need context.

How everyrow.io/clean Works

The deduplication pipeline has five stages:

  1. Semantic Item Comparison: Each row is compared against others using an LLM that understands context—recognizing that "A. Butoi" and "Alexandra Butoi" are likely the same person, or that "BAIR Lab (Former)" indicates a career transition rather than a different organization.

  2. Association Matrix Construction: Pairwise comparison results are assembled into a matrix of match/no-match decisions. To scale efficiently, items are first clustered by embedding similarity, so only semantically similar items are compared.

  3. Equivalence Class Creation: Connected components in the association graph form equivalence classes. If A matches B and B matches C, then A, B, and C form a single cluster representing one entity.

  4. Validation: Each multi-member cluster is re-evaluated to catch false positives—cases where the initial comparison was too aggressive.

  5. Candidate Selection: For each equivalence class, the most complete/canonical record is selected as the representative (e.g., preferring "Alexandra Butoi" over "A. Butoi").

The Solution: 20 Lines of Code

from everyrow import create_client, create_session
from everyrow.ops import dedupe
import pandas as pd

input_df = pd.read_csv("researchers.csv")

async with create_client() as client:
    async with create_session(client, name="Researcher Dedupe") as session:
        result = await dedupe(
            session=session,
            input=input_df,
            equivalence_relation=(
                "Two rows are duplicates if they represent the same person "
                "despite different email/organization (career changes). "
                "Consider name variations like typos, nicknames (Robert/Bob), "
                "and format differences (John Smith/J. Smith)."
            ),
        )
        result.data.to_csv("deduplicated.csv", index=False)

The equivalence_relation parameter tells the AI what counts as a duplicate. Unlike regex or fuzzy matching, this is natural language that captures the semantic intent.

Results

Metric

Value

Input rows200
Output rows (unique)156
Expected unique160
Duplicates found44
Expected duplicates40
Cluster accuracy

156/160 (97.5%)

Row accuracy98%
Processing time

90 seconds

Total cost

$0.42

Accuracy by difficulty:

  • Easy: 100% (12/12)
  • Medium: 93% (13/14)
  • Hard: 100% (10/10)
  • Expert: 50% (2/4 — these are adversarial cases with minimal distinguishing info)
  • Distractor: 95% (19/20 same-first-name pairs correctly identified as different people)
  • Singletons: 100% (100/100)
everyrow.io dedupe session interface

View full session →

Comparison Table

What About Fuzzy String Matching?

A common first approach is fuzzy string matching using libraries like rapidfuzz or fuzzywuzzy. This method compares all row pairs using token-sorted string similarity and groups rows exceeding a threshold using Union-Find clustering. We benchmarked this approach at two thresholds:

MetricFuzzy (t=0.75)Fuzzy (t=0.90)everyrow.io
Row accuracy86%82%98%
Cluster accuracy82%78%98%
Easy duplicates58% (7/12)17% (2/12)100% (12/12)
Hard duplicates70% (7/10)10% (1/10)100% (10/10)
Distractor accuracy90% (18/20)100% (20/20)95% (19/20)
Processing time0.04s0.04s90s
Cost$0$0$0.42

Fuzzy matching is ~2000x faster and free, but has a 12-16% accuracy gap. At aggressive thresholds (0.75), it catches more duplicates but risks false merges. At conservative thresholds (0.90), it avoids false merges but misses most semantic duplicates like "T. Gupta" ↔ "Tejus Gupta".

Qualitative Comparison

CapabilityFuzzy MatchingManual Revieweveryrow.io
Handles nicknames (T. ↔ Tejus)
Handles career changes
Avoids false merges (distractors)✓ (at high threshold)
Scales to 10K rows

What About ChatGPT?

A natural question: why not just upload the CSV to ChatGPT and ask it to deduplicate?

We tested this with the prompt:

Deduplicate this dataset. Remove duplicate rows, keeping one representative
row per unique entity. Equivalence rule: Same person = same individual.
Select the best representative from each equivalence class you identify.

Result: 56% row accuracy (vs everyrow's 98%)

MetricChatGPTeveryrow.io
Row accuracy56%98%
Output rows72156
Expected unique160160
Data loss (over-merged)88 people deleted4 people deleted

The problem: ChatGPT over-merged aggressively, "hallucinating" duplicates that don't exist:

  • 88 unique people were incorrectly deleted (merged into other records)
  • Only 33% of singletons preserved — people with no duplicates were merged into unrelated records
  • Only 25% distractor accuracy — people with the same first name but different identities (like "Rohan Saha" and "Rohan Chandra") were incorrectly merged

This is the opposite failure mode from traditional deduplication. Rather than missing duplicates (under-merging), ChatGPT created false matches (over-merging), resulting in permanent data loss.

Example Matches

Correctly Identified Duplicates:

✓ Match: Name abbreviation
Row 2: "A. Butoi" — ETH Zurich, butoialexandra
Row 8: "Alexandra Butoi" — Ryoclab, butoialexandra
Matched via GitHub handle despite name abbreviation
✓ Match: Typo detection
Row 43: "Namoi Saphra" — nsaphra
Row 47: "Naomi Saphra" — Harvard, nsaphra
Detected typo in first name, matched via GitHub
✓ Match: Career transition
Row 18: "T. Gupta" — AUTON Lab (Former), tejus-gupta
Row 26: "Tejus Gupta" — AUTON Lab, tejus-gupta
Expanded initial, recognized organization continuity

Correctly Identified as Different People:

✗ No match: Different people
Row 6: "Rohan Saha" — Alberta, simpleParadox
Row 141: "Rohan Chandra" — UT Austin, rohanchandra30
Same first name, but different last name, institution, and GitHub

Lost clusters (4 people over-merged):

⚠ Over-merged: Same institution
Row 31: "Sarah Ball" — MCML, PhD student
Merged with: "Wen (Lavine) Lai" — MCML, PhD student
Same institution caused false match — different people
⚠ Over-merged: Co-authors
Rows 87, 109: "Marwa Abdulhai" — BAIR Lab
Merged with: "Tejus Gupta" — AUTON Lab
Collaboration signal overrode identity — they co-authored a paper
⚠ Over-merged: Co-authors + username names
Rows 55, 155: "Suvir Mirchandani" / smirchan
Rows 139, 175: "Igor Oliveira" / isobrrj
Merged with: "Vishnu Sarukkai" — Stanford SAIL
All three co-authored the same paper — username-only names made it harder

Key Takeaways

  • 98% row-level accuracy on a challenging dataset with conflicting signals (same person, different email/organization due to career changes)
  • $0.42 total cost and 90 seconds to process 200 records with AI-powered semantic matching
  • 20 lines of code using the everyrow SDK
  • Correctly handled name variations like "A. Butoi" ↔ "Alexandra Butoi", typos like "Namoi" ↔ "Naomi", and abbreviations like "T. Gupta" ↔ "Tejus Gupta"
  • 95% accuracy on "distractor" pairs (people with same first name but different identities)

When to Use everyrow.io/clean

This approach works best when:

  • Semantic judgment required: Name variations, abbreviations, nicknames
  • Conflicting signals: Same person with different emails/organizations over time
  • No single reliable key: Can't rely on email or ID alone
  • Moderate scale: 100-10,000 rows where manual review is painful but not impossible

Try It Yourself

Run your own dedupe:

  1. Get an API key at everyrow.io
  2. Install the SDK: uv pip install everyrow
  3. Write your equivalence relation in plain English
  4. View results in the web interface with full audit trail