← Back to Research

Using LLMs for Data Cleaning At Scale

Deduplicating 20,000 rows with 11,000 LLM calls cheaply and accurately

Using LLMs for Data Cleaning At Scale

Previously we wrote how we use LLMs to deduplicate data. But that was 200 rows. So we asked, can we scale this up 100x? Then worked to keep the cost low, while still correctly matching rows using semantic knowledge, even when each row has 20,000 potential rows to match again.

Results

Experiment 1 (FDA drug products, 20,000 rows):

  • F1 (precision & recall) = 0.996
  • ~11,000 LLM calls, in a single 25 minute run, no failures
  • Cost ~$1.12 per 1,000 rows

Experiment 2 (company names, 8,628 rows):

  • F1 = 0.976, robust across entity types
  • Cost scales linearly and predictably

We also tested on 5 additional datasets across people, transactions, and products with different equivalence rules, to be sure the LLMs were solving this fully generally.

Scaling Experiment 1: FDA Drug Products

A synthetic dataset based on the FDA Approved Drug Products database, tested at increasing sizes (500 to 20,000 rows) with a 10% duplicate rate. Each row contains a trade name, active ingredient, applicant, strength, and dosage form. Duplicates are created through typos, abbreviation changes, and reordering, the kind of variation that defeats exact matching but is obvious to a human reader.

Equivalence rule

Same ingredient + same strength + same applicant + same dosage form = duplicate. Products must match on ALL four fields.

FDA deduplication scaling analysis showing error rates, runtime, cost, and LLM calls from 500 to 20,000 rows

The FDA dataset shows near perfect precision across all sizes, the system almost never incorrectly merges distinct products. The false negative rate (missed duplicates) stays below 1% even at 20,000 rows. Cost scales linearly at approximately $1.12 per 1,000 rows.

trade_nameingredientapplicantstrengthdosage_form
IRBESARTANIRBESARTANCHARTWELL RX75MGTABLET;ORAL
TRIKAFTA (COPACKAGED)ELEXACAFTOR, IVACAFTOR, TEZACAFTOR; IVACAFTORVERTEX PHARMS INC100MG,75MG,50MG; 150MGTABLET;ORAL
QUETIAPINE FUMARATEQUETIAPINE FUMARATEMEDICAP LABSEQ 50MG BASETABLET, EXTENDED RELEASE;ORAL
MINOCYCLINE HYDROCHLORIDEMINOCYCLINE HYDROCHLORIDESUN PHARM INDUSTRIESEQ 75MG BASETABLET;ORAL

Scaling Experiment 2: Company Names

A synthetic dataset of company names at increasing sizes (500 to ~8,600 rows) with a 10% duplicate rate. Duplicates are created through abbreviations (Inc to Incorporated), case changes, punctuation removal, and minor typos, the kind of real world variation found in CRM exports and financial data.

Equivalence rule

Same company = same legal entity.

Company name deduplication scaling analysis showing error rates, runtime, cost, and LLM calls from 500 to 8,628 rows

Company name deduplication is inherently harder than structured pharmaceutical data. With only a name and sector to work with, the system must make semantic judgments: is "AXON ENTERPRISE, INC." the same as "Axon Enterprise Inc"? (Yes.) Is "VIRTU FINANCIAL INC CLASS A" the same as "VIRTU FINANCIAL INC"? (Depends on the rule.) Despite this ambiguity, F1 stays above 0.976 at the largest scale tested.

For a step by step walkthrough of deduplicating company and contact data, see How to Deduplicate a Contact List with Fuzzy Name Matching.


Cost and Accuracy Scaling

Everyone loves scaling laws, so we ran 5 more experiments featuring duplicates across different types of entities, and with different difficulty levels on how to tell if two rows actually matched.

DatasetEntityRowsDup%F1PrecisionRecallCost$/1k rows
Small Companiescompany2008%1.0001.0001.000$0.18$0.90
Medium Peopleperson1,00020%0.9940.9930.996$1.18$1.18
Medium Transactionstsx1,00020%0.9450.9280.963$1.41$1.41
Large Companies (Messy)company3,00010%0.9740.9730.976$3.21$1.07
Large Products (FDA like)product5,0005%0.9970.9980.996$6.37$1.27

All datasets use ground truth labels. F1 is the harmonic mean of Precision (how many merges were correct) and Recall (how many true duplicates were found). Cost is total LLM inference cost in USD.

The hardest dataset, Medium Transactions, involves property sale records where the same sale can appear with different address formats, and adjacent properties share parcel IDs. Even here, F1 reaches 0.945.

Understanding deduplication quality requires distinguishing two separate error types:

  • Over merging (low Precision): Distinct entities are incorrectly grouped together. This causes data loss, you lose real records. This is the dangerous failure mode.
  • Under merging (low Recall): True duplicates are missed. Your data stays messy, but nothing is lost. This is the safe failure mode.

EveryRow's dedupe tool is tuned to strongly prefer under merging over over merging. In the FDA dataset at 20,000 rows, Precision is 1.000 (zero false merges) while Recall is 0.992 (a small number of duplicates are missed). This means the system is conservative, it only merges when it is confident.

F1 score is the harmonic mean of Precision and Recall, summarizing overall quality in a single number. An F1 of 0.996 means the system is getting almost every decision right.


Steps to Reproduce

The FDA dataset can be found at FDA Drugs@FDA Data Files. Or you can deduplicate your own dataset:

  1. Get an API key at everyrow.io/api-key (based on our scaling, should support about a 15k row dedupe on the free tier)

  2. Either in the app, or using Claude Code (or the code snippet below) upload your CSV.

That's it.

from everyrow import create_client, create_session
from everyrow.ops import dedupe
import pandas as pd

input_df = pd.read_csv("your_data.csv")

async with create_client() as client:
    async with create_session(client, name="Dedupe") as session:
        result = await dedupe(
            session=session,
            input=input_df,
            equivalence_relation=(
                "Same company = same legal entity"
            ),
        )
        result.data.to_csv("deduplicated.csv", index=False)

Install with: pip install everyrow

Resources