TL;DR: Fuzzy string matching breaks down quickly with real-world data. LLMs and LLM-web-agents can merge data, but it takes some elbow grease to get good accuracy without costing an arm and a leg. I built a solution in the everyrow.io SDK that implements a tiered merge approach (exact → fuzzy → LLM → web) that optimizes for cost.
I, and every data engineer, has tried to join two tables that don't have a shared key. Maybe it's companies, or people's names, or addresses, or parts. Those strings never match well enough, but pandas.merge() or VLOOKUP requires exact matches:
# This only works when "company" columns match exactly
merged = pd.merge(left_table, right_table, on="company")
Consider these variations that represent the same company:
| Table A | Table B |
|---|---|
| Microsoft Corporation | Microsoft Corp. |
| Apple Inc. | AAPL |
| Meta |
A standard merge would miss all of these. So next I would try fuzzy merge libraries, like fuzzywuzzy or rapidfuzz, compute string similarity scores:
from rapidfuzz import fuzz
similarity = fuzz.ratio("Microsoft Corporation", "Microsoft Corp.")
# Returns ~85% similarity
This effectively calculates a distance between strings and (like the Levenshtein Distance), and normalizes it to get a similarity score between 0 and 1:
A similarity ≥ threshold qualifies as a fuzzy match.
But this won't work even on my 3 company example. There's no threshold that will match "Facebook" to "Meta" that also doesn't introduce huge false positives. If there even is an optimal threshold, it varies every time.
If you really want high accuracy, like I do, then world knowledge is sometimes required, like "McGraw Hill" ↔ "S&P Global".
LLMs make a much, much higher quality semantic matching possible. First, you can use embeddings, convert text into high-dimensional vectors where semantically similar entities cluster together. This works well for simple relationships, and the OpenAI embeddings API is extremely cheap. But still the quality isn't great.
Full-powered LLMs can fundamentally do this as well as humans. And LLM web agents can do this as well as humans with access to Google. So I started running experiments with these.
The first thing you immediately notice is that LLM web agents are overkill for most rows. I needed to use the simplest method that works for each row, so I implemented a hierarchical cascade, using more expensive ones only if it looks like cheaper ones would fail:
- Exact string match: Basically a lookup-table
- Fuzzy string match: Levenshtein-based with a high threshold of 0.9
- LLM merge: A cheap LLM, run in batches, finds a match if it is confident
- LLM-Web-Agent merge: Browse the web and try find what entities really correspond to
Cost And Accuracy of the Progressive Intelligence Approach
Let's look at a real-world example.
I used a dataset of 438 S&P 500 companies with columns for company name, ticker, fair value, price, market cap, shares, and CEO. This gave us ground truth for validation.
data = pd.read_csv("data/companies.csv")
# Columns: company, ticker, fair_value, price, mkt_cap, shares, CEO
For the different experiments, I split the data into two tables, corrupted the data, and removed columns. The dataset then also serves as a ground truth for the different experiments.
We ran three experiments, each testing a different merging challenge:
- Fuzzy matching with noise from corrupt company names
- LLM merge without common columns by matching companies to tickers
- Web merge for association by matching CEOs to companies
Note this third one is not a typical join. CEOs and companies are not the same entity under different names, it's actually building a semantic relationship. LLMs are strictly needed for this.
My solution is now part of the everyrow.io SDK, so re-doing this experiment is pretty straightforward:
from everyrow import create_session
from everyrow.ops import merge
from everyrow.generated.models import LLMEnum
async with create_session(name="Merge companies") as session:
task = await merge(
session=session,
task="...", # Natural language description of what you're merging (helps the LLM, not necessary in general)
left_table=left_df, # table to merge into as a pandas DataFrame
right_table=right_df, # table to merge from as a pandas DataFrame
merge_on_left="...", # Optional: specify merge column
merge_on_right="...", # Optional: specify merge column
merge_model=LLMEnum.GEMINI_3_FLASH_MINIMAL # Optional: specify LLM model (default: GEMINI_3_FLASH_MINIMAL)
)
result = await task.await_result() # result is the dataframe with merged tables
Parameters:
task: Natural language description of what you're mergingleft_table/right_table: Pandas DataFramesmerge_on_left/merge_on_right: Optional column names for string/fuzzy matching
The result includes a research column with justifications for each match, allowing you to audit the matching logic.
You will also see a link to your session in the everyrow.io UI, where you can look at the merge process in real-time.
Experiment 1: Fuzzy String Matching Under Noise
In our first experiment, we progressively corrupt the data to simulate real-world data quality issues. We corrupted company names with a simple randomization function:
def randomize_string(text, p=0.1):
"""Replace each character with probability p"""
result = []
for char in text:
if random.random() < p:
result.append(random.choice(string.ascii_letters))
else:
result.append(char)
return ''.join(result)
left_table = data.copy()
left_table.drop(columns=["fair_value"], inplace=True)
right_table = data[["company", "fair_value"]]
right_table = randomize_string_column(right_table, "company", p=0.1)
The noise p describes the probability that a character gets corrupter (removed or replaced). With p=0.1 (=10%), "Microsoft" might become "Micposoft" or "Micrsft".
Running the experiment for different noise levels, we pick these arguments:
result = await deep_merge(
session=session,
task="Merge the tables on company name",
left_table=left_table,
right_table=right_table,
merge_on_left="company",
merge_on_right="company",
)
Let us look at the distribution of matching methods, accuracy, and price across all rows.
| Noise Level | Matched | Exact Match | Fuzzy Match | LLM Match | Web Match | Accuracy | Price |
|---|---|---|---|---|---|---|---|
| 0% | 100% | 100% | 0% | 0% | 0% | 100% | $0.13 |
| 5% | 100% | 49.8% | 30.6% | 19.6% | 0% | 100% | $0.32 |
| 10% | 100% | 26.5% | 30.8% | 42.7% | 0% | 100% | $0.44 |
($0.13 for no LLMs is only because I have overhead in my tool that does use LLMs to set up the problem.) At 10% character corruption, exact string matching handles only 27% of rows. Fuzzy matching catches another 30%, but 43% require LLM matching to resolve.
Let us looks at some examples, where fuzzy matching failed but LLMs succeeded:
| company name | company name (with noise) |
|---|---|
| Universal Health Services | Univebsal Health ServucQv |
| Boston Scientific | BostonNSchentifia |
| FactSet | FuctSet |
As we see, for short company names ("FactSet"), a single wrong letter is enough to get below the fuzzy string threshold of 0.9.
Despite the heavy corruption, we achieved 100% accuracy with 0% false positives across all noise levels. The cascade automatically escalated to more sophisticated methods as needed, just costing more.
Link to public session (p=0)
Link to public session (p=0.05)
Link to public session (p=0.1)
Experiment 2: LLM Merge for Semantic Relationships
Here we look at a more complex scenario: both tables do not share a column / identifier and we have to match purely based on the table information.
Left table:
| company | price | mkt_cap |
|---|---|---|
| Apple Inc. | 24.71 | 541.49B |
| Microsoft | 28.93 | 412.32B |
| Alphabet | 31.22 | 298.17B |
| ... | ... | ... |
Right Table:
| ticker | fair_value |
|---|---|
| AAPL | 1593.98 |
| MSFT | 1392.16 |
| GOOGL | 1428.66 |
| ... | ... |
This is where semantic matching can shine. We can give the LLM a hint that the ticker is an important column to identify the company:
result = await merge(
session=session,
task="Merge the tables based on company name and ticker",
left_table=left_table,
right_table=right_table,
)
Here we left out the merge columns, which automatically skips the exact and the fuzzy string matching step. Instead the merge cascade goes straight to LLM reasoning and if that fails, it falls back to web verification.
| Matched | Exact Match | Fuzzy Match | LLM Match | Web Match | Accuracy | False Positives | Price |
|---|---|---|---|---|---|---|---|
| 100% | 0% | 0% | 99.8% | 0.2% | 100% | 0% | $1.00 |
437 of 438 rows matched using pure LLM reasoning. One edge case required web verification.
This should not surprise: company tickers are stable and LLMs have memorized them well.
Link to public session
Experiment 3: Web Merge for Dynamic Data
Let us know look at a case where data is dynamic and web verification is needed for several entities.
From our dataset, we matched only the CEO names to the companies.
Left table:
| company | price | mkt_cap |
|---|---|---|
| Apple Inc. | 24.71 | 541.49B |
| Microsoft | 28.93 | 412.32B |
| Alphabet | 31.22 | 298.17B |
| ... | ... | ... |
Right table:
| CEO |
|---|
| Tim Cook |
| Satya Nadella |
| Sundar Pichai |
| ... |
In this case, we do not want to specify merge columns and we adapt the prompt slightly:
result = await merge(
session=session,
task="Merge the CEO to the company information, use web search if unsure",
left_table=left_table,
right_table=right_table,
)
What is interesting to vary however, is the LLM model, as the ability to pay attention to a row in a large table ("needle in a haystack" problem) will vary, as will the confidence of the LLM and its ability to follow the user prompt.
Therefore we compared GPT-5-MINI, Gemini-3-FLASH, and Claude-4.5-Sonnet, all with minimal CoT. Note however, that these models are only doing the semantic matching. The websearch itself is done by web agents with fixed LLM choices.
This time, the results show that the task was considerably harder for the web agents:
| LLM | Matched | LLM Match | Web Match | Accuracy | False Positives | Cost |
|---|---|---|---|---|---|---|
| GPT-5-MINI | 95.7% | 59.8% | 35.8% | 96.7% | 3.2% | $3.69 |
| GEMINI-3-FLASH | 100% | 89.9% | 1.1% | 87.7% | 12.3% | $0.75 |
| CLAUDE-4.5-SONNET | 99.5% | 69.6% | 29.9% | 91.6% | 8.4% | $9.57 |
We see that agents now use the web-support in a significant number of cases, but it differs vastly by the LLM choice as well as their confidence to match rows, the accuracy and rate of false positives.
Link to public session (GPT-5-MINI)
Link to public session (GEMINI-3-FLASH)
Link to public session (CLAUDE-4.5-SONNET)
You (or Claude Code) can try this easily yourself, or learn more, via the everyrow.io SDK docs