← Back to Research

LLM Agents Solve the Table Merging Problem

How I join large tables with no matching keys cheaply and accurately with LLM web agents

Deep Merge Tutorial Hero

TL;DR: Fuzzy string matching breaks down quickly with real-world data. LLMs and LLM-web-agents can merge data, but it takes some elbow grease to get good accuracy without costing an arm and a leg. I built a solution in the everyrow.io SDK that implements a tiered merge approach (exact → fuzzy → LLM → web) that optimizes for cost.

I, and every data engineer, has tried to join two tables that don't have a shared key. Maybe it's companies, or people's names, or addresses, or parts. Those strings never match well enough, but pandas.merge() or VLOOKUP requires exact matches:

# This only works when "company" columns match exactly
merged = pd.merge(left_table, right_table, on="company")

Consider these variations that represent the same company:

Table ATable B
Microsoft CorporationMicrosoft Corp.
Apple Inc.AAPL
FacebookMeta

A standard merge would miss all of these. So next I would try fuzzy merge libraries, like fuzzywuzzy or rapidfuzz, compute string similarity scores:

from rapidfuzz import fuzz

similarity = fuzz.ratio("Microsoft Corporation", "Microsoft Corp.")
# Returns ~85% similarity

This effectively calculates a distance between strings s1s_1 and s2s_2 (like the Levenshtein Distance), and normalizes it to get a similarity score between 0 and 1:

similarity(s1,s2)1d(s1,s2)max(s1,s2)\text{similarity}(s_1, s_2) \approx 1 - \frac{d(s_1, s_2)}{\max(|s_1|, |s_2|)}

A similarity ≥ threshold qualifies as a fuzzy match.

But this won't work even on my 3 company example. There's no threshold that will match "Facebook" to "Meta" that also doesn't introduce huge false positives. If there even is an optimal threshold, it varies every time.

If you really want high accuracy, like I do, then world knowledge is sometimes required, like "McGraw Hill" ↔ "S&P Global".

LLMs make a much, much higher quality semantic matching possible. First, you can use embeddings, convert text into high-dimensional vectors where semantically similar entities cluster together. This works well for simple relationships, and the OpenAI embeddings API is extremely cheap. But still the quality isn't great.

Full-powered LLMs can fundamentally do this as well as humans. And LLM web agents can do this as well as humans with access to Google. So I started running experiments with these.

The first thing you immediately notice is that LLM web agents are overkill for most rows. I needed to use the simplest method that works for each row, so I implemented a hierarchical cascade, using more expensive ones only if it looks like cheaper ones would fail:

  1. Exact string match: Basically a lookup-table
  2. Fuzzy string match: Levenshtein-based with a high threshold of 0.9
  3. LLM merge: A cheap LLM, run in batches, finds a match if it is confident
  4. LLM-Web-Agent merge: Browse the web and try find what entities really correspond to

Cost And Accuracy of the Progressive Intelligence Approach

Let's look at a real-world example.

I used a dataset of 438 S&P 500 companies with columns for company name, ticker, fair value, price, market cap, shares, and CEO. This gave us ground truth for validation.

data = pd.read_csv("data/companies.csv")
# Columns: company, ticker, fair_value, price, mkt_cap, shares, CEO

For the different experiments, I split the data into two tables, corrupted the data, and removed columns. The dataset then also serves as a ground truth for the different experiments.

We ran three experiments, each testing a different merging challenge:

  1. Fuzzy matching with noise from corrupt company names
  2. LLM merge without common columns by matching companies to tickers
  3. Web merge for association by matching CEOs to companies

Note this third one is not a typical join. CEOs and companies are not the same entity under different names, it's actually building a semantic relationship. LLMs are strictly needed for this.

My solution is now part of the everyrow.io SDK, so re-doing this experiment is pretty straightforward:

from everyrow import create_session
from everyrow.ops import merge
from everyrow.generated.models import LLMEnum

async with create_session(name="Merge companies") as session:
    task = await merge(
            session=session,
            task="...", # Natural language description of what you're merging (helps the LLM, not necessary in general)
            left_table=left_df, # table to merge into as a pandas DataFrame
            right_table=right_df, # table to merge from as a pandas DataFrame
            merge_on_left="...",  # Optional: specify merge column
            merge_on_right="...", # Optional: specify merge column
            merge_model=LLMEnum.GEMINI_3_FLASH_MINIMAL # Optional: specify LLM model (default: GEMINI_3_FLASH_MINIMAL)
    )
    result = await task.await_result() # result is the dataframe with merged tables

Parameters:

  • task: Natural language description of what you're merging
  • left_table / right_table: Pandas DataFrames
  • merge_on_left / merge_on_right: Optional column names for string/fuzzy matching

The result includes a research column with justifications for each match, allowing you to audit the matching logic.

You will also see a link to your session in the everyrow.io UI, where you can look at the merge process in real-time.

Experiment 1: Fuzzy String Matching Under Noise

In our first experiment, we progressively corrupt the data to simulate real-world data quality issues. We corrupted company names with a simple randomization function:

def randomize_string(text, p=0.1):
    """Replace each character with probability p"""
    result = []
    for char in text:
        if random.random() < p:
            result.append(random.choice(string.ascii_letters))
        else:
            result.append(char)
    return ''.join(result)

left_table = data.copy()
left_table.drop(columns=["fair_value"], inplace=True)
right_table = data[["company", "fair_value"]]
right_table = randomize_string_column(right_table, "company", p=0.1)

The noise p describes the probability that a character gets corrupter (removed or replaced). With p=0.1 (=10%), "Microsoft" might become "Micposoft" or "Micrsft".

Running the experiment for different noise levels, we pick these arguments:

result = await deep_merge(
    session=session,
    task="Merge the tables on company name",
    left_table=left_table,
    right_table=right_table,
    merge_on_left="company",
    merge_on_right="company",
)

Let us look at the distribution of matching methods, accuracy, and price across all rows.

Noise LevelMatchedExact MatchFuzzy MatchLLM MatchWeb MatchAccuracyPrice
0%100%100%0%0%0%100%$0.13
5%100%49.8%30.6%19.6%0%100%$0.32
10%100%26.5%30.8%42.7%0%100%$0.44

($0.13 for no LLMs is only because I have overhead in my tool that does use LLMs to set up the problem.) At 10% character corruption, exact string matching handles only 27% of rows. Fuzzy matching catches another 30%, but 43% require LLM matching to resolve.

Let us looks at some examples, where fuzzy matching failed but LLMs succeeded:

company namecompany name (with noise)
Universal Health ServicesUnivebsal Health ServucQv
Boston ScientificBostonNSchentifia
FactSetFuctSet

As we see, for short company names ("FactSet"), a single wrong letter is enough to get below the fuzzy string threshold of 0.9.

Despite the heavy corruption, we achieved 100% accuracy with 0% false positives across all noise levels. The cascade automatically escalated to more sophisticated methods as needed, just costing more.

Link to public session (p=0)

Link to public session (p=0.05)

Link to public session (p=0.1)

Experiment 2: LLM Merge for Semantic Relationships

Here we look at a more complex scenario: both tables do not share a column / identifier and we have to match purely based on the table information.

Left table:

companypricemkt_cap
Apple Inc.24.71541.49B
Microsoft28.93412.32B
Alphabet31.22298.17B
.........

Right Table:

tickerfair_value
AAPL1593.98
MSFT1392.16
GOOGL1428.66
......

This is where semantic matching can shine. We can give the LLM a hint that the ticker is an important column to identify the company:

result = await merge(
    session=session,
    task="Merge the tables based on company name and ticker",
    left_table=left_table,
    right_table=right_table,
)

Here we left out the merge columns, which automatically skips the exact and the fuzzy string matching step. Instead the merge cascade goes straight to LLM reasoning and if that fails, it falls back to web verification.

MatchedExact MatchFuzzy MatchLLM MatchWeb MatchAccuracyFalse PositivesPrice
100%0%0%99.8%0.2%100%0%$1.00

437 of 438 rows matched using pure LLM reasoning. One edge case required web verification.

This should not surprise: company tickers are stable and LLMs have memorized them well.

Link to public session

Experiment 3: Web Merge for Dynamic Data

Let us know look at a case where data is dynamic and web verification is needed for several entities.

From our dataset, we matched only the CEO names to the companies.

Left table:

companypricemkt_cap
Apple Inc.24.71541.49B
Microsoft28.93412.32B
Alphabet31.22298.17B
.........

Right table:

CEO
Tim Cook
Satya Nadella
Sundar Pichai
...

In this case, we do not want to specify merge columns and we adapt the prompt slightly:

result = await merge(
    session=session,
    task="Merge the CEO to the company information, use web search if unsure",
    left_table=left_table,
    right_table=right_table,
)

What is interesting to vary however, is the LLM model, as the ability to pay attention to a row in a large table ("needle in a haystack" problem) will vary, as will the confidence of the LLM and its ability to follow the user prompt.

Therefore we compared GPT-5-MINI, Gemini-3-FLASH, and Claude-4.5-Sonnet, all with minimal CoT. Note however, that these models are only doing the semantic matching. The websearch itself is done by web agents with fixed LLM choices.

This time, the results show that the task was considerably harder for the web agents:

LLMMatchedLLM MatchWeb MatchAccuracyFalse PositivesCost
GPT-5-MINI95.7%59.8%35.8%96.7%3.2%$3.69
GEMINI-3-FLASH100%89.9%1.1%87.7%12.3%$0.75
CLAUDE-4.5-SONNET99.5%69.6%29.9%91.6%8.4%$9.57

We see that agents now use the web-support in a significant number of cases, but it differs vastly by the LLM choice as well as their confidence to match rows, the accuracy and rate of false positives.

Link to public session (GPT-5-MINI)

Link to public session (GEMINI-3-FLASH)

Link to public session (CLAUDE-4.5-SONNET)

You (or Claude Code) can try this easily yourself, or learn more, via the everyrow.io SDK docs