Paper/Hallucinations

Paper · May 2026

Bigger models hallucinate less.
On niche topics, every model hallucinates more.
Both effects are predictable.

We asked language models to cite real scholarly work on 24 research topics, then verified every citation. The picture that emerges is simple: an LLM stores concepts in overlapping superposition, so each fact is read out against background interference from every other fact. Larger models have a lower noise floor, and topics with more training data sit higher above it. Niche topics hurt every model; small models miss on every topic.

models

topics

8,829

references

Paper

Smith, Shock, Segun, Olatunji, Bissyandé, 2026. Factual recall errors in large language models follow log-linear scaling laws.

arXiv preprint, May 2026.

Cite

@article{smith2026recall,
  title         = {Factual recall errors in large
                   language models follow log-linear
                   scaling laws},
  author        = {Smith, Shock, Segun, Olatunji,
                   and Bissyand\'e},
  year          = {2026},
  archivePrefix = {arXiv}
}

§ 01 · Findings

Two variables. Sixty percent of the variance.

The log of the model's parameter count, and the log of how often a topic appears in the literature. Together those two numbers explain most of what we see across 743 model-by-topic cells. Size carries about three times the weight of topic, but topic matters enough that even frontier models miss on underrepresented topics.

§ 01 · A

Size sets the floor.

A model's parameter count strongly shapes how often it can produce real references. The relationship is log-linear within a family and sigmoidal across the population: a sharp climb from 10B through 400B, with diminishing returns beyond.

Within-family R²: 0.79
Half-max: ~135B parameters
Models: 16 dense, 4 families

Citation quality vs total parameters, log scale. R² = 0.79 within dense families; family offsets reflect training procedure, not scale.

§ 01 · B

Topic sets the signal strength.

The same model that nails climate change fabricates citations on rural school-dropout policy. Reading down a column shows how recall collapses as topics get rarer; reading across a row shows the size effect within a topic.

Topic range: S ≈ 32 → 1.2M papers
Best-case spread: 0.90 → 0.43 (same model)
Topic effect: ≈ ⅓ the size effect

Citation quality, model × topic. Topics ordered by OpenAlex publication count (most-represented left). Five orders of magnitude in topic representation; quality drops with both axes.

§ 01 · C

If signal strength matters, this is what we should see.

A within-topic test of the same hypothesis. If recall really is gated by training-data frequency, then for any single topic, smaller models should pull only the most-cited papers and larger ones should be able to reach further down the citation tail. They do: median citation count of correctly-recalled references drops log-linearly with model size. The same noise-floor story holds at the level of individual papers, not just whole topics.

Slope: −0.35 per decade of params
8B median: ~2,260 citations
405B median: ~660 citations

Median citations of correctly-recalled work, n = 10 dense models with ≥50 matched refs. Slope −0.35, R² = 0.59, Spearman ρ = −0.79 (p = 0.007).

§ 01 · D

Hallucination isn't pass or fail.

It's tempting to think a model either gets a citation right or makes one up. The data says no. Across the Llama family, the share of fully-verified references climbs and the share of total fabrications drops, but a wide intermediate band persists at every size: real papers with a wrong year, a missing subtitle, a fabricated coauthor. As capacity falls, the fully-verified bucket shrinks and the corrupted-but-real bucket fills in before anything turns into pure invention. Recall degrades; it doesn't fail.

Llama 405B: 85% real · 12% corrupted · 3% fake
Llama 70B: 63% real · 19% corrupted · 18% fake
Llama 8B: 30% real · 28% corrupted · 42% fake

Verification-status mix for Llama 8B → 70B → 405B. The 'verified-with-error' band sits in the middle at every size — confabulation moves through a graded continuum, not a switch from real to fake.

§ 02 · Why this happens

The model is louder than the signal,
until it isn't.

A modern LLM compresses the patterns from trillions of training tokens into a few billion parameters. There isn't enough room to give each pattern its own slot, so they end up sharing: overlapping in the same directions and interfering with each other. Researchers call this superposition.

Recalling a fact is a contest between two quantities: how strongly the concept was written into the model (its signal), and how loud the interference is from everything else (the noise floor). Whether the model produces a real citation or a confident fabrication depends on which side wins.

The fitted law

q=σ (α log₁₀ P+β log₁₀ S+c )

P: total parameters → sets the noise floor (∝ 1/√N)
S: topic representation → sets the signal strength
σ: sigmoid: bounded recall in [0, 1]
c: family offset: data quality, training procedure

Both contributions enter the SNR multiplicatively, which is what produces a sigmoid in their log-sum. The fit holds at R² = 0.60 across 743 cells.

Three regimes

Floor

Noise drowns signal. Models produce templated fabrications: plausible authors, real journals, papers that don't exist.

Ramp

The log-linear regime. Each decade of parameters or each decade of topic data lifts quality by a fixed amount.

Ceiling

Capacity exceeds the concept inventory. Quality saturates near 1; further scaling buys little.

Floor

Noise drowns signal. Models produce templated fabrications: plausible authors, real journals, papers that don't exist.

Ramp

The log-linear regime. Each decade of parameters or each decade of topic data lifts quality by a fixed amount.

Ceiling

Capacity exceeds the concept inventory. Quality saturates near 1; further scaling buys little.

The largest model we tested, DeepSeek V4 Pro at 1.6 T parameters, scores 0.90 on climate change (S ≈ 1.2 M papers). On school dropout prevention in rural areas (S = 32) it scores 0.43. Solving the fit for what it would take to reach 0.90 on the second topic at today's training-data density: roughly 50 trillion parameters. Thirty times the largest model that exists.

§ 03 · What this means

Scaling is coverage expansion,
not uniform improvement.

Confabulation is structural, not random.

Hallucinations track a predictable inequality in how knowledge is encoded. High-frequency concepts sit above the interference floor and are recalled reliably; low-frequency ones sit below it and are filled in with plausible-looking guesses. The pattern is consistent enough across 38 models and four families that we can fit it with two variables and a sigmoid.

Aggregate benchmarks hide topical disparities.

Averaging over the frequency spectrum overstates reliability on the long tail. The same DeepSeek V4 Pro that scores 0.90 on climate change scores 0.43 on school dropout in rural areas. A leaderboard number averaged across these is a polite fiction; the same disparity surfaces across geographies and languages whenever training-data density is uneven.

Same parameter count, different family, different quality.

Beyond size and topic, persistent gaps remain between model families at the same parameter count. Llama 3.1 70B outperforms Llama 3.3 70B on this task; MoE models tend to sit below the dense trend line when compared at total parameters. We don't try to attribute these gaps to a single cause: pre-training data, post-training methods, and architecture all vary together across families.

Above the floor, curation. Below it, retrieval.

For concepts near the noise floor, modest signal increases (better data, targeted pre-training, in-context priming) push them across the recall threshold. For concepts well below the floor, no realistic amount of scaling helps in time: pushing the rural-school-dropout topic to 0.90 quality at today's data density would take roughly 50 trillion parameters. The appropriate response there is retrieval augmentation that bypasses parametric recall entirely.

§ 04 · Try it

Predict your own query.

Pick a topic and a model size. We pull the topic's OpenAlex coverage in real time, run it through the fitted sigmoid, and report how many of ten requested citations are likely to survive verification.

Research topic

Architecture

Total parameters~70B

8B70B405B3T

Try an example

§ 05 · Leaderboard

How the field stacks up.

Ten of 38 models, ranked by overall citation quality (authenticity × topic relevance) across all 24 topics. The full ranking, including small models and earlier generations, lives on the leaderboard.

01
DeepSeek V4 Pro
MoE·49B / 1.6T
MoE49B / 1.6T
0.89
auth
0.85
quality
02
Claude Opus 4.6
Closed·—
Closed—
0.92
auth
0.81
quality
03
GPT-5
Closed·—
Closed—
0.88
auth
0.76
quality
04
GPT-5.4
Closed·—
Closed—
0.88
auth
0.76
quality
05
DeepSeek V3
MoE·37B / 671B
MoE37B / 671B
0.85
auth
0.75
quality
06
Llama 3.1 405B Hermes
Dense·405B
Dense405B
0.88
auth
0.70
quality
07
Claude Sonnet 4.6
Closed·—
Closed—
0.86
auth
0.69
quality
08
Grok 3
Closed·—
Closed—
0.75
auth
0.69
quality
09
Qwen3.5
MoE·17B / 397B
MoE17B / 397B
0.71
auth
0.67
quality
10
Llama 3.1 405B base
Dense·405B
Dense405B
0.78
auth
0.64
quality

Go to leaderboard→

§ 06 · Methods

How the dataset was built.

We asked each of 38 language models to produce ten APA-formatted scholarly citations for each of 24 research topics, at temperature zero, with no token cap. Every returned reference was passed through SourceVerify (using the SVRIS comparison standard), which resolves five fields (title, identifier, authors, year, venue) against OpenAlex, Crossref, and the open web. A separate Kimi K2 judge rated topic relevance. The product of authenticity and relevance is the per-cell quality score plotted throughout.

Reference accounting

Requested: 9,120; 38 models × 24 topics × 10 refs
Produced: 8,913−207; cells where the model returned fewer than ten parseable APA citations
Analyzed: 8,829−84; normalized-title duplicates collapsed within each model × topic

Human validation

Four independent reviewers, blinded to the SourceVerify verdict, manually searched 288 unique references against the open literature. SourceVerify's field-level scoring agrees with human judgment at κ = 0.887, with no false positives across the audit set: every citation SourceVerify flagged as real was confirmed by a reviewer.

Recall: 88.9%
Specificity: 100%
Cohen's κ: 0.887
Ratings: n = 301

The 17 references SourceVerify marked as unverified that reviewers later judged real all carried at least one citation defect: a wrong subtitle, a misattributed coauthor, a first-edition year paired with a later publisher. SourceVerify isn't missing clean citations; it's flagging defective ones for review.

Bigger models hallucinate less.On niche topics, every model hallucinates more.Both effects are predictable.