SourceVerify Research

ARQ Leaderboard

ARQ · Academic Reference Quality. We ask language models to cite 10 scholarly references for 24 research topics. Every reference is verified for authenticity by SourceVerify.ai and judged for topic relevance by an independent LLM judge.

≥ 0.700.50–0.690.30–0.49< 0.30
|
DenseCoTMoESearch
#ModelRefs
🥇
Claude Opus 4.6
AnthropicMar 2026
Dense
UNK
0.924
0.872
0.806
97.0%
233
🥈
DeepSeek V4 Pro
MoE
1.6T49B active
0.891
0.952
0.847
95.0%
240
🥉
GPT-5.4
OpenAIMar 2026
CoT
UNK
0.884
0.858
0.761
94.6%
240
4
Llama 3.1 405B Hermes 3
Meta / NousResearchFeb 2026
Dense
405B
0.879
0.788
0.703
92.2%
231
5
GPT-5
OpenAIJan 2025
CoT
UNK
0.876
0.868
0.762
92.1%
228
6
Claude Sonnet 4.6
AnthropicMar 2026
Dense
UNK
0.861
0.804
0.693
92.5%
240
7
DeepSeek V3
DeepSeekMar 2026
MoE
671B37B active
0.851
0.885
0.754
90.9%
230
8
GPT-5 Mini
OpenAIMar 2026
CoT
UNK
0.795
0.792
0.623
84.0%
237
9
Llama 3.1 405B
MetaFeb 2025
Dense
405B
0.779
0.821
0.638
84.4%
231
10
DeepSeek R1
DeepSeekFeb 2025
MoECoT
671B37B active
0.758
0.822
0.626
83.6%
238
11
Grok 3
CoT
UNK
0.749
0.938
0.690
79.2%
240
12
Llama 4 Maverick
MetaFeb 2025
MoE
400B17B active
0.735
0.820
0.605
79.2%
236
13
Llama 3.1 70B
Dense
70B
0.724
0.855
0.607
77.1%
214
14
Qwen3.5
AlibabaMar 2026
MoE
397B17B active
0.711
0.932
0.665
71.0%
221
15
Sonar Reasoning Pro
PerplexityMar 2026
SearchCoT

Produced 124 of 240 requested refs (15/24 topics). Refused to generate citations when search results lacked complete metadata.

70B
0.690
0.753
0.530
75.0%
124
16
Mistral Medium 3.1
MistralFeb 2025
Dense
250B
0.678
0.819
0.560
74.6%
232
17
Kimi K2
MoonshotFeb 2025
MoE
1T32B active
0.672
0.839
0.570
70.7%
239
18
Mixtral 8x22B
MistralMar 2026
MoE
141B39B active
0.571
0.856
0.493
64.2%
240
19
Llama 3.3 70B
MetaFeb 2025
Dense
70B
0.570
0.812
0.446
59.3%
236
20
GPT-5 Nano
OpenAIJan 2025
CoT
UNK
0.548
0.824
0.432
56.8%
227
21
Sonar
PerplexityFeb 2025
Search
70B
0.522
0.781
0.389
56.5%
223
22
Mistral Large 2
MistralFeb 2025
Dense
123B
0.449
0.860
0.364
48.1%
237
23
Llama 3.1 8B
MetaFeb 2025
Dense
8B
0.425
0.770
0.322
45.1%
233
24
Mistral Small 3.2
MistralFeb 2025
Dense
24B
0.396
0.818
0.309
42.1%
233
25
Gemma 4 31B
Dense
31B
0.365
0.927
0.326
38.4%
237
26
Qwen3 32B
AlibabaFeb 2026
Dense
32B
0.362
0.819
0.285
37.7%
239
27
MiniMax M2.5
MiniMaxFeb 2025
MoE
230B10B active
0.354
0.894
0.309
36.4%
239
28
Mixtral 8x7B
MoE
47B13B active
0.304
0.881
0.266
32.6%
221
29
Qwen3 14B
AlibabaFeb 2026
Dense
14B
0.274
0.810
0.227
29.7%
239
30
Gemma 3 27B
GoogleFeb 2025
Dense
27B
0.273
0.813
0.220
27.3%
238
31
Llama 4 Scout
MetaFeb 2025
MoE
109B17B active
0.269
0.811
0.209
27.4%
237
32
Qwen3 32B (think)
AlibabaFeb 2026
DenseCoT
32B
0.263
0.862
0.213
28.2%
238
33
Qwen3 14B (think)
AlibabaFeb 2026
DenseCoT
14B
0.219
0.802
0.174
20.8%
240
34
Qwen3 30B-A3B
MoE
30B3B active
0.215
0.929
0.195
22.5%
240
35
GPT-OSS 120B
GroqFeb 2025
MoE
120B5.1B active
0.162
0.858
0.136
17.5%
240
36
Gemma 3 12B
GoogleFeb 2025
Dense
12B
0.143
0.779
0.113
13.8%
239
37
Qwen3 8B (think)
AlibabaFeb 2026
DenseCoT
8B
0.139
0.832
0.107
10.0%
239
38
Gemma 3 4B
GoogleFeb 2025
Dense
4B
0.100
0.852
0.082
9.2%
240
39
Llama 3.2 1B
MetaMar 2026
Dense
1B
0.087
0.873
0.078
10.2%
215
40
Qwen3 8B
AlibabaFeb 2026
Dense
8B
0.067
0.887
0.054
3.4%
236

Key Findings

Scaling works: Llama 8B (0.329) → 70B (0.466) → 405B (0.614) quality — more parameters, fewer hallucinations

Topic specificity matters: Broad topics (Health) score high; narrow topics (Insecticide-treated bed nets) score low

Representation drives quality: Quality scales with log(topic representation) in OpenAlex — R² = 0.44

MoE active params matter: Scout (17B active / 109B total) scores 0.223 — active parameters, not total, determine quality

Even the best models fail often: “% Useful” measures how many requested references are both real and relevant. Even GPT-5 only delivers 66% usable — Llama 405B reaches ~52%, barely a coin flip