ARQ Leaderboard

ARQ · Academic Reference Quality. We ask language models to cite 10 scholarly references for 24 research topics. Every reference is verified for authenticity by SourceVerify.ai and judged for topic relevance by an independent LLM judge.

≥ 0.700.50–0.690.30–0.49< 0.30

DenseCoTMoESearch

#	Model						Refs
🥇	Claude Opus 4.6 AnthropicMar 2026 Dense	UNK	0.924	0.872	0.806	97.0%	233
🥈	DeepSeek V4 Pro MoE	1.6T49B active	0.891	0.952	0.847	95.0%	240
🥉	GPT-5.4 OpenAIMar 2026 CoT	UNK	0.884	0.858	0.761	94.6%	240
4	Llama 3.1 405B Hermes 3 Meta / NousResearchFeb 2026 Dense	405B	0.879	0.788	0.703	92.2%	231
5	GPT-5 OpenAIJan 2025 CoT	UNK	0.876	0.868	0.762	92.1%	228
6	Claude Sonnet 4.6 AnthropicMar 2026 Dense	UNK	0.861	0.804	0.693	92.5%	240
7	DeepSeek V3 DeepSeekMar 2026 MoE	671B37B active	0.851	0.885	0.754	90.9%	230
8	GPT-5 Mini OpenAIMar 2026 CoT	UNK	0.795	0.792	0.623	84.0%	237
9	Llama 3.1 405B MetaFeb 2025 Dense	405B	0.779	0.821	0.638	84.4%	231
10	DeepSeek R1 DeepSeekFeb 2025 MoECoT	671B37B active	0.758	0.822	0.626	83.6%	238
11	Grok 3 CoT	UNK	0.749	0.938	0.690	79.2%	240
12	Llama 4 Maverick MetaFeb 2025 MoE	400B17B active	0.735	0.820	0.605	79.2%	236
13	Llama 3.1 70B Dense	70B	0.724	0.855	0.607	77.1%	214
14	Qwen3.5 AlibabaMar 2026 MoE	397B17B active	0.711	0.932	0.665	71.0%	221
15	Sonar Reasoning Pro PerplexityMar 2026 SearchCoT Produced 124 of 240 requested refs (15/24 topics). Refused to generate citations when search results lacked complete metadata.	70B	0.690	0.753	0.530	75.0%	124
16	Mistral Medium 3.1 MistralFeb 2025 Dense	250B	0.678	0.819	0.560	74.6%	232
17	Kimi K2 MoonshotFeb 2025 MoE	1T32B active	0.672	0.839	0.570	70.7%	239
18	Mixtral 8x22B MistralMar 2026 MoE	141B39B active	0.571	0.856	0.493	64.2%	240
19	Llama 3.3 70B MetaFeb 2025 Dense	70B	0.570	0.812	0.446	59.3%	236
20	GPT-5 Nano OpenAIJan 2025 CoT	UNK	0.548	0.824	0.432	56.8%	227
21	Sonar PerplexityFeb 2025 Search	70B	0.522	0.781	0.389	56.5%	223
22	Mistral Large 2 MistralFeb 2025 Dense	123B	0.449	0.860	0.364	48.1%	237
23	Llama 3.1 8B MetaFeb 2025 Dense	8B	0.425	0.770	0.322	45.1%	233
24	Mistral Small 3.2 MistralFeb 2025 Dense	24B	0.396	0.818	0.309	42.1%	233
25	Gemma 4 31B Dense	31B	0.365	0.927	0.326	38.4%	237
26	Qwen3 32B AlibabaFeb 2026 Dense	32B	0.362	0.819	0.285	37.7%	239
27	MiniMax M2.5 MiniMaxFeb 2025 MoE	230B10B active	0.354	0.894	0.309	36.4%	239
28	Mixtral 8x7B MoE	47B13B active	0.304	0.881	0.266	32.6%	221
29	Qwen3 14B AlibabaFeb 2026 Dense	14B	0.274	0.810	0.227	29.7%	239
30	Gemma 3 27B GoogleFeb 2025 Dense	27B	0.273	0.813	0.220	27.3%	238
31	Llama 4 Scout MetaFeb 2025 MoE	109B17B active	0.269	0.811	0.209	27.4%	237
32	Qwen3 32B (think) AlibabaFeb 2026 DenseCoT	32B	0.263	0.862	0.213	28.2%	238
33	Qwen3 14B (think) AlibabaFeb 2026 DenseCoT	14B	0.219	0.802	0.174	20.8%	240
34	Qwen3 30B-A3B MoE	30B3B active	0.215	0.929	0.195	22.5%	240
35	GPT-OSS 120B GroqFeb 2025 MoE	120B5.1B active	0.162	0.858	0.136	17.5%	240
36	Gemma 3 12B GoogleFeb 2025 Dense	12B	0.143	0.779	0.113	13.8%	239
37	Qwen3 8B (think) AlibabaFeb 2026 DenseCoT	8B	0.139	0.832	0.107	10.0%	239
38	Gemma 3 4B GoogleFeb 2025 Dense	4B	0.100	0.852	0.082	9.2%	240
39	Llama 3.2 1B MetaMar 2026 Dense	1B	0.087	0.873	0.078	10.2%	215
40	Qwen3 8B AlibabaFeb 2026 Dense	8B	0.067	0.887	0.054	3.4%	236

Key Findings

Scaling works: Llama 8B (0.329) → 70B (0.466) → 405B (0.614) quality — more parameters, fewer hallucinations

Topic specificity matters: Broad topics (Health) score high; narrow topics (Insecticide-treated bed nets) score low

Representation drives quality: Quality scales with log(topic representation) in OpenAlex — R² = 0.44

MoE active params matter: Scout (17B active / 109B total) scores 0.223 — active parameters, not total, determine quality

Even the best models fail often: “% Useful” measures how many requested references are both real and relevant. Even GPT-5 only delivers 66% usable — Llama 405B reaches ~52%, barely a coin flip