ARQ Leaderboard
ARQ · Academic Reference Quality. We ask language models to cite 10 scholarly references for 24 research topics. Every reference is verified for authenticity by SourceVerify.ai and judged for topic relevance by an independent LLM judge.
| # | Model | Refs | |||||
|---|---|---|---|---|---|---|---|
| 🥇 | Claude Opus 4.6 AnthropicMar 2026 Dense | UNK | 0.924 | 0.872 | 0.806 | 97.0% | 233 |
| 🥈 | DeepSeek V4 Pro MoE | 1.6T49B active | 0.891 | 0.952 | 0.847 | 95.0% | 240 |
| 🥉 | GPT-5.4 OpenAIMar 2026 CoT | UNK | 0.884 | 0.858 | 0.761 | 94.6% | 240 |
| 4 | Llama 3.1 405B Hermes 3 Meta / NousResearchFeb 2026 Dense | 405B | 0.879 | 0.788 | 0.703 | 92.2% | 231 |
| 5 | GPT-5 OpenAIJan 2025 CoT | UNK | 0.876 | 0.868 | 0.762 | 92.1% | 228 |
| 6 | Claude Sonnet 4.6 AnthropicMar 2026 Dense | UNK | 0.861 | 0.804 | 0.693 | 92.5% | 240 |
| 7 | DeepSeek V3 DeepSeekMar 2026 MoE | 671B37B active | 0.851 | 0.885 | 0.754 | 90.9% | 230 |
| 8 | GPT-5 Mini OpenAIMar 2026 CoT | UNK | 0.795 | 0.792 | 0.623 | 84.0% | 237 |
| 9 | Llama 3.1 405B MetaFeb 2025 Dense | 405B | 0.779 | 0.821 | 0.638 | 84.4% | 231 |
| 10 | DeepSeek R1 DeepSeekFeb 2025 MoECoT | 671B37B active | 0.758 | 0.822 | 0.626 | 83.6% | 238 |
| 11 | Grok 3 CoT | UNK | 0.749 | 0.938 | 0.690 | 79.2% | 240 |
| 12 | Llama 4 Maverick MetaFeb 2025 MoE | 400B17B active | 0.735 | 0.820 | 0.605 | 79.2% | 236 |
| 13 | Llama 3.1 70B Dense | 70B | 0.724 | 0.855 | 0.607 | 77.1% | 214 |
| 14 | Qwen3.5 AlibabaMar 2026 MoE | 397B17B active | 0.711 | 0.932 | 0.665 | 71.0% | 221 |
| 15 | Sonar Reasoning Pro PerplexityMar 2026 SearchCoT Produced 124 of 240 requested refs (15/24 topics). Refused to generate citations when search results lacked complete metadata. | 70B | 0.690 | 0.753 | 0.530 | 75.0% | 124 |
| 16 | Mistral Medium 3.1 MistralFeb 2025 Dense | 250B | 0.678 | 0.819 | 0.560 | 74.6% | 232 |
| 17 | Kimi K2 MoonshotFeb 2025 MoE | 1T32B active | 0.672 | 0.839 | 0.570 | 70.7% | 239 |
| 18 | Mixtral 8x22B MistralMar 2026 MoE | 141B39B active | 0.571 | 0.856 | 0.493 | 64.2% | 240 |
| 19 | Llama 3.3 70B MetaFeb 2025 Dense | 70B | 0.570 | 0.812 | 0.446 | 59.3% | 236 |
| 20 | GPT-5 Nano OpenAIJan 2025 CoT | UNK | 0.548 | 0.824 | 0.432 | 56.8% | 227 |
| 21 | Sonar PerplexityFeb 2025 Search | 70B | 0.522 | 0.781 | 0.389 | 56.5% | 223 |
| 22 | Mistral Large 2 MistralFeb 2025 Dense | 123B | 0.449 | 0.860 | 0.364 | 48.1% | 237 |
| 23 | Llama 3.1 8B MetaFeb 2025 Dense | 8B | 0.425 | 0.770 | 0.322 | 45.1% | 233 |
| 24 | Mistral Small 3.2 MistralFeb 2025 Dense | 24B | 0.396 | 0.818 | 0.309 | 42.1% | 233 |
| 25 | Gemma 4 31B Dense | 31B | 0.365 | 0.927 | 0.326 | 38.4% | 237 |
| 26 | Qwen3 32B AlibabaFeb 2026 Dense | 32B | 0.362 | 0.819 | 0.285 | 37.7% | 239 |
| 27 | MiniMax M2.5 MiniMaxFeb 2025 MoE | 230B10B active | 0.354 | 0.894 | 0.309 | 36.4% | 239 |
| 28 | Mixtral 8x7B MoE | 47B13B active | 0.304 | 0.881 | 0.266 | 32.6% | 221 |
| 29 | Qwen3 14B AlibabaFeb 2026 Dense | 14B | 0.274 | 0.810 | 0.227 | 29.7% | 239 |
| 30 | Gemma 3 27B GoogleFeb 2025 Dense | 27B | 0.273 | 0.813 | 0.220 | 27.3% | 238 |
| 31 | Llama 4 Scout MetaFeb 2025 MoE | 109B17B active | 0.269 | 0.811 | 0.209 | 27.4% | 237 |
| 32 | Qwen3 32B (think) AlibabaFeb 2026 DenseCoT | 32B | 0.263 | 0.862 | 0.213 | 28.2% | 238 |
| 33 | Qwen3 14B (think) AlibabaFeb 2026 DenseCoT | 14B | 0.219 | 0.802 | 0.174 | 20.8% | 240 |
| 34 | Qwen3 30B-A3B MoE | 30B3B active | 0.215 | 0.929 | 0.195 | 22.5% | 240 |
| 35 | GPT-OSS 120B GroqFeb 2025 MoE | 120B5.1B active | 0.162 | 0.858 | 0.136 | 17.5% | 240 |
| 36 | Gemma 3 12B GoogleFeb 2025 Dense | 12B | 0.143 | 0.779 | 0.113 | 13.8% | 239 |
| 37 | Qwen3 8B (think) AlibabaFeb 2026 DenseCoT | 8B | 0.139 | 0.832 | 0.107 | 10.0% | 239 |
| 38 | Gemma 3 4B GoogleFeb 2025 Dense | 4B | 0.100 | 0.852 | 0.082 | 9.2% | 240 |
| 39 | Llama 3.2 1B MetaMar 2026 Dense | 1B | 0.087 | 0.873 | 0.078 | 10.2% | 215 |
| 40 | Qwen3 8B AlibabaFeb 2026 Dense | 8B | 0.067 | 0.887 | 0.054 | 3.4% | 236 |
Key Findings
Scaling works: Llama 8B (0.329) → 70B (0.466) → 405B (0.614) quality — more parameters, fewer hallucinations
Topic specificity matters: Broad topics (Health) score high; narrow topics (Insecticide-treated bed nets) score low
Representation drives quality: Quality scales with log(topic representation) in OpenAlex — R² = 0.44
MoE active params matter: Scout (17B active / 109B total) scores 0.223 — active parameters, not total, determine quality
Even the best models fail often: “% Useful” measures how many requested references are both real and relevant. Even GPT-5 only delivers 66% usable — Llama 405B reaches ~52%, barely a coin flip