Artificial Analysis v4.0 shifts model ranking to real work outputs

GENERAL PUB_DATE: 2026.01.09

Artificial Analysis overhauled its Intelligence Index to weigh four pillars (Agents, Coding, Scientific Reasoning, General Knowledge) and introduced GDPval-AA t...

Artificial Analysis overhauled its Intelligence Index to weigh four pillars (Agents, Coding, Scientific Reasoning, General Knowledge) and introduced GDPval-AA to score models on agentic, workflow-grade tasks across 44 occupations and nine industries. Saturated exams (MMLU-Pro, AIME 2025, LiveCodeBench) were dropped and the scale was reset so top models score ≤50; GPT-5.2 leads overall, Claude Opus 4.5 leads on SWE-Bench Verified, and Gemini 3 Pro is close, underscoring no single model wins everywhere.

[ WHY_IT_MATTERS ]

01.

Model selection should pivot to workflow-grade outputs instead of exam benchmarks.

02.

Leaders vary by task, making multi-model strategies and task-specific evals more practical.

[ WHAT_TO_TEST ]

terminal
Re-score shortlisted models on repo-level coding tasks (e.g., SWE-Bench Verified) and domain deliverables (docs, spreadsheets) with agentic runs, tracking success, latency, and cost.
terminal
Assemble a small GDPval-like eval from tickets/RFCs/migrations and compare models with and without extended reasoning effort.

[ BROWNFIELD_PERSPECTIVE ]

Legacy codebase integration strategies...

01.
Replace MMLU/LiveCodeBench gates with v4-style task suites and reset thresholds to the new ≤50 scale.
02.
Pilot pillar-based model routing (coding vs reasoning) by swapping APIs behind existing orchestration without touching core code.

[ GREENFIELD_PERSPECTIVE ]

Fresh architecture paradigms...

01.
Design eval-first pipelines where agents emit reviewable artifacts (PRs, docs, diagrams) and log traces for regressions.
02.
Plan for multi-model selection upfront since rankings differ by pillar and task type.

arrow_back

PREVIOUS_DATA_LOG

Use GitHub Copilot to create or update GitHub Issues

Initialize_Return_to_Core

LINK_STATUS: 127.0.0.1 (SECURE)

NEXT_DATA_LOG

—

arrow_forward