GPT-5 Dominates Graph Theory Benchmark, Leaving Rivals Far Behind on Mathematical Reasoning
GPT-5 has emerged as the clear leader in mathematical reasoning, achieving near-perfect performance on foundational graph theory problems while other frontier models lag substantially behind. Researchers at academic institutions introduced GTBench, a curriculum-grounded benchmark designed to evaluate how well large language models (LLMs), which are AI systems trained on vast amounts of text data, can serve as mathematical research assistants. The findings reveal a pronounced performance hierarchy that has significant implications for how these tools are trusted in education and scientific research.
How Do AI Models Compare on Mathematical Problem-Solving?
The study evaluated five frontier models across three difficulty levels of graph theory problems, which is a branch of mathematics dealing with networks and connections. The researchers tested GPT-5, Claude Sonnet 4.6, Gemini 2.5 Flash-Lite, Llama 3.3 70B, and Mistral Large 3 on problems ranging from basic definitions to graduate-level proof construction.
The performance differences were striking. GPT-5 approached near-perfect accuracy on undergraduate introductory problems, achieving 95.8% on zero-shot evaluation, meaning the model solved problems without any examples or step-by-step guidance. More impressively, GPT-5 maintained meaningful accuracy on graduate-level proof construction tasks, reaching 82% accuracy on the most difficult problems. In contrast, all other models degraded substantially as difficulty increased, with Llama 3.3 70B achieving 0% accuracy under human evaluation on graduate-level problems when given no examples.
What Makes Graph Theory Such a Rigorous Test for AI?
Graph theory occupies a unique position in mathematics education. It begins with simple, intuitive definitions but progresses to deeply challenging proofs that require structural intuition developed through sustained practice. This makes it an ideal stress-test for AI models because it demands more than pattern-matching; it requires genuine reasoning about relational structures, combinatorial properties, and formal proof construction.
The benchmark organized problems into three groups of increasing complexity. Group 1 assessed foundational knowledge of standard graph families, degree sequences, and basic counting arguments. Group 2 required application of graph algorithms like breadth-first search and depth-first search, as well as reasoning about connectivity and traversal. Group 3 moved beyond algorithmic reasoning to require deeper mathematical justification and proof construction involving advanced concepts.
Steps to Understand AI Model Evaluation in Mathematics
- Curriculum-Grounded Design: GTBench organizes problems according to the standard progression of graph theory instruction in undergraduate and graduate programs, drawing on verified academic sources including textbooks and university course materials, ensuring the benchmark reflects real educational pathways.
- Multiple Evaluation Methods: The researchers used exact-match evaluation for basic problems, LLM-as-judge evaluation for intermediate problems, and a hybrid approach combining human expert judgment with automated evaluation for graduate-level proofs to capture nuanced reasoning.
- Failure Mode Analysis: The study identified specific error patterns, such as correct algorithm but wrong execution errors dominating introductory and intermediate problems, while graduate-level problems revealed incomplete reasoning failures and systematic disagreement between human evaluators.
The research revealed important limitations in how AI models handle mathematical reasoning. Even on Group 1 problems, which test basic definitions and properties, models other than GPT-5 showed significant gaps. The failure analysis showed that models often understood the correct approach but made execution errors, suggesting they grasp conceptual frameworks but struggle with precise implementation.
A particularly concerning finding emerged at the graduate level. The study noted systematic disagreement between human evaluators and automated judges, particularly on verbose or near-complete proofs. This disagreement ranged from moderate to substantial, indicating that automated evaluation alone may not capture the full quality of mathematical reasoning, especially when proofs are lengthy or nearly correct.
The implications extend beyond academic curiosity. Researchers and students increasingly rely on LLMs as reasoning assistants in their daily work. The gap between assumed and actual capability in a domain as foundational as graph theory carries real consequences for how these tools are trusted and adopted in educational and research settings. GTBench provides the first systematic evaluation framework specifically designed to address whether LLMs are trustworthy enough to serve as mathematical research assistants that scientists or students might rely on to understand, verify, or extend their knowledge of technical domains.
The stark performance differences suggest that not all AI models are equally suitable for mathematical reasoning tasks. Organizations considering deploying these tools in educational or research contexts should carefully evaluate which models meet their reliability requirements. GPT-5's strong performance across all difficulty levels indicates it may be more suitable for serious mathematical work, while other models may be better suited for less demanding applications or require human oversight when used for mathematical reasoning.