Logo
FrontierNews.ai

Why AI Struggles With Logic That Symbolic Solvers Solve in Microseconds

Frontier language models are failing at a type of logical reasoning that symbolic solvers have mastered for decades. A new benchmark called DeFAb (Defeasible Abduction Benchmark) exposes a stark gap: while rule-based logic solvers achieve 100% accuracy on defeasible reasoning tasks in under 50 microseconds, the best frontier language models reach only 23.5% accuracy under real-world conditions. This gap reveals something fundamental about how current AI systems think.

What Is Defeasible Reasoning and Why Does It Matter?

Defeasible reasoning is the ability to construct hypotheses that explain exceptions to general rules while preserving unrelated expectations. It's how humans revise their beliefs when confronted with anomalies. Imagine learning that most proteins have fixed three-dimensional structures, then discovering intrinsically disordered proteins that lack structure yet remain functionally essential. A truly intelligent system would need to override the structure-function default for this specific case without dismantling the entire framework of protein biology.

This type of reasoning matters because it underpins scientific discovery, legal reasoning, and medical diagnosis. Yet current AI models struggle with it. The DeFAb benchmark tested four frontier language models and found rendering-robust Level 2 accuracy ranging from 7.8% to 23.5%, meaning when the same logical content was presented in different surface formats, model performance collapsed. One model, Kimi-K2.5, failed to decode responses at all 80.9% of the time.

Why Are Language Models Losing to Symbolic Solvers?

The research identifies three entangled capability gaps in current foundation models:

  • Grounding Deficit: Models lack an explicit epistemic structure that distinguishes strict knowledge from revisable defaults, and cannot trace predictions back to their supporting evidence.
  • Novelty Deficit: Without knowing which beliefs are revisable, models cannot identify where creative exceptions might apply, limiting their ability to hypothesize genuinely new concepts.
  • Belief Revision Deficit: Even when models do update knowledge, they lack formal machinery to ensure updates respect the principle of minimal change, potentially disturbing unrelated commitments when making targeted fixes.

The symbolic solver, by contrast, uses Answer Set Programming with the same defeasible reasoning algorithm used to generate the benchmark itself. It resolves every instance with perfect accuracy because it operates on explicit logical rules where every derivation, every default, and every piece of evidence is traceable and verifiable.

How Large Is This Benchmark and What Does It Cover?

DeFAb is substantial in scope. The benchmark contains 372,648 instances across 33.75 million materialized rules drawn from 18 knowledge sources, including OpenCyc, YAGO, Wikidata, ConceptNet, and UMLS. These sources span domains including biology, law, materials science, and specialized rule-of-engagement scenarios. The benchmark is stratified into three difficulty levels, with a harder variant called DeFAb-Hard containing 235 Level 3 instances on which the strongest model reached only 53.3% accuracy while the symbolic solver maintained 100%.

The researchers also released CONJURE, a variant comprising 560 Lean 4 and Mathlib instances where gold-standard answers are mathematical definitions that the proof-assistant kernel did not previously contain. Under a three-tier novelty specification, a pilot study found zero genuinely novel concepts generated by the model, establishing a falsification target for future work.

How Can This Benchmark Improve AI Training?

The benchmark's polynomial-time verifier serves as an exact reward function for preference optimization techniques, including RLVR (Reinforcement Learning with Verifiable Rewards) and GRPO (Group Relative Policy Optimization). This means researchers can use the benchmark not just to evaluate models, but to train them. Because every correct hypothesis passes formal checks for valid derivation, conservativity, and minimality, the verifier provides a mathematically rigorous signal for which model outputs deserve reward.

This approach differs fundamentally from traditional reward models, which rely on human judgment or proxy metrics. A verifiable reward function eliminates ambiguity: either a hypothesis satisfies the formal constraints or it does not. This precision could enable more efficient training of models on reasoning tasks where correctness is mathematically definable.

Steps to Leverage Verifiable Benchmarks for AI Development

  • Adopt Formal Verification: Use polynomial-time verifiable benchmarks as reward signals during training rather than relying solely on human feedback or proxy metrics, enabling more precise optimization of reasoning capabilities.
  • Integrate Knowledge Infrastructure: Activate existing public knowledge bases like Wikidata, ConceptNet, and UMLS as formal scaffolds for evaluation and training, treating them as active logical frameworks rather than passive evaluation backdrops.
  • Test Rendering Robustness: Evaluate models across multiple surface presentations of the same logical content to identify brittleness in decoding and reasoning, not just accuracy on a single format.

The DeFAb dataset, pipeline, and evaluation harness are released under the MIT license on Hugging Face, making the infrastructure available to the research community. This represents a shift in how AI evaluation can work: moving from fluent-sounding but potentially theory-destroying prose toward disciplined construction of theory revisions that preserve logical consistency.

The gap between 23.5% and 100% is not a minor engineering problem. It reflects a structural mismatch between how current foundation models process information and the explicit belief-revision operations that rigorous reasoning demands. Closing that gap may require rethinking how language models represent and update knowledge itself.