How AI Reward Systems Get Fooled: New Research Tackles the Verifier Problem
New research reveals how AI reward systems get fooled by imperfect verifiers and introduces two lightweight correction methods that restore accuracy.
137 articles
New research reveals how AI reward systems get fooled by imperfect verifiers and introduces two lightweight correction methods that restore accuracy.
AI agents achieve 20x better efficiency with new benchmark AA-AgentPerf, the first standard test for measuring real-world agent performance.
AI models can compress reasoning steps without losing accuracy when trained on sufficient data, with composed reasoning outperforming explicit methods by.
OpenAI o3 saves 37 minutes per week versus Gemini Sheets' 19 minutes in spreadsheet tests, but the speed comes with compliance tradeoffs.
AI models perform dramatically better with more test-time compute, but current benchmarks hide this gap, forcing labs to rethink safety evaluations.
New AI training method EAPO boosts medical answer diversity by 22% while improving clinical accuracy, solving the exploration vs reliability dilemma.
OpenAI researcher Noam Brown argues AI benchmarks are broken because they ignore computational costs, where models spending $30,000 per question beat.
DeepSeek R1 costs 94% less than GPT-4 with similar performance, but engineers warn of serious privacy and security risks in production systems.
NVIDIA-backed startup Span pays homeowners $150 monthly to host GPU servers on their homes, solving AI infrastructure delays at one-fifth the cost.
DeepSeek-R1 mimics human reasoning patterns rather than genuinely thinking through problems, with researchers finding repetitive loops in over 10,000.
AI coding agents significantly outperform traditional search when exploring large code repositories, using strategic navigation instead of keyword.
AI systems are learning to grade themselves using structured rubrics instead of simple scores, revolutionizing how models train and improve.
AI models unlock hidden abilities through test-time reasoning techniques, with one method boosting safety awareness from 14.6% to 40.3% without retraining.
Anthropic files for IPO ahead of OpenAI with $47B revenue forecast, positioning itself as the stronger trillion-dollar candidate in the AI race.
AI agents fail to remember conversations across multiple people and sessions, with new research revealing major gaps in real-world memory capabilities.
OpenAI transformed ChatGPT from a simple research preview into a comprehensive work platform by continuously upgrading capabilities beneath the same.
OpenAI's o1 and o3 reasoning models have eliminated multistep agent chains, allowing complex tasks to be solved in a single call instead of five separate.
NVIDIA's new 550B parameter AI model uses RLVR training to achieve 6x faster inference while cutting agent deployment costs by 30 percent.
GPT-5 merges OpenAI's reasoning and speed into one model, offering four modes from instant responses to deep thinking for all users.
MIT research reveals AI agents excel at answering questions but fail at asking them, though inference-time reasoning boosts performance 10x.
New AI training method MGSD improves vision models' spatial planning abilities by 19.3%, bridging the gap between visual perception and reasoning.
Despite comprehensive surveillance infrastructure tracking our every move, AI systems can't identify who you are without explicit consent like loyalty.
Diffusion language models generate multiple tokens simultaneously instead of one-by-one, delivering thousands of tokens per second while matching ChatGPT.
Researchers built DisasterVL, a 2-billion-parameter AI model that matches GPT-4o's disaster reasoning accuracy while running on drones and edge devices.
Bengali AI models score just 7.72% to 55.42% on new hallucination tests, exposing major reliability gaps for the world's sixth most spoken language.
AI systems now think in parallel instead of step-by-step, cutting response times by 3x while smaller models beat larger ones at 1% of the cost.
Microsoft unveils MAI-Thinking-1, its first reasoning model built from scratch, marking a bold break from OpenAI to compete independently.
New research reveals AI agents fail at planning in hidden ways, with a diagnostic benchmark exposing systematic weaknesses across 12 major models.
ChatGPT's new GPT-5.5 models offer three tiers for different reasoning needs, from $8 instant responses to $200 unlimited deep thinking capabilities.
DeepSeek-R1 and major AI models hit a hard 22-step reasoning limit due to architectural constraints, achieving only 24-42% accuracy on complex tasks.
New SuperARC benchmark reveals leading AI models are failing at true reasoning, with some newer versions performing worse than earlier ones.
New research shows AI models like GPT-5 achieve 96% accuracy on basic graph theory but drop to 82% on graduate proofs, revealing critical reasoning limits.
Researchers built the first AI-powered computer worm that writes its own attack code in real time, bypassing traditional cybersecurity defenses.
IBM's new Abstract Chain-of-Thought technique cuts AI reasoning costs by 11 times using symbols instead of words, solving DeepSeek-R1's expense problem.
NVIDIA's new Vera CPU delivers 50% higher performance per core than previous generations, targeting AI's unexpected bottleneck as agents shift workloads.
A company accidentally spent $500 million on Anthropic's Claude in one month after forgetting to set usage limits, exposing critical gaps in AI cost.
New AI system Ptah generates factually accurate research reports with integrated visuals by using specialized agents and verification at every stage.
Most on-premises AI systems underperform because they lack the multi-layer verification architecture that makes commercial services reliable.
DeepSeek's reasoning model achieves 86% accuracy on complex medical tasks versus 56.6% for its faster variant, proving speed isn't always better.
XCENA raised $135 million to solve AI's hidden bottleneck: memory traffic between chips, not computing power, which could cut infrastructure costs.
OpenAI's o3 model cuts errors by 20% compared to o1, signaling a major shift toward reasoning-focused AI for enterprise applications in 2026.
AI vision models suffer catastrophic accuracy collapse from 88% to 0.53% with large category lists, but new divide-and-conquer technique fixes it.
AI agent discovers test-time scaling algorithms that cut computational costs by 70% while boosting accuracy, outperforming human-designed methods.
OpenAI's o3 model scores 87.5% on reasoning tests while Google's Gemini dominates long-context tasks, reshaping how AI models compete.
AI professionals are adopting RAG, chain-of-thought reasoning, and agentic workflows to overcome ChatGPT's limitations in accuracy and complex reasoning.
Wall Street's cautious AI adoption is costing billions as finance firms struggle to bridge the gap between tech capabilities and industry needs.
Researchers developed Knowledge-to-Verification (K2V), a new framework that teaches AI language models to verify their reasoning process in knowledge-intensive...
While experts argue about when AGI will arrive, AI models are already solving problems once thought impossible, from Olympic-level math to production software...
Researchers developed a method that teaches AI coding agents to reason through complex tasks using only final answers, not detailed reasoning steps, achieving...
As AI models learn to reason longer at test time, a massive infrastructure bottleneck looms: the power grid, cooling systems, and skilled workers needed to...