The One-Line Code Change That Makes AI Math Reasoning 19 Times More Efficient
CMU researchers found a one-line code change that makes AI math reasoning up to 19 times more efficient by reweighting how models learn from hard problems.
144 articles
CMU researchers found a one-line code change that makes AI math reasoning up to 19 times more efficient by reweighting how models learn from hard problems.
An 8B AI agent trained with the HOTE framework outperformed models four times its size on deep research benchmarks by learning to improve itself without.
OpenAI's o1 outperformed doctors on clinical reasoning, but experts say AI still can't replace physicians who rely on physical exams and sensory judgment.
Multiple AI models improve complex tasks, but only when the cost of being wrong exceeds the extra expense; here is the escalation framework developers.
AI labs are now spending hours on a single answer using test-time compute, a shift that trades speed for deeper reasoning without costly retraining.
Budget models like DeepSeek-V3 and Llama-3.3-70B can match Claude on 80-90% of dev tasks if you restructure prompts using four key dimensions.
Researchers introduced UXBench, a new benchmark showing that AI models struggle to evaluate user experience from screenshots.
Researchers discovered that AI safety guardrails designed to protect autonomous agents can be weaponized through denial-of-service attacks that exploit.
REFT boosts AI reasoning accuracy by 7.3% through a simple trick: diversifying the first word in reasoning chains while keeping costs flat.
Open source AI advocates push distributed computing to break tech giants' control, but latency and power efficiency create massive technical hurdles.
Hugging Face reproduced DeepSeek-R1's reasoning capabilities using only open-source tools, proving AI breakthroughs can be rebuilt by researchers.
New research reveals how AI reward systems get fooled by imperfect verifiers and introduces two lightweight correction methods that restore accuracy.
AI agents achieve 20x better efficiency with new benchmark AA-AgentPerf, the first standard test for measuring real-world agent performance.
AI models can compress reasoning steps without losing accuracy when trained on sufficient data, with composed reasoning outperforming explicit methods by.
OpenAI o3 saves 37 minutes per week versus Gemini Sheets' 19 minutes in spreadsheet tests, but the speed comes with compliance tradeoffs.
AI models perform dramatically better with more test-time compute, but current benchmarks hide this gap, forcing labs to rethink safety evaluations.
New AI training method EAPO boosts medical answer diversity by 22% while improving clinical accuracy, solving the exploration vs reliability dilemma.
OpenAI researcher Noam Brown argues AI benchmarks are broken because they ignore computational costs, where models spending $30,000 per question beat.
DeepSeek R1 costs 94% less than GPT-4 with similar performance, but engineers warn of serious privacy and security risks in production systems.
NVIDIA-backed startup Span pays homeowners $150 monthly to host GPU servers on their homes, solving AI infrastructure delays at one-fifth the cost.
DeepSeek-R1 mimics human reasoning patterns rather than genuinely thinking through problems, with researchers finding repetitive loops in over 10,000.
AI coding agents significantly outperform traditional search when exploring large code repositories, using strategic navigation instead of keyword.
AI systems are learning to grade themselves using structured rubrics instead of simple scores, revolutionizing how models train and improve.
AI models unlock hidden abilities through test-time reasoning techniques, with one method boosting safety awareness from 14.6% to 40.3% without retraining.
Anthropic files for IPO ahead of OpenAI with $47B revenue forecast, positioning itself as the stronger trillion-dollar candidate in the AI race.
AI agents fail to remember conversations across multiple people and sessions, with new research revealing major gaps in real-world memory capabilities.
OpenAI transformed ChatGPT from a simple research preview into a comprehensive work platform by continuously upgrading capabilities beneath the same.
OpenAI's o1 and o3 reasoning models have eliminated multistep agent chains, allowing complex tasks to be solved in a single call instead of five separate.
NVIDIA's new 550B parameter AI model uses RLVR training to achieve 6x faster inference while cutting agent deployment costs by 30 percent.
GPT-5 merges OpenAI's reasoning and speed into one model, offering four modes from instant responses to deep thinking for all users.
MIT research reveals AI agents excel at answering questions but fail at asking them, though inference-time reasoning boosts performance 10x.
New AI training method MGSD improves vision models' spatial planning abilities by 19.3%, bridging the gap between visual perception and reasoning.
Despite comprehensive surveillance infrastructure tracking our every move, AI systems can't identify who you are without explicit consent like loyalty.
Diffusion language models generate multiple tokens simultaneously instead of one-by-one, delivering thousands of tokens per second while matching ChatGPT.
Researchers built DisasterVL, a 2-billion-parameter AI model that matches GPT-4o's disaster reasoning accuracy while running on drones and edge devices.
Bengali AI models score just 7.72% to 55.42% on new hallucination tests, exposing major reliability gaps for the world's sixth most spoken language.
AI systems now think in parallel instead of step-by-step, cutting response times by 3x while smaller models beat larger ones at 1% of the cost.
Microsoft unveils MAI-Thinking-1, its first reasoning model built from scratch, marking a bold break from OpenAI to compete independently.
New research reveals AI agents fail at planning in hidden ways, with a diagnostic benchmark exposing systematic weaknesses across 12 major models.
ChatGPT's new GPT-5.5 models offer three tiers for different reasoning needs, from $8 instant responses to $200 unlimited deep thinking capabilities.
DeepSeek-R1 and major AI models hit a hard 22-step reasoning limit due to architectural constraints, achieving only 24-42% accuracy on complex tasks.
New SuperARC benchmark reveals leading AI models are failing at true reasoning, with some newer versions performing worse than earlier ones.
New research shows AI models like GPT-5 achieve 96% accuracy on basic graph theory but drop to 82% on graduate proofs, revealing critical reasoning limits.
Researchers built the first AI-powered computer worm that writes its own attack code in real time, bypassing traditional cybersecurity defenses.
IBM's new Abstract Chain-of-Thought technique cuts AI reasoning costs by 11 times using symbols instead of words, solving DeepSeek-R1's expense problem.
NVIDIA's new Vera CPU delivers 50% higher performance per core than previous generations, targeting AI's unexpected bottleneck as agents shift workloads.
A company accidentally spent $500 million on Anthropic's Claude in one month after forgetting to set usage limits, exposing critical gaps in AI cost.
New AI system Ptah generates factually accurate research reports with integrated visuals by using specialized agents and verification at every stage.
Most on-premises AI systems underperform because they lack the multi-layer verification architecture that makes commercial services reliable.
DeepSeek's reasoning model achieves 86% accuracy on complex medical tasks versus 56.6% for its faster variant, proving speed isn't always better.