Logo
FrontierNews.ai

The Token Trap: Why AI Companies Are Realizing They're Paying for the Wrong Thing

The AI industry has been pricing intelligence by the wrong metric for two years, and the bill is finally catching up. Companies like Microsoft, Uber, Meta, and Amazon have all recently pulled back on AI spending after realizing they were paying for tokens, a unit that measures computational exhaust rather than actual value delivered. This fundamental mismeasurement is driving a reckoning across the inference chip market, where custom silicon makers like Etched, Groq, and Cerebras are positioning themselves as alternatives to NVIDIA.

The problem is deceptively simple: a token measures what a machine burned to produce an answer, not whether the answer was right. It's like pricing a car by how much smoke leaves the tailpipe rather than by how far it travels. Microsoft reportedly cancelled a usage-priced coding assistant for roughly 5,000 engineers after per-engineer bills ran $500 to $2,000 per month and exhausted budgets ahead of schedule. Uber burned its entire annual AI-coding budget in about four months and imposed a hard cap of roughly $1,500 per engineer per tool.

"Something has gone completely wrong. I am paying for tokens that create no value," said Alex Karp, CEO of Palantir, on CNBC in early July 2026.

Alex Karp, CEO at Palantir

The uncomfortable truth is that these are not naive buyers who will learn better next quarter. They are the definitional experts in AI deployment. If they mispriced the exhaust and got burned, the problem is not an information gap that education closes; it is structural, baked into the unit itself. The entire market is currently measuring in the wrong unit, and every correction now underway, from spending caps to usage audits, is a scramble toward a metric none of them yet has.

Why Are Inference Chips Becoming the New Battleground?

Inference, the stage where trained models answer user queries in real time, is now the majority of AI compute cost and is stable enough to hard-wire into purpose-built silicon. This is why every operator with scale is building a custom chip, and custom application-specific integrated circuit (ASIC) AI-server shipments are growing far faster than merchant GPUs. The inference market is drawing enormous attention because it is where costs, latency, and energy efficiency matter most at scale.

NVIDIA CEO Jensen Huang announced in March 2026 that the revenue opportunity for AI chips could reach at least $1 trillion through 2027, a significant step-up from a prior $500 billion forecast through 2026. NVIDIA is aggressively defending its position by unveiling new CPUs and AI systems built on technology licensed from Groq, a chip startup NVIDIA acquired technology from for $17 billion in December 2025.

Etched, a startup building inference-specialized chips, has achieved a $5 billion valuation and already booked $1 billion under contract for its inference systems. The company is positioning itself specifically in the inference layer rather than in model training, where NVIDIA's GPUs have dominated. Etched's reported $1 billion in contracted sales suggests at least some large customers are willing to commit to alternatives before the hardware is widely deployed, which is notable in a market where NVIDIA's software ecosystem (CUDA) has been a powerful competitive advantage.

How Will Custom Inference Chips Affect API Costs for Smaller Companies?

The real question for small and medium enterprises (SMEs) is not whether prices will drop, but when and by how much. Three major investment waves are converging to reshape inference economics over the next two years.

  • Memory Supply Expansion: South Korea, along with Samsung and SK Hynix, has announced plans to invest $1 trillion in enhancing memory chip manufacturing and building AI data centers by 2047. SK Hynix holds about 50% of the high-bandwidth memory (HBM) supply, while Samsung accounts for about 40%, effectively monopolizing the market. Semiconductor fabrication facilities take at least 2 to 3 years from construction start to full operation, so the impact on supply will likely be felt in the market as early as late 2027 to 2028. A realistic outlook suggests a 20 to 30 percent decrease in GPU prices compared to current levels by around 2028, translating to roughly 10 to 15 percent reduction in inference costs.
  • Amazon's Deployment Acceleration: Amazon's newly established FDE (Fast Deployment Engineering) team sends engineers to client companies to implement and deploy AI agents within weeks, with an investment of 100 billion yen. AWS Bedrock's inference costs have already dropped by approximately 40 to 60 percent compared to a year ago as of 2025, and this trend is expected to continue. Between late 2025 and 2026, AWS Bedrock's inference costs are likely to drop by an additional 20 to 30 percent, representing a significant moment when not only the unit price of APIs but also the total cost of implementation decreases substantially.
  • Specialized Chip Competition: The current AI inference market is almost entirely dominated by NVIDIA, a monopoly that keeps prices high. If specialized chip manufacturers like Etched emerge, the cost structure of inference processing itself will change. Startups like Groq, Cerebras, and SambaNova are also raising funds for inference-specialized chips, and Google's TPU, Amazon's Trainium and Inferentia, and Microsoft's Maia are all competing to crack NVIDIA's dominance. Between 2027 and 2028, if the widespread adoption of inference-specialized chips begins, API fees could drop by 50 to 70 percent compared to current levels.

For context, SMEs currently using OpenAI's GPT-4o pay $2.50 for every 1 million tokens of input and $10 for output. Running a single chatbot for customer inquiries can cost between 50,000 to 100,000 yen per month. If these three investment waves materialize as expected, an SME currently paying 100,000 yen in API fees could see reductions of 10,000 to 15,000 yen per month, or 120,000 to 180,000 yen annually, from memory supply improvements alone. Combined with competition-driven price drops, the savings could be far more dramatic.

What's the Real Cost of AI Inference?

The physics of AI inference reveals why tokens are such a poor metric. The cost of AI is not the arithmetic; it is moving data across boundaries. Priced against the physical minimum energy per operation, the movement dwarfs the math by 10,000 to 100,000 times. The flashy number everyone quotes, calculations per second, is the cheapest part. The cost and the value live in the movement.

This is why custom inference chips are not mainly a bet on cheaper math; they are a bet on reducing data movement. Because each design merely shifts the binding boundary toward memory, packaging, interconnect, power, or verification, the durable economics accrue below and around the chip, while accounting recognition lags the asset's true competitive life. The decision is the asset; the token is the exhaust.

The honest metric should be decisions-per-joule, which equals verified successful task outcomes divided by total energy, including model, memory, networking, cooling, and retry loops. Two companies can show identical accounting profits and very different decision-joule fitness; that difference appears in operating metrics before it appears in audited results. Today AI is measured by tokens-per-watt, fuel burned. The measure that matters is decisions-per-joule, correct outcomes per unit of energy.

Physics wears out an inference chip's competitive economic usefulness in roughly 2 to 3 years, faster in some workloads. Standard accounting spreads its cost over 5 to 7 years. Those two clocks disagree, and the reported profitability of the whole AI-infrastructure complex leans on that one assumption. The honest read of this industry is therefore the cash-flow statement, the spend to keep replacing chips, not the profit line.

When Will These Changes Actually Reach the Market?

The timeline is critical for SMEs trying to plan their AI budgets. Memory supply improvements are expected to begin showing up in lower GPU prices by late 2027 to 2028. Amazon's FDE-driven cost reductions are already underway and will accelerate through 2026. Specialized inference chip competition, if it materializes as expected, will intensify between 2027 and 2028. This is not just prediction; it is an extension of trends already occurring. OpenAI's API prices have dropped by about 90 percent over the past two years, and the performance equivalent to GPT-3.5 can now be used at less than one-tenth the price of GPT-4 from two years ago.

The critical unknown is whether Etched and other inference startups can deliver on their performance and cost promises. Etched's contracted revenue only matters if the chips ship on schedule and meet performance claims. The broader startup activity in inference silicon, including companies like Groq and Cerebras, will determine whether NVIDIA's dominance truly cracks or whether the company successfully defends its position through its own inference offerings.