The $2 Billion Bet Against Large Language Models: Why AI's Next Frontier Is Learning Physics, Not Text
The artificial intelligence field is splitting into two competing visions of how machines should learn, and the stakes have never been higher. On one side are large language models (LLMs), which predict the next word in a sequence. On the other are "world models," AI systems designed to understand how physical reality actually works. Two newly published research papers from leading AI labs reveal exactly how far current world models still fall short, even as companies like AMI Labs and World Labs have raised over $2 billion betting this is where the future lies.
What Exactly Is a World Model, and Why Does It Matter?
A world model learns an internal representation of how an environment behaves, allowing a machine to predict the consequences of its actions before taking them. Unlike language models that work with text, world models are designed to understand spatial relationships, physics, and cause-and-effect in the real world. This distinction is not merely academic. For robotics, autonomous vehicles, and simulation, the ability to reason about physical consequences could unlock capabilities that text-based AI cannot achieve.
Yann LeCun, the Turing Award winner who left Meta in late 2025 after 12 years as its chief AI scientist, is leading one of the most aggressive bets on this direction. His Paris-based startup, AMI Labs, raised $1.03 billion at a $3.5 billion pre-money valuation in March 2026, reported as the largest seed round in European startup history. As of early June 2026, the company has no commercial product, a team of roughly a dozen people, and a research agenda measured in years.
"The world is unpredictable. If you try to build a generative model that predicts every detail of the future, it will fail. JEPA is not generative AI. It is a system that learns to represent videos really well," LeCun explained in an exclusive interview with MIT Technology Review.
Yann LeCun, Executive Chairman at AMI Labs
LeCun's argument against current LLMs is unambiguous. He stated that "people have had this illusion, or delusion, that it is a matter of time until we can scale them up to having human-level intelligence, and that is simply false." His position is that language models "are limited to the discrete world of text," cannot truly reason or plan, and "can't predict the consequences of their actions".
How Do World Models Actually Work Differently From Language Models?
The technical foundation of AMI Labs' approach is the Joint Embedding Predictive Architecture, or JEPA, a framework LeCun proposed in 2022. The key innovation is counterintuitive: instead of trying to predict every pixel or word token, JEPA encodes inputs into an abstract representation and predicts in that latent space. This means the system discards what it cannot predict and keeps only high-level structure. By predicting representations rather than raw outputs, a JEPA system can theoretically learn the underlying rules of an environment "like a baby learning about gravity," as LeCun put it.
Fei-Fei Li's World Labs, which emerged from stealth in 2024 and raised $1 billion in February 2026, takes a complementary but distinct approach. The company published an essay on June 3, 2026, arguing that the term "world model" now covers three fundamentally different kinds of systems, each serving different purposes.
What Are the Three Types of World Models?
- Renderers: Systems that output pixels meant for human eyes, judged primarily on visual fidelity. Video-generation models and Google's interactive Genie 3 fall into this category, but they carry no explicit understanding of three-dimensional structure.
- Simulators: Systems that output state as a geometrically and physically faithful representation that programs can compute on. Their contract is structural: geometry that holds under inspection and physics that respects Newton's laws.
- Planners: Systems that output actions, answering what an agent should do next. Vision-language-action systems and "World Action Models" are attempts at this category.
World Labs argues that the simulator is "the linchpin" because the same underlying knowledge of geometry and physics can be projected into pixels for a renderer and into action predictions for a planner. A model that masters simulation can serve both purposes, while a model that only renders or only plans cannot.
What Do Recent Benchmarks Reveal About Current World Models?
Two preprints posted in late May 2026 provide the first rigorous public test of whether the science behind these billion-dollar bets is tracking toward its goals. The first paper, "When Does LeJEPA Learn a World Model?," was submitted on May 25 by researchers including David Klindt of Cold Spring Harbor Laboratory and LeCun. It proves that the LeJEPA architecture can achieve "linear identifiability," meaning it can recover the true hidden variables behind raw observations, such as an object's position and velocity, up to a linear transformation.
However, the guarantee comes with significant conditions. It holds only when latent variables follow a Gaussian distribution and evolve under stationary, additive-noise dynamics, and when training data approximates broad, roughly uniform exploration of the state space. The practical implication is blunt: goal-directed training data, the kind most robotic pipelines rely on, can quietly push observations into a regime where the guarantee no longer applies.
The second paper, the stable-worldmodel benchmark posted May 20 by a team led by Lucas Maes of Mila and Université de Montréal, delivers a sobering verdict on current systems: they remain brittle. On a standard task requiring pushing an object into a target position, one tested model succeeded about 50 percent of the time under clean conditions. Success fell to roughly 12 percent when the agent's color changed and to about 6 percent when the background color shifted, with added visual distractors producing a collapse across every baseline.
A subtler finding cuts deeper: prediction accuracy proved a poor proxy for planning success. A model can forecast the next frame correctly while having latched onto a background color rather than the task's geometry. This disconnect reveals a fundamental challenge in world model research.
How Is World Labs Addressing the Simulator Problem?
World Labs' first product, Marble, takes a different technical approach. The system takes a multimodal prompt (text, images, video, or coarse 3D layouts) and produces an explorable 3D environment in two distinct representations simultaneously. Its visual output uses 3D Gaussian splatting, or 3DGS, which models a scene as millions of semitransparent particles, each carrying position, scale, color, and opacity. This represents a sharp break from the polygon-mesh pipeline that has dominated 3D graphics for decades.
Alongside the splats, Marble outputs collider meshes, low-fidelity geometry that a physics engine can operate on, plus higher-quality triangle meshes for compatibility with standard tools. This dual output is the engineering decision that, in the company's words, "dissolves the boundary between the renderer and the simulator." One model produces both what a scene looks like and a structure a program can run physics against.
The practical payoff is concrete. NVIDIA has published a technical workflow showing that a Marble scene, exported as Gaussian splats and a collider mesh, can be converted and imported into NVIDIA Isaac Sim to build a photorealistic, simulation-ready training environment. By NVIDIA's account, this approach compresses setup that once took weeks into hours. For robotics, where demonstrations and 3D environments are expensive and scarce, cheap, varied, physically usable worlds address a structural bottleneck.
What Are the Remaining Challenges in World Model Research?
- Visual-Physical Reconciliation: AI-generated geometry can look correct while containing self-intersections or wrong scale that produce nonsensical physics. Reconciling visual beauty with the precision a robot needs is, according to World Labs, "the defining open problem in world model research today."
- Real-World Deployment: Robotics demonstrations to date have been confined to heavily constrained laboratory setups. The gap between a compelling demo reel and a robot that reliably works in a kitchen, warehouse, or operating room remains vast.
- Data Scarcity: 3D assets with explicit geometry and physical annotations are "orders of magnitude scarcer" than the internet video that visual renderers train on, creating a fundamental bottleneck for simulator development.
World Labs frames Marble as "the first chapter" toward a unified world model that can render, simulate, and plan from one system, a destination the company acknowledges is years away.
How Are Other AI Labs Approaching World Models?
AMI Labs and World Labs are not alone in this race. Google DeepMind's Genie line of generative interactive environments marks another front in the competition. NVIDIA runs a parallel simulation stack around Omniverse and its Cosmos world-foundation models. A field of well-funded startups is racing to solve the planner problem. The approaches differ in their technical foundations and strategic priorities, but all share the conviction that understanding physical reality is essential for the next leap in AI capability.
The convergence of massive funding, rigorous benchmarking, and competing technical approaches suggests the field is at an inflection point. Whether world models will indeed prove superior to scaled language models for reasoning and planning remains an open question, but the $2 billion bet suggests the AI industry is taking the possibility seriously.