Logo
FrontierNews.ai

SpaceX Is Building Its Own AI Training Software to Squeeze More Power From 220,000 GPUs

SpaceX is nearing completion of a fully custom artificial intelligence training system written in C and optimized for 220,000 NVIDIA GB300 graphics processing units (GPUs), designed to extract maximum performance from bare metal hardware. Elon Musk revealed the project on X, noting it makes heavy use of pipeline parallelism and 800G network interface cards. The implied speed improvement over existing frameworks like JAX for large training runs could be substantial, though Musk left the exact performance gains as a cliffhanger in his post.

Why Is SpaceX Building Its Own AI Training Software?

Most artificial intelligence labs train models using frameworks like JAX, PyTorch, or custom derivatives that abstract away the underlying hardware details. Those abstractions are convenient for developers, but they carry overhead. Every layer of software between a training job and the GPU costs computing cycles. At the scale SpaceX is operating, that overhead compounds fast and wastes valuable computing resources.

Writing directly in C and mapping the software exactly to the physical hardware eliminates those intermediate layers. Pipeline parallelism, which splits a model across multiple devices and keeps them all busy simultaneously, becomes far easier to tune when you control the entire stack. The result, in theory, is a training cluster that runs closer to its theoretical maximum throughput than any general-purpose framework can achieve.

What Computing Infrastructure Is SpaceX Using?

To understand the ambition here, the numbers matter. According to verified reporting, SpaceX operates two major AI training clusters. Colossus I houses over 220,000 NVIDIA GPUs including H100s, H200s, and GB200s, the cluster that Anthropic has contracted to use, paying $1.25 billion per month through May 2029. Colossus II, which came online at gigawatt-scale power in January 2026, holds approximately 550,000 to 555,000 NVIDIA Blackwell-series GPUs, primarily GB200 and GB300 chips, with plans to scale toward one million GPUs.

Colossus II is currently running seven parallel artificial intelligence model training jobs, including variants scaling up to 10 trillion parameters. That is the environment this new training stack is being built for, not a research cluster, but an active, production-scale AI factory.

How Does This Fit Into SpaceX's Broader AI Strategy?

SpaceX's S-1 registration statement, filed ahead of its planned initial public offering (IPO), explicitly reframes the company as a vertically integrated AI infrastructure platform encompassing compute, satellite networking, orbital data centers, and energy systems. The filing argues that future AI competition will be decided by control of underlying physical systems: chips, energy, networking, manufacturing, and deployment capacity.

A proprietary training stack is a direct extension of that thesis. If SpaceX can train models meaningfully faster than competitors using the same NVIDIA hardware, it gains an asymmetric advantage: more model iterations per dollar, faster research cycles, and a competitive moat that does not depend on getting scarce next-generation chips first.

Key Advantages of SpaceX's Custom Training Stack

  • Hardware Optimization: Writing in C and mapping software directly to physical hardware eliminates abstraction layers that waste computing cycles in traditional frameworks like JAX and PyTorch.
  • Pipeline Parallelism Efficiency: Splitting models across multiple devices becomes far easier to tune when engineers control the entire software stack from code to GPU.
  • Competitive Speed Advantage: Training models faster than competitors using identical NVIDIA hardware enables more iterations per dollar and accelerates research cycles.
  • Vertical Integration Moat: Owning the full stack from C code to GPU to data center power supply creates a competitive advantage that does not depend on access to scarce next-generation chips.

This is not a novel idea in principle. Google built TPUs and XLA for exactly this reason. However, doing it at this scale in C, targeting commodity NVIDIA hardware rather than custom silicon, represents a different kind of bet.

The broader Musk ecosystem stands to benefit too. Tesla's Dojo supercomputer has pursued a similar philosophy of custom silicon, custom interconnects, and custom software for autonomous driving training. A proven bare-metal training stack developed at SpaceX could inform or accelerate that work, though the two remain separate organizations with separate compute infrastructure.

Version 1.0 of the training stack is not finished yet, and benchmark numbers have not been published. But the direction is clear: SpaceX is betting that owning the full stack, from the C code to the GPU to the data center power supply, is the only way to compete at the frontier of artificial intelligence. Whether the speed gains match the ambition is the question that Version 1.0 will have to answer.