Why the U.S. Military Is Ditching Massive AI Models for Smaller, Smarter Ones
The U.S. federal government faces a critical modernization challenge: aging Python and Java codebases powering mission-critical defense systems need to move to secure, modern foundations, but manually refactoring millions of lines of code would take years of engineering effort. Rather than rely on large frontier models running on cloud servers, defense contractors and systems integrators are building specialized AI agent networks using smaller, locally-deployed models that can work in disconnected environments without sending sensitive code to external endpoints.
Why Aren't Defense Teams Using the Biggest AI Models Available?
The instinct in most AI applications is to reach for the largest model available, assuming more parameters and more computing power automatically deliver better results. That logic breaks down completely in national security environments where internet connectivity cannot be guaranteed, sensitive code cannot leave secure facilities, and infrastructure costs must remain predictable.
Large frontier models introduce three critical problems for mission-critical systems. First, they create unpredictable inference behavior at scale, meaning the same request might produce different outputs depending on server load or model state. Second, they demand massive GPU (graphics processing unit) memory and generate long reasoning chains that multiply token usage and latency costs. Third, they require external cloud connections, which violates security protocols in environments designed to operate offline by design.
For the U.S. federal government and federal agencies operating under the General Services Administration's AI governance framework, the mandate is clear: operationalize responsible AI before deploying agents to handle anything that matters. That means getting the infrastructure right first, which means working with models that can run locally, predictably, and without external dependencies.
What Models Are Actually Performing Well in Defense Modernization Work?
Testing revealed that model capability still correlates with context length and parameter count, particularly for code migration tasks requiring understanding of large software systems. However, the models selected for this defense modernization project were chosen based on practical constraints, not raw capability alone.
The current model selections for the agent harness include specialized tools for different tasks:
- mistralai/Devstral-Small-2-24B-Instruct: Designated for coding agents, featuring a 256,000-token context window (roughly the equivalent of processing 200,000 words at once) and strong performance on coding benchmarks optimized for software analysis and refactoring tasks.
- mistralai/Ministral-3-14B-Reasoning: Designated for non-coding agents, featuring a 256,000-token context window and effective performance on structured reasoning, dependency analysis, and orchestration across migration workflows.
- gpt-oss-120B: Used for building knowledge graphs representing the global structure of the codebase through a retrieval-augmented generation (RAG) approach, which allows AI systems to reference external information sources.
- intfloat/e5-mistral-7B-instruct: An embedding model used for indexing and vector retrieval within the knowledge graph system, enabling semantic search across millions of lines of code.
These were deliberate engineering decisions based on the available compute environment and the specific requirements of offline, mission-critical work. The context limitations encountered are a function of GPU capacity, not a constraint of the architecture itself. As infrastructure scales, the model selection strategy can scale with it.
How to Build an AI System for Legacy Code Modernization
- Design for offline operation: Ensure all models run locally without requiring external cloud APIs, GitHub connections, or remote package registries that could introduce vulnerabilities or points of failure in national security environments.
- Use specialized agents for different tasks: Deploy separate AI agents optimized for coding work, dependency analysis, orchestration, and knowledge indexing rather than relying on a single general-purpose model to handle all migration challenges.
- Implement an agentic harness architecture: Build a modular orchestration framework that coordinates multiple specialized agents, allowing them to share state and interoperate across complex modernization programs without requiring human intervention at every step.
- Prioritize deterministic behavior over raw model scale: Select models that deliver fast, repeatable execution and low-latency responses suitable for automation tasks like tool-calling and structured reasoning, rather than models that produce variable outputs requiring human review.
Why Are Smaller Models Better for Automation Tasks?
Small language models (SLMs) are well-suited for agentic workflows because these systems require reliable, repeatable execution and low-latency responses, not just raw model scale. Smaller models can deliver fast, deterministic behavior that works well for automation tasks like tool-calling, orchestration, and structured reasoning.
When a model only "knows" one domain well, the probability distribution becomes sharper, and it tends to produce the same answer more often. This consistency is critical for mission-critical work where unpredictable behavior could delay operations or introduce security vulnerabilities. Additionally, smaller models are easier to fine-tune and operate locally, enabling teams to deploy fleets of domain-specific agents without relying on external endpoints.
The challenge with larger models like Claude is that they sometimes overthink straightforward engineering tasks, producing unnecessarily long reasoning chains that increase token usage and latency. At migration scale, where systems contain millions of lines of code, that behavior compounds quickly into significant compute cost and operational overhead.
What Are the Real Obstacles in Modernizing Legacy Defense Systems?
Brownfield migration, which means moving real production code rather than building a proof-of-concept from scratch, is fundamentally different from greenfield development. You cannot simply prompt an AI model and call the work complete.
Legacy systems present multiple interconnected challenges that require careful orchestration:
- Aging architectures: Systems designed around assumptions, frameworks, and infrastructure that reflect the engineering norms of an earlier era, making them difficult for modern AI models to understand without extensive context.
- Limited or uneven test coverage: Large portions of codebases were written before automated testing became a standard discipline, leaving gaps in documentation about expected behavior and edge cases.
- Implicit behavioral contracts: Dependencies between modules and services that exist in practice but were never formally documented or enforced, creating hidden assumptions that AI systems must infer.
- Accumulated technical debt: Layers of workarounds, deprecated libraries, and compatibility fixes that have gradually become part of the production system, requiring AI agents to understand both the original intent and the accumulated modifications.
For national security missions, the stakes of getting this wrong are not abstract. A failed migration does not just break an application; it can ground a workflow, delay a mission, or introduce a security vulnerability into infrastructure that protects people.
How Does This Approach Scale Across Multiple Modernization Programs?
The heart of this platform is an agentic harness, a modular orchestration framework that coordinates multiple specialized AI agents, each responsible for a specific part of the migration workflow. This harness runs on OpenShift AI, leveraging vLLM for efficient, low-latency model inference.
Over time, this pattern evolves toward what practitioners describe as an agent mesh, a "harness of harnesses" architecture where multiple agentic workflows can interoperate, coordinate tasks, and share state across complex modernization programs. This approach allows systems integrators to supervise modernization at scale rather than perform every step by hand, addressing the years of engineering effort that manual refactoring would require.
The practical implication is significant: instead of assigning teams of engineers to manually update millions of lines of code, organizations can deploy AI agents to handle the repetitive, pattern-based work while human engineers focus on validating results and handling edge cases that require domain expertise or judgment calls about architectural decisions.