Logo
FrontierNews.ai

Your Graphics Card's Memory Is the Real Limit for Local AI, Not Model Hype

Your graphics card's memory capacity silently decides which AI models will run on your machine, how fast they respond, and whether they're actually usable. While tech forums obsess over benchmark scores and model cleverness, the practical reality is far simpler: a model that doesn't fit your available memory becomes either unusably slow or impossible to run at all.

Why Does VRAM Matter More Than Processing Power?

Generating text with an AI model is fundamentally a memory problem, not a computing problem. To produce each word, your graphics card must read the entire set of model weights it needs from memory, so the speed at which it can move data sets the ceiling far more than raw calculating power does. This is why a card that holds your model comfortably beats a theoretically faster card that has to spill data onto your computer's main memory, which is dramatically slower.

Think of your graphics card's memory as a fixed budget shared by three competing demands at the same time: the model's weights, the conversation history the model stores to avoid redoing work, and temporary scratch space for current calculations. The moment one grows, the others get squeezed. This creates hard limits that no amount of model-shopping can overcome.

What Can Different Graphics Cards Actually Run?

The jumps between memory tiers are not smooth; they are cliffs. Here's what you can realistically expect from common graphics card configurations:

  • 8 GB cards (RTX 4060, 3050): Run 7B to 8B parameter models at 4-bit quantization with modest conversation length. A 13B model sits at the edge and gets risky; anything larger spills to main memory and crawls at 1 to 3 tokens per second.
  • 12 GB cards (RTX 3060, 5070): Handle 13B models at 4-bit quantization comfortably, but conversation length is limited to roughly 4,000 tokens with little room for anything else running alongside.
  • 16 GB cards (RTX 5080, 5070 Ti): Run 13B to 14B models at full speed with headroom. A 70B model only partly fits, enough to move it from completely unusable to merely slow.
  • 24 GB cards (RTX 4090, 3090): Run 30B to 34B models at 4-bit quantization comfortably. This is the serious-hobbyist sweet spot and the practical floor for running a squeezed 70B model by splitting it between GPU and CPU.
  • 32 GB cards (RTX 5090): Hold 30B-plus models fully in memory and stretch to a low-bit 70B. The real edge over the 4090 comes from memory speed, delivering roughly 67% faster performance.
  • 48 GB and above (RTX 6000 Ada, RTX PRO 6000, DGX Spark): Run a 70B model at 4-bit quantization comfortably; 96 to 128 GB unified machines reach 120B-plus models. This is workstation and desktop AI territory.

The concrete numbers tell the story. A 70-billion-parameter model at full precision needs about 140 gigabytes just for its weights, before a single word of conversation. No consumer card holds that. Squeezed hard to 4-bit quantization, that same model still needs around 42 gigabytes, which overflows a 24 GB card and forces it to spill onto much slower main computer memory, dragging output down to roughly 1 to 3 tokens per second. That speed is not usable for real-time conversation.

How to Match Your Hardware to the Right Model

  • Start with your VRAM, not the leaderboard: Flip the conventional approach. Instead of asking "which model is smartest," ask "what can my graphics card sustain at a speed I'll actually tolerate?" An 8 GB card running a 7B model at 4-bit delivers about 20 to 25 tokens per second. Push the same card to a 30B model and it collapses to 1 to 3 tokens per second. The card didn't change; the fit did.
  • Account for conversation memory alongside model weights: The model's weights are only part of the memory budget. As the model generates text, it stores what it has already worked out for every earlier word so it doesn't have to redo that work each step. This storage grows with every token and scales with conversation length and the number of simultaneous users. On a realistic 24 GB card, conversation memory for a long chat can match or exceed the model weights themselves.
  • Understand the three-way trade-off: You can spend your memory budget on a bigger model, a longer conversation, or more simultaneous users, but not all three at once. Every local setup is a negotiation between those three claims on the same fixed pool. The hardware drew those lines, and no amount of model-shopping moves them.

The practical implication is stark: a solo developer on a 12 GB card and a startup on a 48 GB workstation simply are not running the same models, holding the same conversations, or serving the same number of users. The hardware made those decisions before either of them downloaded anything.

What Speed Should You Actually Expect?

Speed figures without context tell you almost nothing. Here are real-world benchmarks tied to exact setups:

  • Llama 3.1 8B at 4-bit on an RTX 4090: About 92 tokens per second, measured using LM Studio.
  • Same 8B model on an RTX 5090: About 213 tokens per second, roughly 67% faster almost entirely from faster memory speed.
  • Llama 2 70B at 4-bit with 40 layers on a 24 GB 4090: About 18 tokens per second, using 23 GB of memory.
  • Same model with only 20 layers offloaded: Falls to 8 tokens per second. The model didn't change; the fit did.
  • Batching server versus single-stream tool on the same card: 16 to 20 times the total throughput across users, just by managing conversation memory better.

Memory speed, not raw calculating power, is the spec to watch. This is why choosing by fit rather than by leaderboard is the difference between a tool people actually use and an expensive demo that sits idle. Usability, not raw cleverness, is what turns a model into real work.