Logo
FrontierNews.ai

One Developer Built an AI Council on His Gaming GPU,and It's Changing How He Thinks About Local Models

Running multiple local AI models simultaneously on consumer hardware sounds impossible, but one developer found a way to make them work together. By adapting Andrej Karpathy's "LLM Council" concept to work with local models running on his RTX 4070 Ti GPU through Ollama, a popular tool for self-hosting AI models, he discovered that having three models debate and critique each other's answers often produces better results than asking any single model for the final word.

What Is the LLM Council Concept?

Karpathy's LLM Council is a three-stage process designed to improve AI reasoning by introducing debate and critique. In the first stage, multiple models independently answer the same question. In the second stage, each model anonymously reviews and ranks the others' responses. Finally, in the third stage, a designated "chairman" model reads all the reviews and produces a final answer that synthesizes the best elements from each perspective.

The original concept was built for cloud-based APIs, where multiple models could run in parallel on dedicated servers. Adapting it to work on a single consumer GPU required significant rethinking. The developer tested three models: DeepSeek-R1 8B, Qwen 3.5 9B, and Gemma 4 E4B, all running through Ollama, an open-source platform that lets users run large language models (LLMs) locally without relying on cloud services.

Why Did Running Three Models at Once Fail on Consumer Hardware?

The original LLM Council implementation fired requests to all models simultaneously, which works fine when cloud providers have dedicated hardware for each model. But on a single 12GB GPU, attempting to run three models with 8 to 9 billion parameters each in parallel caused one model to silently fail without returning any response. The solution was to process each model sequentially instead of in parallel, trading speed for reliability.

This sequential approach meant each stage took longer to complete, but it ensured every model could generate a full response. Once the workflow was stable, the real question became whether the extra computation time and effort actually produced better answers than simply asking the strongest single model.

How Did Each Model Perform Individually?

Before building the council, the developer tested all three models on his RTX 4070 Ti across four different categories of prompts. Each model had distinct strengths and weaknesses:

  • DeepSeek-R1 8B: Excelled at reasoning tasks but hallucinated (generated false information) under pressure when questions became complex.
  • Qwen 3.5 9B: Demonstrated broader knowledge than the other two models but often buried good answers in verbose, lengthy responses that were hard to parse.
  • Gemma 4 E4B: Proved strongest at organizing and synthesizing information from multiple sources but occasionally produced factually incorrect answers.

Initially, the developer chose Gemma 4 as his default model because it was the best synthesizer. However, he found himself manually comparing all three responses whenever a prompt truly mattered, acting as a judge to weigh each model's unique perspective.

What Did the Council Actually Produce?

When tested on a complex technical question about Cloudflare Tunnel versus Pangolin, the council produced an unexpected result: all three models ranked themselves first while disagreeing on how to order the others. Technically, this meant the "street cred leaderboard" showed a perfect three-way tie, but the real value emerged in stage three.

The chairman model (Gemma 4) didn't rely on the rankings alone. Instead, it read the detailed reasoning behind each model's self-assessment and synthesized them into a single response. DeepSeek provided the most detailed practical deployment information, Qwen framed trade-offs in an approachable way, and Gemma added structural clarity that tied everything together. The council's strength wasn't in determining a winner; it was in forcing each model to read and respond to the others' logic.

How to Set Up a Local AI Council on Your Own Hardware

  • Choose Your Models: Select three models with complementary strengths, not three identical ones. Pair a reasoning-focused model with a knowledge-focused model and a synthesis-focused model for maximum diversity.
  • Install Ollama: Download Ollama and configure it to expose an OpenAI-compatible API endpoint, which allows you to swap cloud API calls for local model calls without rewriting the entire application.
  • Adapt the Workflow for Sequential Processing: If running on consumer hardware with limited VRAM, modify the council implementation to process each model one at a time rather than in parallel, accepting longer response times in exchange for reliability.
  • Build a User Interface: Create a clear UI that shows each stage of the council process, including color-coding for stages, a leaderboard of model rankings, and separated sections so you can follow how the council reached its conclusion.
  • Test on Real Prompts: Run the council on questions that actually matter to you, comparing the council's final answer against what you would get from a single model to determine if the extra computation is worth the improvement.

What Are the Trade-Offs of Running a Local AI Council?

The council approach is slower and more demanding on hardware than asking a single model. Each request requires three separate inference passes plus two additional review stages, multiplying the computational load. For routine questions, the developer still opens Gemma directly because the speed advantage outweighs the marginal improvement in answer quality.

However, for complex prompts where accuracy and nuance matter, the council's ability to synthesize multiple perspectives justifies the extra time. The developer noted that synthesis is a fundamentally different skill from generation or self-grading. A single model might generate a correct answer, but it cannot reliably evaluate whether its own answer is better than an alternative perspective without seeing that alternative first.

The privacy and cost benefits of running everything locally on consumer hardware remain significant. There are no API fees, no data sent to cloud providers, and no dependency on external services. For developers and researchers who value control and privacy, the trade-off between speed and autonomy makes the council approach worth exploring, even if it is not practical for every single query.