Can Local Models Replace Claude and GPT for Daily Coding?

Developers on Hacker News discuss whether local AI models can fully replace cloud-based tools like Claude and GPT for professional coding work. The community shares setups, performance metrics, and real-world experiences.

You are choosing between keeping Claude or GPT-4 as your daily coding assistant and switching to a local model running on your own hardware. The question is not whether local models have gotten better - they have - it is whether they are good enough to replace a frontier model for the work you actually do every day, not just for demos. A thread on Hacker News this week, titled Ask HN: Has anyone replaced Claude/GPT with a local model for daily coding?, drew several hundred responses from engineers who have tried this for real. The answers are more useful than most benchmark comparisons.

The number that anchors every setup in the thread

60 tok/s

The approximate threshold where local model responses stop feeling like waiting and start feeling like a tool

Sixty tokens per second comes up repeatedly as the dividing line. Below it, the experience feels like watching code type itself. Above it, the model starts to feel interactive. This is not a universal law, but it matches what you would expect from how developers actually work: short completions and quick back-and-forth questions dominate the workflow, not long document generation, so latency per token compounds quickly. The setups hitting 60+ tok/s in the thread are almost all running quantized versions of Qwen 2.5 Coder 32B or DeepSeek Coder V2 on machines with an RTX 4090 or a Mac with M-series silicon and at least 64 GB unified memory. A 4090 with 24 GB VRAM can run a Q4_K_M quantized 32B model at roughly 55 to 70 tok/s depending on the model architecture. A MacBook Pro M4 Max with 128 GB memory pushes higher - some reports in the thread land at 80 to 90 tok/s for 32B models. What changes if you halve that number? A 7B model running on a mid-range GPU might give you 30 tok/s. The model is technically responding. But for a task like asking it to trace a call stack across five files, the 8-second wait for a 240-token answer erodes the habit. Engineers in the thread consistently reported that they stopped using the local model for exploratory questions and only kept it for completions when it felt slow. That is not a replacement workflow. It is a fallback.

Why context window size matters more than benchmark scores

The raw benchmark scores for models like Qwen 2.5 Coder 32B are competitive with older Claude versions on standard coding tasks. That is not the issue. The issue is what happens when you paste in a real file. Think of a context window like a whiteboard in a meeting room. A frontier model via API gives you a whiteboard that covers an entire wall. A 32B model running locally on 24 GB of VRAM gives you a whiteboard roughly the size of a standard A1 sheet. You can still write code on it. But when your codebase file is 800 lines, your test file is 400 lines, and you want to ask about the interaction between three functions, you are already taping sheets together and hoping nothing falls off. The practical ceiling for most local 32B setups is around 16K to 32K tokens in context before you start seeing quality degrade or speed drop sharply. Claude 3.7 Sonnet handles 200K tokens. That gap matters enormously for refactoring work, for reading unfamiliar codebases, and for any task where you are asking the model to hold multiple files in mind simultaneously. The engineers in the HN thread who reported successful full replacements were, almost without exception, working on projects with tight scope: personal tools, single-file scripts, hobby projects with small surface areas. The ones maintaining production systems with large codebases were using local models as a supplement, not a replacement. They would run the local model for fast completions and fall back to Claude for anything requiring broad context. That is a sensible split, but it is not a replacement.

A decision tree for your specific situation

The right answer depends almost entirely on your hardware, your codebase size, and your tolerance for setup friction. Here is how to map your situation to a decision. If you have an M3 Pro or M3 Max Mac with 36 GB+ unified memory: Run Qwen 2.5 Coder 32B via llama.cpp or LM Studio. You will get usable speeds, reasonable context, and zero API cost. This setup works as a genuine daily driver for projects under roughly 50K tokens of relevant context. Pair it with Cursor or another editor that supports local model endpoints. If you have an NVIDIA GPU with 24 GB VRAM (RTX 4090 or equivalent): The story is similar. You can run 32B quantized models at 55 to 70 tok/s. The setup is more involved on Windows, slightly more straightforward on Linux. If your daily work is Python or TypeScript with moderate file sizes, this is viable. If you regularly paste in large files or want cross-repo understanding, you will feel the context limit. If your machine has 16 GB VRAM or less: You are in 7B or 13B model territory for anything running at interactive speeds. These models are useful for autocomplete and small function generation. They are not replacements for Claude on non-trivial tasks. The step down in reasoning quality is real and compounds on anything requiring more than two to three steps of logic. If your concern is primarily cost: The math shifts. A developer spending $80 to $120 per month on Claude Pro or API usage could recoup hardware costs within 18 to 24 months on a high-end GPU or Apple Silicon machine - assuming they were going to buy new hardware anyway. If you are buying a GPU specifically for inference, the break-even extends and the opportunity cost of the quality gap has to be factored in. If your concern is privacy or air-gapped environments: This is the strongest case for going local. No data leaves your machine. The quality tradeoff is worth it for many compliance-sensitive situations. DeepSeek models running locally have become a popular choice for this use case in the thread, though the political context around DeepSeek has made some organizations cautious about using it even offline. If you want to understand what local models can do before committing hardware: The comparison at Qwen vs Claude for local model use covers the quality gap in more detail. The post on running Gemma 4 locally via LM Studio gives a concrete setup path if you want to test before buying.

When to act on this and when to wait

If you are on Apple Silicon with 64 GB or more unified memory, the earliest you could act on this productively is right now. The tooling is mature enough - llama.cpp, LM Studio, and Ollama all have stable releases - and the model quality at 32B has cleared the bar for daily single-repo coding work. Download a Q4_K_M quantized Qwen 2.5 Coder 32B, point your editor at the local endpoint, and run it for one week on real tasks before making a decision. If you are on a Windows or Linux machine with a mid-range GPU, wait for the next generation of consumer GPUs with 32 GB+ VRAM. The 24 GB ceiling is workable but tight. NVIDIA's next consumer tier is expected to push VRAM higher, and that single change resolves most of the context limitations that make current local setups feel like a compromise rather than a solution. Until then, the split workflow - local model for fast completions, Claude for broad context tasks - is the more honest recommendation. See the Cursor vs GitHub Copilot comparison for more on how editor-level tooling affects this kind of hybrid setup.