Run Gemma 4 Locally with LM Studio's New CLI

LM Studio's headless CLI now exposes Gemma 4 as an OpenAI-compatible API endpoint, letting you build a local coding agent with zero cloud costs and complete data privacy. Setup takes 10-20 minutes.

Is running a capable coding model locally on your own hardware actually practical in 2026, or is it still the kind of thing that requires a weekend of configuration and a high tolerance for documentation gaps?

The answer changed recently. LM Studio's headless CLI shipped a clean server mode that exposes local models as OpenAI-compatible endpoints. Point any tool that speaks the OpenAI API - Cursor, Claude Code with a custom backend, Goose - at localhost, and the integration works without modification. The setup is 10-20 minutes on a machine that meets the RAM requirements.

This guide covers exactly what that looks like with Gemma 4.

Why Gemma 4 and why now

Google released Gemma 4 in early 2026 as an open-weight model. Open-weight means the weights are publicly available - you download them and run them on your hardware, with no calls back to Google's infrastructure. That's a fundamentally different relationship with the model than using Claude or GPT-4o, which only run on company servers.

Gemma 4 comes in four sizes. The right choice depends on what your machine can handle.

Variant	Parameters	RAM needed	Best for
Gemma 4 1B	1 billion	4GB	Older hardware, limited use cases
Gemma 4 4B	4 billion	8GB	Basic coding tasks on constrained machines
Gemma 4 12B	12 billion	16GB	M1/M2 Macs, everyday development work
Gemma 4 27B	27 billion	32GB+	M3 Pro and higher, demanding tasks

The 12B variant on a machine with 16GB of RAM handles the majority of routine coding tasks without noticeable lag. On an M4 Max with 48GB, the 27B model generates 30-40 tokens per second - fast enough that you stop noticing it's local.

The performance assessment: Gemma 4 handles routine work well. Function generation from docstrings, code explanation, unit test writing, refactoring suggestions, syntax error identification - these work reliably. Architectural reasoning across a large unfamiliar codebase, complex multi-step refactors, and nuanced judgment calls are where it falls short of Claude Opus. That gap matters for maybe 20-30% of coding tasks. For the other 70%, the local model is sufficient.

Setting up the server

Download LM Studio from lmstudio.ai. Open it, search for Gemma 4, and download the variant that fits your hardware. The 12B model is around 7GB compressed. Plan for the download time if your connection is slow.

Once downloaded, start the headless server:

lms server start --model gemma-4-12b

The model loads and serves an OpenAI-compatible API at http://localhost:1234/v1. LM Studio logs token generation speed to the terminal - check this number. It tells you whether your hardware is handling the model comfortably or struggling under load.

The server runs as a background process. You can close the terminal and it keeps running. Stop it with:

lms server stop

Connecting your coding tools

This is where LM Studio's OpenAI-compatible API design pays off. Any tool that accepts a custom API endpoint works without modification.

For Claude Code with a custom backend, set these environment variables before launching:

export ANTHROPIC_API_BASE_URL=http://localhost:1234/v1
export ANTHROPIC_API_KEY=local

The API key value doesn't matter - LM Studio doesn't validate it locally. Any string works. After setting these, launch Claude Code normally. It connects to the local server instead of Anthropic's infrastructure.

For Cursor, go to Settings, then Models, then add a custom model with base URL http://localhost:1234/v1 and any non-empty API key. Select it from the model dropdown when you want to use local inference.

For Goose, which was designed for model-agnostic use from the start, point the provider configuration at the local endpoint. The tool treats local inference identically to cloud APIs.

What to expect on real work

Two variables matter most: your hardware and your task type.

On hardware: 16GB RAM is the realistic minimum for the 12B model. Below that, you're looking at the 4B variant with meaningfully lower output quality. At 32GB, the 27B model runs comfortably. At 48GB or more, you're getting performance that rivals cloud API response quality for routine tasks.

On task type: Gemma 4 performs confidently on generating functions from clear specifications, writing tests for existing code, explaining how code works, catching syntax and logic errors in short snippets, and translating between languages or frameworks it knows well. It struggles when the codebase is large and unfamiliar, when the task requires reasoning across many files simultaneously, or when the right answer requires judgment about architectural tradeoffs rather than pattern recognition.

The practical approach developers end up with: route the first category through local Gemma, keep cloud API access for the second. The hybrid setup cuts cloud spend significantly while maintaining quality on the tasks that actually need it.

70%

of routine coding queries can be handled by local Gemma 4 on adequate hardware - cloud API reserved for complex reasoning tasks

Other models worth comparing

Gemma 4 is not the only option LM Studio supports well. If you want to compare before committing to a setup, three others are worth testing on your specific hardware and workload.

Mistral Small 3 at 22B parameters performs well on multilingual projects and has strong benchmark results on code tasks. It fits comfortably on 32GB machines. DeepSeek Coder V3 was purpose-built for coding and matches or beats Gemma 4 on several code-specific benchmarks with a slightly smaller footprint. Microsoft's Phi-4 at 14B parameters performs above its parameter count on reasoning tasks and runs fast on consumer hardware - a good choice if your primary constraint is response speed rather than task complexity.

The best way to choose: run the same ten prompts from your actual codebase through each model. Benchmark numbers don't translate cleanly to individual codebases. Your own work is the only benchmark that matters for your setup.

Local vs cloud: choosing by situation
Local inference is the right choice when your API bill has become a real budget line, when your company's data policy prohibits sending proprietary code to third-party servers, when you need predictable latency without queueing, or when you're building something that will make a large volume of model calls and per-token costs would accumulate to unreasonable levels.

Cloud remains the right choice when you code infrequently and the setup overhead isn't worth it, when your tasks regularly require frontier-level reasoning that local models fall short on, or when your hardware doesn't meet the 16GB minimum for useful inference. A single month of Claude subscription may cost less than upgrading hardware to run 27B models comfortably.

The hybrid path - local for routine work, cloud for complex problems - works well for developers who are already paying cloud API rates and have hardware that can handle a 12B or larger model. The setup is a one-time cost. The savings are ongoing.

Verification checklist before you rely on this setup

LM Studio server started and terminal shows token generation speed above 10 tokens/second
Environment variables ANTHROPIC_API_BASE_URL and ANTHROPIC_API_KEY are set before launching your tool
First test prompt generates a response without authentication errors
Token generation speed during a real coding prompt matches the idle speed logged at startup
CPU and RAM usage are stable after 10-15 minutes of continuous use (not climbing toward 100%)
Output quality on a sample of your actual routine coding tasks is acceptable
You have a fallback (cloud API or key) configured for tasks where local model output isn't sufficient