Google's Gemma 4 isn't the smartest local LLM I've run, but it's the one I reach for most
At a glance:
- Google's Gemma 4 family includes four multimodal models: 31B dense, 26B-A4B MoE, E4B, and E2B edge variants.
- The 26B-A4B MoE model activates only 3.8B parameters per token, delivering near-31B quality at much faster speeds.
- Edge models (E4B/E2B) support native audio input, OCR, and chart understanding, enabling fully offline voice assistants.
Google's Gemma 4 family brings a new balance of speed and smarts
If you've spent any time running local LLMs, you know the drill. You find a model that's smart enough but too slow, or fast enough but too dumb, and you spend hours swapping between quantizations trying to find the right balance. It's a constant trade-off between quality and speed, and it's one that I've largely accepted as the cost of keeping my inference local. Google's new Gemma 4 models don't eliminate that trade-off entirely, but they come closer than anything else I've tried.
What makes Gemma 4 different from previous open-weight releases is the sheer range of what Google shipped. There are four models in the family, all built from the same Gemini 3 research, all released under the Apache 2.0 license, and all natively multimodal. That license change is a pretty big deal, because it makes Gemma 4 much easier to use as a foundation for commercial and self-hosted projects rather than something you have to worry about building around. You get the 31B dense model for when you want the best quality, the 26B-A4B Mixture of Experts model that activates only 3.8 billion parameters per token for speed, and then two edge models, the E4B and E2B.
I've been playing around with all four models, and the 26B-A4B model has been by far my favorite. I've had it running on the ThinkStation PGX since it came out, and it's the model I keep coming back to play around with. It's not perfect, and the inference speed story has some caveats, but it's very good, and that's more than I can say for most local models at this size.
Four models from the same DNA
A different focus for each
The biggest shift with Gemma 4 is that Google didn't just release a set of similar models at different sizes and call it a day. The four variants are meaningfully different in how they're built and what they're designed for, but they all, broadly speaking, come from the same research base and shared training approach. This has benefits, because they share the same training pipelines and quirks, even if they all serve very different purposes.
The 31B dense model is the flagship. It scores 89.2% on AIME 2026, 80% on LiveCodeBench, and 76.9% on the agentic tau2-bench. Those are numbers that put it comfortably in line with, and even ahead of, models with ten times the parameter count, which is exactly what Google was going for with its "intelligence-per-parameter" pitch. The 26B-A4B MoE model trails only slightly behind, hitting 88.3% on AIME and 68.2% on tau2-bench, but it does so while only activating 3.8 billion parameters per forward pass. That means it runs almost as fast as a 4B model while giving you most of the 31B's quality.
Where the two really diverge is in hardware requirements. Google's own base-weight estimates put the 31B at roughly 17.4GB in 4-bit and the 26B A4B at roughly 15.6GB in 4-bit, though real-world requirements rise once you account for runtime overhead and KV cache. On my ThinkStation PGX with its 128GB of unified memory, both fit without issue, but on a more typical consumer GPU, the 26B with a slightly aggressive quantization is a more realistic option. And because it activates so few parameters per token, you're not giving up much quality for that smaller footprint. While I haven't tested it, it may also be possible to offload the expert weights to system RAM to conserve more VRAM, just like I did with gpt-oss-120b.
The E4B and E2B are where things get interesting, and even their names need a bit of explanation. Google brands them as Effective 4B and Effective 2B models rather than leading with the fuller parameter counts, because they're optimized around how much of the model is meaningfully in play rather than just the biggest headline number. Both support native audio input on top of text and vision, which the larger models don't. It's not just for transcription either, as these models handle speech recognition and audio understanding natively, on-device, with no cloud round-trip. They run with 128K context windows, and the E2B is optimized for speed, delivering roughly three times faster inference than the E4B while using 60% less battery on mobile hardware.
Google is being unusually specific about where these are meant to be deployed, showing viable deployment paths through AI Edge Gallery and LiteRT-LM across Android, iOS, desktop, web, and even edge devices like the Raspberry Pi.
They may feel like afterthought releases akin to the 270m or 1b models of the Gemma 3 family, but they certainly aren't. They're fully multimodal models that can do OCR, chart understanding, document parsing, and speech recognition on a phone with no internet connection. If you've ever wanted to build a local voice assistant that actually understands context, the E2B and E4B make that practical in a way that previous edge models didn't.
The 26B MoE is the one I keep using
It's incredibly fast
I've been running the 26B-A4B more than the rest of the models here, and for good reason. At Q4_K_M quantization, the model fits comfortably in about 18GB, and even the Q8 version leaves the PGX with memory to spare. And because the MoE architecture only routes to 3.8 billion active parameters per token, generation speed is noticeably faster than running the dense 31B, as the memory bandwidth of the PGX is by far its biggest weak point.
To illustrate the problem, the PGX's memory bandwidth comes in at 273 GB/s, which is decent, but Apple's M3 Ultra pushes roughly 800 GB/s. Since token generation speed in local inference is almost entirely memory-bandwidth-bound, that gap shows up in dense models like the 31B variant. The PGX compensates with its 6,144 CUDA cores and the broader Nvidia software stack, which matters if you're doing fine-tuning or quantization alongside inference. However, if raw tokens-per-second on a single model is all you care about, Apple Silicon still has an edge in bandwidth when it comes to its higher-tier SKUs, and so do most dedicated GPUs.
The quality difference between the 26B and the 31B is marginal for most of what I throw at them. For local tool calling and research, smart home voice assistants, general reasoning, and more, the 26B-A4B model handles it without much fuss. Where the 31B would pull ahead is on harder coding tasks and complex multi-step reasoning, but those are also the tasks where I'm more likely to reach for Qwen3 Coder Next or a cloud model anyway. For day-to-day local work, the 26B-A4B hits a balance that I haven't found with other MoE models. It feels pretty fast while also refraining from feeling shallow.
The 256K context window helps too. I can pass entire repositories or long documents in a single prompt, where older local models would have needed careful chunking. Once you stop worrying about whether your input fits, it's a lot easier to be carefree in how you use your local model and what you use it for.
Tool calling that's baked into the architecture
This model is great for practical uses
One of the things that's held back local LLMs for agentic workflows is tool calling. Most people who have used local models have probably dealt with the experience of open-weight models struggling to call tools consistently and effectively. Gemma 4 handles it reliably, as Google baked function calling directly into the model's architecture with six dedicated special tokens for tool declarations, function calls, and responses.
You can define tools as JSON schemas or even raw Python functions with type hints, and the model will generate structured function call objects that you can parse and execute. You can also enable the model's internal reasoning process alongside tool calling, which helps it decide when to trigger a tool and how to structure the parameters. I've been playing around with this for use with my SearXNG MCP server and even for coding harnesses such as Pi. Sometimes, smaller models still struggle to know when to call a tool versus just answering directly, but the 26B-A4B model that I've been using doesn't have that problem.
The E2B and E4B support function calling too. You could have a local model on your phone calling tools and querying APIs without any cloud dependency whatsoever. If you pair that with the native audio input on the edge models, you've got the building blocks for a voice-controlled agent running entirely on a phone, and that's exactly what people have been doing. It supports "native skills" as part of Google's AI Edge Gallery, so that the LLM can execute actual commands on your device.
Speculative decoding gives the 31B a speed boost
Though Google left performance on the table
Google trained the Gemma 4 models with Multi-Token Prediction heads, which would normally allow the model to predict several tokens at once and dramatically speed up inference. But when they published the weights to Hugging Face, those MTP heads were stripped out. They're only available through Google's proprietary LiteRT framework. Without them, the 31B runs at roughly 11 tokens per second on the PGX where models like Qwen3 Coder Next, more than twice its size, can run at 50 or more. That's a problem.
MTP matters because the prediction heads share the model's internal representations, so they produce higher-quality draft tokens than a separate draft model would. DeepSeek V3 demonstrated acceptance rates of about 2.4 tokens per verification step using this approach, which translates to an almost two-times speedup with zero quality loss. Google trained Gemma 4 with exactly this capability, then chose not to ship it. It's hard not to read that as a deliberate decision to keep the best inference performance locked to their own framework.
Community workarounds are already appearing for this, though. A team trained an EAGLE3 draft head for the 31B that adds just 277MB to the model's footprint and achieves an almost-two-times speedup on conversational benchmarks, with trained acceptance rates between 0.75 and 0.82. It works by conditioning on the target model's hidden states from three points in the network, early, middle, and late layers, and proposing multiple tokens that the full model verifies in a single forward pass. Because it's verification-based, the output is mathematically identical to standard generation. Getting it to work wasn't straightforward either, as Gemma 4's hybrid attention architecture uses 50 sliding-window layers and 10 global layers with different head counts. This meant the team had to fix multiple bugs in the serving stack that standard transformer models never trigger.
You can also use traditional speculative decoding by pairing one of the smaller Gemma 4 models as a draft model with the 31B as the target. Because they share the same tokenizer and training pipeline, the acceptance rate is solid. If you're already running the E2B for edge tasks, repurposing it as a draft model for the 31B is essentially free. If you're running the 31B on something like the ThinkStation PGX or another bandwidth-constrained device and you care about speed, speculative decoding is practically required. Tooling in vLLM and SGLang is also catching up quickly, as llama.cpp took some time to get this right, too.
Google's decision to withhold the MTP heads is frustrating, it does undercut the "open" in open-weight a bit. But even without them, the Gemma 4 family is the most complete set of local models I've used. The 26B-A4B handles most of what I need at a speed that feels natural, the edge models open up use cases that weren't practical before, and the 31B is there when I need it, even if it needs a little help to get up to speed. For a home lab that's already set up for local inference, this is the model family to start with.
FAQ
What are the four Gemma 4 models and how do they differ?
Why is the 26B-A4B model the author's favorite?
What tool-calling capabilities do Gemma 4 models have?
More in the feed
Prepared by the editorial stack from public data and external sources.
Original article