Back to Phantom Notes
AI Models

Gemma 4: Everything You Need to Know About Google's Most Capable Open Model

April 5, 202612 min readBy T.W. Ghost
Gemma 4GoogleOpen SourceMoEEdge AILocal AIApache 2.0Ollama

Google Just Changed the Open Model Game

On April 2, 2026, Google DeepMind released Gemma 4. Four models. Apache 2.0 license. Built from the same research behind Gemini 3. And for the first time, a Google model family that runs on your phone, your laptop, and your server with no licensing restrictions.

That last part matters more than the benchmarks.

Previous Gemma releases shipped under a custom license that let Google update usage terms unilaterally, required developers to enforce those terms across derivative projects, and created enough legal ambiguity that corporate legal teams routinely flagged it. Gemma 4 ships under standard Apache 2.0. No custom clauses. No prohibited-use carve-outs. No restrictions on commercial deployment or redistribution. The same license used by Qwen, Mistral, and most of the open-weight ecosystem.

But the license is just the door. What is behind it is a model family engineered for the agentic era, with native tool use, hybrid attention for 256K context, and a mixture-of-experts architecture that activates only 3.8 billion parameters per token while delivering near-frontier intelligence.

Here is everything you need to know.


The Model Family

Gemma 4 is not one model. It is four, each designed for a different deployment target.

ModelTotal ParamsActive ParamsTypeContext WindowTarget
31B31B31BDense256KWorkstations, servers
26B A4B25.2B3.8BMoE256KConsumer GPUs
E4B7.9B4.5B effectiveDense + PLE128KLaptops, edge devices
E2B5.1B2.3B effectiveDense + PLE128KPhones, IoT

The naming tells you the story. "31B" is the powerhouse. "26B A4B" means 26 billion total parameters but only 4 billion active. "E4B" and "E2B" are effective parameter counts for edge deployment.

Let us break each one down.


31B Dense: The Flagship

The 31B is a straightforward dense transformer. Every parameter fires on every token. No routing, no sparsity, no tricks. This is Google saying "here is the best quality we can give you at this size."

Benchmarks:

BenchmarkGemma 4 31BWhat It Measures
MMLU Pro85.2%Broad knowledge
GPQA Diamond84.3%Graduate-level science
AIME 202689.2%Competition math
LiveCodeBench v680.0%Real-world coding
Codeforces ELO2150Competitive programming
t2-bench (Agentic)86.4%Multi-step agent tasks

The 256K context window uses a hybrid attention mechanism that alternates between local sliding-window attention (512-1024 tokens) and global full-context attention placed less frequently. Local attention keeps per-token compute linear. Global attention handles long-range dependencies. The combination lets you analyze entire codebases or run multi-turn agentic conversations without running out of context.

Who should use it: Anyone who wants maximum quality and has a 24GB+ GPU or Apple Silicon Mac. This is the model you run when accuracy matters more than speed.


26B A4B: The Engineering Marvel

This is the model that will get the most attention, and it deserves it.

The 26B A4B is a Mixture of Experts model with 128 small experts. On each token, a router selects 8 experts plus 1 shared always-on expert to process the input. The shared expert is 3x the size of individual experts and holds general knowledge that should always be active. The routed experts contain specialized knowledge activated on demand.

The result: only 3.8 billion parameters fire per forward pass, but the model has access to 25.2 billion parameters worth of knowledge. It achieves roughly 97% of the dense 31B model's quality at a fraction of the compute.

How Gemma's MoE Differs from DeepSeek and Qwen:

Most MoE implementations replace the MLP (feedforward) blocks in transformer layers with sparse expert blocks. Gemma does something different. It adds MoE blocks as separate layers alongside the standard MLP blocks and sums their outputs. The MLP layer always runs. The expert layer adds on top of it. This trades some efficiency for architectural simplicity and training stability.

Benchmarks:

Benchmark26B A4B31B DenseGap
MMLU Pro82.6%85.2%-2.6%
GPQA Diamond82.3%84.3%-2.0%
AIME 202688.3%89.2%-0.9%
LiveCodeBench v677.1%80.0%-2.9%

Less than 3 percentage points behind the dense model on most benchmarks, while using roughly one-eighth of the active parameters. That is the MoE value proposition.

The catch: Despite only 3.8B active parameters, all 25.2B must be loaded into VRAM. The router needs access to every expert's weights to decide which ones to activate. You save compute, not memory. An important distinction for hardware planning.

Who should use it: Developers who want near-frontier quality with faster inference. Ideal for agentic workflows where you need many calls and cannot afford the latency of a full dense model.


E4B and E2B: AI on Your Phone

The edge models are where Gemma 4 gets genuinely interesting for a different audience.

Both E4B and E2B use Per-Layer Embeddings (PLE), a technique where each decoder layer gets its own small embedding table for every token. These tables are large (explaining the gap between total and effective parameter counts) but only need quick lookups, not active computation. They can be cached for faster inference and reduced memory usage on constrained devices.

Multimodal Capabilities

The edge models are not text-only. Both include dedicated encoders:

  • Vision encoder (~150M params): Object detection, document and PDF parsing, screen and UI understanding, chart comprehension, multilingual OCR, handwriting recognition, and pointing
  • Audio encoder (~300M params): Automatic speech recognition, speech-to-translated-text across multiple languages

An E2B model with 2.3 billion effective parameters can see, hear, and respond in over 140 languages. On a phone.

Edge Benchmarks

BenchmarkE4BE2B
AIME 202642.5%37.5%
LiveCodeBench52.0%44.0%

These numbers look modest next to the 31B. But consider: the E2B with 2.3B effective parameters beats the previous Gemma 3 27B on most benchmarks. That is a generational leap in intelligence-per-parameter.

Supported Edge Hardware

  • Android phones via Google AI Edge Gallery
  • NVIDIA Jetson (Nano through Thor)
  • Raspberry Pi-class devices
  • Any device with 4-8GB RAM for quantized E2B
  • Laptops with CPU-only inference (E2B)
  • NVIDIA T4 GPU (E4B runs comfortably)

Who should use them: Mobile developers, IoT builders, anyone who needs AI inference without a cloud connection. The combination of vision, audio, and 140 languages in a 2B model is unprecedented.


Native Tool Use and Function Calling

Every Gemma 4 model, from E2B to 31B, supports native function calling. This is not prompt-engineered tool use. It is built into the model architecture.

Define a function schema, and Gemma 4 returns valid JSON matching that schema. No special prompting required. No output parsing. The model understands it is supposed to call tools and does so reliably.

This matters for agentic workflows. When a model needs to plan multi-step actions, query APIs, navigate apps, or execute code, native tool use is the difference between reliable automation and fragile string matching.

Google specifically calls out:

  • Multi-step planning: The model can reason about which tools to call and in what order
  • Structured JSON output: Guaranteed schema compliance
  • System instructions: Native support for configuring agent behavior
  • Android integration: Tool calling powers Agent Skills in Google AI Edge Gallery

For builders already working with function-calling models like Claude or GPT-4o, Gemma 4 brings the same capability to local, self-hosted deployments with zero API costs.

Get the Weekly IT + AI Roundup

What changed this week in NinjaOne, ServiceNow, CrowdStrike, and AI. One email, every Monday.

No spam, unsubscribe anytime. Privacy Policy


How It Compares

vs. Qwen 3.5 (27B)

Qwen 3.5 and Gemma 4 are the closest competitors in the open model space right now.

Gemma 4 31BQwen 3.5 27B
AIME 202689.2%~85%
MMLU Pro85.2%86.1%
GPQA Diamond84.3%85.5%
LicenseApache 2.0Apache 2.0
Token efficiency~2.5x fewer tokensBaseline

Gemma wins on math and reasoning. Qwen edges ahead on knowledge benchmarks. Both are Apache 2.0. The real differentiator: Gemma uses approximately 2.5x fewer output tokens for similar tasks. Fewer tokens means faster responses and lower compute costs in production.

However, Qwen currently wins on speed. The 26B MoE runs at roughly 11 tokens/second on an RTX 4090. Qwen 3.5 35B-A3B hits 60+ tok/s on the same hardware. That is a significant gap for real-time applications.

vs. Llama 4 Scout (109B / 17B active)

Llama 4 Scout has more total parameters but Gemma 4 31B generally leads on reasoning benchmarks despite being much smaller. And Llama 4 ships under a custom community license with a 700 million monthly active user cap. For commercial deployment, Gemma's Apache 2.0 is cleaner.

vs. Phi-4 (14B)

Different weight class. Phi-4 is exceptional for its size (80.4% on MATH), but Gemma 4 31B operates at a higher tier overall. Phi-4 is the better choice if you are constrained to 14B parameters. Otherwise, Gemma 4 offers more headroom.

Multilingual Edge

Early community testing consistently reports Gemma 4 as "in a tier of its own" for non-English tasks. German, Arabic, Vietnamese, French, and other languages show notably better performance than Qwen 3.5 and Llama 4. If your deployment serves a global audience, this matters.


Running It Locally

Hardware Requirements

26B A4B (MoE):

QuantizationDownload SizeMin VRAMNotes
Q4_K_M~18GB16GBRTX 5060 Ti 16GB, RTX 4090
Q8_0~28GB32GBRTX A6000 or dual GPU
BF16 (full)~52GB48GB+A100, H100

31B Dense:

QuantizationDownload SizeMin VRAMNotes
Q4_K_M~20GB24GBRTX 3090/4090 (45K context max)
Q8_0~33GB40GB+48GB workstation GPU
BF16 (full)~62GB64GB+Apple Silicon ideal

Edge Models:

ModelQ4_K_MQ8_0
E2B7.2GB8.1GB
E4B9.6GB12GB

System RAM rule of thumb: You need roughly 2x the model size in system RAM for weights plus working space.

Apple Silicon note: Gemma 4 is well-optimized for unified memory. An M2 Ultra with 128GB can run the 31B with large context windows more comfortably than most PC GPU setups.

Ollama Setup

bash
# Pull and run
ollama run gemma4          # Default: E4B
ollama run gemma4:e2b      # Smallest
ollama run gemma4:26b      # MoE variant
ollama run gemma4:31b      # Dense flagship

# Specific quantizations
ollama pull gemma4:26b-a4b-it-q4_K_M
ollama pull gemma4:e4b-it-q8_0
ollama pull gemma4:e2b-it-q4_K_M

Once running, the model is accessible on localhost:11434 via the standard Ollama API, compatible with OpenAI-format clients. Connect it to n8n, Claude Code, or any tool that speaks the OpenAI API format.

Other Platforms

Available on Hugging Face, Kaggle, LM Studio, Docker, and Google AI Studio for interactive testing.

For NVIDIA hardware, Gemma 4 supports NVFP4 quantization on Blackwell GPUs and runs via vLLM, llama.cpp, and NVIDIA NIM microservices. All four models fit on a single H100.


The Speed Problem (Honest Take)

The benchmarks are impressive. The architecture is clever. The license is finally right. But the community found a catch within 24 hours.

Gemma 4 is slow.

The 26B MoE hits roughly 11 tokens per second on an RTX 4090. For comparison, Qwen 3.5's equivalent MoE model runs at 60+ tok/s on the same card. The 31B dense is even slower.

Memory consumption is also higher than expected. The 26B Q4 with 20K context barely fits on an RTX 5090. Qwen 3.5 27B Q4 fits with 190K context on the same card.

And the tooling was not ready at launch. HuggingFace Transformers did not recognize the gemma4 architecture on day one. PEFT could not handle a new layer type. QLoRA fine-tuning was broken. A new mm_token_type_ids field was required even for text-only data, breaking existing pipelines.

Google shipped a great model with a rough developer experience. The quality is there. The speed, memory efficiency, and ecosystem support need work. These are solvable problems, and Google's track record suggests they will be solved. But if you are deploying today, test throughput carefully before committing.


Who Should Care

If you run local AI: Gemma 4 is now a top-tier option with no licensing risk. The 26B MoE gives you near-frontier intelligence on a consumer GPU. Test it against whatever you are running now.

If you build agents: Native tool use across all model sizes, including edge, opens up local agentic workflows that previously required cloud API calls. An E4B on a Jetson board can plan and execute multi-step tasks autonomously.

If you serve global users: The multilingual performance is genuinely ahead of the competition. 140 languages with audio and vision support on a 2B model is a capability that did not exist before this release.

If you are an enterprise: Apache 2.0 removes the last legal barrier that kept Gemma off the approved list. Deploy commercially, fine-tune, redistribute, no custom license review required.

If you need raw speed today: Wait. Or use Qwen 3.5. Gemma 4's inference performance needs optimization, and the tooling ecosystem needs a few weeks to catch up.


The Bigger Picture

Gemma 4 is not just a model release. It is Google signaling that the open model race is now about deployability, not just benchmarks.

Apache 2.0. Native tool use. Edge models with multimodal input. A MoE architecture that brings frontier intelligence to consumer hardware. These are the building blocks for a world where AI runs on your own infrastructure, on your own terms, at every scale from phone to data center.

The speed will improve. The tooling will stabilize. The model quality is already there.

If you have been waiting for Google to ship an open model you can actually use without reading a custom license, checking a prohibited-use list, or wondering if the terms might change next quarter, Gemma 4 is it.


*We run open models alongside Claude Code on a $7/month VPS. When we say "test it in production," we mean it. Browse all notes at Phantom Notes.*