This paper narrows the benchmark question to a same-family deployment problem: what changes when Gemma 4 is moved from a smaller BF16 workstation configuration to a larger 4-bit quantized deployment on a lower-power GB10 system. That framing is more operationally useful than a generic model bake-off because it helps practitioners reason about the trade between raw speed and system efficiency while keeping the model family constant.
The contribution is therefore twofold. First, the study provides a direct Gemma-to-Gemma comparison across materially different hardware envelopes. Second, it contextualizes the result with a concurrent Qwen3.5-4B run on the same GB10 system, showing that the GB10 platform is not merely slower than a workstation GPU but also markedly more efficient per watt in practical multimodal serving.
Gemma 4 E2B was served on an RTX 3090 workstation in BF16. Gemma 4 E4B was served on a GB10 DGX Spark system using 4-bit Unsloth quantization. The benchmark was initiated from a Mac client over HTTP, with GPU telemetry sampled every 5 seconds. The reported suite included basic Q&A, reasoning, coding, multilingual prompting, summarization, image description, color identification, transcription, and audio question answering.
| Item | Gemma 4 E2B | Gemma 4 E4B / Unsloth | Qwen3.5-4B reference |
|---|---|---|---|
| Primary deployment | RTX 3090 workstation | GB10 DGX Spark | GB10 DGX Spark |
| Precision mode | BF16 | bnb-4bit via Unsloth | BF16 |
| Modalities in run | Text, image, audio, video | Text, image, audio, video | Text, image |
| Benchmark coverage | 9/9 tasks | 9/9 tasks | 7/7 tasks |
| Observed average throughput | 48.5 tok/s | 11.6 tok/s | 12.4 tok/s |
| Observed average latency | 2.87s | 9.33s | 15.0s |
The benchmark compares two very different operating envelopes. The RTX 3090 offers far higher instantaneous throughput and dedicated VRAM, while the GB10 system emphasizes compactness, unified memory, and lower energy draw. Because the E4B system is also quantized, the study is best read as a deployment comparison rather than a pure architectural comparison between two unmodified checkpoints.
| Platform metric | RTX 3090 host | GB10 DGX Spark |
|---|---|---|
| GPU class | Ampere discrete GPU | Grace Blackwell GB10 SoC |
| Memory model | 24 GB dedicated GDDR6X VRAM | 128 GB unified memory |
| Cooling profile | Air cooled | Liquid cooled |
| Observed average power in run | 77.6 W for E2B | 27.7 W total for E4B + Qwen |
| Observed peak temperature | 53 C | 48 C |
Gemma 4 E2B leads every major throughput category in this dataset. Its best text results are near 50 tok/s across reasoning, coding, multilingual prompting, and summarization. The E4B/Unsloth deployment stays in the 11.5 to 13.0 tok/s range on most tasks. The same-family comparison therefore indicates that the throughput penalty from the lower-power quantized deployment is approximately 4x in this benchmark.
| Task | Gemma 4 E2B | Gemma 4 E4B / Unsloth | Qwen3.5-4B |
|---|---|---|---|
| Basic Q&A | 40.7 tok/s | 4.8 tok/s | 12.4 tok/s |
| Reasoning | 50.2 tok/s | 12.9 tok/s | 7.3 tok/s |
| Coding | 50.5 tok/s | 13.0 tok/s | 7.3 tok/s |
| Multilingual | 50.7 tok/s | 12.9 tok/s | 8.5 tok/s |
| Summarization | 50.2 tok/s | 12.7 tok/s | 17.0 tok/s |
| Image description | 49.4 tok/s | 12.4 tok/s | 17.5 tok/s |
| Color identification | 47.2 tok/s | 12.3 tok/s | 17.1 tok/s |
| Transcription | 47.9 tok/s | 11.7 tok/s | Not applicable |
| Audio Q&A | 49.9 tok/s | 11.5 tok/s | Not applicable |
The throughput gap is mirrored by a latency gap. Gemma 4 E2B averages 2.87 seconds across the full benchmark, while E4B / Unsloth averages 9.33 seconds. Despite that difference, both Gemma deployments passed all nine benchmark tasks, which indicates that the quantized GB10 setup preserves functionality even when it sacrifices responsiveness.
| Metric | Gemma 4 E2B | Gemma 4 E4B / Unsloth | Qwen3.5-4B |
|---|---|---|---|
| Average latency | 2.87s | 9.33s | 15.0s |
| Pass rate | 9 / 9 | 9 / 9 | 7 / 7 |
| Audio support in run | Supported | Supported | Not supported |
| Video support in run | Supported | Supported | Not supported |
The efficiency result is the most important systems-level counterweight to the RTX throughput lead. The RTX 3090 run averaged 77.6 W and peaked at 219.8 W for a single model. The GB10 system averaged 27.7 W and peaked at 31 W while concurrently hosting both the E4B / Unsloth endpoint and the Qwen reference endpoint. That is a materially smaller operational envelope.
| Metric | Gemma 4 E2B | Gemma 4 E4B / Unsloth |
|---|---|---|
| Average power draw | 77.6 W | ~27.7 W shared-system total |
| Peak power draw | 219.8 W | 31.0 W shared-system total |
| Average temperature | 46 C | 44.5 C |
| Peak temperature | 53 C | 48 C |
The same-family result supports a clean deployment split. E2B defines the throughput frontier within Gemma 4 for this benchmark. E4B / Unsloth defines the compact-efficiency frontier by preserving full pass rate and multimodal coverage at a much smaller power and thermal envelope. Qwen3.5-4B further reinforces that the GB10 platform is useful for efficient text-and-vision serving even when it does not match workstation-class generation speed.
| Deployment regime | Preferred model | Why it stays on the frontier |
|---|---|---|
| Interactive multimodal workstation | Gemma 4 E2B | Highest throughput and lowest average latency in the benchmark |
| Low-power always-on local serving | Gemma 4 E4B / Unsloth | Maintains 9/9 pass rate with far smaller observed system power |
| Text-and-vision efficiency reference | Qwen3.5-4B on GB10 | Shows the same GB10 hardware can remain useful for efficient multi-model hosting |
The benchmark is not presented as a large human-rated quality study, but the reported outputs show that both Gemma variants completed the same multimodal task set successfully. The practical difference is therefore not whether E4B / Unsloth can answer, but how long the user waits and what hardware envelope must be provisioned to obtain that answer.
Because both compared Gemma checkpoints pass the full task set, the result reads less like a capability gap and more like a deployment tuning decision. BF16 on RTX 3090 buys responsiveness. 4-bit Unsloth on GB10 buys efficiency and system compactness.
The comparison is intentionally deployment-oriented and therefore combines multiple differences at once: model size, hardware platform, and quantization method. That means the paper should not be interpreted as a pure architectural comparison between E2B and E4B alone. The power numbers on GB10 are also shared-system figures because two model services were active simultaneously. In addition, quality is summarized as task success rather than large-scale blind evaluation.
The benchmark supports a straightforward conclusion. Gemma 4 E2B on RTX 3090 is the better operator choice when speed, responsiveness, and sustained multimodal throughput matter most. Gemma 4 E4B / Unsloth on GB10 is the better choice when power draw, thermal stability, and a compact always-on local deployment are the main constraints. The more useful claim is therefore selective rather than absolute: the Gemma family spans two different deployment frontiers, and both remain practically relevant.