Gemma 4 E2B vs Gemma 4 E4B / Unsloth: A Same-Family Cross-Platform Benchmark Across RTX 3090 and GB10 DGX Spark
A Research Benchmark by DLYog Lab
Tarun Chawdhury  ยท  Mousumi Chawdhury
DLYog Lab Research Services LLC
April 2026
Research Preview v1

Abstract

Background. Lightweight open multimodal models are increasingly deployed under hardware constraints that make architecture family, precision mode, and power envelope matter as much as raw answer quality. Same-family comparisons are especially useful because they isolate deployment effects from large cross-architecture differences.

Methods. We benchmarked Gemma 4 E2B on an NVIDIA RTX 3090 in BF16 and Gemma 4 E4B on an NVIDIA GB10 DGX Spark system using 4-bit Unsloth quantization. The benchmark covered five text tasks, two vision tasks, and two audio tasks. Qwen3.5-4B was run in parallel on the GB10 platform as an efficiency reference for the text-and-vision subset. We recorded tokens per second, end-to-end latency, pass rate, power draw, temperature, and deployment notes.

Results. Gemma 4 E2B delivered materially higher raw throughput, averaging 48.5 tok/s versus 11.6 tok/s for Gemma 4 E4B / Unsloth. Both Gemma variants passed all nine reported tests. The GB10 system maintained a much lower total observed power draw, averaging 27.7 W while simultaneously serving both the E4B/Unsloth endpoint and the Qwen reference endpoint. The RTX 3090 averaged 77.6 W for the E2B run alone. E4B therefore preserves functional multimodal coverage at a much smaller operating envelope, but with a substantial throughput penalty.

Conclusion. The comparison suggests a clear split frontier within the Gemma family. Gemma 4 E2B on RTX 3090 is the better choice when throughput and interactive responsiveness dominate. Gemma 4 E4B / Unsloth on GB10 is compelling when lower power, smaller thermal footprint, and compact always-on deployment matter more than raw generation speed.
Keywords: Gemma 4, Unsloth, quantization, GB10, DGX Spark, RTX 3090, multimodal benchmarking, deployment efficiency, edge inference, thermal analysis

1. Research Contribution

This paper narrows the benchmark question to a same-family deployment problem: what changes when Gemma 4 is moved from a smaller BF16 workstation configuration to a larger 4-bit quantized deployment on a lower-power GB10 system. That framing is more operationally useful than a generic model bake-off because it helps practitioners reason about the trade between raw speed and system efficiency while keeping the model family constant.

The contribution is therefore twofold. First, the study provides a direct Gemma-to-Gemma comparison across materially different hardware envelopes. Second, it contextualizes the result with a concurrent Qwen3.5-4B run on the same GB10 system, showing that the GB10 platform is not merely slower than a workstation GPU but also markedly more efficient per watt in practical multimodal serving.

2. Experimental Setup

Gemma 4 E2B was served on an RTX 3090 workstation in BF16. Gemma 4 E4B was served on a GB10 DGX Spark system using 4-bit Unsloth quantization. The benchmark was initiated from a Mac client over HTTP, with GPU telemetry sampled every 5 seconds. The reported suite included basic Q&A, reasoning, coding, multilingual prompting, summarization, image description, color identification, transcription, and audio question answering.

Item Gemma 4 E2B Gemma 4 E4B / Unsloth Qwen3.5-4B reference
Primary deployment RTX 3090 workstation GB10 DGX Spark GB10 DGX Spark
Precision mode BF16 bnb-4bit via Unsloth BF16
Modalities in run Text, image, audio, video Text, image, audio, video Text, image
Benchmark coverage 9/9 tasks 9/9 tasks 7/7 tasks
Observed average throughput 48.5 tok/s 11.6 tok/s 12.4 tok/s
Observed average latency 2.87s 9.33s 15.0s

3. Hardware Platform Comparison

The benchmark compares two very different operating envelopes. The RTX 3090 offers far higher instantaneous throughput and dedicated VRAM, while the GB10 system emphasizes compactness, unified memory, and lower energy draw. Because the E4B system is also quantized, the study is best read as a deployment comparison rather than a pure architectural comparison between two unmodified checkpoints.

Platform metric RTX 3090 host GB10 DGX Spark
GPU class Ampere discrete GPU Grace Blackwell GB10 SoC
Memory model 24 GB dedicated GDDR6X VRAM 128 GB unified memory
Cooling profile Air cooled Liquid cooled
Observed average power in run 77.6 W for E2B 27.7 W total for E4B + Qwen
Observed peak temperature 53 C 48 C
Interpretive note. The GB10 measurement is a shared-system number because Gemma 4 E4B / Unsloth and Qwen3.5-4B were hosted simultaneously. Even with that caveat, the result is still operationally significant because the full dual-model setup remained well below the single model power observed on the RTX 3090 run.

4. Throughput Findings

Gemma 4 E2B leads every major throughput category in this dataset. Its best text results are near 50 tok/s across reasoning, coding, multilingual prompting, and summarization. The E4B/Unsloth deployment stays in the 11.5 to 13.0 tok/s range on most tasks. The same-family comparison therefore indicates that the throughput penalty from the lower-power quantized deployment is approximately 4x in this benchmark.

Task Gemma 4 E2B Gemma 4 E4B / Unsloth Qwen3.5-4B
Basic Q&A 40.7 tok/s 4.8 tok/s 12.4 tok/s
Reasoning 50.2 tok/s 12.9 tok/s 7.3 tok/s
Coding 50.5 tok/s 13.0 tok/s 7.3 tok/s
Multilingual 50.7 tok/s 12.9 tok/s 8.5 tok/s
Summarization 50.2 tok/s 12.7 tok/s 17.0 tok/s
Image description 49.4 tok/s 12.4 tok/s 17.5 tok/s
Color identification 47.2 tok/s 12.3 tok/s 17.1 tok/s
Transcription 47.9 tok/s 11.7 tok/s Not applicable
Audio Q&A 49.9 tok/s 11.5 tok/s Not applicable

5. Latency and Pass Rate

The throughput gap is mirrored by a latency gap. Gemma 4 E2B averages 2.87 seconds across the full benchmark, while E4B / Unsloth averages 9.33 seconds. Despite that difference, both Gemma deployments passed all nine benchmark tasks, which indicates that the quantized GB10 setup preserves functionality even when it sacrifices responsiveness.

Metric Gemma 4 E2B Gemma 4 E4B / Unsloth Qwen3.5-4B
Average latency 2.87s 9.33s 15.0s
Pass rate 9 / 9 9 / 9 7 / 7
Audio support in run Supported Supported Not supported
Video support in run Supported Supported Not supported

6. Power and Thermal Efficiency

The efficiency result is the most important systems-level counterweight to the RTX throughput lead. The RTX 3090 run averaged 77.6 W and peaked at 219.8 W for a single model. The GB10 system averaged 27.7 W and peaked at 31 W while concurrently hosting both the E4B / Unsloth endpoint and the Qwen reference endpoint. That is a materially smaller operational envelope.

Metric Gemma 4 E2B Gemma 4 E4B / Unsloth
Average power draw 77.6 W ~27.7 W shared-system total
Peak power draw 219.8 W 31.0 W shared-system total
Average temperature 46 C 44.5 C
Peak temperature 53 C 48 C
Operational takeaway. If the objective is a fast single-model workstation setup, E2B on RTX 3090 is clearly ahead. If the objective is a quieter and lower-power desk-side serving setup that can keep multiple lightweight models live at once, the GB10 deployment is more attractive even though it is slower.

7. Deployment-Frontier Interpretation

The same-family result supports a clean deployment split. E2B defines the throughput frontier within Gemma 4 for this benchmark. E4B / Unsloth defines the compact-efficiency frontier by preserving full pass rate and multimodal coverage at a much smaller power and thermal envelope. Qwen3.5-4B further reinforces that the GB10 platform is useful for efficient text-and-vision serving even when it does not match workstation-class generation speed.

Deployment regime Preferred model Why it stays on the frontier
Interactive multimodal workstation Gemma 4 E2B Highest throughput and lowest average latency in the benchmark
Low-power always-on local serving Gemma 4 E4B / Unsloth Maintains 9/9 pass rate with far smaller observed system power
Text-and-vision efficiency reference Qwen3.5-4B on GB10 Shows the same GB10 hardware can remain useful for efficient multi-model hosting

8. Qualitative Observations

The benchmark is not presented as a large human-rated quality study, but the reported outputs show that both Gemma variants completed the same multimodal task set successfully. The practical difference is therefore not whether E4B / Unsloth can answer, but how long the user waits and what hardware envelope must be provisioned to obtain that answer.

Same-family interpretation

Because both compared Gemma checkpoints pass the full task set, the result reads less like a capability gap and more like a deployment tuning decision. BF16 on RTX 3090 buys responsiveness. 4-bit Unsloth on GB10 buys efficiency and system compactness.

9. Limitations

The comparison is intentionally deployment-oriented and therefore combines multiple differences at once: model size, hardware platform, and quantization method. That means the paper should not be interpreted as a pure architectural comparison between E2B and E4B alone. The power numbers on GB10 are also shared-system figures because two model services were active simultaneously. In addition, quality is summarized as task success rather than large-scale blind evaluation.

10. Conclusions

The benchmark supports a straightforward conclusion. Gemma 4 E2B on RTX 3090 is the better operator choice when speed, responsiveness, and sustained multimodal throughput matter most. Gemma 4 E4B / Unsloth on GB10 is the better choice when power draw, thermal stability, and a compact always-on local deployment are the main constraints. The more useful claim is therefore selective rather than absolute: the Gemma family spans two different deployment frontiers, and both remain practically relevant.

11. References

  1. Google DeepMind. Gemma 4 model materials and checkpoint documentation.
  2. Unsloth. Documentation for 4-bit quantized deployment workflows.
  3. NVIDIA. NVIDIA GeForce RTX 3090 specifications.
  4. NVIDIA. NVIDIA DGX Spark and GB10 platform documentation.
  5. DLYog Lab. Gemma4-E2B vs Gemma4-E4B / Unsloth benchmark article and telemetry summary.