Gemma 4 E2B vs Phi-4 vs Qwen3.5-4B: A Cross-Platform Multimodal Deployment Frontier Across RTX 3090 and GB10 DGX Spark
A Research Benchmark by DLYog Lab
Tarun Chawdhury  ยท  Mousumi Chawdhury
DLYog Lab Research Services LLC
April 2026
Research Preview v2
Demo video: https://youtu.be/_rwdFBZ6jps

Abstract

Background. Open-weights multimodal language models are increasingly judged not only by answer quality but also by deployment characteristics such as throughput, latency, memory footprint, power draw, and thermal behavior across different hardware platforms.

Methods. We benchmarked Gemma 4 E2B and Phi-4 Multimodal sequentially on an NVIDIA RTX 3090 (24 GB VRAM, Rocky Linux) and Qwen3.5-4B separately on an NVIDIA DGX Spark system powered by the GB10 Grace Blackwell Superchip and 128 GB of unified memory. The benchmark suite covered text, vision, and audio tasks where supported, with additional video analysis results reported for Gemma 4 from the same benchmark program. Measured outputs included tokens per second, end-to-end latency, observed memory use, GPU power draw, and peak temperature. We then reinterpret the raw results as a deployment-frontier problem rather than a single leaderboard, asking which models remain non-dominated once speed, efficiency, modality coverage, and context length are considered jointly.

Results. Gemma 4 E2B achieved the highest overall throughput, leading most text and vision tasks and retaining exclusive video capability in this comparison. Phi-4 remained competitive on some short-form text interactions and showed strong audio transcription throughput on the updated audio set. Qwen3.5-4B was materially slower in raw token generation but delivered the best power efficiency at approximately 0.59 tokens per second per watt on GB10, versus 0.20 for Gemma 4 and 0.22 for Phi-4 on the RTX 3090. Under this frontier view, Gemma dominates for broad multimodal throughput, Qwen dominates for efficiency and context-scaled deployment, and Phi-4 occupies a narrower niche centered on concise text response and ASR throughput.

Conclusion. The updated benchmark no longer supports a single-model recommendation for every environment. Gemma 4 E2B is the strongest general-purpose choice when throughput, long context, and full multimodal coverage matter. Phi-4 remains usable where existing pipelines and concise outputs are preferred. Qwen3.5-4B on GB10 is compelling where low power, quiet operation, ARM-based deployment, and very long context windows are the primary constraints. The paper's main novelty is therefore not another same-hardware speed chart, but an explicit cross-platform account of how open multimodal models occupy different operational Pareto frontiers.
Keywords: multimodal LLMs, Gemma 4 E2B, Phi-4 Multimodal, Qwen3.5-4B, DGX Spark, NVIDIA DGX OS, GB10, benchmarking, power efficiency, thermal analysis, edge deployment, Pareto frontier, deployment science

1. Research Contribution

The manuscript's contribution is a shift in framing. Most lightweight multimodal benchmark notes stop at quality or speed comparisons on a single GPU. This study instead treats model selection as a deployment-frontier problem across two materially different hardware envelopes: a throughput-oriented discrete GPU workstation and a lower-power unified-memory GB10 system. That framing makes it possible to ask not only which model is fastest, but which model remains non-dominated under realistic operator constraints such as power budget, modality coverage, thermal ceiling, and context length.

Three concrete contributions follow from that framing. First, the paper compares multimodal open-weight models across heterogeneous hardware rather than forcing a misleading single-device ranking. Second, it reports operational telemetry alongside task outcomes, making efficiency a first-class benchmark variable rather than an appendix metric. Third, it identifies a split frontier: Gemma leads on multimodal throughput and video capability, Qwen leads on context-scaled energy efficiency, and Phi-4 retains a narrower but measurable niche in selected short-form and ASR settings.

2. Experimental Setup

The revised benchmark spans two hardware environments. Gemma 4 E2B and Phi-4 Multimodal were served sequentially on a Rocky Linux host with a single NVIDIA RTX 3090 (24 GB GDDR6X VRAM, CUDA 12.4). Qwen3.5-4B was served separately on an NVIDIA DGX Spark system running NVIDIA DGX OS on ARM aarch64 with CUDA 13 and 128 GB LPDDR5X unified memory. Requests were initiated from a Mac client over the local network. GPU telemetry was polled every 5 seconds during the full benchmark run.

The benchmark script referenced by the updated blog post is benchmark_v2.py. Gemma 4 E2B was run in BF16, Phi-4 Multimodal in auto FP16/BF16 mode, and Qwen3.5-4B in BF16 within a Dockerized DGX OS deployment. Gemma and Phi-4 were not run simultaneously on the RTX 3090; Qwen was benchmarked on the GB10 system as a separate service.

Item Gemma 4 E2B Phi-4 Multimodal Qwen3.5-4B
Model identifier google/gemma-4-E2B-it microsoft/Phi-4-multimodal-instruct Qwen/Qwen3.5-4B
Reported parameters 2.3B active, 5.1B total ~5.6B 4B
Architecture Dense + PLE Dense Transformer Gated DeltaNet + MoE
Context window 128K tokens 128K tokens 262K tokens
Modalities in this deployment Text, image, audio, video Text, image, audio Text, image
Primary hardware RTX 3090 / Rocky Linux RTX 3090 / Rocky Linux DGX Spark / NVIDIA DGX OS
Observed memory baseline 10.3 GB VRAM 11.6 GB VRAM ~8 GB unified memory
Thinking mode Native support Not reported Native support

3. Hardware Platform Comparison

The updated benchmark is not a same-card-only comparison. Gemma 4 E2B and Phi-4 Multimodal remain direct peers on the RTX 3090, but Qwen3.5-4B adds a second deployment axis by running on NVIDIA's DGX Spark platform under NVIDIA DGX OS. This makes the paper relevant to both throughput-oriented workstation deployments and low-power edge-style inference scenarios.

Platform metric RTX 3090 host GB10 DGX Spark
Architecture Ampere GA102 discrete GPU Blackwell GB10 SoC
Memory model 24 GB dedicated GDDR6X VRAM 128 GB LPDDR5X unified memory
Reported GPU memory bandwidth 936 GB/s 273 GB/s
TDP / system envelope 350 W rated GPU TDP ~140 W total system TDP
Idle power during benchmark period ~17 W GPU ~11 W GPU
Inference power 156-205 W average GPU draw 25.8 W average GPU draw
Peak observed temperature 58-61 C 51 C
Interpretive note. The RTX 3090 provides far higher memory bandwidth and therefore substantially higher raw generation speed. The GB10 platform trades that throughput for much lower power draw, lower thermal output, a unified memory model, and a deployment profile better suited to quiet desk-side or edge environments.

4. Generation Throughput

The updated benchmark covers five text tasks, two vision tasks, and two audio tasks. Qwen3.5-4B participates only in text and vision because the current DGX OS deployment is text-and-image only. Gemma 4 E2B leads the sustained text and vision categories overall, while Phi-4 records the highest throughput on the transcription task in the revised audio run.

Task Phi-4 tok/s Gemma 4 tok/s Qwen3.5 tok/s Observed outcome
Basic Q&A 32.9 17.6 8.3 Gemma wins on latency; Phi-4 leads token rate on this short answer
Reasoning 38.1 49.4 17.5 Gemma highest sustained throughput
Python coding 38.1 49.5 17.3 Gemma highest sustained throughput
Multilingual 38.2 50.0 12.0 Gemma highest sustained throughput
Summarization 37.8 49.3 16.6 Gemma highest sustained throughput
Image description 28.2 48.7 17.5 Gemma leads vision throughput
Color identification 24.6 47.2 17.2 Gemma leads vision throughput
Transcription (ASR) 33.4 13.9 Not applicable Phi-4 higher token rate; Gemma lower latency
Audio Q&A 33.4 44.0 Not applicable Gemma higher token rate and lower latency
Benchmark note. The reasoning benchmark uses a snail-puzzle prompt. The updated source indicates that Gemma 4 and Qwen both produced correct reasoning chains but were truncated before printing the final short answer because of the benchmark token limit. The behavior is treated as a measurement artifact rather than a reasoning failure.

5. Latency and Accuracy

Throughput alone does not determine usability. The benchmark therefore also reports end-to-end latency and a simple task-success assessment. Gemma 4 E2B records the lowest average latency across the full benchmark set at 2.78 s, followed by Phi-4 at 3.76 s and Qwen3.5-4B at 8.73 s for the text-and-vision subset it supports.

Modality Test Phi-4 latency Gemma latency Qwen latency Accuracy summary
Text Basic Q&A 2.24s 0.68s 1.20s All reported correct
Text Reasoning 8.02s 6.57s 17.15s All reasoning chains reported correct; Gemma and Qwen truncated
Text Python coding 3.94s 5.22s 14.45s All reported correct
Text Multilingual 1.77s 3.05s 7.41s All reported correct
Text Summarization 2.75s 1.33s 3.61s All reported correct
Vision Image description 2.88s 4.22s 11.45s All reported correct
Vision Color identification 1.13s 2.13s 5.83s All reported correct
Audio Transcription 6.13s 0.44s Not applicable Both supported models reported correct
Audio Audio Q&A 5.02s 1.38s Not applicable Both supported models reported correct

6. Memory, Power, and Thermal Efficiency

The revised benchmark adds operational metrics beyond speed. On the RTX 3090, Gemma 4 E2B uses less memory than Phi-4 despite the larger total parameter count reported in the source material. On the GB10 platform, Qwen3.5-4B uses only a small fraction of the 128 GB unified memory pool and achieves a much stronger token-per-watt profile than either RTX-based model.

Metric Gemma 4 E2B Phi-4 Multimodal Qwen3.5-4B
Baseline memory use 10.3 GB VRAM 11.6 GB VRAM ~8 GB unified memory
Average inference power 204.8 W 156.5 W 25.8 W
Peak power in run 227.3 W 239.8 W 27.1 W
Average / peak temperature 58.8 C / 61 C 55.4 C / 58 C 47.7 C / 51 C
Average throughput 41.1 tok/s 33.9 tok/s 15.2 tok/s
Tokens per second per watt 0.20 0.22 0.59
Operational takeaway. Qwen3.5-4B on GB10 is roughly three times more power-efficient than Gemma 4 on the RTX 3090 and substantially cooler in the reported run. The trade-off is lower raw throughput. This makes GB10 attractive for continuous low-noise inference where electrical and thermal budgets matter more than peak tokens per second.

7. Deployment-Frontier Analysis

A more novel interpretation of the benchmark is to ask which systems are eliminated by another system that is better on all decision-relevant axes. Under raw throughput alone, Gemma appears dominant. Under energy efficiency and thermal load, Qwen on GB10 becomes dominant. Under audio transcription throughput, Phi-4 remains competitive. The comparison therefore yields not a global winner but a structured frontier with different optima for different deployment regimes.

Deployment regime Primary decision variables Model on frontier Reason it remains non-dominated
General multimodal workstation Throughput, modality breadth, video support Gemma 4 E2B Highest average throughput, best vision speed, only reported video support
Energy- and thermal-constrained edge inference Tokens per watt, temperature, noise, long context Qwen3.5-4B on GB10 Approximately 3x better token-per-watt than RTX 3090 runs and largest context window
Audio-centric concise pipeline ASR throughput, shorter outputs, existing compatibility Phi-4 Multimodal Best transcription token rate and competitive short-answer behavior on RTX 3090
Novelty claim. The principal contribution of the paper is this deployment-frontier interpretation. The benchmark argues that model evaluation for small and mid-sized multimodal systems should be framed as an operator decision problem across heterogeneous hardware, not as a single scalar race for the highest tokens per second.

8. Qualitative Findings by Modality

The updated blog content also broadens the qualitative interpretation of the benchmark. In the summarization task, all three models produce accurate two-sentence summaries with different style profiles: Gemma 4 is most concise, Phi-4 is plain and direct, and Qwen is more expansive while remaining accurate. In the image-description task, all three models correctly identify the main scene elements, although Gemma 4 and Qwen produce more structured responses than Phi-4.

Vision Example

Gemma 4 and Qwen both return structured scene breakdowns with distinct element-level labeling, whereas Phi-4 responds with a concise paragraph. The source interpretation is that answer quality is broadly comparable across models and that Gemma 4 wins primarily on speed rather than on an obvious quality gap.

Audio Example

The updated audio discussion distinguishes between real speech and a synthetic tone test. For the revised five-sample speech benchmark, Gemma 4 and Phi-4 both achieve acceptable task quality, with Gemma 4 showing much lower latency. For ambiguous synthetic audio, Phi-4 is reported to loop while Gemma 4 terminates cleanly. Qwen3.5-4B does not participate because the current deployment does not expose audio input.

9. Video and Context-Length Findings

The most consequential capability difference remains video support. Gemma 4 E2B is still the only model in this comparison with reported video understanding capability. In the referenced benchmark, Gemma 4 processed an 8-second 1280x720 MP4 sampled at 1 fps and produced a frame-aware description in 13.3 seconds at 42.9 tok/s. Phi-4 Multimodal has no video support in this benchmark, and Qwen3.5-4B as deployed on DGX OS GB10 does not expose video capability.

Context length now forms a second major differentiator. Qwen3.5-4B provides the largest context window at 262K tokens, while both Gemma 4 E2B and Phi-4 Multimodal expose 128K-token context windows in their current official model documentation. This means that Qwen still holds the largest-context position in the comparison, but Phi-4 should no longer be interpreted as a short-context outlier on paper. Even so, Qwen may still be a better fit for document-scale or repository-scale prompts when power budgets are constrained.

Capability metric Gemma 4 E2B Phi-4 Multimodal Qwen3.5-4B
Video understanding Supported Not supported Not supported in this deployment
Reported video benchmark 13.3 s, 42.9 tok/s Not applicable Not applicable
Maximum context window 128K 128K 262K

10. Limitations

The paper also has important limitations. Qwen3.5-4B is not evaluated on the same GPU as Gemma 4 and Phi-4, so claims about absolute model superiority across all settings would be overstated. The audio and video comparisons are asymmetric because supported modalities differ by deployment. Task accuracy is reported as a compact pass/fail-style summary rather than a large human-rated evaluation set. These design choices are acceptable for deployment-oriented systems work, but they should be read as a study of practical operating envelopes rather than a universal measure of intrinsic model capability.

Those limitations also clarify what is novel here. The manuscript is strongest when it makes a systems claim about model-hardware fit under real operating constraints. It is weaker if read as a pure architecture bake-off. Future work should therefore include same-model cross-hardware replications, larger standardized multimodal test sets, and first-token latency measurements in addition to end-to-end completion time.

11. Conclusions and Recommendations

The updated evidence changes the framing of the paper from a simple Gemma-versus-Phi result to a three-way deployment trade-off. Gemma 4 E2B remains the strongest overall multimodal model for production-style workloads that require sustained generation speed, rich multimodal coverage, long context, and especially video analysis. Phi-4 remains viable for existing pipelines that already depend on its prompt behavior and for workloads that value concise answers and stable text-image-audio support on a standard GPU stack.

Qwen3.5-4B materially extends the benchmark by introducing an energy-efficient path. On NVIDIA DGX Spark running NVIDIA DGX OS, it offers a substantially better power and thermal profile than the RTX 3090 deployments while still delivering usable text and vision quality and the largest context window of the group. For edge deployment, quiet office operation, ARM-native systems, or always-on inference with strict power budgets, Qwen3.5-4B on GB10 is now a legitimate first choice even though it is not the fastest model in absolute terms.

The stronger paper-level claim is therefore selective rather than absolute: no single model wins every axis once hardware envelope is treated as part of the benchmark. Gemma 4 defines the throughput frontier, Qwen defines the efficiency frontier, and Phi-4 survives mainly where audio throughput or compatibility with existing concise-output workflows matters. That tri-partite result is more novel and more useful than a single winner-take-all ranking.

Reproducibility summary. The benchmark references benchmark_v2.py, five text cases, two vision cases, two audio cases, GPU telemetry collection, separate API endpoints for the RTX and GB10 systems, and a results artifact reported in the updated blog as results/benchmark_v2_20260405_204546.json.

12. References

  1. Google DeepMind. google/gemma-4-E2B model card. Available at: https://huggingface.co/google/gemma-4-E2B
  2. Microsoft / NVIDIA. Phi-4-multimodal-instruct model card. Available at: https://build.nvidia.com/microsoft/phi-4-multimodal-instruct/modelcard
  3. Qwen Team. Qwen3.5-4B-Base model card. Available at: https://huggingface.co/Qwen/Qwen3.5-4B-Base
  4. NVIDIA. NVIDIA DGX Spark product and specifications page. Available at: https://www.nvidia.com/en-us/products/workstations/dgx-spark/
  5. hexgrad. Kokoro-82M model card. Hugging Face. Available at: https://huggingface.co/hexgrad/Kokoro-82M
  6. Black Forest Labs. FLUX.1 [dev] model card. Hugging Face. Available at: https://huggingface.co/black-forest-labs/FLUX.1-dev
  7. PyAV Project. PyAV. Python Package Index. Available at: https://pypi.org/project/av/
  8. DLYog. Gemma-4-E2B-vs-PHI4-Multimodal-TestData. GitHub repository for benchmark test data and verification. Available at: https://github.com/dlyog/gemma-4-E2B-vs-PHI4-Multimodal-TestData