Gemma 4 E2B vs Phi-4 vs Qwen3.5-4B: A Cross-Platform Multimodal Deployment Frontier Across RTX 3090 and GB10 DGX Spark

Tarun Chawdhury; Mousumi Chawdhury

Background. Open-weights multimodal language models are increasingly judged not only by answer quality but also by deployment characteristics such as throughput, latency, memory footprint, power draw, and thermal behavior across different hardware platforms.

Methods. We benchmarked Gemma 4 E2B and Phi-4 Multimodal sequentially on an NVIDIA RTX 3090 (24 GB VRAM, Rocky Linux) and Qwen3.5-4B separately on an NVIDIA DGX Spark system powered by the GB10 Grace Blackwell Superchip and 128 GB of unified memory. The benchmark suite covered text, vision, and audio tasks where supported, with additional video analysis results reported for Gemma 4 from the same benchmark program. Measured outputs included tokens per second, end-to-end latency, observed memory use, GPU power draw, and peak temperature. We then reinterpret the raw results as a deployment-frontier problem rather than a single leaderboard, asking which models remain non-dominated once speed, efficiency, modality coverage, and context length are considered jointly.

Results. Gemma 4 E2B achieved the highest overall throughput, leading most text and vision tasks and retaining exclusive video capability in this comparison. Phi-4 remained competitive on some short-form text interactions and showed strong audio transcription throughput on the updated audio set. Qwen3.5-4B was materially slower in raw token generation but delivered the best power efficiency at approximately 0.59 tokens per second per watt on GB10, versus 0.20 for Gemma 4 and 0.22 for Phi-4 on the RTX 3090. Under this frontier view, Gemma dominates for broad multimodal throughput, Qwen dominates for efficiency and context-scaled deployment, and Phi-4 occupies a narrower niche centered on concise text response and ASR throughput.

Conclusion. The updated benchmark no longer supports a single-model recommendation for every environment. Gemma 4 E2B is the strongest general-purpose choice when throughput, long context, and full multimodal coverage matter. Phi-4 remains usable where existing pipelines and concise outputs are preferred. Qwen3.5-4B on GB10 is compelling where low power, quiet operation, ARM-based deployment, and very long context windows are the primary constraints. The paper's main novelty is therefore not another same-hardware speed chart, but an explicit cross-platform account of how open multimodal models occupy different operational Pareto frontiers.

1. Research Contribution

The manuscript's contribution is a shift in framing. Most lightweight multimodal benchmark notes stop at quality or speed comparisons on a single GPU. This study instead treats model selection as a deployment-frontier problem across two materially different hardware envelopes: a throughput-oriented discrete GPU workstation and a lower-power unified-memory GB10 system. That framing makes it possible to ask not only which model is fastest, but which model remains non-dominated under realistic operator constraints such as power budget, modality coverage, thermal ceiling, and context length.

Three concrete contributions follow from that framing. First, the paper compares multimodal open-weight models across heterogeneous hardware rather than forcing a misleading single-device ranking. Second, it reports operational telemetry alongside task outcomes, making efficiency a first-class benchmark variable rather than an appendix metric. Third, it identifies a split frontier: Gemma leads on multimodal throughput and video capability, Qwen leads on context-scaled energy efficiency, and Phi-4 retains a narrower but measurable niche in selected short-form and ASR settings.

2. Experimental Setup

The revised benchmark spans two hardware environments. Gemma 4 E2B and Phi-4 Multimodal were served sequentially on a Rocky Linux host with a single NVIDIA RTX 3090 (24 GB GDDR6X VRAM, CUDA 12.4). Qwen3.5-4B was served separately on an NVIDIA DGX Spark system running NVIDIA DGX OS on ARM aarch64 with CUDA 13 and 128 GB LPDDR5X unified memory. Requests were initiated from a Mac client over the local network. GPU telemetry was polled every 5 seconds during the full benchmark run.

The benchmark script referenced by the updated blog post is benchmark_v2.py. Gemma 4 E2B was run in BF16, Phi-4 Multimodal in auto FP16/BF16 mode, and Qwen3.5-4B in BF16 within a Dockerized DGX OS deployment. Gemma and Phi-4 were not run simultaneously on the RTX 3090; Qwen was benchmarked on the GB10 system as a separate service.

3. Hardware Platform Comparison

The updated benchmark is not a same-card-only comparison. Gemma 4 E2B and Phi-4 Multimodal remain direct peers on the RTX 3090, but Qwen3.5-4B adds a second deployment axis by running on NVIDIA's DGX Spark platform under NVIDIA DGX OS. This makes the paper relevant to both throughput-oriented workstation deployments and low-power edge-style inference scenarios.

4. Generation Throughput

The updated benchmark covers five text tasks, two vision tasks, and two audio tasks. Qwen3.5-4B participates only in text and vision because the current DGX OS deployment is text-and-image only. Gemma 4 E2B leads the sustained text and vision categories overall, while Phi-4 records the highest throughput on the transcription task in the revised audio run.

5. Latency and Accuracy

Throughput alone does not determine usability. The benchmark therefore also reports end-to-end latency and a simple task-success assessment. Gemma 4 E2B records the lowest average latency across the full benchmark set at 2.78 s, followed by Phi-4 at 3.76 s and Qwen3.5-4B at 8.73 s for the text-and-vision subset it supports.

6. Memory, Power, and Thermal Efficiency

The revised benchmark adds operational metrics beyond speed. On the RTX 3090, Gemma 4 E2B uses less memory than Phi-4 despite the larger total parameter count reported in the source material. On the GB10 platform, Qwen3.5-4B uses only a small fraction of the 128 GB unified memory pool and achieves a much stronger token-per-watt profile than either RTX-based model.

7. Deployment-Frontier Analysis

A more novel interpretation of the benchmark is to ask which systems are eliminated by another system that is better on all decision-relevant axes. Under raw throughput alone, Gemma appears dominant. Under energy efficiency and thermal load, Qwen on GB10 becomes dominant. Under audio transcription throughput, Phi-4 remains competitive. The comparison therefore yields not a global winner but a structured frontier with different optima for different deployment regimes.

8. Qualitative Findings by Modality

The updated blog content also broadens the qualitative interpretation of the benchmark. In the summarization task, all three models produce accurate two-sentence summaries with different style profiles: Gemma 4 is most concise, Phi-4 is plain and direct, and Qwen is more expansive while remaining accurate. In the image-description task, all three models correctly identify the main scene elements, although Gemma 4 and Qwen produce more structured responses than Phi-4.

9. Video and Context-Length Findings

The most consequential capability difference remains video support. Gemma 4 E2B is still the only model in this comparison with reported video understanding capability. In the referenced benchmark, Gemma 4 processed an 8-second 1280x720 MP4 sampled at 1 fps and produced a frame-aware description in 13.3 seconds at 42.9 tok/s. Phi-4 Multimodal has no video support in this benchmark, and Qwen3.5-4B as deployed on DGX OS GB10 does not expose video capability.

Context length now forms a second major differentiator. Qwen3.5-4B provides the largest context window at 262K tokens, while both Gemma 4 E2B and Phi-4 Multimodal expose 128K-token context windows in their current official model documentation. This means that Qwen still holds the largest-context position in the comparison, but Phi-4 should no longer be interpreted as a short-context outlier on paper. Even so, Qwen may still be a better fit for document-scale or repository-scale prompts when power budgets are constrained.

10. Limitations

The paper also has important limitations. Qwen3.5-4B is not evaluated on the same GPU as Gemma 4 and Phi-4, so claims about absolute model superiority across all settings would be overstated. The audio and video comparisons are asymmetric because supported modalities differ by deployment. Task accuracy is reported as a compact pass/fail-style summary rather than a large human-rated evaluation set. These design choices are acceptable for deployment-oriented systems work, but they should be read as a study of practical operating envelopes rather than a universal measure of intrinsic model capability.

Those limitations also clarify what is novel here. The manuscript is strongest when it makes a systems claim about model-hardware fit under real operating constraints. It is weaker if read as a pure architecture bake-off. Future work should therefore include same-model cross-hardware replications, larger standardized multimodal test sets, and first-token latency measurements in addition to end-to-end completion time.

11. Conclusions and Recommendations

The updated evidence changes the framing of the paper from a simple Gemma-versus-Phi result to a three-way deployment trade-off. Gemma 4 E2B remains the strongest overall multimodal model for production-style workloads that require sustained generation speed, rich multimodal coverage, long context, and especially video analysis. Phi-4 remains viable for existing pipelines that already depend on its prompt behavior and for workloads that value concise answers and stable text-image-audio support on a standard GPU stack.

Qwen3.5-4B materially extends the benchmark by introducing an energy-efficient path. On NVIDIA DGX Spark running NVIDIA DGX OS, it offers a substantially better power and thermal profile than the RTX 3090 deployments while still delivering usable text and vision quality and the largest context window of the group. For edge deployment, quiet office operation, ARM-native systems, or always-on inference with strict power budgets, Qwen3.5-4B on GB10 is now a legitimate first choice even though it is not the fastest model in absolute terms.

The stronger paper-level claim is therefore selective rather than absolute: no single model wins every axis once hardware envelope is treated as part of the benchmark. Gemma 4 defines the throughput frontier, Qwen defines the efficiency frontier, and Phi-4 survives mainly where audio throughput or compatibility with existing concise-output workflows matters. That tri-partite result is more novel and more useful than a single winner-take-all ranking.

Reproducibility summary. The benchmark references benchmark_v2.py, five text cases, two vision cases, two audio cases, GPU telemetry collection, separate API endpoints for the RTX and GB10 systems, and a results artifact reported in the updated blog as results/benchmark_v2_20260405_204546.json.

Item	Gemma 4 E2B	Phi-4 Multimodal	Qwen3.5-4B
Model identifier	`google/gemma-4-E2B-it`	`microsoft/Phi-4-multimodal-instruct`	`Qwen/Qwen3.5-4B`
Reported parameters	2.3B active, 5.1B total	~5.6B	4B
Architecture	Dense + PLE	Dense Transformer	Gated DeltaNet + MoE
Context window	128K tokens	128K tokens	262K tokens
Modalities in this deployment	Text, image, audio, video	Text, image, audio	Text, image
Primary hardware	RTX 3090 / Rocky Linux	RTX 3090 / Rocky Linux	DGX Spark / NVIDIA DGX OS
Observed memory baseline	10.3 GB VRAM	11.6 GB VRAM	~8 GB unified memory
Thinking mode	Native support	Not reported	Native support

Platform metric	RTX 3090 host	GB10 DGX Spark
Architecture	Ampere GA102 discrete GPU	Blackwell GB10 SoC
Memory model	24 GB dedicated GDDR6X VRAM	128 GB LPDDR5X unified memory
Reported GPU memory bandwidth	936 GB/s	273 GB/s
TDP / system envelope	350 W rated GPU TDP	~140 W total system TDP
Idle power during benchmark period	~17 W GPU	~11 W GPU
Inference power	156-205 W average GPU draw	25.8 W average GPU draw
Peak observed temperature	58-61 C	51 C

Task	Phi-4 tok/s	Gemma 4 tok/s	Qwen3.5 tok/s	Observed outcome
Basic Q&A	32.9	17.6	8.3	Gemma wins on latency; Phi-4 leads token rate on this short answer
Reasoning	38.1	49.4	17.5	Gemma highest sustained throughput
Python coding	38.1	49.5	17.3	Gemma highest sustained throughput
Multilingual	38.2	50.0	12.0	Gemma highest sustained throughput
Summarization	37.8	49.3	16.6	Gemma highest sustained throughput
Image description	28.2	48.7	17.5	Gemma leads vision throughput
Color identification	24.6	47.2	17.2	Gemma leads vision throughput
Transcription (ASR)	33.4	13.9	Not applicable	Phi-4 higher token rate; Gemma lower latency
Audio Q&A	33.4	44.0	Not applicable	Gemma higher token rate and lower latency

Modality	Test	Phi-4 latency	Gemma latency	Qwen latency	Accuracy summary
Text	Basic Q&A	2.24s	0.68s	1.20s	All reported correct
Text	Reasoning	8.02s	6.57s	17.15s	All reasoning chains reported correct; Gemma and Qwen truncated
Text	Python coding	3.94s	5.22s	14.45s	All reported correct
Text	Multilingual	1.77s	3.05s	7.41s	All reported correct
Text	Summarization	2.75s	1.33s	3.61s	All reported correct
Vision	Image description	2.88s	4.22s	11.45s	All reported correct
Vision	Color identification	1.13s	2.13s	5.83s	All reported correct
Audio	Transcription	6.13s	0.44s	Not applicable	Both supported models reported correct
Audio	Audio Q&A	5.02s	1.38s	Not applicable	Both supported models reported correct

Metric	Gemma 4 E2B	Phi-4 Multimodal	Qwen3.5-4B
Baseline memory use	10.3 GB VRAM	11.6 GB VRAM	~8 GB unified memory
Average inference power	204.8 W	156.5 W	25.8 W
Peak power in run	227.3 W	239.8 W	27.1 W
Average / peak temperature	58.8 C / 61 C	55.4 C / 58 C	47.7 C / 51 C
Average throughput	41.1 tok/s	33.9 tok/s	15.2 tok/s
Tokens per second per watt	0.20	0.22	0.59

Abstract

1. Research Contribution

2. Experimental Setup

3. Hardware Platform Comparison

4. Generation Throughput

5. Latency and Accuracy

6. Memory, Power, and Thermal Efficiency

7. Deployment-Frontier Analysis

8. Qualitative Findings by Modality

Vision Example

Audio Example

9. Video and Context-Length Findings

10. Limitations

11. Conclusions and Recommendations

12. References

Deployment regime	Primary decision variables	Model on frontier	Reason it remains non-dominated
General multimodal workstation	Throughput, modality breadth, video support	Gemma 4 E2B	Highest average throughput, best vision speed, only reported video support
Energy- and thermal-constrained edge inference	Tokens per watt, temperature, noise, long context	Qwen3.5-4B on GB10	Approximately 3x better token-per-watt than RTX 3090 runs and largest context window
Audio-centric concise pipeline	ASR throughput, shorter outputs, existing compatibility	Phi-4 Multimodal	Best transcription token rate and competitive short-answer behavior on RTX 3090

Capability metric	Gemma 4 E2B	Phi-4 Multimodal	Qwen3.5-4B
Video understanding	Supported	Not supported	Not supported in this deployment
Reported video benchmark	13.3 s, 42.9 tok/s	Not applicable	Not applicable
Maximum context window	128K	128K	262K