The manuscript's contribution is a shift in framing. Most lightweight multimodal benchmark notes stop at quality or speed comparisons on a single GPU. This study instead treats model selection as a deployment-frontier problem across two materially different hardware envelopes: a throughput-oriented discrete GPU workstation and a lower-power unified-memory GB10 system. That framing makes it possible to ask not only which model is fastest, but which model remains non-dominated under realistic operator constraints such as power budget, modality coverage, thermal ceiling, and context length.
Three concrete contributions follow from that framing. First, the paper compares multimodal open-weight models across heterogeneous hardware rather than forcing a misleading single-device ranking. Second, it reports operational telemetry alongside task outcomes, making efficiency a first-class benchmark variable rather than an appendix metric. Third, it identifies a split frontier: Gemma leads on multimodal throughput and video capability, Qwen leads on context-scaled energy efficiency, and Phi-4 retains a narrower but measurable niche in selected short-form and ASR settings.
The revised benchmark spans two hardware environments. Gemma 4 E2B and Phi-4 Multimodal were served sequentially on a Rocky Linux host with a single NVIDIA RTX 3090 (24 GB GDDR6X VRAM, CUDA 12.4). Qwen3.5-4B was served separately on an NVIDIA DGX Spark system running NVIDIA DGX OS on ARM aarch64 with CUDA 13 and 128 GB LPDDR5X unified memory. Requests were initiated from a Mac client over the local network. GPU telemetry was polled every 5 seconds during the full benchmark run.
The benchmark script referenced by the updated blog post is benchmark_v2.py.
Gemma 4 E2B was run in BF16, Phi-4 Multimodal in auto FP16/BF16 mode, and Qwen3.5-4B in
BF16 within a Dockerized DGX OS deployment. Gemma and Phi-4 were not run simultaneously on the
RTX 3090; Qwen was benchmarked on the GB10 system as a separate service.
| Item | Gemma 4 E2B | Phi-4 Multimodal | Qwen3.5-4B |
|---|---|---|---|
| Model identifier | google/gemma-4-E2B-it |
microsoft/Phi-4-multimodal-instruct |
Qwen/Qwen3.5-4B |
| Reported parameters | 2.3B active, 5.1B total | ~5.6B | 4B |
| Architecture | Dense + PLE | Dense Transformer | Gated DeltaNet + MoE |
| Context window | 128K tokens | 128K tokens | 262K tokens |
| Modalities in this deployment | Text, image, audio, video | Text, image, audio | Text, image |
| Primary hardware | RTX 3090 / Rocky Linux | RTX 3090 / Rocky Linux | DGX Spark / NVIDIA DGX OS |
| Observed memory baseline | 10.3 GB VRAM | 11.6 GB VRAM | ~8 GB unified memory |
| Thinking mode | Native support | Not reported | Native support |
The updated benchmark is not a same-card-only comparison. Gemma 4 E2B and Phi-4 Multimodal remain direct peers on the RTX 3090, but Qwen3.5-4B adds a second deployment axis by running on NVIDIA's DGX Spark platform under NVIDIA DGX OS. This makes the paper relevant to both throughput-oriented workstation deployments and low-power edge-style inference scenarios.
| Platform metric | RTX 3090 host | GB10 DGX Spark |
|---|---|---|
| Architecture | Ampere GA102 discrete GPU | Blackwell GB10 SoC |
| Memory model | 24 GB dedicated GDDR6X VRAM | 128 GB LPDDR5X unified memory |
| Reported GPU memory bandwidth | 936 GB/s | 273 GB/s |
| TDP / system envelope | 350 W rated GPU TDP | ~140 W total system TDP |
| Idle power during benchmark period | ~17 W GPU | ~11 W GPU |
| Inference power | 156-205 W average GPU draw | 25.8 W average GPU draw |
| Peak observed temperature | 58-61 C | 51 C |
The updated benchmark covers five text tasks, two vision tasks, and two audio tasks. Qwen3.5-4B participates only in text and vision because the current DGX OS deployment is text-and-image only. Gemma 4 E2B leads the sustained text and vision categories overall, while Phi-4 records the highest throughput on the transcription task in the revised audio run.
| Task | Phi-4 tok/s | Gemma 4 tok/s | Qwen3.5 tok/s | Observed outcome |
|---|---|---|---|---|
| Basic Q&A | 32.9 | 17.6 | 8.3 | Gemma wins on latency; Phi-4 leads token rate on this short answer |
| Reasoning | 38.1 | 49.4 | 17.5 | Gemma highest sustained throughput |
| Python coding | 38.1 | 49.5 | 17.3 | Gemma highest sustained throughput |
| Multilingual | 38.2 | 50.0 | 12.0 | Gemma highest sustained throughput |
| Summarization | 37.8 | 49.3 | 16.6 | Gemma highest sustained throughput |
| Image description | 28.2 | 48.7 | 17.5 | Gemma leads vision throughput |
| Color identification | 24.6 | 47.2 | 17.2 | Gemma leads vision throughput |
| Transcription (ASR) | 33.4 | 13.9 | Not applicable | Phi-4 higher token rate; Gemma lower latency |
| Audio Q&A | 33.4 | 44.0 | Not applicable | Gemma higher token rate and lower latency |
Throughput alone does not determine usability. The benchmark therefore also reports end-to-end latency and a simple task-success assessment. Gemma 4 E2B records the lowest average latency across the full benchmark set at 2.78 s, followed by Phi-4 at 3.76 s and Qwen3.5-4B at 8.73 s for the text-and-vision subset it supports.
| Modality | Test | Phi-4 latency | Gemma latency | Qwen latency | Accuracy summary |
|---|---|---|---|---|---|
| Text | Basic Q&A | 2.24s | 0.68s | 1.20s | All reported correct |
| Text | Reasoning | 8.02s | 6.57s | 17.15s | All reasoning chains reported correct; Gemma and Qwen truncated |
| Text | Python coding | 3.94s | 5.22s | 14.45s | All reported correct |
| Text | Multilingual | 1.77s | 3.05s | 7.41s | All reported correct |
| Text | Summarization | 2.75s | 1.33s | 3.61s | All reported correct |
| Vision | Image description | 2.88s | 4.22s | 11.45s | All reported correct |
| Vision | Color identification | 1.13s | 2.13s | 5.83s | All reported correct |
| Audio | Transcription | 6.13s | 0.44s | Not applicable | Both supported models reported correct |
| Audio | Audio Q&A | 5.02s | 1.38s | Not applicable | Both supported models reported correct |
The revised benchmark adds operational metrics beyond speed. On the RTX 3090, Gemma 4 E2B uses less memory than Phi-4 despite the larger total parameter count reported in the source material. On the GB10 platform, Qwen3.5-4B uses only a small fraction of the 128 GB unified memory pool and achieves a much stronger token-per-watt profile than either RTX-based model.
| Metric | Gemma 4 E2B | Phi-4 Multimodal | Qwen3.5-4B |
|---|---|---|---|
| Baseline memory use | 10.3 GB VRAM | 11.6 GB VRAM | ~8 GB unified memory |
| Average inference power | 204.8 W | 156.5 W | 25.8 W |
| Peak power in run | 227.3 W | 239.8 W | 27.1 W |
| Average / peak temperature | 58.8 C / 61 C | 55.4 C / 58 C | 47.7 C / 51 C |
| Average throughput | 41.1 tok/s | 33.9 tok/s | 15.2 tok/s |
| Tokens per second per watt | 0.20 | 0.22 | 0.59 |
A more novel interpretation of the benchmark is to ask which systems are eliminated by another system that is better on all decision-relevant axes. Under raw throughput alone, Gemma appears dominant. Under energy efficiency and thermal load, Qwen on GB10 becomes dominant. Under audio transcription throughput, Phi-4 remains competitive. The comparison therefore yields not a global winner but a structured frontier with different optima for different deployment regimes.
| Deployment regime | Primary decision variables | Model on frontier | Reason it remains non-dominated |
|---|---|---|---|
| General multimodal workstation | Throughput, modality breadth, video support | Gemma 4 E2B | Highest average throughput, best vision speed, only reported video support |
| Energy- and thermal-constrained edge inference | Tokens per watt, temperature, noise, long context | Qwen3.5-4B on GB10 | Approximately 3x better token-per-watt than RTX 3090 runs and largest context window |
| Audio-centric concise pipeline | ASR throughput, shorter outputs, existing compatibility | Phi-4 Multimodal | Best transcription token rate and competitive short-answer behavior on RTX 3090 |
The updated blog content also broadens the qualitative interpretation of the benchmark. In the summarization task, all three models produce accurate two-sentence summaries with different style profiles: Gemma 4 is most concise, Phi-4 is plain and direct, and Qwen is more expansive while remaining accurate. In the image-description task, all three models correctly identify the main scene elements, although Gemma 4 and Qwen produce more structured responses than Phi-4.
Gemma 4 and Qwen both return structured scene breakdowns with distinct element-level labeling, whereas Phi-4 responds with a concise paragraph. The source interpretation is that answer quality is broadly comparable across models and that Gemma 4 wins primarily on speed rather than on an obvious quality gap.
The updated audio discussion distinguishes between real speech and a synthetic tone test. For the revised five-sample speech benchmark, Gemma 4 and Phi-4 both achieve acceptable task quality, with Gemma 4 showing much lower latency. For ambiguous synthetic audio, Phi-4 is reported to loop while Gemma 4 terminates cleanly. Qwen3.5-4B does not participate because the current deployment does not expose audio input.
The most consequential capability difference remains video support. Gemma 4 E2B is still the only model in this comparison with reported video understanding capability. In the referenced benchmark, Gemma 4 processed an 8-second 1280x720 MP4 sampled at 1 fps and produced a frame-aware description in 13.3 seconds at 42.9 tok/s. Phi-4 Multimodal has no video support in this benchmark, and Qwen3.5-4B as deployed on DGX OS GB10 does not expose video capability.
Context length now forms a second major differentiator. Qwen3.5-4B provides the largest context window at 262K tokens, while both Gemma 4 E2B and Phi-4 Multimodal expose 128K-token context windows in their current official model documentation. This means that Qwen still holds the largest-context position in the comparison, but Phi-4 should no longer be interpreted as a short-context outlier on paper. Even so, Qwen may still be a better fit for document-scale or repository-scale prompts when power budgets are constrained.
| Capability metric | Gemma 4 E2B | Phi-4 Multimodal | Qwen3.5-4B |
|---|---|---|---|
| Video understanding | Supported | Not supported | Not supported in this deployment |
| Reported video benchmark | 13.3 s, 42.9 tok/s | Not applicable | Not applicable |
| Maximum context window | 128K | 128K | 262K |
The paper also has important limitations. Qwen3.5-4B is not evaluated on the same GPU as Gemma 4 and Phi-4, so claims about absolute model superiority across all settings would be overstated. The audio and video comparisons are asymmetric because supported modalities differ by deployment. Task accuracy is reported as a compact pass/fail-style summary rather than a large human-rated evaluation set. These design choices are acceptable for deployment-oriented systems work, but they should be read as a study of practical operating envelopes rather than a universal measure of intrinsic model capability.
Those limitations also clarify what is novel here. The manuscript is strongest when it makes a systems claim about model-hardware fit under real operating constraints. It is weaker if read as a pure architecture bake-off. Future work should therefore include same-model cross-hardware replications, larger standardized multimodal test sets, and first-token latency measurements in addition to end-to-end completion time.
The updated evidence changes the framing of the paper from a simple Gemma-versus-Phi result to a three-way deployment trade-off. Gemma 4 E2B remains the strongest overall multimodal model for production-style workloads that require sustained generation speed, rich multimodal coverage, long context, and especially video analysis. Phi-4 remains viable for existing pipelines that already depend on its prompt behavior and for workloads that value concise answers and stable text-image-audio support on a standard GPU stack.
Qwen3.5-4B materially extends the benchmark by introducing an energy-efficient path. On NVIDIA DGX Spark running NVIDIA DGX OS, it offers a substantially better power and thermal profile than the RTX 3090 deployments while still delivering usable text and vision quality and the largest context window of the group. For edge deployment, quiet office operation, ARM-native systems, or always-on inference with strict power budgets, Qwen3.5-4B on GB10 is now a legitimate first choice even though it is not the fastest model in absolute terms.
The stronger paper-level claim is therefore selective rather than absolute: no single model wins every axis once hardware envelope is treated as part of the benchmark. Gemma 4 defines the throughput frontier, Qwen defines the efficiency frontier, and Phi-4 survives mainly where audio throughput or compatibility with existing concise-output workflows matters. That tri-partite result is more novel and more useful than a single winner-take-all ranking.
benchmark_v2.py,
five text cases, two vision cases, two audio cases, GPU telemetry collection, separate API
endpoints for the RTX and GB10 systems, and a results artifact reported in the updated blog as
results/benchmark_v2_20260405_204546.json.