A practical performance comparison of two open-weights multimodal LLMs running locally on a single RTX 3090 — measuring throughput, latency, VRAM efficiency, and response quality across text, vision, audio, and video modalities. Gemma 4 uniquely supports video; Phi-4 does not.
google/gemma-4-E2B-itmicrosoft/Phi-4-multimodal-instructGemma 4 E2B uses 1.5 GB less VRAM than Phi-4 despite having a larger total parameter count (5.1B vs 5.6B). The Per-Layer Embedding (PLE) architecture — where embedding tables are stored separately and accessed via fast lookups — means the active computation memory footprint is smaller.
Both models run comfortably on 24 GB. Gemma 4's lower VRAM footprint leaves more headroom for longer context windows and KV-cache when processing large documents or multi-turn conversations.
import math
def is_prime(n: int) -> bool:
"""
Checks if a given integer is a prime number.
Uses trial division up to sqrt(n) for efficiency.
"""
if n < 2: return False
if n == 2: return True
if n % 2 == 0: return False
for i in range(3, int(math.sqrt(n)) + 1, 2):
if n % i == 0: return False
return True
def is_prime(n: int) -> bool:
"""Check if a number is prime.
A prime number is a natural number greater
than 1 that has no positive divisors other
than 1 and itself.
"""
if n < 2: return False
for i in range(2, int(n**0.5) + 1):
if n % i == 0: return False
return True
math.sqrt for clarity and skips even numbers after 2 (slightly more efficient). Phi-4's solution is slightly simpler and also correct.
For text tasks requiring >50 tokens, Gemma 4 E2B averages ~49.8 tok/s vs Phi-4's ~39.2 tok/s — a 27% throughput advantage. This gap widens on longer outputs. For very short responses (1–2 sentences), Phi-4 can appear faster due to Gemma 4's slightly higher first-token latency.
Gemma 4's vision token throughput (~48 tok/s) is nearly double Phi-4's (~28 tok/s average). The vision encoder in Gemma 4 E2B (~150M params) processes the 400×300 test image efficiently. Response quality is richer — Gemma 4 consistently produces structured markdown with labeled elements; Phi-4 gives accurate but more compact prose.
The audio benchmark was re-run with 5 real speech recordings generated by an open-source TTS library covering diverse domains: solar system facts, tea-making instructions, a medical clinical note, a tech news update, and a Spanish language lesson. Both models scored 1.00 accuracy — correctly transcribing and summarising all 5 clips. However, Gemma 4 is 37% faster (44.0 vs 32.2 tok/s) with a dramatically lower first-token latency (1.2s vs 11.0s average). In a separate clinical-session-style recording, Gemma 4 produced a rich contextual analysis that correctly identified the session type, speaker tone, and key themes — a task where Phi-4 returned a shorter, less detailed response. Gemma 4 wins on audio.
Despite having a larger stated total parameter count (5.1B with embeddings vs ~5.6B), Gemma 4 E2B uses 1.5 GB less GPU memory. The Per-Layer Embedding (PLE) design means embedding tables are accessed via lookup operations rather than held in active computation memory. This leaves more VRAM headroom for KV-cache and longer contexts.
Gemma 4 consistently uses markdown formatting (bullet points, bold headings, code blocks with imports). This makes it better suited for document-style outputs and agentic tasks. Phi-4 tends toward concise, direct prose — often better for chat-style applications where brevity is valued. Neither approach is universally superior; it depends on the use case.
Gemma 4 E2B supports enable_thinking=True which activates step-by-step internal reasoning before the final answer. This is critical for complex multi-step problems and is not available in Phi-4. Gemma 4's 128K context window (vs ~16K for Phi-4) makes it suitable for long-document tasks: summarising entire codebases, legal documents, or research papers.
Gemma 4 E2B can process video up to 60 seconds by sampling frames at 1 fps (up to 60 frames). It also extracts and processes audio embedded in video (load_audio_from_video=True). Tested with a real 8-second 1280×720 MP4 — Gemma 4 delivered a frame-by-frame timestamped analysis in 13.3s at 42.9 tok/s, correctly identifying the person, setting, clothing, expressions, and hand gestures across all 8 frames. Phi-4 has zero video support — this is an exclusive capability of Gemma 4.
"video" field. On the server, it is written to a temp file and passed as
{"type": "video", "video": path} in the Gemma 4 message content.
processor.apply_chat_template(..., processor_kwargs={"num_frames": 8, "load_audio_from_video": True})
extracts 8 frames uniformly via the av (PyAV) backend.
The model receives all frames as a visual sequence and generates a single unified response.
Faster on text (+30%), 2.8× faster on vision (49.5 vs 17.9 tok/s), 37% faster on audio (44.0 vs 32.2 tok/s), less VRAM (−1.5 GB), larger context (128K vs 16K), thinking mode, superior real-speech comprehension, and an exclusive capability Phi-4 simply cannot match: video understanding. Gemma 4 E2B analyzed an 8-second clinical video frame-by-frame in 13.3s at 42.9 tok/s. Phi-4 scores slightly higher on text accuracy (1.00 vs 0.80) but trades throughput to get there. For production multimodal workloads, Gemma 4 is the clear choice.
enable_thinking=True on hard reasoning (AIME, GPQA) and compare quality.
(2) Test Gemma 4's 128K context with long clinical session recordings vs Phi-4's ~16K limit.
(3) Measure WER on real speech with known ground-truth transcripts.
(4) Test longer videos (30–60 s) to see how frame sampling affects analysis quality.
(5) Production batching — both APIs currently use a single-thread lock; Gemma 4 with vLLM/TGI could push throughput 3–5× higher.
generate_dataset.pybenchmark_dataset.pypip install av)python3 generate_dataset.py
python3 benchmark_dataset.py --url http://<model-endpoint> --model phi4 --suite all
python3 benchmark_dataset.py --url http://<model-endpoint> --model gemma4 --suite all
bash start.sh # start local benchmark UI