Far different performance between Qwen3-4B and Qwen3-Embedding-4B

juliasi · April 17, 2026, 3:06pm

Context

We’ve been benchmarking Qwen3-Embedding-4B lately on vLLM v0.19.0 using genai-bench and getting perplexing and disappointing results. We ran bechmarking with a similarly sized, text generation model too (Qwen3-4B), and all other things being equal, performance for the text generation model is much better than the embedding model. No matter what parameters we adjust, on the embedding model runs:

latency seems higher than it should be
throughput is relatively low
requests per second actually drop as concurrency rises
we never get anywhere close to saturating the GPU

We greatly appreciate any guidance you all might be able to provide!

Questions

Why is the GPU not saturated for the embedding model benchmark?
Why do RPS drop when concurrency rises, only for the embedding model run?
Is vLLM well suited in the first place for embedding models like Qwen3-Embedding-4B?
What might be the most effective adjustments we can make to get better throughput, especially as concurrency rises?

Benchmark Metadata

Below are the metadata for both the text and the embedding model benchmark runs.

Qwen3-4B on g7e.2xlarge with D(8,50) (vLLM v0.19.0)

vLLM Params

“Qwen/Qwen3-4B”,

    --served-model-name, "Qwen/Qwen3-4B", "Qwen-Qwen3-4B"

```
    --uvicorn-log-level, "warning",
```
```
    --gpu-memory-utilization, "0.95",
```
```
    --quantization, "fp8",
```
```
    --max-num-batched-tokens, "4096",
```
```
    --max-num-seqs, "128",
```

    --performance-mode, "interactivity"

genai-bench Command
- genai-bench benchmark --api-backend vllm --api-base --api-model-name Qwen-Qwen3-4B --task text-to-text --api-key “x” --num-workers 16 --server-gpu-count 1 --model-tokenizer Qwen/Qwen3-4B --traffic-scenario “D(8,50)” --metrics-time-unit ms --spawn-rate 50 --max-requests-per-run 5000 --max-time-per-run 2
E2E Latencies
- P50 – 348.3 milliseconds
- P90 – 363.8 milliseconds
- P99 – 404.7 milliseconds
Token Throughput per Sec
- Input - 2793.7
- Output - 8692.0
RPS - 173.8
GPU Utilization - 100%
Graphs
- traffic_scenario_combined_plots_2x43200×1200 262 KB

Qwen3-Embedding-4B on g7e.2xlarge with E(8) (vLLM v0.19.0)