Context
We’ve been benchmarking Qwen3-Embedding-4B lately on vLLM v0.19.0 using genai-bench and getting perplexing and disappointing results. We ran bechmarking with a similarly sized, text generation model too (Qwen3-4B), and all other things being equal, performance for the text generation model is much better than the embedding model. No matter what parameters we adjust, on the embedding model runs:
- latency seems higher than it should be
- throughput is relatively low
- requests per second actually drop as concurrency rises
- we never get anywhere close to saturating the GPU
We greatly appreciate any guidance you all might be able to provide!
Questions
- Why is the GPU not saturated for the embedding model benchmark?
- Why do RPS drop when concurrency rises, only for the embedding model run?
- Is vLLM well suited in the first place for embedding models like Qwen3-Embedding-4B?
- What might be the most effective adjustments we can make to get better throughput, especially as concurrency rises?
Benchmark Metadata
Below are the metadata for both the text and the embedding model benchmark runs.
Qwen3-4B on g7e.2xlarge with D(8,50) (vLLM v0.19.0)
-
vLLM Params
-
“Qwen/Qwen3-4B”,
-
--served-model-name, "Qwen/Qwen3-4B", "Qwen-Qwen3-4B" -
--uvicorn-log-level, "warning", -
--gpu-memory-utilization, "0.95", -
--quantization, "fp8", -
--max-num-batched-tokens, "4096", -
--max-num-seqs, "128", -
--performance-mode, "interactivity"
-
-
genai-bench Command
- genai-bench benchmark --api-backend vllm --api-base --api-model-name Qwen-Qwen3-4B --task text-to-text --api-key “x” --num-workers 16 --server-gpu-count 1 --model-tokenizer Qwen/Qwen3-4B --traffic-scenario “D(8,50)” --metrics-time-unit ms --spawn-rate 50 --max-requests-per-run 5000 --max-time-per-run 2
-
E2E Latencies
-
P50 – 348.3 milliseconds
-
P90 – 363.8 milliseconds
-
P99 – 404.7 milliseconds
-
-
Token Throughput per Sec
-
Input - 2793.7
-
Output - 8692.0
-
-
RPS - 173.8
-
GPU Utilization - 100%
-
Graphs
Qwen3-Embedding-4B on g7e.2xlarge with E(8) (vLLM v0.19.0)
-
vLLM Params
-
“Qwen/Qwen3-Embedding-4B”,
-
--served-model-name, "Qwen/Qwen3-Embedding-4B", "Qwen-Qwen3-Embedding-4B" -
--uvicorn-log-level, "warning", -
--gpu-memory-utilization, "0.95", -
--quantization, "fp8", -
--max-num-batched-tokens, "4096", -
--max-num-seqs, "128", -
--performance-mode, "interactivity"
-
-
genai-bench Command
- genai-bench benchmark --api-backend vllm --api-base --api-model-name Qwen-Qwen3-Embedding-4B --task text-to-embeddings --api-key “x” --num-workers 16 --server-gpu-count 1 --model-tokenizer Qwen/Qwen3-Embedding-4B --traffic-scenario “E(8)” --metrics-time-unit ms --spawn-rate 50 --max-requests-per-run 5000 --max-time-per-run 2
-
E2E Latencies
-
P50 – 98.1 milliseconds
-
P90 – 107.6 milliseconds
-
P99 – 126.9 milliseconds
-
-
Token Throughput per Sec
-
Input - 1858.9
-
Output - N/A
-
-
RPS - 8.5
-
GPU Utilization - 19%
-
Graphs

