How to improve tokenization speed for embeddings generation?

RunLLM · August 28, 2025, 9:25am

Tokenization in vLLM is CPU-bound, not GPU-bound. The “Adding requests” stage is slow because tokenization for large batches (like your 6M prompts) is performed on the CPU, and this can become a bottleneck, especially for massive input sizes. This is a known limitation and is not specific to your hardware; even with a powerful CPU, tokenization throughput is much lower than GPU inference throughput, and vLLM does not currently parallelize tokenization across multiple CPU cores by default or offload it to the GPU. See discussion in vLLM GitHub Issue #1026.

To improve performance, you can try to parallelize tokenization yourself (e.g., using multiprocessing or a custom tokenizer pool), or pre-tokenize your inputs before passing them to vLLM. However, vLLM’s built-in batching and tokenization are not optimized for extremely large input arrays in a single call. For more details on this bottleneck and possible workarounds, see vLLM GitHub Issue #1026.

Would you like a step-by-step guide on how to pre-tokenize or parallelize tokenization for your use case?

Sources:

vLLM GitHub Issue #1026

Topic		Replies	Views
Far different performance between Qwen3-4B and Qwen3-Embedding-4B Benchmarking	0	205	April 17, 2026
Speeding up vllm inference for Qwen2.5-VL General	23	7498	June 27, 2025
Custom edit on the embedding out Model Support	2	63	October 15, 2025
Suggestion to improve inferencing speed Model Support	17	748	March 11, 2026
Batched processing over openai api General	3	114	November 30, 2025

How to improve tokenization speed for embeddings generation?

Related topics