Tokenization in vLLM is CPU-bound, not GPU-bound. The “Adding requests” stage is slow because tokenization for large batches (like your 6M prompts) is performed on the CPU, and this can become a bottleneck, especially for massive input sizes. This is a known limitation and is not specific to your hardware; even with a powerful CPU, tokenization throughput is much lower than GPU inference throughput, and vLLM does not currently parallelize tokenization across multiple CPU cores by default or offload it to the GPU. See discussion in vLLM GitHub Issue #1026.
To improve performance, you can try to parallelize tokenization yourself (e.g., using multiprocessing or a custom tokenizer pool), or pre-tokenize your inputs before passing them to vLLM. However, vLLM’s built-in batching and tokenization are not optimized for extremely large input arrays in a single call. For more details on this bottleneck and possible workarounds, see vLLM GitHub Issue #1026.
Would you like a step-by-step guide on how to pre-tokenize or parallelize tokenization for your use case?
Sources: