How to improve tokenization speed for embeddings generation?

Hi, I’ve been trying to generate embeddings with Qwen/Qwen3-Embedding-8B model and “Adding requests“ stage is extremely slow. As I understand it is tokenization that is slow. Is it CPU or GPU bound?

Adding requests: 100%| 6400627/6400627 [17:48<00:00, 5988.69it/s]

I’m using H200 and Xeon 8568Y+ CPU with 2TB of RAM. 5988.69it/s is incredibly slow. Any suggestions on how to improve generation performance.

        model = LLM(
            model=args.model,
            task="embed",
            trust_remote_code=True,
            max_model_len=args.max_model_len,
            gpu_memory_utilization=args.gpu_memory_utilization,
            enforce_eager=args.enforce_eager,
            max_num_seqs=args.max_num_seqs,
            dtype=args.dtype,
            max_num_batched_tokens=args.max_num_batched_tokens,
        )

 outputs = model.embed(instructed_texts)

Where instructed_texts is array of 6M prompts. Where the model is run with the following params:

  --model "Qwen/Qwen3-Embedding-8B" \
  --embedding-dim 4096 \
  --gpu-memory-utilization 0.98 \
  --max-num-batched-tokens 1048576 \
  --max-num-seqs 2048 \
  --max-model-len 8192 \
  --dtype bfloat16 \
 Initializing a V1 LLM engine (v0.10.2.dev2+gf5635d62e.d20250807) with config: model='Qwen/Qwen3-Embedding-8B', speculative_config=None, tokenizer='Qwen/Qwen3-Embedding-8B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=Qwen/Qwen3-Embedding-8B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=False, pooler_config=PoolerConfig(pooling_type='LAST', normalize=True, softmax=None, step_tag_id=None, returned_token_ids=None), compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":512

Tokenization in vLLM is CPU-bound, not GPU-bound. The “Adding requests” stage is slow because tokenization for large batches (like your 6M prompts) is performed on the CPU, and this can become a bottleneck, especially for massive input sizes. This is a known limitation and is not specific to your hardware; even with a powerful CPU, tokenization throughput is much lower than GPU inference throughput, and vLLM does not currently parallelize tokenization across multiple CPU cores by default or offload it to the GPU. See discussion in vLLM GitHub Issue #1026.

To improve performance, you can try to parallelize tokenization yourself (e.g., using multiprocessing or a custom tokenizer pool), or pre-tokenize your inputs before passing them to vLLM. However, vLLM’s built-in batching and tokenization are not optimized for extremely large input arrays in a single call. For more details on this bottleneck and possible workarounds, see vLLM GitHub Issue #1026.

Would you like a step-by-step guide on how to pre-tokenize or parallelize tokenization for your use case?

Sources: