How to improve tokenization speed for embeddings generation?

aindlq · August 28, 2025, 9:25am

Hi, I’ve been trying to generate embeddings with Qwen/Qwen3-Embedding-8B model and “Adding requests“ stage is extremely slow. As I understand it is tokenization that is slow. Is it CPU or GPU bound?

Adding requests: 100%| 6400627/6400627 [17:48<00:00, 5988.69it/s]

I’m using H200 and Xeon 8568Y+ CPU with 2TB of RAM. 5988.69it/s is incredibly slow. Any suggestions on how to improve generation performance.

        model = LLM(
            model=args.model,
            task="embed",
            trust_remote_code=True,
            max_model_len=args.max_model_len,
            gpu_memory_utilization=args.gpu_memory_utilization,
            enforce_eager=args.enforce_eager,
            max_num_seqs=args.max_num_seqs,
            dtype=args.dtype,
            max_num_batched_tokens=args.max_num_batched_tokens,
        )

 outputs = model.embed(instructed_texts)

Where instructed_texts is array of 6M prompts. Where the model is run with the following params:

  --model "Qwen/Qwen3-Embedding-8B" \
  --embedding-dim 4096 \
  --gpu-memory-utilization 0.98 \
  --max-num-batched-tokens 1048576 \
  --max-num-seqs 2048 \
  --max-model-len 8192 \
  --dtype bfloat16 \

 Initializing a V1 LLM engine (v0.10.2.dev2+gf5635d62e.d20250807) with config: model='Qwen/Qwen3-Embedding-8B', speculative_config=None, tokenizer='Qwen/Qwen3-Embedding-8B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=Qwen/Qwen3-Embedding-8B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=False, pooler_config=PoolerConfig(pooling_type='LAST', normalize=True, softmax=None, step_tag_id=None, returned_token_ids=None), compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":512

RunLLM · August 28, 2025, 9:25am

Tokenization in vLLM is CPU-bound, not GPU-bound. The “Adding requests” stage is slow because tokenization for large batches (like your 6M prompts) is performed on the CPU, and this can become a bottleneck, especially for massive input sizes. This is a known limitation and is not specific to your hardware; even with a powerful CPU, tokenization throughput is much lower than GPU inference throughput, and vLLM does not currently parallelize tokenization across multiple CPU cores by default or offload it to the GPU. See discussion in vLLM GitHub Issue #1026.

To improve performance, you can try to parallelize tokenization yourself (e.g., using multiprocessing or a custom tokenizer pool), or pre-tokenize your inputs before passing them to vLLM. However, vLLM’s built-in batching and tokenization are not optimized for extremely large input arrays in a single call. For more details on this bottleneck and possible workarounds, see vLLM GitHub Issue #1026.

Would you like a step-by-step guide on how to pre-tokenize or parallelize tokenization for your use case?

Sources:

vLLM GitHub Issue #1026

Topic		Replies	Views
Far different performance between Qwen3-4B and Qwen3-Embedding-4B Benchmarking	0	73	April 17, 2026
Speeding up vllm inference for Qwen2.5-VL General	23	7136	June 27, 2025
Custom edit on the embedding out Model Support	2	58	October 15, 2025
Suggestion to improve inferencing speed Model Support	17	525	March 11, 2026
Batched processing over openai api General	3	81	November 30, 2025

How to improve tokenization speed for embeddings generation?

Related topics