Hi, I’ve been trying to generate embeddings with Qwen/Qwen3-Embedding-8B model and “Adding requests“ stage is extremely slow. As I understand it is tokenization that is slow. Is it CPU or GPU bound?
Adding requests: 100%| 6400627/6400627 [17:48<00:00, 5988.69it/s]
I’m using H200 and Xeon 8568Y+ CPU with 2TB of RAM. 5988.69it/s is incredibly slow. Any suggestions on how to improve generation performance.
model = LLM(
model=args.model,
task="embed",
trust_remote_code=True,
max_model_len=args.max_model_len,
gpu_memory_utilization=args.gpu_memory_utilization,
enforce_eager=args.enforce_eager,
max_num_seqs=args.max_num_seqs,
dtype=args.dtype,
max_num_batched_tokens=args.max_num_batched_tokens,
)
outputs = model.embed(instructed_texts)
Where instructed_texts is array of 6M prompts. Where the model is run with the following params:
--model "Qwen/Qwen3-Embedding-8B" \
--embedding-dim 4096 \
--gpu-memory-utilization 0.98 \
--max-num-batched-tokens 1048576 \
--max-num-seqs 2048 \
--max-model-len 8192 \
--dtype bfloat16 \
Initializing a V1 LLM engine (v0.10.2.dev2+gf5635d62e.d20250807) with config: model='Qwen/Qwen3-Embedding-8B', speculative_config=None, tokenizer='Qwen/Qwen3-Embedding-8B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=Qwen/Qwen3-Embedding-8B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=False, pooler_config=PoolerConfig(pooling_type='LAST', normalize=True, softmax=None, step_tag_id=None, returned_token_ids=None), compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":512