TL,DR: 100% cpu in top
and GPU usage is very low for extended periods when using grammar.
I’d expect to see perhaps 13*100% on a 13 (real) core VM if it was perfectly efficient. Certainly more than one core.
Are there any settings I am missing to say “when applying grammar, use all my gazillion cores that for some reason Google require me to have when using a H100 card”
or is it a known limitation in the xgrammar backend?
In detail:
I’m using today’s version of the cuda server variant docker image for vLLM, on a GCE a3-highgpu-1g (26 vCPU, scads of regular ram, 1 nvidia h100 80gb GPU).
I’m playing with the deepseek distills, currently deepseek-ai/DeepSeek-R1-Distill-Llama-8B (unquantized). Using the default prompt template.
docker run --runtime nvidia --gpus all -v ~/.cache/huggingface:/root/.cache/huggingface --env "HUGGING_FACE_HUB_TOKEN=xxx" -p 8000:8000 --ipc=host vllm/vllm-openai:latest --model deepseek-ai/DeepSeek-R1-Distill-Llama-8B
I’m using guided_grammar
to restrict to a somewhat sophisticated grammar being my own conversion of a JSON schema. I’ve played with different forms for support as a grammar prelude including treating [^{]*
as an entire acceptable prelude before the "{"
JSON start. I’m aware of the issues with deepseek having baked into the prompt template (and hence not present in the grammar-checked response). (Strangely, llama.cpp didn’t have that issue, and grammar with worked fine there, whereas here with vLLM if you omit in a custom template you get no thinking. But I digress…)
I’m aware that grammar effectively restricts each token, and the model needs to “want” to generate matching text. So I’m engineering my prompt to ensure the output I get when grammar isn’t being used, is typically matching (or near-enough) the grammar restrictions.
When I run the same kinds of queries, without grammar I get around 8 seconds turnaround time for the chat completions endpoint, and the GPU usage is approaching 100% for the majority of that time; with grammar I get around 24 seconds turnaround time, and the GPU usage spikes a couple of times to maybe 40% but is mostly around 15%.
(This is according to nvidia-smi -l 1)
Whereas top shows 100% almost exactly, constantly, throughout the request. Then it drops to near zero before/after.