Grammar CPU bound performance

supersteves · April 4, 2025, 1:41pm

TL,DR: 100% cpu in top and GPU usage is very low for extended periods when using grammar.

I’d expect to see perhaps 13*100% on a 13 (real) core VM if it was perfectly efficient. Certainly more than one core.

Are there any settings I am missing to say “when applying grammar, use all my gazillion cores that for some reason Google require me to have when using a H100 card”
or is it a known limitation in the xgrammar backend?

In detail:

I’m using today’s version of the cuda server variant docker image for vLLM, on a GCE a3-highgpu-1g (26 vCPU, scads of regular ram, 1 nvidia h100 80gb GPU).

I’m playing with the deepseek distills, currently deepseek-ai/DeepSeek-R1-Distill-Llama-8B (unquantized). Using the default prompt template.

docker run --runtime nvidia --gpus all     -v ~/.cache/huggingface:/root/.cache/huggingface     --env "HUGGING_FACE_HUB_TOKEN=xxx"     -p 8000:8000     --ipc=host     vllm/vllm-openai:latest     --model deepseek-ai/DeepSeek-R1-Distill-Llama-8B

I’m using guided_grammar to restrict to a somewhat sophisticated grammar being my own conversion of a JSON schema. I’ve played with different forms for support as a grammar prelude including treating [^{]* as an entire acceptable prelude before the "{" JSON start. I’m aware of the issues with deepseek having baked into the prompt template (and hence not present in the grammar-checked response). (Strangely, llama.cpp didn’t have that issue, and grammar with worked fine there, whereas here with vLLM if you omit in a custom template you get no thinking. But I digress…)

I’m aware that grammar effectively restricts each token, and the model needs to “want” to generate matching text. So I’m engineering my prompt to ensure the output I get when grammar isn’t being used, is typically matching (or near-enough) the grammar restrictions.

When I run the same kinds of queries, without grammar I get around 8 seconds turnaround time for the chat completions endpoint, and the GPU usage is approaching 100% for the majority of that time; with grammar I get around 24 seconds turnaround time, and the GPU usage spikes a couple of times to maybe 40% but is mostly around 15%.

(This is according to nvidia-smi -l 1)

Whereas top shows 100% almost exactly, constantly, throughout the request. Then it drops to near zero before/after.

supersteves · April 7, 2025, 1:06pm

Update:

I’ve found dramatically faster results using guided_json - but there’s a bug:

github.com/vllm-project/vllm

[Bug]: Deepseek reasoning and guided_json no longer works

opened 11:15AM - 07 Apr 25 UTC

supersteves

bug

### Your current environment <details> <summary>The output of `python collect_e…nv.py`</summary> ```text N/A. I'm using docker latest: docker pull vllm/vllm-openai:latest latest: Pulling from vllm/vllm-openai ... Digest: sha256:4d8d397a62c36237293a4d5e2acbf911b91b0a8552825bda69f581c5811af9ec Status: Downloaded newer image for vllm/vllm-openai:latest And I'm running as follows: docker run --runtime nvidia --gpus all -v ~/.cache/huggingface:/root/.cache/huggingface --env "HUGGING_FACE_HUB_TOKEN=xxx" --env VLLM_LOGGING_LEVEL=DEBUG -p 8000:8000 --ipc=host vllm/vllm-openai:latest --model deepseek-ai/DeepSeek-R1-Distill-Llama-70B --enforce-eager --tensor-parallel-size=4 --enable-reasoning --reasoning-parser deepseek_r1 ``` </details> ### 🐛 Describe the bug I'm using the exact same example of Reasoning with Structured Outputs from the vLLM docs: https://docs.vllm.ai/en/v0.8.1/features/reasoning_outputs.html#structured-output I expect to see both reasoning and content. Something like this: ``` reasoning_content: "Hmm, let me think for a bit... Wait, let me think a bit more..." content: {"name": "Ethan", "age": 28} ``` Instead I get: ``` reasoning_content: {"name": "Ethan", "age": 28} content: None ``` (I also tried guided_grammar, although it's not documented as supported with reasoning in the top of the same page. With that case, it worked, but was [unbearably slow](https://github.com/vllm-project/vllm/issues/12122#issuecomment-2782941708), 100% of one CPU, and didn't utilise the GPU. Hence I'm trying guided_json, which is very fast, but doesn't work, as above.) ### Before submitting a new issue... - [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

I also found related:

github.com/vllm-project/vllm

[Bug]: XGrammar-based CFG decoding degraded after 0.6.5

opened 04:11PM - 16 Jan 25 UTC

AlbertoCastelo

bug structured-output

### Your current environment Tested in 3 environments with 8xH100: * `public.ec…r.aws/q9t5s3a7/vllm-ci-test-repo:61c1d499f07d3a50e3721a38f3f54a721f3eaf65` * CI version before v0.6.5 that contained XGrammars (v0.6.4.post2 did not contained XGrammars). * This image no longer exists. * `v0.6.5` * `v0.6.6.post1` ### Model Input Dumps ```python extra_body = { "guided_grammar": grammar, "guided_decoding_backend": "xgrammar", # optional } chat_completion = client.chat.completions.create( model=model, messages=messages, stream=True, temperature=0, max_tokens=1024, timeout=timeout, extra_body=extra_body, stream_options={"include_usage": True}, ) ``` ### 🐛 Describe the bug XGrammars guided decoding both for Time to first token (TTFT) and overall response time. I've tested 2 versions with a Llama3-70b: * A CI version before v0.6.5 that contained XGrammars (v0.6.4.post2 did not contained XGrammars). This is the exact image I was using: `public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:61c1d499f07d3a50e3721a38f3f54a721f3eaf65` * TTFT P50: ~0.6s * `v0.6.5` and `v0.6.6.post1`: * TTFT P50: ~4s Related [issue affecting Outlines](https://github.com/vllm-project/vllm/issues/12005) ### Before submitting a new issue... - [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

mmoskal · April 8, 2025, 11:14pm

You may try to use the guidance backend, though you would likely need to rewrite your grammar, or better yet use the %json { ... } syntax in the Lark grammar, see llguidance/docs/syntax.md at main · guidance-ai/llguidance · GitHub

Guidance should have near zero pre-compilation times for even the most complex schemas.

supersteves · April 9, 2025, 1:06pm

That’s super helpful @mmoskal. It took some digging but I found the reference to it in the vLLM docs. I’ll try that and will post back.

Ubospica · April 10, 2025, 7:56am

Hi @supersteves, I think the speed issue may also be related to the specific grammar. There could be some specific components in the grammar that make it slow. Could you share the grammar?

Besides, we are enhancing xgrammar with the vllm team to significantly increase the speed of some slower grammars, also making it robust to all potential usecases of vllm, expected to be finished in a recent version. It could be much better after that!

supersteves · April 10, 2025, 10:15am

@Ubospica thanks for your reply.

Although we use a much more extensive grammar (effectively our own JSON schema compilation), the same degradation happens with this simpler grammar, which is a basic “raw JSON” grammar augmented to tolerate arbitrary thinking prelude (terminated by “{”), handling the vLLM / Deepseek case (without incompatible reasoning process args) whereby <think> is in the template (thus omitted from the output - note that llama.cpp seems to avoid this quirk by way of its own custom written equivalent to the prompt template, I think. A tangential point.).

root           ::= [^{]* json-value

object ::= "{" ( whitespace | whitespace string whitespace ":" value ( "," whitespace string whitespace ":" value )* ) "}"

array ::= "[" ( whitespace | value ( "," value )* ) "]"

value ::= whitespace ( string | number | object | array | boolean | null ) whitespace

boolean ::= ( "true" | "false" )

null ::= "null"

string ::= "\"" ( [^\"\\\x7F\x00-\x1F\x80-\x9F] | "\\" ( [\"\\/bfnrt] | "u" [0-9a-fA-F]{4,4} ) )* "\""

number ::= "\x2d"? ( "0" | [1-9] [0-9]{0,30} ) ( "." [0-9]{1,30} )? ( [Ee] [+\x2d]? [0-9]{1,30} )?

integer ::= "\x2d"? ( "0" | [1-9] [0-9]{0,30} ) ( [Ee] [+]? [0-9]{1,30} )?

whitespace ::= [ \n\r\t]{0,21}

n ::= "{"  whitespace "\"myMessage\"" whitespace ":"  whitespace string whitespace   "," whitespace "\"sentiment\"" whitespace ":"  whitespace number whitespace   "}" whitespace

json-value ::= n

Ubospica · April 13, 2025, 10:56am

Hi @supersteves, I’ve studied your grammar. I believe the issue now lies in how whitespace is handled: the current analyzer handles this whitespace rule slowly. For now, you can use this version of the grammar, where whitespace is incorporated directly into specific rules, and the use of range has been replaced with star, which I think should significantly speed things up. This issue should be fixed in a future version.

root           ::= [^{]* json-value

object ::= "{" ( [ \n\r\t]* | [ \n\r\t]* string [ \n\r\t]* ":" value ( "," [ \n\r\t]* string [ \n\r\t]* ":" value )* ) "}"

array ::= "[" ( [ \n\r\t]* | value ( "," value )* ) "]"

value ::= [ \n\r\t]* ( string | number | object | array | boolean | null ) [ \n\r\t]*

boolean ::= ( "true" | "false" )

null ::= "null"

string ::= "\"" ( [^\"\\\x7F\x00-\x1F\x80-\x9F] | "\\" ( [\"\\/bfnrt] | "u" [0-9a-fA-F]{4,4} ) )* "\""

number ::= "\x2d"? ( "0" | [1-9] [0-9]{0,30} ) ( "." [0-9]{1,30} )? ( [Ee] [+\x2d]? [0-9]{1,30} )?

integer ::= "\x2d"? ( "0" | [1-9] [0-9]{0,30} ) ( [Ee] [+]? [0-9]{1,30} )?

n ::= "{"  [ \n\r\t]* "\"myMessage\"" [ \n\r\t]* ":"  [ \n\r\t]* string [ \n\r\t]*   "," [ \n\r\t]* "\"sentiment\"" [ \n\r\t]* ":"  [ \n\r\t]* number [ \n\r\t]*   "}"

json-value ::= n
"""

supersteves · April 14, 2025, 10:28am

@Ubospica Hey, thank you for that. So, extra rules aren’t just syntactic sugar, they have an impact? I can see this being more problematic with our wider json schema-compiled grammar.

And the * change: bounded repetition has a cost. Fine. It was there to limit runaway generation of whitespace which you sometimes see with lower grade models or where the prompt is a poor match for the grammar. It can be solved with max tokens in the request, instead.

Ubospica · April 15, 2025, 1:23pm

So, extra rules aren’t just syntactic sugar, they have an impact

Thanks for asking this. Due to current parser limitations, in a few edge cases, splitting the rules can indeed lead to worse performance. That said, improving the parser is already on our roadmap (there are several enhanced algorithms), and we expect to deliver a more performant version within the next few weeks.

It was there to limit runaway generation of whitespace which you sometimes see with lower grade models or where the prompt is a poor match for the grammar. It can be solved with max tokens in the request, instead.

Yes, I agree — smaller models do sometimes output excessive whitespace. I think this can indeed be addressed via max token constraints, and it should also be able to achieve similar performance improvements in the near future.

supersteves · April 29, 2025, 2:04pm

I had a little more time to play with some suggestions here. @Ubospica , your suggestions may have improved performance slightly, but not significantly. However please note that using json schema instead of gbnf is way faster, I don’t understand how, but I have a working solution.

In my simple test, asking for an answer in a specific json format (just one sample for each test):

unconstrained: 3.5s
json format: 3.7s
your gbnf: 18s

(I have not yet had time to explore llguidance.)

Separately, I wonder how the json format route works. AFAIK, the guided_grammar approach uses the same “xgrammar” as guided_json (and response_format json_schema, which I gather is the same). I would have thought that json schema was decomposed into gbnf, but it can’t be the case due to the performance difference.

I also don’t understand how the model is able to “fit” the json schema even when I remove all prompts about the schema and change the property names to really obscure streams of random characters. It’s as if the JSON schema itself is included in the prompt behind the scenes, but I don’t see any evidence of that in the logs. I’ve also proven that the json schema descriptions are ignored. Maybe the model will try every possible token, even those that are far down the probabilities list, until valid output is produced? I’m amazed that it works, without any speed penalty. For example I give a json schema saying property name “123576345982ryq09vs” but give an example in the prompt that it is “message”, and in a few seconds, it has successfully matched the schema.

Topic		Replies	Views
Why vllm cannot fully use GPU in batch processing General	12	114	March 29, 2025
I published a performance test result of vllm vs sglang but can someone help me explain it? General	3	44	April 29, 2025
Why vLLM uses alot of CPU memory General	1	87	April 21, 2025
Spec decode with eagle get very low Draft acceptance rate Speculative Decoding	1	36	April 25, 2025
vLLM output vs Ollama General	8	132	April 10, 2025

Grammar CPU bound performance

Related topics