Grammar CPU bound performance

TL,DR: 100% cpu in top and GPU usage is very low for extended periods when using grammar.

I’d expect to see perhaps 13*100% on a 13 (real) core VM if it was perfectly efficient. Certainly more than one core.

Are there any settings I am missing to say “when applying grammar, use all my gazillion cores that for some reason Google require me to have when using a H100 card”
or is it a known limitation in the xgrammar backend?

In detail:

I’m using today’s version of the cuda server variant docker image for vLLM, on a GCE a3-highgpu-1g (26 vCPU, scads of regular ram, 1 nvidia h100 80gb GPU).

I’m playing with the deepseek distills, currently deepseek-ai/DeepSeek-R1-Distill-Llama-8B (unquantized). Using the default prompt template.

docker run --runtime nvidia --gpus all     -v ~/.cache/huggingface:/root/.cache/huggingface     --env "HUGGING_FACE_HUB_TOKEN=xxx"     -p 8000:8000     --ipc=host     vllm/vllm-openai:latest     --model deepseek-ai/DeepSeek-R1-Distill-Llama-8B

I’m using guided_grammar to restrict to a somewhat sophisticated grammar being my own conversion of a JSON schema. I’ve played with different forms for support as a grammar prelude including treating [^{]* as an entire acceptable prelude before the "{" JSON start. I’m aware of the issues with deepseek having baked into the prompt template (and hence not present in the grammar-checked response). (Strangely, llama.cpp didn’t have that issue, and grammar with worked fine there, whereas here with vLLM if you omit in a custom template you get no thinking. But I digress…)

I’m aware that grammar effectively restricts each token, and the model needs to “want” to generate matching text. So I’m engineering my prompt to ensure the output I get when grammar isn’t being used, is typically matching (or near-enough) the grammar restrictions.

When I run the same kinds of queries, without grammar I get around 8 seconds turnaround time for the chat completions endpoint, and the GPU usage is approaching 100% for the majority of that time; with grammar I get around 24 seconds turnaround time, and the GPU usage spikes a couple of times to maybe 40% but is mostly around 15%.

(This is according to nvidia-smi -l 1)

Whereas top shows 100% almost exactly, constantly, throughout the request. Then it drops to near zero before/after.

Update:

I’ve found dramatically faster results using guided_json - but there’s a bug:

I also found related:

You may try to use the guidance backend, though you would likely need to rewrite your grammar, or better yet use the %json { ... } syntax in the Lark grammar, see llguidance/docs/syntax.md at main · guidance-ai/llguidance · GitHub

Guidance should have near zero pre-compilation times for even the most complex schemas.

That’s super helpful @mmoskal. It took some digging but I found the reference to it in the vLLM docs. I’ll try that and will post back.

Hi @supersteves, I think the speed issue may also be related to the specific grammar. There could be some specific components in the grammar that make it slow. Could you share the grammar?

Besides, we are enhancing xgrammar with the vllm team to significantly increase the speed of some slower grammars, also making it robust to all potential usecases of vllm, expected to be finished in a recent version. It could be much better after that!

@Ubospica thanks for your reply.

Although we use a much more extensive grammar (effectively our own JSON schema compilation), the same degradation happens with this simpler grammar, which is a basic “raw JSON” grammar augmented to tolerate arbitrary thinking prelude (terminated by “{”), handling the vLLM / Deepseek case (without incompatible reasoning process args) whereby <think> is in the template (thus omitted from the output - note that llama.cpp seems to avoid this quirk by way of its own custom written equivalent to the prompt template, I think. A tangential point.).

root           ::= [^{]* json-value

object ::= "{" ( whitespace | whitespace string whitespace ":" value ( "," whitespace string whitespace ":" value )* ) "}"

array ::= "[" ( whitespace | value ( "," value )* ) "]"

value ::= whitespace ( string | number | object | array | boolean | null ) whitespace

boolean ::= ( "true" | "false" )

null ::= "null"

string ::= "\"" ( [^\"\\\x7F\x00-\x1F\x80-\x9F] | "\\" ( [\"\\/bfnrt] | "u" [0-9a-fA-F]{4,4} ) )* "\""

number ::= "\x2d"? ( "0" | [1-9] [0-9]{0,30} ) ( "." [0-9]{1,30} )? ( [Ee] [+\x2d]? [0-9]{1,30} )?

integer ::= "\x2d"? ( "0" | [1-9] [0-9]{0,30} ) ( [Ee] [+]? [0-9]{1,30} )?

whitespace ::= [ \n\r\t]{0,21}

n ::= "{"  whitespace "\"myMessage\"" whitespace ":"  whitespace string whitespace   "," whitespace "\"sentiment\"" whitespace ":"  whitespace number whitespace   "}" whitespace

json-value ::= n

Hi @supersteves, I’ve studied your grammar. I believe the issue now lies in how whitespace is handled: the current analyzer handles this whitespace rule slowly. For now, you can use this version of the grammar, where whitespace is incorporated directly into specific rules, and the use of range has been replaced with star, which I think should significantly speed things up. This issue should be fixed in a future version.

root           ::= [^{]* json-value

object ::= "{" ( [ \n\r\t]* | [ \n\r\t]* string [ \n\r\t]* ":" value ( "," [ \n\r\t]* string [ \n\r\t]* ":" value )* ) "}"

array ::= "[" ( [ \n\r\t]* | value ( "," value )* ) "]"

value ::= [ \n\r\t]* ( string | number | object | array | boolean | null ) [ \n\r\t]*

boolean ::= ( "true" | "false" )

null ::= "null"

string ::= "\"" ( [^\"\\\x7F\x00-\x1F\x80-\x9F] | "\\" ( [\"\\/bfnrt] | "u" [0-9a-fA-F]{4,4} ) )* "\""

number ::= "\x2d"? ( "0" | [1-9] [0-9]{0,30} ) ( "." [0-9]{1,30} )? ( [Ee] [+\x2d]? [0-9]{1,30} )?

integer ::= "\x2d"? ( "0" | [1-9] [0-9]{0,30} ) ( [Ee] [+]? [0-9]{1,30} )?

n ::= "{"  [ \n\r\t]* "\"myMessage\"" [ \n\r\t]* ":"  [ \n\r\t]* string [ \n\r\t]*   "," [ \n\r\t]* "\"sentiment\"" [ \n\r\t]* ":"  [ \n\r\t]* number [ \n\r\t]*   "}"

json-value ::= n
"""

@Ubospica Hey, thank you for that. So, extra rules aren’t just syntactic sugar, they have an impact? I can see this being more problematic with our wider json schema-compiled grammar.

And the * change: bounded repetition has a cost. Fine. It was there to limit runaway generation of whitespace which you sometimes see with lower grade models or where the prompt is a poor match for the grammar. It can be solved with max tokens in the request, instead.

So, extra rules aren’t just syntactic sugar, they have an impact

Thanks for asking this. Due to current parser limitations, in a few edge cases, splitting the rules can indeed lead to worse performance. That said, improving the parser is already on our roadmap (there are several enhanced algorithms), and we expect to deliver a more performant version within the next few weeks.

It was there to limit runaway generation of whitespace which you sometimes see with lower grade models or where the prompt is a poor match for the grammar. It can be solved with max tokens in the request, instead.

Yes, I agree — smaller models do sometimes output excessive whitespace. I think this can indeed be addressed via max token constraints, and it should also be able to achieve similar performance improvements in the near future.

I had a little more time to play with some suggestions here. @Ubospica , your suggestions may have improved performance slightly, but not significantly. However please note that using json schema instead of gbnf is way faster, I don’t understand how, but I have a working solution.

In my simple test, asking for an answer in a specific json format (just one sample for each test):

  • unconstrained: 3.5s
  • json format: 3.7s
  • your gbnf: 18s

(I have not yet had time to explore llguidance.)

Separately, I wonder how the json format route works. AFAIK, the guided_grammar approach uses the same “xgrammar” as guided_json (and response_format json_schema, which I gather is the same). I would have thought that json schema was decomposed into gbnf, but it can’t be the case due to the performance difference.

I also don’t understand how the model is able to “fit” the json schema even when I remove all prompts about the schema and change the property names to really obscure streams of random characters. It’s as if the JSON schema itself is included in the prompt behind the scenes, but I don’t see any evidence of that in the logs. I’ve also proven that the json schema descriptions are ignored. Maybe the model will try every possible token, even those that are far down the probabilities list, until valid output is produced? I’m amazed that it works, without any speed penalty. For example I give a json schema saying property name “123576345982ryq09vs” but give an example in the prompt that it is “message”, and in a few seconds, it has successfully matched the schema.