Minimum requirements for Disaggregated Prefilling?

tammela · April 9, 2025, 5:39pm

Hi,

I have a test setup for disaggregated prefilling that somehow if I run the same prompt multiple times computes different tensors for K and V. This seems quite odd, as I would expect KV to always have the exact same tensors. I’m wondering if there’s a minimum requirement for this feature that it’s not mentioned in the documentation? Or some other feature must be disabled? I did see on GitHub issues that chunked prefilling should be disabled but it has no effect on my setup’s problem.

Example:

OK:

# Request
[logger.py:39] Received request cmpl-29f4aad8e4a84dbca5eed6ccf77cfc3f-0: prompt: 'San Francisco is the', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=1, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: [128000, 24661, 13175, 374, 279], lora_request: None, prompt_adapter_request: None.

# Computed tensors on prefill instance
value hash: 72d20f0c1592472a4da7a3516664c19a96e715cb419ced30a97315c329e69ec1 shape: torch.Size([16, 5, 8, 64]) device: cuda:0
key hash: 7468212fdf4f77fc8dea0ea9f7411917aeffdc7fa7253230433c5ba259ee56c9 shape: torch.Size([16, 5, 8, 64]) device: cuda:0
hidden hash: 7eb84ba21e705e5b7fde6ec16b2e10e22c59bd7708fac4515a833cb8e7775728 shape: torch.Size([5, 2048]) device: cuda:0

# Reply from decode
--- Response Headers ---
Date: Wed, 09 Apr 2025 17:17:40 GMT
Server: uvicorn
Content-Length: 443
Content-Type: application/json
--- Response Body ---
{"id":"cmpl-3a43d853b8824cb0a1d11967eebdf030","object":"text_completion","created":1744219061,"model":"neuralmagic/Llama-3.2-1B-Instruct-quantized.w8a8","choices":[{"index":0,"text":" most of the San Andreas Fault.\n\nThe San Andreas Fault is a fault line in","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":5,"total_tokens":21,"completion_tokens":16,"prompt_tokens_details":null}}

Replay the same prompt:

# Request
[logger.py:39] Received request cmpl-29f4aad8e4a84dbca5eed6ccf77cfc3f-0: prompt: 'San Francisco is the', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=1, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: [128000, 24661, 13175, 374, 279], lora_request: None, prompt_adapter_request: None.

# Computed tensors on prefill instance
value hash: ed65aa6cd4c3c20eefec772d287528488d48fa71ca91ab357a65292dcf6a605e shape: torch.Size([16, 5, 8, 64]) device: cuda:0
key hash: c7b61b699de694d474f6ef2c5e174fdf058347e1b681e48208ff937b8c085f7e shape: torch.Size([16, 5, 8, 64]) device: cuda:0
hidden hash: 7eb84ba21e705e5b7fde6ec16b2e10e22c59bd7708fac4515a833cb8e7775728 shape: torch.Size([5, 2048]) device: cuda:0

# Bogus reply from decode
--- Response Headers ---
Date: Wed, 09 Apr 2025 17:17:48 GMT
Server: uvicorn
Content-Length: 414
Content-Type: application/json
--- Response Body ---
{"id":"cmpl-0ce1fe4330da4cbab08d712578eb6e2a","object":"text_completion","created":1744219068,"model":"neuralmagic/Llama-3.2-1B-Instruct-quantized.w8a8","choices":[{"index":0,"text":" most than than than the f f f f f f f f f f f","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":5,"total_tokens":21,"completion_tokens":16,"prompt_tokens_details":null}}

Topic		Replies	Views
Computation time remain consistent across chunks in chunked-prefill despite linearly growing attention complexity? KV-Cache	1	12	June 2, 2025
V1 Chunked Prefill Scheduling Policy: how prefill would be scheduled? Scheduling	8	235	March 25, 2025
Why does computation time remain consistent across chunks in chunked-prefill despite linearly growing attention complexity? Scheduling	3	12	June 2, 2025
Multimodal inference guideline? General	53	204	July 2, 2025
Gemma 3 prefix caching in case of multimodal prompts Model Support	4	56	May 22, 2025

Minimum requirements for Disaggregated Prefilling?

Related topics