Vllm中，deepseek的模型刷新kvcache的地方在哪

yangshanjun · March 12, 2026, 7:36am

@RunLLM 模型执行时，tp 之间通信的代码在哪里

RunLLM · March 12, 2026, 7:37am

vLLM中TP（Tensor Parallel）之间通信的核心代码位于vllm.distributed.communication_op模块，主要包括tensor_model_parallel_all_reduce、tensor_model_parallel_all_gather、tensor_model_parallel_reduce_scatter等函数，这些函数调用底层的通信组实现跨GPU的数据同步。API文档说明和源码实现均有详细描述。

需要具体的调用链或某个通信函数的详细解释吗？

Sources:

Hint: Mention @RunLLM in the post for followups.

yangshanjun · March 12, 2026, 7:44am

@RunLLM 模型执行完，采样的时候，之后会有allreduce操作吗

RunLLM · March 12, 2026, 7:44am

是的，模型执行完后，在采样（sampling）阶段通常不会再有allreduce操作。allreduce主要用于模型前向传播时TP（Tensor Parallel）之间的通信，同步各分片的输出；采样阶段只需在本地处理生成token，无需跨卡同步。详见API说明。

需要更详细的执行流程或特殊场景说明吗？

Sources:

vllm.distributed.communication_op — vLLM

Hint: Mention @RunLLM in the post for followups.

yangshanjun · March 12, 2026, 8:03am

@runLLM 如果时tp2 rank 6 rank 7都是最后一个rank ，是分开采样的吗

RunLLM · March 12, 2026, 8:03am

是的，如果是tp=2，rank 6和rank 7分别属于不同的pipeline stage（如pp=4时），每个pp stage的最后一个rank会在本地独立进行采样，互不干扰。采样操作不会跨TP或PP rank同步，各自只处理本地分片的输出。相关原理见官方架构说明。

需要详细的rank分布或采样流程说明吗？

Sources:

相关原理见官方架构说明

Hint: Mention @RunLLM in the post for followups.

yangshanjun · March 12, 2026, 10:00am

@RunLLM 那最终rank 6 rank7 各自都生成了自己的 token id？不会合并吗，最终模型采样的是哪个数据呢

RunLLM · March 12, 2026, 10:00am

我的理解是，在pp+tp并行下，每个pipeline最后一个rank（如rank 6和rank 7）会各自独立采样，但它们实际上属于不同的pipeline分支（即不同的请求或batch），每个分支只会用到自己那一份采样结果。不会将不同pipeline分支的token id合并，最终每个请求只采纳自己分支最后rank生成的token id。相关讨论见官方issue。

需要更详细的rank分布、采样合并机制或调度原理说明吗？

Sources:

https://github.com/vllm-project/vllm/blob/main/vllm/v1/worker/gpu_model_runner.py

Hint: Mention @RunLLM in the post for followups.

yangshanjun · March 27, 2026, 3:39am

@RunLLM

def get_kv_cache_config_from_groups(

vllm_config: VllmConfig,

kv_cache_groups: list\[KVCacheGroupSpec\],

available_memory: int,

) → KVCacheConfig:

"""

Generate the KV cache configuration from the KV cache groups and spec

of each layer.



Args:

    vllm_config: The global VllmConfig

    kv_cache_groups: The KV cache groups

    available_memory: Memory available for KV cache in bytes

Returns:

    The generated KVCacheConfig

"""

if len(kv_cache_groups) == 0:

    \# Attention free models do not have KV cache.

    \# Return num_blocks=1 as BlockPool always needs a null_block.

    return KVCacheConfig(

        num_blocks=1,

        kv_cache_tensors=\[\],

        kv_cache_groups=kv_cache_groups,

    )



\# Determine how model runners should initialize the KV cache tensors.

if len(kv_cache_groups) == 1 and isinstance(

    kv_cache_groups\[0\].kv_cache_spec, UniformTypeKVCacheSpecs

):

    \# Special case: all layers have the same type of KV cache but with

    \# different hidden size. Allocate different amount of memory for each

    \# layer based on its hidden size.

    num_blocks = (

        available_memory // kv_cache_groups\[0\].kv_cache_spec.page_size_bytes

    )

    num_blocks = may_override_num_blocks(vllm_config, num_blocks)

    per_layer_specs = kv_cache_groups\[0\].kv_cache_spec.kv_cache_specs

    kv_cache_tensors = \[

        KVCacheTensor(

            size=per_layer_specs\[layer_name\].page_size_bytes \* num_blocks,

            shared_by=\[layer_name\],

        )

        for layer_name in kv_cache_groups\[0\].layer_names

    \]

else:

    \# General case:

    \# We will have group_size memory pools, each is shared by one layer from

    \# each group. As layers of different groups have different block table,

    \# they will use different parts of the shared Tensor.

    \# The memory layout for 3 groups (full.0, full.1), (sw.0, sw.2),

    \# (sw.1, padding) will be: (group_size = 2)

    \# full.0, sw.0, sw.1: share a Tensor with size=available_memory//2

    \# full.1, sw.2: share another Tensor with size=available_memory//2

    group_size = max(len(group.layer_names) for group in kv_cache_groups)



    page_size = get_uniform_page_size(

        \[group.kv_cache_spec for group in kv_cache_groups\]

    )

    assert group_size > 0, "group_size must be greater than 0"

    num_blocks = get_num_blocks(

        vllm_config, group_size, available_memory, page_size

    )

    kv_cache_tensors = \[\]

    for i in range(group_size):

        shared_by = \[\]

        for j in range(len(kv_cache_groups)):

            if i < len(kv_cache_groups\[j\].layer_names):

                shared_by.append(kv_cache_groups\[j\].layer_names\[i\])

        kv_cache_tensors.append(

            KVCacheTensor(size=page_size \* num_blocks, shared_by=shared_by)

        )



return KVCacheConfig(

    num_blocks=num_blocks,

    kv_cache_tensors=kv_cache_tensors,

    kv_cache_groups=kv_cache_groups,

)

这个函数里面，为什么第一个if条件里面，计算num_blocks的时候不除以 num_layers，而else分支get_num_blocks里计算num_blocks的时候，除以了 num_layers ？为什么

RunLLM · March 27, 2026, 3:40am

因为第一个if分支（len(kv_cache_groups) == 1 and isinstance(..., UniformTypeKVCacheSpecs)）的场景下，每个layer单独分配一块KV cache tensor，num_blocks直接表示每层能分配多少块（每层独立，不需要再除以num_layers）；而else分支是多个layer共享同一个tensor（group_size个池子，每个池子被多个layer共享），此时num_blocks要按总内存平分到每个group，再除以group_size，确保每个group分到的block数一致。详见源码注释和设计说明。

是否需要更详细的内存分配原理或具体示例？

Sources:

kv_cache_utils.py

Hint: Mention @RunLLM in the post for followups.

yangshanjun · March 27, 2026, 3:45am

@RunLLM pp场景下，也不用除以 num_layers吗

RunLLM · March 27, 2026, 3:45am

在pp（pipeline parallel）场景下，KV cache分配时每个worker只负责自己分到的部分layer，因此在第一个if分支（每层独立分配tensor）下，num_blocks仍然不用除以num_layers，每层直接分配可用内存/每层page_size的block数；而在else分支（多层共享tensor）时，num_blocks需要除以group_size（即每个group的layer数），以保证共享tensor的分配公平。源码实现说明。

需要详细的pp分配流程或内存分配示意图吗？

Sources:

源码实现说明

Hint: Mention @RunLLM in the post for followups.

yangshanjun · March 27, 2026, 6:47am

@RunLLM UniformTypeKVCacheSpecs 和 AttentionSpec 是什么区别

RunLLM · March 27, 2026, 6:47am

UniformTypeKVCacheSpecs 是一种用于描述多层具有相同类型注意力（如全注意力或滑动窗口注意力）时的KV缓存规格集合，内部包含每层的KVCacheSpec；而 AttentionSpec 是单层注意力的KV缓存规格基类，描述单层的KV缓存参数（如block_size、head_size等）。API文档说明。

需要更详细的类结构或使用场景说明吗？

Sources:

Hint: Mention @RunLLM in the post for followups.

yangshanjun · March 27, 2026, 6:49am

@RunLLM 他们计算 num_blocks 的逻辑不一样吗

RunLLM · March 27, 2026, 6:50am

是的，UniformTypeKVCacheSpecs 和 AttentionSpec 计算 num_blocks 的逻辑不同。UniformTypeKVCacheSpecs 场景下，每层独立分配KV cache tensor，num_blocks=可用内存/每层page_size；而 AttentionSpec（多层共享tensor）时，num_blocks=可用内存/(group_size × page_size)，即要除以group_size（或num_layers），以保证共享公平。详见源码实现。

需要详细的代码示例或分配原理说明吗？

Sources:

kv_cache_utils.py

Hint: Mention @RunLLM in the post for followups.

yangshanjun · March 27, 2026, 6:51am

@RunLLM group_size 是什么含义

RunLLM · March 27, 2026, 6:51am

group_size 表示每个 KV cache group 中包含的 layer 数量。在混合注意力模型（如全注意力+滑动窗口）下，vLLM 会将不同类型的层分组，每组有 group_size 个 layer，用于共享同一个 KV cache tensor，实现高效内存分配。官方设计文档说明。

需要详细的分组算法或内存布局示意吗？

Sources:

Hybrid KV Cache Manager 设计文档

Hint: Mention @RunLLM in the post for followups.

yangshanjun · March 27, 2026, 6:54am

@RunLLM num_blocks 统计的是每层的吗，还是所有层的

RunLLM · March 27, 2026, 6:54am

num_blocks 统计的是每个 KV cache tensor（即每组/每层分配的 block 数），不是所有层的总和。对于每层独立分配（如 UniformTypeKVCacheSpecs），num_blocks 表示每层的 block 数；对于多层共享 tensor（如 AttentionSpec 分组），num_blocks 表示每组共享 tensor 的 block 数。详见官方API说明。

需要更详细的分配示例或内存布局说明吗？

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
什么情况下，一个请求会被重复schedule? General	58	124	January 6, 2026
Why is it so slow to build a odeVLLM from source using Docker? General	39	112	January 17, 2026
Understanding vllm kv cache KV-Cache	5	902	December 1, 2025
Skip_leading_tokens 是什么意思 General	41	62	February 13, 2026
Support for (sparse) key value caching KV-Cache	16	527	May 3, 2025

Vllm中，deepseek的模型 刷新kvcache的地方在哪

Related topics

Vllm中，deepseek的模型刷新kvcache的地方在哪