欢迎来到 Richelieu 的技术空间 🚀

这里是 Richelieu 记录大模型推理优化 (vLLM/SGLang)、生活感悟与成长的地方。

vLLM 中的 Attention Sink 与 Sliding Window Attention 深度解析

概述本文围绕 vLLM 推理引擎中的两个关键注意力机制展开深入分析：Attention Sink（注意力沉没）和 Sliding Window Attention（滑动窗口注意力，SWA）。内容涵盖它们的设计原理、vLLM 中的代码实现、KV cache 管理流程，以及 DeepSeek V4 模型如何运用这些技术。所有分析基于 vLLM 代码库中 DeepSeek V4 相关的实现。 ...

vLLM Daily — 2026-06-04

共 63 个 commit，涉及 375 个文件，+13551/-4389 行变动。 ...

vLLM Daily — 2026-06-03

共 79 个 commit，涉及 371 个文件，+13084/-2832 行变动。 ...

DeepSeek V4 MegaMoE Kernel 深度解析

前言经典 MoE 的计算过程在深入 MegaMoE 之前，先梳理一下经典 MoE（Mixture of Experts）层的完整计算流程。以一个具体配置为例： T = 8 # 当前 batch 中的 token 数 H = 7168 # hidden size I = 2048 # intermediate size（每个 expert 的 FFN 中间维度） E = 256 # 总 expert 数量 K = 6 # 每个 token 激活的 expert 数（top-K）第一步：路由（Routing）输入 hidden_states 形状为 [T, H]（即 [8, 7168]）。通过 Gate 线性层： gate = hidden_states @ W_gate^T # W_gate: [E, H] → gate: [T, E] = [8, 256] # 每个 token 对每个 expert 的得分对每个 token 施加 scoring 函数（如 softmax 或 sqrt(softplus)），然后取 top-K： ...

DeepSeek V4 MoE 量化技术详解

前言本文整理自一次围绕 vLLM 代码库中 DeepSeek V4 MoE 模块的技术讨论，内容涉及 MXFP4 与 NVFP4 的量化方案对比、Block Quantized GEMM 的设计原理、FP4 packed 存储格式、以及 DeepGEMM 库中 FP8×FP4 在 Blackwell 硬件上的具体实现。一、DeepSeek V4 MoE 核心优化概览 DeepSeek V4 的 MoE 模块在 vLLM 中的实现包含了大量优化：优化说明 DeepGEMM MegaMoE 融合 EP dispatch + L1 GEMM + SwiGLU + L2 GEMM + EP combine 为单 mega-kernel，NVLink 通信与计算重叠 FP4 (MXFP4/NVFP4) 权重量化 4-bit 浮点权重 + UE8M0 block scale Expert Parallelism 多后端 DeepEP、FlashInfer NVLink、MORI、NIXL 等多种 all-to-all 策略 Fused TopK Bias Routing sqrt(softplus) 得分函数、e_score_correction_bias、hash MoE EPLB 每层跟踪 expert 负载，动态重新分配 Fused MLA Kernel Q-norm + RoPE + KV quant + cache insert 融合为单 CUDA 核 MTP (Multi-Token Prediction) 共享 MoE 架构的 speculative decoding 二、MXFP4 与 NVFP4 的区别 DeepSeek V4 Flash 使用 FP4 权重，有两个可选方案：MXFP4 (OCP 开放标准) 和 NVFP4 (NVIDIA 私有格式)。切换由 HuggingFace config 中的 moe_quant_algo 字段控制。 ...

vLLM PCP/DCP 技术笔记

背景 vLLM 中引入了两个上下文并行维度——Prefill Context Parallel (PCP) 和 Decode Context Parallel (DCP)。本文是源码分析记录，覆盖配置入口、进程组初始化、通信模式、KV cache 布局、LSE 合并机制、各 attention backend 支持情况，以及一些数学推导。 ...

vLLM SP/AsyncTP/Quant 技术笔记

背景 vLLM 中实现了多种并行策略与编译优化。本文聚焦三条密切关联的技术线：Sequence Parallelism (SP)、Async Tensor Parallelism (AsyncTP) 与量化 (Quantization) 的协同工作方式，涵盖概念辨析、GEMM-通信融合原理、量化感知改写以及配置细节。 ...

vLLM 编译系统完全解析

vLLM 的编译系统在标准 PyTorch torch.compile 之上做了大量定制：分段编译（Piecewise Compilation）、字节码 Hook、AOT 缓存、动态形状管理等。本文从多个实际调试问题出发，系统梳理 vLLM 编译系统的核心机制。 ...

DeepSeek-v2 Routed Scaling Factor 应用时机详解

背景 DeepSeek-V2/V3 系列模型采用了 MoE（Mixture of Experts）架构，其中 routed_scaling_factor 是一个重要的超参数，用于缩放 routed expert 的输出。该系数来自模型 config，在 DeepseekV2MoE.__init__ 中初始化： self.routed_scaling_factor = config.routed_scaling_factor 默认值通常为 1.0，但 DeepSeek-V2 系列（如 deepseek-v2、deepseek-coder-v2）设置的典型值是 2.5 或 1.0，取决于具体子模型。控制开关在 vLLM 的 deepseek_v2.py 中，关键代码如下： apply_routed_scale_to_output = not self.is_rocm_aiter_moe_enabled routed_scaling_factor=self.routed_scaling_factor, apply_routed_scale_to_output=not self.is_rocm_aiter_moe_enabled, 这个 bool 值决定了 routed_scaling_factor 由谁处理——是 kernel 内部还是 runner 外部。 ...

vLLM embed.py 扩展：添加非 EngineArgs 自定义参数

背景 vLLM 官方示例 examples/basic/offline_inference/embed.py 展示了一个标准的 Embedding 推理流程，其参数解析采用如下模式： parser = FlexibleArgumentParser() parser = EngineArgs.add_cli_args(parser) args = parser.parse_args() llm = LLM(**vars(args)) 核心流程是：创建 FlexibleArgumentParser 通过 EngineArgs.add_cli_args(parser) 将 EngineArgs 的所有字段注册为 CLI 参数解析参数后通过 LLM(**vars(args)) 直接解包传给 LLM 构造函数这种模式对纯 EngineArgs 场景工作良好，但当你需要添加业务相关的自定义参数（如 --batch-size、--input-file）时，就会遇到一个问题：非 EngineArgs 的参数会被一起传给 LLM()，导致 TypeError。问题所在 LLM.__init__ 只接受 EngineArgs 中定义的字段。如果在 parser 上添加了额外的自定义参数，vars(args) 会包含这些多余字段，直接解包传入 LLM() 会导致类似这样的错误： TypeError: LLM.__init__() got an unexpected keyword argument 'custom_param' 解决方案的核心思路是：在将参数传给 LLM() 之前，把自定义参数剥离出来。三种解决方案方法一：手动按字段过滤（最直观） def parse_args(): parser = FlexibleArgumentParser() parser = EngineArgs.add_cli_args(parser) parser.add_argument("--custom-param", type=str, default=None) parser.add_argument("--batch-size", type=int, default=32) return parser.parse_args() def main(args: Namespace): engine_args = {k: v for k, v in vars(args).items() if k in EngineArgs.__dataclass_fields__} llm = LLM(**engine_args) print(f"Custom param: {args.custom_param}") print(f"Batch size: {args.batch_size}") 原理：EngineArgs 是一个 dataclass，__dataclass_fields__ 包含所有声明字段的元信息。遍历 vars(args) 字典，只保留 key 存在于 __dataclass_fields__ 中的条目即可。 ...