https://brook-cycle-730.notion.site/DeepSeek-V3-2-Pushing-the-Frontier-of-Open-Large-Language-Models-2c299f6063e78011b9cdf6721e0e1023?source=copy_link

Architecture

Based on DeepSeek-V3, add  DeepSeek Sparse Attention

image.png

Lightning Indexer: ****use smaller MLP quantized in fp8 to get coarse attention map and get a mark for top-k ( k =2048) tokens similarity

image.png

fine-grained token selection mechanism

image.png

How to train the Lightning Indexer

From DeepSeek-V3.1, keep frozen dense attention and train indexer to simulate the original attention map with KL-divergence loss

image.png

After convergence of warp-up stage, Optimize all parameter with sparse attention

image.png

Performance

Strong efficiency advantage at 128k context window