https://brook-cycle-730.notion.site/DeepSeek-V3-2-Pushing-the-Frontier-of-Open-Large-Language-Models-2c299f6063e78011b9cdf6721e0e1023?source=copy_link
Based on DeepSeek-V3, add DeepSeek Sparse Attention

Lightning Indexer: ****use smaller MLP quantized in fp8 to get coarse attention map and get a mark for top-k ( k =2048) tokens similarity

fine-grained token selection mechanism

From DeepSeek-V3.1, keep frozen dense attention and train indexer to simulate the original attention map with KL-divergence loss

After convergence of warp-up stage, Optimize all parameter with sparse attention

Strong efficiency advantage at 128k context window