Architecture

Based on DeepSeek-V3, add DeepSeek Sparse Attention

Lightning Indexer: ****use smaller MLP quantized in fp8 to get coarse attention map and get a mark for top-k ( k =2048) tokens similarity

fine-grained token selection mechanism

How to train the Lightning Indexer

From DeepSeek-V3.1, keep frozen dense attention and train indexer to simulate the original attention map with KL-divergence loss

After convergence of warp-up stage, Optimize all parameter with sparse attention

Strong efficiency advantage at 128k context window