Saturday, November 08, 2025

Meet Kimi Linear: Faster long-context AI that uses less memory and beats full attention


There’s a new article on making large AI models faster and more memory-efficient. The original abstract is written for specialists, which is perfect for readers deep in the field but tougher for everyone else. To make it easier to engage with the ideas, I’m sharing the original abstract and three friendlier rewrites using a skiing analogy.
Pick your slope: the black link goes to the original, the blue version is for people who know LLMs but don’t track research details, and the green version is for well-educated novices who use AI but don’t speak the jargon. If helpful, I can add a bunny hill version for absolute newcomers.

Green Slope 

Kimi Linear is a new way to build large AI models that is faster and more memory-efficient than today’s standard “full attention” models, while also being more accurate in our tests. It works well on short and very long inputs, and in reinforcement learning settings.

At the core is Kimi Delta Attention (KDA), a compact attention module that treats part of the model like a small working memory. A finer set of gates helps the model decide what to keep and what to forget, so it uses that memory more effectively. We also process text in manageable chunks and use a lightweight math trick that cuts computation without changing what the model learns.

We trained a hybrid model with 3B active parameters (48B total) that mixes KDA with Multi-Head Latent Attention. Using the same training setup, this model beat a full-attention baseline on every task we checked. It used up to 75% less key-value cache memory (the short-term memory used during generation) and reached up to 6× faster decoding on 1-million-token inputs. In practice, you can swap Kimi Linear in for full attention models and get better accuracy and efficiency, especially on long inputs and outputs.

We are releasing the KDA kernel, vLLM integrations, and both pretrained and instruction-tuned checkpoints for others to use.

Blue Slope

We introduce Kimi Linear, a hybrid linear attention architecture that, for the first time, outperforms full attention under fair comparisons across various scenarios -- including short-context, long-context, and reinforcement learning (RL) scaling regimes. At its core lies Kimi Delta Attention (KDA), an expressive linear attention module that extends Gated DeltaNet with a finer-grained gating mechanism, enabling more effective use of limited finite-state RNN memory. Our bespoke chunkwise algorithm achieves high hardware efficiency through a specialized variant of the Diagonal-Plus-Low-Rank (DPLR) transition matrices, which substantially reduces computation compared to the general DPLR formulation while remaining more consistent with the classical delta rule.

We pretrain a Kimi Linear model with 3B activated parameters and 48B total parameters, based on a layerwise hybrid of KDA and Multi-Head Latent Attention (MLA). Our experiments show that with an identical training recipe, Kimi Linear outperforms full MLA with a sizeable margin across all evaluated tasks, while reducing KV cache usage by up to 75% and achieving up to 6 times decoding throughput for a 1M context. These results demonstrate that Kimi Linear can be a drop-in replacement for full attention architectures with superior performance and efficiency, including tasks with longer input and output lengths.
To support further research, we open-source the KDA kernel and vLLM implementations, and release the pre-trained and instruction-tuned model checkpoints.

Black Slope


No comments: