Introduction
Today, the biggest problem faced with Large Language Models (especially by individuals, entities, and researchers who want to use them locally) is the cost/performance issue. In a paper published by a team from NVIDIA, this problem is attributed to the $O(n^2)$ complexity (which increases computational cost) and the resulting KV Cache (the model's temporary memory for a given conversation, which increases memory cost). As a solution, PostNAS (Post Neural Architecture Search) and the related JetBlock are presented. According to the paper, models created with this method show similar or better performance than models of comparable size, without sacrificing speed.
What is PostNAS?
Post Neural Architecture Search is the main argument proposed in the paper. PostNAS is essentially a sequence that discovers the model architecture.
In this sequence, a pre-trained 'Full Attention' model is first considered, and its MLP layers are kept frozen. Then, a coarse-to-fine search is conducted for effective attention block designs. This search proceeds by first determining the best position for full attention layers, followed by selecting the best linear attention block or a new one, and then choosing the most suitable hyperparameters.
A key point of this approach, as noted in the paper, is that models attempting to balance efficiency and accuracy by hybridizing full and linear attention already exist; however, it is stated that these models actually lag behind full attention models, especially in benchmarks like MMLU(-PRO).
After hyperparameter selection, the model goes through 4 main stages:
- Placement and elimination of Full Attention (for high accuracy)
- Selection of the Linear Attention block (selecting the most suitable linear attention block after the full attention layers are placed). PostNAS makes small Large Language Models (50M-150M) trained for proxy tasks unnecessary.
- New attention block design (JetBlock. As a result of Dynamic Convolution, this block can dynamically adapt to the text; unlike other blocks, it is not static, making it powerful and flexible).
- Hardware-aware Architecture search
The Purpose of PostNAS
In the previous section, we mentioned that the concept of KV Cache is a central problem, and the paper attributes hardware bottlenecks or slowness (as a result of trade-offs) in models to this.
When the KV Cache is kept constant and the number of parameters is increased, the paper states that both speed is maintained and accuracy is improved. It is also noted that Jet-Nemotron-2B runs up to 47 times faster than the Qwen3-1.7B-Base model while providing more accurate results. The reason for this is that by discarding the full attention layers (which are slow) in the initial steps, the KV Cache is tremendously reduced, increasing memory read/write speed. This also leads to the effective use of the new parameters we add in the freed-up space, and no slowdown occurs.
As a result, we see that the primary purpose of PostNAS (and relatedly, JetBlock) is to reduce cost. A model like Jet-Nemotron is stated to run faster, more energy-efficiently, and more smoothly on less powerful hardware.
A secondary purpose is an address to researchers and developers; the techniques and discovery process used in PostNAS reduce not only the cost but also the risk in traditional Large Language Model architecture discovery. After all, pre-training is a process that often needs to be developed, designed, and is also difficult and expensive, typically undertaken only by large companies (due to their financial capacity). The PostNAS paper focuses on this topic, concentrating on the question: 'How can we take an already proven model and make it more efficient?'. After all, NVIDIA was already known in the Large Language Model scene with its regular Nemotron models before this paper.
Conclusion
NVIDIA's paper encourages us to rethink the efficiency-accuracy balance with the innovations of the PostNAS method and the JetBlock block, attempting to offer effective solutions to the trade-offs in this balance. With these advancements, the paper aims to provide solutions like Jet-Nemotron for researchers, developers, and end-users alike.