Memory acts as a pivotal element in the realm of human cognition, granting individuals the ability to sift through the incessant buzz of information that saturates daily lifeUnlike, however, the richly nuanced memory systems characteristic of human beings, large language models (LLMs) are bereft of such a capabilityThese models indiscriminately store and process all previous inputs, which can significantly hamper their performance and escalate operational costs, especially during long-duration tasks.

In a manner reminiscent of the human brain's selective memory retention, where vital information is conserved while trivial details fade into obscurity, artificial intelligence systems also require mechanisms for smart memory managementThe absence of such mechanisms can culminate in boundless demands for computational resources and memory as models grow increasingly expansive.

Researchers have long sought methods to imbue AI systems with memory capabilities that closely mirror those of humans

Traditional approach relied heavily on pre-set rules to manage memory within the models—selectively retaining or discarding data based on temporal succession or attention scoresYet, such methods have proven too mechanical, often failing to discriminate the significance of information as effectively as human memory doesThis limitation, while aiming for efficiency, sometimes inadvertently compromises the overall performance of these models.

In this context emerges an innovative solution from the Japanese startup Sakana AI, led by a research team that has proposed the Neural Attention Memory Models (NAMMs). This approach draws inspiration from the pivotal role of evolutionary processes in shaping human memory systemsBy harnessing evolutionary algorithms to train a dedicated neural network, NAMMs can actively choose and retain significant information, thereby enhancing both efficiency and model performance.

NAMMs operate on principles akin to those guiding human memory—they assess the importance of information based on its long-term utility

The underlying mechanism comprises three core components: the feature extraction system, the memory management network, and the evolutionary optimization strategy.

At the outset is the feature extraction mechanismNAMMs utilize short-time Fourier transform (STFT) to manipulate column vectors of the attention matrixSpecifically, this method employs a Hann window of size 32 to produce spectral graph representations of 17 complex-valued frequenciesThis ingenious representation not only preserves the frequency characteristics of attention values over time but also significantly compresses the dataExperimental evidence suggests that this spectral representation is notably more effective than simply using raw attention values or manually crafted features.

Following this is the design of the Backward Attention Memory (BAM) architecture

This innovation forms a critical cornerstone of NAMMs, introducing a distinctive attention mechanism that permits tokens to focus solely on “future” relevant content in the KV cacheThis unique design engenders a rivalry between tokens, allowing the model to learn the retention of the most informative tokensFor instance, when confronted with repetitive sentences or words, the model favors the most recently encountered instance, as it provides a fuller context.

Regarding the optimization strategy, the research team adopted the Covariance Matrix Adaptation Evolution Strategy (CMA-ES) algorithmTraditional gradient descent methods struggle with the discrete decision-making inherent in memory management problemsIn contrast, CMA-ES emulates natural evolutionary processes to directly optimize non-differentiable objective functionsThis approach employs an incremental evolution strategy, beginning with a single task and gradually expanding the number of training tasks, thereby enhancing regularization and improving the model's generalization capabilities.

The team selected Llama 3-8b as the foundational model for training NAMMs, conducting comprehensive evaluations on LongBench, InfiniteBench, and ChouBun

alefox

The results reveal that NAMMs significantly elevate the performance of Llama 3-8b Transformers, surpassing existing hand-crafted memory management techniques, namely H2O and L2.

For instance, within the LongBench benchmark testing, NAMMs reduced the KV cache size to just 25% of its original while simultaneously achieving an 11% performance enhancementIn the InfiniteBench test, the model's performance escalated from a baseline of 1.05% to 11%, all while shrinking the cache size to a mere 40% of its original dimensions.

Additionally, NAMMs exhibit a noteworthy feature—impressive zero-shot transferabilityThe research team discovered that NAMMs trained solely on language tasks could seamlessly apply to other architectures and modalitiesFor example, when utilized within the Llava Next Video-7B model, NAMMs showcased commendable performance on LongVideoBench and MLVU benchmarks, achieving a 1% performance boost in visual task capabilities and reducing video frame cache size by 72%.

In the domain of reinforcement learning, NAMMs yielded a 9% performance increase in D4RL benchmark tests utilizing decision transformers, concurrently shrinking the cache size to 81% of its original size.

Delving deeper into the operational schema of NAMMs, the research team unearthed a sophisticated memory management strategy emerging within the model

By observing the memory retention patterns at different layers, it became evident that the model tends to conserve a greater volume of older tokens within early and mid-layers—likely due to these layers’ roles in processing and aggregating information across long distancesConversely, in data-intensive coding tasks, the model learned to retain a relatively larger number of tokens.

Ultimately, NAMMs embody Sakana AI's previous research methodology, drawing insights from nature and optimizing AI systems through the mimicry of natural evolutionary processesThis research trajectory aligns seamlessly with the company's technological expertise in the domains of model merging and evolutionary optimization.

Similar to the automated “evolution” algorithm previously developed by Sakana AI, which autonomously identifies and integrates superior models, NAMMs operationalize evolutionary algorithms to enhance memory management systems, achieving continual performance improvements without human intervention.

This remarkable approach has propelled the year-old startup into a multi-billion dollar valuation, securing $210 million in Series A funding, with a total valuation reaching $1.5 billion.

Looking forward, the research team aspires to explore more complex memory model designs, potentially investigating finer-grained feature extraction techniques or examining how NAMMs can be synergized with other optimization technologies.

They stated, “This work is just the beginning of exploring the potential of our new class of memory models, and we anticipate that it may unlock a multitude of new opportunities for the evolution of Transformers in future generations.”