Back to All Events

ML Benkyoukai: Mamba

Are we at the precipice of a foundation model architectural revolution?

In the majority of our study groups, we’ve looked almost exclusively at large language models, often billed as “the closest thing we have to artificial general intelligence”. These large language models all use the transformer architecture. Transformers have proved very powerful, but are hamstrung by their poor scaling with sequence length / context window size. In the past few years there have been many attempts to modify the transformer architecture to scale better without harming performance, to varying degrees of effectiveness (Performer, Reformer, Random Feature Attention, Transformers are RNNs, LongNet, to name a few). But what if instead of modifying the transformer to be more efficient, we could replace it with an architecture designed for efficiency from the ground-up?

The much hyped Mamba, released in December 2023 by Gu and Dao, promises to be just such an architecture. Mamba (so called because it uses a Selective Structured State Space Sequence-model or SSSSS🐍) replaces the attention mechanism, which scales quadratically with sequence length, with a structured state space sequence model that is linear in sequence length during training and constant (!) in sequence length when deployed. Structured state space sequence models are not Mamba’s invention; their trick is to drop the linear time invariance constraint adopted by other authors, then optimize hard for training and inference on GPUs to make dropping that constraint feasible. In evaluations, the Mamba architecture achieves state-of-the-art performance on various modalities such as language, audio, and genomics, matching transformers twice its size.

What should we make of this? Will the selective structured state space sequence model come to replace the transformer, as the transformer replaced the convolutional neural network before it? How does the SSSSS compare to previous recurrent architectures like the long short-term memory network or the neural Turing machine? Is a 2x improvement in performance even a big deal when you can 100x your performance with <1% increase in training compute using reinforcement learning from human feedback?

All these questions and more will be answered in the first session of AI Safety Tokyo’s machine learning benkyoukai. We will be going much deeper into the details than our usual benkyoukai. Expect to hear terms like activation function, skip connection and convolution thrown around without explanation. Unlike our regular benkyoukai, we’ll be expecting our attendees to have read the paper beforehand.

Foundation models, now powering most of the exciting applications in deep learning, are almost universally based on the Transformer architecture and its core attention module. Many subquadratic-time architectures such as linear attention, gated convolution and recurrent models, and structured state space models (SSMs) have been developed to address Transformers’ computational inefficiency on long sequences, but they have not performed as well as attention on important modalities such as language. We identify that a key weakness of such models is their inability to perform content-based reasoning, and make several improvements. […] We integrate these selective SSMs into a simplified end-to-end neural network architecture without attention or even MLP blocks (Mamba). Mamba enjoys fast inference (5× higher throughput than Transformers) and linear scaling in sequence length, and its performance improves on real data up to million-length sequences. As a general sequence model backbone, Mamba achieves state-of-the-art performance across several modalities such as language, audio, and genomics. On language modeling, our Mamba-3B model outperforms Transformers of the same size and matches Transformers twice its size, both in pretraining and downstream evaluation.

[…]

We empirically validate Mamba’s potential as a general sequence FM backbone, in both pretraining quality and domain-specific task performance, on several types of modalities and settings:

• Synthetics. On important synthetic tasks such as copying and induction heads that have been proposed as being key to large language models, Mamba not only solves them easily but can extrapolate solutions indefinitely long (>1M tokens).

• Audio and Genomics. Mamba out-performs prior state-of-the-art models such as SaShiMi, Hyena, and Transformers on modeling audio waveforms and DNA sequences, both in pretraining quality and downstream metrics (e.g. reducing FID on a challenging speech generation dataset by more than half). In both settings, its performance improves with longer context up to million-length sequences.

• Language Modeling. Mamba is the first linear-time sequence model that truly achieves Transformer-quality performance, both in pretraining perplexity and downstream evaluations. With scaling laws up to 1B parameters, we show that Mamba exceeds the performance of a large range of baselines, including very strong modern Transformer training recipes based on LLaMa (Touvron et al. 2023).

- Mamba: Linear-Time Sequence Modeling with Selective State Spaces, Albert Gu and Tri Dao, 2023.

Previous
Previous
24 January

AI Sleeper Agents

Next
Next
7 February

Why Agent Foundations?