Resurging Reccurence, renegade 12 step program to build AGI god and other stories

By Various · Gradient Ascent ·Feb 17, 2024 · 8 min read

Resurging Recurrence

There has been a resurgence of new recurrent neural network architectures. S4, mamba, RWKV etc. They aim to fix the cons of older recurrent neural networks such as LSTM & GRU, over Transformers with self-attention while retaining their benefits.

To me, recurrence is something architectures require in some form because of the following reasons.

If you want an AI system that can process long-range sequences, you need some form of memory to which you can write and read. RNNs provide us with this with their hidden state, which you can consider as an abstract form of memory. Vanilla transformers avoid the need for this by attending to all its inputs. The O(n**2) complexity of transformers is not reasonable for extremely long-range sequence tasks.
The second reason is more esoteric. If you think the way to build generally capable agents is to take inspiration from animals (including humans) as a template for intelligence, even in a black-box manner, we would want to avoid architectures that require you to store the raw inputs as memory. While humans can conceivably do something like self-attention on all the sensory inputs at a particular time, we for sure don’t store snapshots of raw sensory data over time and process it. This might seem like an artificial constraint for computers, but I intuit that this constraint along with constraints of embodiment are essential to interpret the sensory data into forms of intelligent behaviour. Maybe an infinite raw-sensory memory architecture in your cognitive architecture impairs you

Let’s revisit the pros/cons of old recurrent neural networks

Pro: During inference, the computational complexity of LSTMs and GRUs doesn’t depend on sequence length as all your past is compressed into the memory/hidden state representation
Con: LSTMs/GRUs unlike transformers be parallelized across sequences while training

The current variants want the best of both. The key thing that unifies all the new variants is the use of linear recurrence. If you have a linear relationship in computing the memory or hidden state, then you can at the very least cleverly parallelize the training across sequence length using something called an associative scan. If you apply a mathematical operator which respects associativity on a sequence, it can be parallelized with an associative scan.

There are lots of other differences, s4 (which preceded mamba) which is motivated by linear time-invariant state space models from control theory could ...

Read full article on Gradient Ascent →

This excerpt is provided for preview purposes. Full article content is available on the original publication.