History of State Space Models (SSM) in 2022

Community Article Published April 11, 2024
Une version en français est disponible sur mon blog.

Introduction

In the previous article, we defined a State Space Model (SSM) using a continuous-time system. We discretized it to show its recurrent and then convolutive view. The interest here is in being able to train the model convolutely and then perform inference recursively over very long sequences.

image
Figure 1: Image from blog post « Structured State Spaces: Combining Continuous-Time, Recurrent, and Convolutional Models » by Albert GU et al. (2022)

This vision was introduced by Albert GU in his papers LSSL and S4 published in 2021. S4 being the equivalent of "Attention is all you need" for transformers.
In this article, we will review the SSM literature published in 2022. Those appearing in 2023 will be listed in the next article. The aim is to show the different evolutions of these types of models over the months, while remaining synthetic (i.e. I won't go into all the details of the papers listed). During this year 2022, the various advances have focused on applying different discretization algorithms, while replacing the HiPPO matrix with a simpler one.


Theoretical models

In this section, we will review the theoretical work behind the proposed improvements to the S4 architecture. Then, in a different section, we'll look at concrete applications to different tasks (audio, vision, etc.).

S4 V2

On March 4, 2022, the authors of S4 updated their paper to include a section on the importance of the HiPPO matrix (see section 4.4 of the most recent version of the paper).
In a nutshell, it consists in reporting the results observed following the performance of ablations on the CIFAR-10 sequential dataset. Instead of using an SSM with the HiPPO matrix, the authors have tried to use various parameterizations such as a random dense matrix and a random diagonal matrix.

image
Figure 2: Accuracy on the CIFAR-10 validation split, from figure 3 of the S4 paper

The use of HiPPO is therefore important. Are the performances obtained due to its specific intrinsic qualities, or could any low-rank normal matrix (NPLR for Normal Plus Low-Rank) suffice?

image
Figure 3: Accuracy on the CIFAR-10 validation split with different initializations and parameterizations, taken from figure 4 of the S4 paper

Initializing an NPLR matrix with HiPPO significantly enhances performance. Thus, according to these experiments, the HiPPO matrix is essential to obtain a high-performance model.

The authors of S4 have further developed their work, which they presented on June 24, 2022 in the paper How to Train Your HiPPO. This is an extremely detailed paper of over 39 pages.

In this paper, the authors focus on a more intuitive interpretation of MSS as a convolutional model where the convolution kernel is a linear combination of particular basis functions, leading to several generalizations and new methods.
For example, they prove that the A\mathbf{A} matrix of S4 produces exponentially scaled Legendre polynomials (LegS). This gives the system a better ability to model long-term dependencies via very smooth kernels.
The authors also derive a new SSM that produces approximations of truncated Fourier functions (FouT). This method generalizes short-time Fourier transforms and local convolutions (i.e. a standard ConvNet). This SSM can also encode state-of-the-art functions to solve classical memory tasks.
Note that it is mainly HiPPO-FouT that is introduced in this paper, HiPPO-LegS having been introduced in the original HiPPO paper two years earlier. As is HiPPO-LegT (truncated Legendre polynomials).

image
Figure 4: The different HiPPO variants

The colors represent the first 4 basis functions Kn(tK_n(t) (the convolution kernel) for each of the methods (we invite the reader to look at Table 1 of the paper to see what Kn(tK_n(t) equals for each of the methods).

In addition, the authors also work on the time step , which independently of a notion of discretization can be interpreted simply as controlling the length of the dependencies or the width of the SSM kernels. The authors also detail how to choose a good value of for a given task.

The work carried out improves S4 results by more than 5.5 points on the benchmark LRA by TAY, DEHGHANI et al. (2020):

image
Figure 5: S4 v2 benchmark results

The model resulting from this paper is generally referred to in the literature as "S4 V2" or "S4 updated", as opposed to the "original S4" or "S4 V1".

DSS: Diagonal State Spaces

On March 27, 2022, Ankit GUPTA introduced Diagonal State Spaces (DSS) in his paper Diagonal State Spaces are as Effective as Structured State Spaces.
It seems that following this paper, he and Albert GU began working together on an updated version of this paper (GU subsequently appearing as co-author in v2 and v3 of the article) and within the framework of S4D (see next section).
The main thing to remember is that this approach is significantly simpler than S4. Indeed, DSS is based on diagonal state matrices (i.e. without the low-rank correction of S4, i.e. without the HiPPO matrix) which, if initialized appropriately, work better than the original S4. The use of a diagonal matrix in place of the HiPPO matrix for A\mathbf{A} has since become standard practice.

Nevertheless, let's dwell on the few complexities/limitations contained in this paper. Listing them will help us understand the contributions of the following methods, which aim to simplify things even further.


1. Discretization
The DSS uses the same system of differential equations as the S4:

x=Ax+Buy=Cx \begin{aligned} x' &= \mathbf{A}x + \mathbf{B}u \\ y &= \mathbf{C}x \end{aligned}

However, it uses a different discretization in order to achieve convolutional and recurrent views, namely the zero-order hold (ZOH) discretization, instead of the bilinear discretization, which assumes that the sampled signal is constant between each sampling point.
Below is a table comparing the values of A\mathbf{A}, B\mathbf{B} and C\mathbf{C} for each of the two discretizations in the recursive view, as well as the expression of the convolution kernel in the convolutional view:

Discretization Bilinear ZOH
Recurrence Aˉ=(IΔ2A)1(I+Δ2A\mathbf{\bar{A}} = (\mathbf {I} - \frac{\Delta}{2} \mathbf{A})^{-1}(\mathbf {I} + \frac{\Delta}{2} \mathbf{A})
Bˉ=(IΔ2A)1ΔB\mathbf {\bar{B}} = (\mathbf{I} - \frac{\Delta}{2} \mathbf{A})^{-1} \Delta \mathbf{B}
Cˉ=C\mathbf{\bar{C}} = \mathbf{C}
Aˉ=eAΔ\mathbf{\bar{A}} = e^{\mathbf{A}\Delta}
Bˉ=(AˉI)A1B\mathbf{\bar{B}} = (\mathbf{\bar{A}} - I)\mathbf{A}^{-1}\mathbf{B}
Cˉ=C\mathbf{\bar{C}} = \mathbf{C}
Convolution Kˉk=(CˉBˉ,CˉAˉBˉ,,CˉAˉkBˉ\mathbf{\bar{K}}_k = (\mathbf{\bar{C}} \mathbf{\bar{B}}, \mathbf{\bar{C}} \mathbf{\bar{A}} \mathbf{\bar{B}}, …, \mathbf{\bar{C}} \mathbf{\bar{A}}^{k} \mathbf{\bar{B}}) Kˉ=( CeAkΔ(eAΔI)A1B )0k<L\mathbf{\bar{K}} = (\ \mathbf{C} e^{\mathbf{A}\cdot k\Delta} (e^{\mathbf{A}\Delta} - I)\mathbf{A}^{-1}\mathbf{B}\ )_{0 \leq k < L}

For the ZOH, after running through the calculations, we end up with yk=j=0kCˉAˉjBˉukj=j=0kKˉjukjy_k = \sum_{j=0}^k \bar{C}\bar{A}^j\bar{B}\cdot u_{k-j} = \sum_{j=0}^k \bar{K}_j\cdot u_{k-j}.

Calculating yy from uu and Kˉ\bar{K} is then done by Fast Fourier Transform (FFT) in O(L log(L)O(L~log(L)) with LL the length of the sequence by simultaneously calculating the multiplication of two polynomials of degrees L1L-1.



2. DSSsoftmax and DSSexp

Short version

GUPTA formulates a proposal to obtain DSS that are as expressive as S4, resulting in the formulation of two different DSS: DSSexp and DSSsoftmax. The information to be retained concerning them can be summarized in the following table:

Approach DSSexp DSSsoftmax
Convolutive view K=KˉΔ,L(Λ,I1iN,w~)K = \bar{K}_{\Delta, L}(\Lambda,\mathbb{I}_{1 \leq i \leq N},\widetilde{w})
=w~Λ1(eΛΔI)elementwise-exp(P)= \widetilde{w} \cdot \Lambda^{-1} (e^{\Lambda\Delta} - I) \cdot \text{elementwise-exp}(P)
K=KˉΔ,L(Λ, ((eLλiΔ1)1)1iN,w)K = \bar{K}_{\Delta, L}(\Lambda,\ ((e^{L\lambda_i\Delta} - 1)^{-1})_{1\leq i \leq N}, w)
=wΛ1row-softmax(P)= w \cdot \Lambda^{-1} \cdot \text{row-softmax}(P)
Recursive view Aˉ=diag(eλ1Δ,,eλNΔ)\bar{A} = \mathrm{diag}(e^{\lambda_1\Delta}, \ldots, e^{\lambda_N\Delta})
Bˉ=(λi1(eλiΔ1))1iN\bar{B} = \left(\lambda_i^{-1} (e^{\lambda_i\Delta} - 1) \right)_{1\leq i \leq N}
Aˉ=diag(eλ1Δ,,eλNΔ)\bar{A} = \mathrm{diag}(e^{\lambda_1\Delta}, \ldots, e^{\lambda_N\Delta})
Bˉ=(eλiΔ1λi(eλiΔL1))1iN\bar{B} = \left( {e^{\lambda_i\Delta} - 1 \over \lambda_i (e^{\lambda_i\Delta L} - 1)} \right)_{1\leq i \leq N}
Interpretation Acts as the forgetting gate of an LSTM If (λ)<<0\Re(\lambda)<<0 : retains local information,
if (λ)>>0\Re(\lambda)>>0 : can capture information at very long distances

So we are working here on ℂ and not ℝ.


Long version

GUPTA formulates the following proposition to obtain DSS that are as expressive as S4:

Let KR1×LK \in \mathbb{R}^{1\times L} be the kernel of length LL of a given state space (A,B,C(A, B, C) and sampling time Δ>0\Delta > 0, where ACN×NA \in \mathbb{C}^{N \times N} is diagonalizable on C\mathbb{C} with eigenvalues λ1,,λN\lambda_1,\ldots,\lambda_N and i\forall i, λi0\lambda_i \neq 0 and eLλiΔ1e^{L\lambda_i\Delta} \neq 1. Let PCN×LPi,k=λikΔP \in \mathbb{C}^{N \times L} P_{i,k} = \lambda_i k\Delta and Λ\Lambda the diagonal matrix with λ1,,λN\lambda_1,\ldots,\lambda_N. Then there exists w~,wC1×N\widetilde{w}, w \in \mathbb{C}^{1\times N} such that :

  • (a): K = KˉΔ,L(Λ, (1)1iN, w~) = w~Λ1(eLambdaΔI)elementwise-exp(P)K\ =\ \bar{K}_{\Delta, L}(\Lambda,\ (1)_{1 \leq i \leq N},\ \widetilde{w})\ =\ \widetilde{w} \cdot \Lambda^{-1} (e^{Lambda\Delta} - I) \cdot \text{elementwise-exp}(P)
  • (b): K  = KˉΔ,L(Λ, ((eLambdaiΔ1)1)1iN, w) =  wΛ1row-softmax(P)K\ \ =\ \bar{K}_{\Delta, L}(\Lambda,\ ((e^{Lambda_i\Delta} - 1)^{-1})_{1\leq i \leq N},\ w)\ =\ \ w \cdot \Lambda^{-1} \cdot \text{row-softmax}(P)

(a) suggests that we can parameterize the state spaces via Λ,w~CN\Lambda, \widetilde{w} \in \mathbb{C}^N and compute the kernel as shown. Unfortunately, in practice, the real part of the elements of ΛΛ may become positive during learning, making it unstable for long inputs. To solve this problem, the authors propose two methods: DSSexp and DSSsoftmax.

2.1 Convolutional view
In DSSexp, the real parts of ΛΛ must be negative. We then have Λ=elementwise-exp(Λre)+iΛim\Lambda = - \text{elementwise-exp}(\Lambda_\mathrm{re}) + i\cdot \Lambda_\mathrm{im} and Δ=exp(Δlog)R>0\Delta = \mathrm{exp}(\Delta_{\log}) \in \mathbb{R}_{> 0}. KK is then calculated as in the formula given in part (a) of the proposal.
In DSSsoftmax, each line of ΛΛ is normalized by the sum of these elements. We have Λ=Λre+iΛim\Lambda = \Lambda_\mathrm{re} + i\cdot \Lambda_\mathrm{im} and Δ=exp(Δlog)R>0\Delta = \mathrm{exp}(\Delta_{\log}) \in \mathbb{R}_{> 0}. KKis then calculated as in the formula indicated in part (b) of the proposition.
Note that softmax on C\mathbb{C} is not necessarily defined during sofmax (0,iπ)(0, i \pi), the authors using a corrected version of softmax to prevent this problem (see Appendix A.2. of the paper).

2.2 Recurrence view
In DSSexp, using the recurrence formula in the table above, we obtain Aˉ=diag(elambda1Δ,,elambdaNΔ)\bar{A} = \mathrm{diag}(e^{lambda_1\Delta}, \ldots, e^{lambda_N\Delta}) and Bˉ=(λi1(elambdaiΔ1))1iN\bar{B} = \left(\lambda_i^{-1} (e^{lambda_i\Delta} - 1) \right)_{1\leq i \leq N}, where in both equalities, λi\lambda_i is Lambda's ith eigenvalue.
Since Aˉ\bar{A} is diagonal, we can calculate xkx_k independently as follows: xi,k=eλiΔxi,k1+λi1(eλiΔ1)ukx_{i,k} = e^{\lambda_i\Delta} x_{i,k-1} + \lambda_i^{-1} (e^{\lambda_i\Delta} - 1)u_k.
It is then possible to deduce that, if λiΔ0\lambda_i|\Delta \approx 0, we have xi,kxi,k1x_{i,k} \approx x_{i,k-1} allowing us to copy the story over many time steps. On the other hand, if Re(λi)Δ0\mathrm{Re}(\lambda_i)\Delta \ll 0, then i,kλi1uk_{i,k} \approx -\lambda_i^{-1}u_k and information from previous time steps is forgotten, similar to a "forget" gate in LSTMs.

In DSSsoftmax, using the recurrence formula in the table above, we obtain: Aˉ=diag(eambda1Δ,,eambdaNΔ)\bar{A} = \mathrm{diag}(e^{ambda_1\Delta}, \ldots, e^{ambda_N\Delta}) and barB=(eambdaiΔ1λi(eambdaiΔL1))1iNbar{B} = \left( {e^{ambda_i\Delta} - 1 \over \lambda_i (e^{ambda_i\Delta L} - 1)} \right)_{1\leq i \leq N}.
Hence xi,k=eλiΔxi,k1+uk(eλiΔ1)λi(eλiΔL1)x_{i,k} = e^{\lambda_i\Delta} x_{i,k-1} + {u_k(e^{\lambda_i\Delta} - 1) \over \lambda_i (e^{\lambda_i\Delta L} - 1)}.
Note that eλiΔe^{\lambda_i\Delta} can be unstable. We must then calculate two different cases depending on the sign of Re(λ)\mathrm{Re}(\lambda) by introducing an intermediate state x~k\widetilde{x}_{k}.
• If Re(λ)0\mathrm{Re}(\lambda) \leq 0: x~k=eλΔx~k1+uk   ,   xk=x~k(eλΔ1)λ(eλΔL1)\widetilde{x}_{k} = e^{\lambda\Delta}\cdot \widetilde{x}_{k-1} + u_k \ \ \ ,\ \ \ x_k = \widetilde{x}_k \cdot {(e^{\lambda\Delta} - 1) \over \lambda (e^{\lambda\Delta L} - 1) }.

And in particular if Re(λ)0\mathrm{Re}(\lambda) \ll 0 then x~kuk\widetilde{x}_k \approx u_k and xkuk/λx_k \approx u_k / \lambda, resulting in a focus on local information (previous steps are ignored).

• If Re(λ)>0\mathrm{Re}(\lambda) > 0: x~k=x~k1+ekλΔuk    ,    xk=x~kelambdaΔ(k(L1))λeλΔ1eλΔL1\widetilde{x}_{k} = \widetilde{x}_{k-1} + e^{-k\lambda\Delta} \cdot u_k \ \ \ \ ,\ \ \ \ x_k = \widetilde{x}_k \cdot {e^{lambda\Delta (k-(L-1))} \over \lambda}\cdot {e^{-\lambda\Delta}-1 \over e^{-\lambda\Delta L} - 1 }

Similarly if Re(λ)0\mathrm{Re}(\lambda) \gg 0 then x~0u0\widetilde{x}_0 \approx u_0 and x~kx~k1u0\widetilde{x}_k \approx \widetilde{x}_{k-1} \approx u_0, xk<L10x_{k < L-1} \approx 0 and xL1u0/λx_{L-1} \approx u_0 / \lambda, the model can capture information at very long distances. In practice, the authors of S4D indicate that Re(λ)0\mathrm{Re}(\lambda) \gg 0 doesn't work if LL is very large (explosion when tt \rightarrow \infty in K(t)=Cexp(A).B)K(t) = C \exp(A^\intercal).B).


3. Initialization
The real and imaginary parts of ww are initialized from N(0,1)\mathcal{N}(0,1), the elements of Δlog\Delta_{\log} from exp(U (log(0.001),log(0.1)))\exp(\mathcal{U}~(log(0.001), log(0.1))), and Δ\Delta via the eigenvalues of the HiPPO matrix. The authors wonder whether it might not be possible to find a simpler initialization for Δ\Delta. They note, however, that randomization leads to poorer results.

Regarding the results, DSS has been tested on LRA and Speech Commands by WARDEN (2018):

image
Figure 6: DSS results on LRA

The DSS (in softmax or exp versions) achieves better average results than the original S4 for this benchmark. DSSsoftmax seems to perform slightly better than DSSexp. Another interesting aspect of this paper is that it is the first to reproduce the S4 results, thus confirming that SSMs pass this benchmark.

image
Figure 7: DSS results on Speech Commands

On Speech Commands, the S4 maintains its advantage over DSS.

To dig deeper
The official implementation is available on GitHub.
This paper was the subject of a spotlight talk at NeurIPS 2022.



S4D: the diagonal S4

On June 23, 2022, GU, GUPTA et al. introduced S4D in their paper On the Parameterization and Initialization of Diagonal State Space Models.
Initialization of the DSS state matrix A\mathbf{A} relies on a particular approximation of the S4 HiPPO matrix. While the S4 matrix has a mathematical interpretation for dealing with long-range dependencies, the effectiveness of the diagonal approximation remains theoretically unexplained.
With S4D, the authors introduce a diagonal SSM combining the best of S4 computation and parameterization and DSS initialization. The result is a very simple, theoretically sound and empirically effective method.

A comparison of the three methods is given in Table 1 of the paper:

image
Figure 8: Comparison of S4, DSS and S4D

S4D can use either the bilinear discretization of S4 or the ZOH discretization of DSS.

In S4D, the discretized convolution kernel of the equation y=uKy = u \ast \mathbf{\overline{K}} is calculated as follows:
K=n=0N1CnAnBn    K=(BC)VL(A)\mathbf{\overline{K}}_\ell = \sum_{n = 0}^{N-1} \mathbf{C}_n \mathbf{\overline{A}}_n^\ell \mathbf{\overline{B}}_n \implies \mathbf{\overline{K}} = (\mathbf{\overline{B}}^\top \circ \mathbf{C}) \cdot \mathcal{V}_L(\mathbf{\overline{A}})
where :
\circ represents the Hadamard matrix product,
\cdot a classic matrix product,
VL\mathcal{V}_L is the Vandermonde matrix i.e.: V=[1α1α12α1n11α2α22α2n11α3α32α3n11αmαm2αmn1]\mathcal{V} = \begin{bmatrix} 1&\alpha _{1}&{\alpha _{1}}^{2}&\dots &{\alpha _{1}}^{n-1}\\ 1&\alpha _{2}&{\alpha _{2}}^{2}&\dots &{\alpha _{2}}^{n-1}\\ 1&\alpha _{3}&{\alpha _{3}}^{2}&\dots &{\alpha _{3}}^{n-1}\\ \vdots &\vdots &\vdots & \vdots \\ 1&\alpha _{m}&{\alpha _{m}}^{2}&\dots &{\alpha _{m}}^{n-1} \end{bmatrix}.

In other words, for any ii and jj, the coefficient in row ii and column jj is Vi,j=αij1\displaystyle V_{i,j}={\alpha _{i}}^{j-1}.

Finally, in S4D,

K=[B0C0BN1CN1][1A0A02A0L11A1A12A1L11AN1AN12AN1L1]where VL(A)n,=An \mathbf{\overline{K}} = \begin{bmatrix} \mathbf{\overline{B}}_0 \mathbf{C}_0 & \dots & \mathbf{\overline{B}}_{N-1} \mathbf{C}_{N-1} \end{bmatrix} \begin{bmatrix} 1 & \mathbf{\overline{A}}_0 & \mathbf{\overline{A}}_0^2 & \dots & \mathbf{\overline{A}}_0^{L-1} \\ 1 & \mathbf{\overline{A}}_1 & \mathbf{\overline{A}}_1^2 & \dots & \mathbf{\overline{A}}_1^{L-1} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & \mathbf{\overline{A}}_{N-1} & \mathbf{\overline{A}}_{N-1}^2 & \dots & \mathbf{\overline{A}}_{N-1}^{L-1} \\ \end{bmatrix} \qquad \text{where } \mathcal{V}_L(\mathbf{\overline{A}})_{n, \ell} = \mathbf{\overline{A}}_n^\ell

All calculable in O(N+L)O(N+L) operations and space.

The parametrization of the various matrices is as follows:

  • A=exp((A))+i(A)\mathbf{A} = -\exp(\Re(\mathbf{A})) + i \cdot \Im(\mathbf{A}).
    The authors point out that the exponential can be replaced by any positive function.
  • B=1\mathbf{B} = 1 then is trained
  • C\mathbf{C} random with a standard deviation of 1 then trained.

Note that S4 takes real numbers into account, whereas S4D takes complex numbers into account by parameterizing with a state size of N/2N/2 and implicitly adding conjugate pairs to the parameters. We then have the equivalent of NN real parameters ensuring that the output is real.

Concerning initialization, the authors introduce two:

  • S4D-Inv is an approximation of S4-LegS: An=12+iNπ(N2n+11)\quad \mathbf{A}_n = -\frac{1}{2} + i \frac{N}{\pi} \left( \frac{N}{2n+1}-1 \right)
  • S4D-Lin which is an approximation of S4-FouT: An=121+iπn\quad \mathbf{A}_n = -\frac{1}{2}\mathbf{1} + i \pi n.

We invite the reader to consult part 4 of the paper for more details on these equations.
From an interpretability point of view, the real part of An\mathbf{A}_n controls the rate of weight decay. The imaginary part of An\mathbf{A}_n controls the oscillation frequencies of the basis function Kn(t)=etABK_n(t) = e^{t\mathbf{A}}\mathbf{B}.

Finally, the authors put forward a few results:

  1. Calculating the model with a softmax instead of Vandermonde doesn't make much difference
  2. Training B always gives better results.
  3. There are no significant differences between the two possible discretizations.
  4. Restricting the real part of A leads to better results (although not significantly).
  5. All the modifications tested for initialization degraded the results. Namely, applying a coefficient to the imaginary part or using a random imaginary part / using a random real part / using an imaginary part and a random real part.

As this method is much easier to implement than the others (Vandermade requires just two lines of code), S4D has replaced S4 in common usage (in fact, in table 6 of the Mamba paper, for example, the authors use the term S4 to designate S4D).


To dig deeper
The official implementation is available on GitHub.
On December 1, 2022, GUPTA et al. present a sequel to DSS with Simplifying and Understanding State Space Models with Diagonal Linear RNNs which introduces DLR. They get rid of the discretization step and propose a model based on diagonal linear RNNs (DLR) that can operate on around 1 million positions (unlike classic RNNs). The code for this model is available on GitHub.



GSS: Gated State Space

Five days after S4D, on June 27, 2022, MEHTA, GUPTA et al. introduce GSS in their paper Long Range Language Modeling via Gated State Spaces.
In this work, they focus on modeling autoregressive sequences (where previous work on GSS focused particularly on sequence classification tasks) from English books, Github source code and ArXiv mathematical articles. They show that their layer called Gated State Space (GSS) trains significantly faster than DSS (2 to 3 times faster). They also show that exploiting self-attention to model local dependencies further improves GSS performance.

image
Figure 9: Comparison of DSS vs GSS. Models are trained on sequences of 4K length and then evaluated on sequences of up to 65K tokens.

Based on the observation that SSMs (S4/DSS) train more slowly than expected on TPU, the authors modified the architecture to reduce the dimensionality of specific operations that proved to be bottlenecks. These modifications are inspired by well-supported empirical observation regarding the efficiency of gating units (Language Modeling with Gated Convolutional Networks by DAUPHIN et al. (2016), GLU Variants Improve Transformer by SHAZEER (2020), etc.). More specifically, the authors draw inspiration from the paper Transformer Quality in Linear Time by HUA et al. (2022). The latter showed that with their FLASH model, replacing the feed-forward layer in the Transformer with gating units enables the use of lower unit-peak attention with minimal quality loss. They call this component the Gated Attention Unit (GAU).

image
Figure 10: The Gated Attention Unit.
This is not exactly the same figure as the one on the paper: I've made a horizontal translation so that the input is at the bottom and not at the top, to facilitate the parallel with the Mega figure visible below.

The authors of GSS have therefore extended the use of gating units to SSMs, and observe a reduction in dimensionality when performing FFT operations.

image
Figure 11: Adaptation of GAU to SSM.

Note that unlike HUA et al. the authors do not observe much advantage in using RELU² or Swish activations instead of GELU hence its retention.
In addition, DSS uses a time step fixed at 1 (the authors observing that this reduces the computational time required to create the kernels and simplifies their calculation).
A particularly interesting point is that, in contrast to the observations made in S4 and DSS, the model's performance on language modeling tasks was found to be much less sensitive to initialization allowing then to train the model successfully by initializing the state space variables randomly. This is a very important result, as it shows that neither the HiPPO matrix (S4) nor the HiPPO initialization (DSS) are necessary.

As for the GSS-Transformer hybrid, it simply consists of sparingly interspersing traditional Transformer blocks with GSS layers. The hybrid model achieves lower perplexity than the pure SSM model:

Figure 12: Performance of the GSS-Transformer hybrid model.

To dig deeper
The official implementation is available on GitHub.



Mega

On September 21, 2022, MA, ZHOU et al. published the Mega: Moving Average Equipped Gated Attention.
The Mega is a transformer with a single-headed attention mechanism, using the GAU gating system, and is equipped with a damped exponential moving average (EMA) to incorporate positional inductive bias.

image
Figure 13: Overview of the Mega, figure designed from my understanding of the paper.
The authors replace the RELU² of the GAU with a Laplace function which is more stable (cf. figure 4 in the paper).

The authors also propose a variant, Mega-chunk, which effectively divides the entire sequence into several fixed-length chunks. Here, they take up the principle already present and explained in the FLASH model (see figure 4 of the paper on this model). This offers linear complexity in terms of time and space with minimal loss of quality.

image
Figure 14: The Mega chunk

This offers linear complexity, simply applying attention locally to each fixed-length chunk.
More precisely, we divide the sequences of queries, keys and values into chunks of length cc. For example, Q=Q1,...Qk\mathbf{Q} = {\mathbf{Q}_1, ... \mathbf{Q}_k}, where k=nck = -\frac{n}{c} is the number of chunks. The attention operation is applied individually to each block, giving a linear complexity O(kc2)=O(nc)\mathcal{O}(kc^2) = \mathcal{O}(nc) with respect to nn.
Nevertheless, this method suffers from a critical limitation, namely the loss of contextual information from other blocks. But the EMA sublayer mitigates this problem by capturing local contextual information in the vicinity of each token whose results are used as inputs to the attention sublayer. In this way, the effective context exploited by attention at block level can extend beyond the block boundary.

Mega is extremely competitive, as it becomes the best model on the LRA:

image
Figure 15: Mega's results on the LRA benchmark

So what's a transformer doing in a blog post about SSM? Let's look at damped EMA to understand the link between Mega and S4D.

Reminder on "classic" EMA:
The equation of a "classical" EMA is 𝐲t=𝜶𝐱t+(1𝜶)𝐲t1𝐲_t = 𝜶 ⊙ 𝐱_t + (1-𝜶) ⊙ 𝐲_{t-1} with 𝜶𝜶 in [0,1]d[0,1]^d the EMA coefficient representing the degree of weighting decrease and ⊙ the Hadamard matrix product.
A higher 𝜶 discounts older observations more quickly.
We therefore impose an inductive bias here: the weight of the dependency between two tokens decreases exponentially over time with an input-agnostic 𝜶 factor. This property favors local dependencies and limits long-term dependencies.
The EMA calculation can be represented as n individual convolutions that can be efficiently calculated by FFT.

EMA used in the Mega:
The Mega uses a multidimensional "damped" EMA. That is, in the "damped" EMA equation, 𝐲t=𝜶𝐱t+(1𝜶𝜹)𝐲t1𝐲_t = 𝜶 ⊙ 𝐱_t + (1-𝜶 ⊙ 𝜹) ⊙ 𝐲_{t-1} where a parameter 𝜹𝜹 in [0,1]d[0, 1]^d is introduced which represents the damping factor, xx is extended to hh dimensions via an expansion matrix 𝜷𝜷 in Rd×hR^{d \times h}.
The equation then becomes 𝐲t,j=𝜼j𝐡t(j)𝐲_{t, j} = 𝜼_j^{\intercal} 𝐡_t^{(j)} with 𝐡t(j)=𝜶j𝐮t(j)+(1𝜶j𝜹j)𝐡t1(j)𝐡_t^{(j)} = 𝜶_j ⊙ 𝐮_t^{(j)}+ (1-𝜶_j ⊙ 𝜹_j) ⊙ 𝐡_{t-1}^{(j)} and 𝜼Rd×h𝜼 \in R^{d \times h} is the projection matrix that returns the hh-dimensional hidden state to the 1-dimensional output 𝐲t,jR𝐲_{t, j} \in \mathbb{R}.

Proof that multidimensional "damped" EMA can be computed as a convolution and therefore by FTT (by setting d=1d = 1 for 𝜶𝜶 and 𝜹𝜹) :

We have 𝐲t=𝜼𝐡t𝐲_t = 𝜼^{\intercal} 𝐡_t with 𝐡t=𝜶j𝐮t+(1𝜶𝜹)𝐡t1𝐡_t = 𝜶_j ⊙ 𝐮_t + (1-𝜶 ⊙ 𝜹) ⊙ 𝐡_{t-1}. Let's note ϕ=1𝜶𝜹ϕ = 1-𝜶 ⊙ 𝜹.
Then: 𝐡t=𝜶𝐮t+(1𝜶𝜹)𝐡t1=𝜶𝜷𝐱t+ϕ𝐡t1𝐡_t = 𝜶 ⊙ 𝐮_t + (1-𝜶 ⊙ 𝜹) ⊙ 𝐡_{t-1} = 𝜶 ⊙ 𝜷 𝐱_t + ϕ ⊙ 𝐡_{t-1}
    and 𝐲t=𝜼𝐡t=𝜼(𝜶𝜷𝐱t+ϕ𝐡t1)𝐲_t = 𝜼^{\intercal} 𝐡_t = 𝜼^{\intercal} (𝜶 ⊙ 𝜷 𝐱_t + ϕ ⊙ 𝐡_{t-1})
Then, unrolling the two equations above, we explicitly obtain:
Step 0: 𝐡1=𝜶𝜷𝐱1+ϕ𝐡0𝐡_1 = 𝜶 ⊙ 𝜷𝐱_1 + ϕ ⊙ 𝐡_0
Step 1: 𝐡2=𝜶𝜷𝐱2+ϕ𝐡1𝐡_2 = 𝜶 ⊙ 𝜷𝐱_2 + ϕ ⊙ 𝐡_1 =𝜶𝜷𝐱2+ϕ(ϕ𝐡0+𝜶𝜷𝐱1)=𝜶𝜷𝐱2+ϕ2𝐡0+ϕ𝜶𝜷𝐱1= 𝜶 ⊙ 𝜷 𝐱_2 + ϕ ⊙ (ϕ ⊙ 𝐡_0 + 𝜶 ⊙ 𝜷 𝐱_1) = 𝜶 ⊙ 𝜷 𝐱_2 + ϕ^2 ⊙ 𝐡_0 + ϕ ⊙ 𝜶 ⊙ 𝜷 𝐱_1
...

And likewise:
Step 0: 𝐲1=𝜼𝜶𝜷𝐱1+ϕ𝐡0)=𝜼𝜶𝜷𝐱1+𝜼ϕ𝐡0𝐲_1 = 𝜼^{\intercal} 𝜶 ⊙ 𝜷 𝐱_1 + ϕ ⊙𝐡0) = 𝜼^{\intercal} 𝜶 ⊙ 𝜷 𝐱_1 + 𝜼^{\intercal} ϕ ⊙ 𝐡_0
Step 1: 𝐲2=𝜼𝜶𝜷𝐱2+𝜼ϕ𝐡1𝐲_2 = 𝜼^{\intercal} 𝜶 ⊙ 𝜷 𝐱_2 + 𝜼^{\intercal} ϕ ⊙ 𝐡_1 =𝜼𝜶𝜷𝐱2+𝜼ϕ(𝜶𝜷𝐱1+ϕ𝐡0)=𝜼𝜶𝜷𝐱2+𝜼ϕ2𝐡0= 𝜼^{\intercal} 𝜶 ⊙ 𝜷 𝐱_2 + 𝜼^{\intercal} ϕ ⊙ (𝜶 ⊙ 𝜷 𝐱_1 + ϕ ⊙ 𝐡_0) = 𝜼^{\intercal} 𝜶 ⊙ 𝜷 𝐱_2 + 𝜼^{\intercal} ϕ^2 ⊙ 𝐡_0
...
Step tt: 𝐲t=𝜼𝜶𝜷𝐱t+...+𝜼ϕt1𝜶𝜷𝐱t1+𝜼ϕt𝐡0𝐲_t = 𝜼^{\intercal} 𝜶 ⊙ 𝜷 𝐱_t + ... + 𝜼^{\intercal} ϕ_{t-1} ⊙ 𝜶 ⊙ 𝜷 𝐱_{t-1} + 𝜼^{\intercal} ϕ^t ⊙ 𝐡_0.

And so 𝐲=K𝐱+𝜼ϕt𝐡0𝐲 = \mathbf{K} * 𝐱 + 𝜼^{\intercal} ϕ^t ⊙ 𝐡_0 with K=(𝜼(𝜶𝜷),𝜼(ϕ𝜶𝜷),...,𝜼(ϕt𝜶𝜷)Rn\mathbf{K} = (𝜼^{\intercal} (𝜶 ⊙ 𝜷), 𝜼^{\intercal} (ϕ ⊙ 𝜶 ⊙ 𝜷), ..., 𝜼^{\intercal}(ϕ^t ⊙ 𝜶 ⊙ 𝜷) \in \mathbb{R}^n.
K\mathbf{K} is calculated in the Mega via the Vandermonde product, which reminds us of the method used in S4D.


To dig deeper
The official implementation is available on GitHub.
The model is also available at Transformers.
For more details on the links between Mega and S4, the reader is invited to consult the messages exchanged between Albert GU and the Mega authors found in comments to the Mega submission on Open Review. In summary, by linking the SSM discretization step to damped EMA, it is possible to see the Mega viewed as a hybrid SSM/Attention simplifying the S4 to be real-valued rather than complex.



Liquid-S4: Liquid Structural State-Space Models

On September 26, 2022, HASANI, LECHNER et al. put online Liquid Structural State-Space Models introducing Liquid-S4. In this paper, the authors use the structural SSM (S4) formulation to obtain instances of linear liquid networks possessing the approximation capabilities of both S4 and LTC (liquid time-constant).
LTC neural networks are continuous-time causal neural networks with an input-dependent state transition module, enabling them to learn to adapt to inputs during inference. This can be seen as a kind of selection mechanism.
To learn more about liquid networks, you can consult a previous paper by the same authors: Liquid Time-constant Networks (2021).
In the context of Liquid-S4, all you need to know is that the state of an LTC at each time step is given by :

dx(t)dt=[A+Bf(x(t),u(t),t,θ)]Liquid time-constantx(t)+Bf(x(t),u(t),t,θ). \frac{d\textbf{x}(t)}{dt} = - \underbrace{\Big[\mathbf{A} + \mathbf{B} \odot f(\textbf{x}(t),\textbf{u}(t), t, \theta)\Big]}_\text{Liquid time-constant} \odot \textbf{x}(t) + \mathbf{B} \odot f(\textbf{x}(t), \textbf{u}(t), t, \theta).

With :

  • x(N×1)(t)\textbf{x}^{(N \times 1)}(t) is the hidden state vector of size NN,
  • u(m×1)(t)\textbf{u}^{(m \times 1)}(t) is an input signal with mm characteristics,
  • A(N×1)\mathbf{A}^{(N \times 1)} is a time-constant state transition mechanism,
  • B(N×1)\mathbf{B}^{(N \times 1)} is a bias vector
  • \odot represents the Hadamard product
  • f(.)f(.) is a bounded nonlinearity parametrized by θ\theta.

In practice, an SSM using a liquid network is formulated via the following system of differential equations:

x=(A+Bu)x+Buy=Cx \begin{aligned} x' &= (\mathbf{A} + \mathbf{B} u) x + \mathbf{B}u\\ y &= \mathbf{C}x \end{aligned}

This dynamic system can be solved efficiently via the same parameterization as the S4, giving rise to an additional convolutional kernel that takes into account the similarities of the shifted signals. The resulting model is Liquid-S4. Let's explain this with a little math.

The recurrent view of Liquid-S4 is obtained by discretizing the system using the trapezoidal rule (bilinear form). The result is :

xk=(A+B uk) xk1+B uk,      yk=C xk x_k = \big( \overline{\textbf{A}} + \overline{\textbf{B}}~u_k\big)~x_{k-1} + \overline{\textbf{B}}~u_k,~~~~~~y_k = \overline{\textbf{C}}~x_k

As with S4, the convolutional view is obtained by unrolling the recursive view in time (assuming x1=0x_{-1} = 0):

x0=Bu0y0=CBu0x1=ABu0+Bu1+ B2u0u1y1=CABu0+CBu1+CB2u0u1x2=A2Bu0+ABu1+Bu2+ AB2u0u1+ AB2u0u2+ B2u1u2+ B3u0u1u2y2=CA2Bu0+CABu1+CBu2+ CAB2u0u1+ CAB2u0u2+ CB2u1u2+ CB3u0u1u2,  yk=CAkBu0+CAk1Bu1+CABuk1+CBuk +p=2P(k+1p) of uiui+1  up CA(k+1pi)Bpuiui+1  up  for iZ and i0             y= Ku+Kliquiducorrelations \begin{align} x_0 &= \overline{\textbf{B}} u_0 \\ \nonumber y_0 &= \overline{\textbf{C}}\overline{\textbf{B}} u_0 \\\\ \nonumber x_1 &= \overline{\textbf{A}}\overline{\textbf{B}} u_0 + \overline{\textbf{B}} u_1 {\color{violet} +~ \overline{\textbf{B}}^2 u_0 u_1} \\ \nonumber y_1 &= \overline{\textbf{C}}\overline{\textbf{A}}\overline{\textbf{B}} u_0 + \overline{\textbf{C}}\overline{\textbf{B}} u_1 {\color{violet} + \overline{\textbf{C}}\overline{\textbf{B}}^2 u_0 u_1} \\\\ \nonumber x_2 &= \overline{\textbf{A}}^2\overline{\textbf{B}} u_0 + \overline{\textbf{A}}\overline{\textbf{B}} u_1 + \overline{\textbf{B}} u_2 {\color{violet} +~ \overline{\textbf{A}}\overline{\textbf{B}}^2 u_0 u_1 +~ \overline{\textbf{A}}\overline{\textbf{B}}^2 u_0 u_2 +~ \overline{\textbf{B}}^2 u_1 u_2 +~ \overline{\textbf{B}}^3 u_0 u_1 u_2 } \\ \nonumber y_2 &= \overline{\textbf{C}}\overline{\textbf{A}}^2\overline{\textbf{B}} u_0 + \overline{\textbf{C}}\overline{\textbf{A}}\overline{\textbf{B}} u_1 + \overline{\textbf{C}}\overline{\textbf{B}} u_2 {\color{violet} +~ \overline{\textbf{C}}\overline{\textbf{A}}\overline{\textbf{B}}^2 u_0 u_1 +~ \overline{\textbf{C}}\overline{\textbf{A}}\overline{\textbf{B}}^2 u_0 u_2 +~ \overline{\textbf{C}}\overline{\textbf{B}}^2 u_1 u_2 +~ \overline{\textbf{C}}\overline{\textbf{B}}^3 u_0 u_1 u_2 },~~ \dots \nonumber\\\\ y_k &= \overline{\textbf{C}}\overline{\textbf{A}}^k\overline{\textbf{B}} u_0 + \overline{\textbf{C}}\overline{\textbf{A}}^{k-1}\overline{\textbf{B}} u_1 + \dots \overline{\textbf{C}}\overline{\textbf{A}}\overline{\textbf{B}} u_{k-1} + \overline{\textbf{C}}\overline{\textbf{B}} u_k \nonumber~+ \\ & {\color{violet} \sum_{p=2}^\mathcal{P} \forall \binom{k+1}{p} \text{~of~} u_i u_{i+1}~\dots~u_p~ \overline{\textbf{C}}\overline{\textbf{A}}^{(k+1-p-i)}\overline{\textbf{B}}^p u_i u_{i+1}~\dots~u_p} \nonumber ~~\text{for} ~i \in \mathbb{Z} \text{~and~} i \geq 0\\\\ &~~~~~\rightarrow~~~~~~~~y =~ \overline{\textbf{K}} * u \nonumber +{\color{violet} \overline{\textbf{K}}_{\text{liquid}} * u_{\text{correlations}}} \end{align}


You can see two colors in the formulas. They correspond to two types of weight configurations:

  1. In black, the weights of the individual independent time inputs, i.e. the convolutional kernel of the S4.
  2. In purple, the weights associated with all orders of auto-correlation of the input signal. This is an additional input correlation kernel, called the liquid kernel by the authors.

Finally, the convolution kernel is expressed as follows:

KliquidRL~:=KL(C,A,B):=(CA(L~ip)Bp)i[L~], p[P]=(CAL~2B2,,CBp) \overline{\textbf{K}}_{\text{liquid}} \in \mathbb{R}^{\tilde{L}} := \mathcal{K}_L(\overline{\textbf{C}},\overline{\textbf{A}},\overline{\textbf{B}}) := \big(\overline{\textbf{C}}\overline{\textbf{A}}^{(\tilde{L}-i-p)}\overline{\textbf{B}}^p\big)_{i \in [\tilde{L}],~ p \in [\mathcal{P}]} = \big( \overline{\textbf{C}}\overline{\textbf{A}}^{\tilde{L}-2}\overline{\textbf{B}}^2, \dots, \overline{\textbf{C}}\overline{\textbf{B}}^p \big)

The authors then show that this is efficiently computable via a process similar to what was applied in S4 (HiPPO, Woodbury, Inverse Fourier Transform, etc.). We refer the reader to Algorithm 1 in the paper for further details.


Tested on the LRA, this approach appears to be the best. Only Mega, published few days earlier and therefore not present in the paper, does better:

Figure 16: Liquid-S4 results on the LRA benchmark

HASANI, LECHNER et al. also apply their model to the Speech Commands, sCIFAR and BIDMC Vital Signs datasets of PIMENTEL et al. and establish the new state of the art.

To dig deeper
The official implementation is available on GitHub.
Exchanges on Open Review.
The official LTC implementation is available on GitHub.



S5: Simplified State Space Layers for Sequence Modeling

Chronologically, the Simplified State Space Layers for Sequence Modeling by SMITH, WARRINGTON and LINDERMAN introducing the S5 model was unveiled on August 9, 2022, so before the Mega and Liquid-S5. However, I'm tackling this paper after the latter because on October 6, 2022, the authors of the S5 updated their publication, improving their model by more than 5 points on the LRA compared with V1. What's more, they propose a comparison covering all SSMs released in 2022. It seemed to me more appropriate to approach S5 from its V2.

In S5, the authors propose to replace the formulation of S4 using a bank of independent single-input/single-output (SISO) SSMs with a multi-input/multi-output (MIMO) SSM, which has a reduced latent dimension.

Figure 17: S4 vs. S5 internal behavior

The reduced latent dimension of the MIMO system allows the use of the parallel scan algorithm, which simplifies the calculations required to apply the S5 layer as a sequence-to-sequence transformation. The resulting model thus loses the convolutional view of the SSM and focuses solely on the recurrent view (obtained by ZOH discretization). The authors' approach is therefore to operate in the time domain rather than the frequency domain. They use a diagonal approximation of the HiPPO matrix, enabling efficient initialization and parameterization for their MIMO system.

Figure 19: Full comparison of S4 vs. S5 behaviour

As the use of parallel scan is a component used in other SSMs in the future (notably Mamba), let's take a closer look at how it works in S5, so as to familiarize ourselves with this algorithm right from this article. The easiest way to do this is to use the example given in appendix H of the paper, where the authors apply it to a sequence of length L=4L = 4.

To calculate a parallel scan, two things are required:

  • The initial elements on which the scan will operate.
    We define the initial elements of a sequence of length LL as, c1:Lc_{1:L}, so that each element ckc_k is the tuple ck=(ck,a,ck,b):=(A,Buk)c_k = (c_{k,a}, c_{k,b}) := (\overline{\mathbf{A}},\enspace \overline{\mathbf{B}}u_k). In the case of L=4L = 4, we therefore have (A,Bu1),(A,Bu2),(A,Bu3)(\overline{\mathbf{A}},\enspace \overline{\mathbf{B}}u_1), (\overline{\mathbf{A}}, \enspace \overline{\mathbf{B}}u_2),(\overline{\mathbf{A}},\enspace \overline{\mathbf{B}}u_3) and (A,Bu4)(\overline{\mathbf{A}},\enspace \overline{\mathbf{B}}u_4).
  • A binary associative operator used to combine elements. Mathematically, a binary associative operator is IJK=(IJ)K=I(JK)I \bullet J \bullet K = (I \bullet J)\bullet K = I \bullet (J \bullet K) .

If we were to proceed sequentially with the scan, posing s0:=(I,0)s_0:=(\mathbf{I},0), we would have to perform 4 calculations to obtain the 4 outputs sis_i :
s1=s0c1=(I,0)(A,Bu1)=(AI,A0+Bu1)=(A,Bu1)s_1 = s_0 \bullet c_1 = (\mathbf{I},\enspace 0) \bullet (\overline{\mathbf{A}},\enspace \overline{\mathbf{B}}u_1) = (\overline{\mathbf{A}}\mathbf{I},\enspace \overline{\mathbf{A}}0+\overline{\mathbf{B}}u_1) = (\overline{\mathbf{A}},\enspace \overline{\mathbf{B}}u_1)
s2=s1c2=(A,Bu1)(A,Bu2)=(A2,ABu1+Bu2)s_2 = s_1 \bullet c_2 =(\overline{\mathbf{A}},\enspace \overline{\mathbf{B}}u_1) \bullet (\overline{\mathbf{A}},\enspace \overline{\mathbf{B}}u_2) = (\overline{\mathbf{A}}^2,\enspace \overline{\mathbf{A}}\overline{\mathbf{B}}u_1 + \overline{\mathbf{B}}u_2 )
s3=s2c3=(A2,ABu1+Bu2)(A,Bu3)=(A3,A2Bu1+ABu2+Bu3)s_3 = s_2 \bullet c_3 = (\overline{\mathbf{A}}^2,\enspace \overline{\mathbf{A}}\overline{\mathbf{B}}u_1 + \overline{\mathbf{B}}u_2 ) \bullet (\overline{\mathbf{A}},\enspace \overline{\mathbf{B}}u_3) = (\overline{\mathbf{A}}^3,\enspace \overline{\mathbf{A}}^2\overline{\mathbf{B}}u_1 + \overline{\mathbf{A}}\overline{\mathbf{B}}u_2 + \overline{\mathbf{B}}u_3)
s4=s3c4=(A3,A2Bu1+ABu2+Bu3)(A,Bu4)=(A4,A3Bu1+A2Bu2+ABu3+Bu4).s_4 = s_3 \bullet c_4 = (\overline{\mathbf{A}}^3,\enspace \overline{\mathbf{A}}^2\overline{\mathbf{B}}u_1 + \overline{\mathbf{A}}\overline{\mathbf{B}}u_2 + \overline{\mathbf{B}}u_3) \bullet (\overline{\mathbf{A}},\enspace \overline{\mathbf{B}}u_4)\\ = (\overline{\mathbf{A}}^4, \enspace \overline{\mathbf{A}}^3\overline{\mathbf{B}}u_1 + \overline{\mathbf{A}}^2\overline{\mathbf{B}}u_2 + \overline{\mathbf{A}}\overline{\mathbf{B}}u_3 + \overline{\mathbf{B}}u_4).

To obtain the xix_i states, we would then have to take the second element of each sis_i tuple.

Proceeding sequentially is not the most efficient way, since it is possible to parallelize the calculation of a recurrence with parallel scan. Here's an illustration of how it works with our sequence of size LL = 4 :

image
Figure 19: How parallel scan works in S5

Again, to obtain the xix_i states, we'd have to take the second element of each sis_i tuple.
You'll notice here that it's possible to calculate s2s_2 and i4i_4 in parallel, then s1s_1, s3s_3 and s4s_4 in parallel. This reduces the number of sequential calculations from 4 to just 2. The complexity of the parallel scan is O(log(L))O(log(L)).

image
Figure 20: How parallel scan works in general. Based on the animation by Scott Linderman.

In terms of performance, the S5 ranks second on the LRA :

image
Figure 21: S5 results on the LRA benchmark

In addition to the LRA, the authors of S5 compare their model on the Speech Commands, the pendulum regression dataset as well as sMNIST, psMNIST and sCIFAR. The full results are available in the paper's appendix, which also contains an ablation study.

To dig deeper
The official implementation is available on GitHub.
Exchanges on Open Review.

SGConv

On October 17, 2022, What Makes Convolutional Models Great on Long Sequence Modeling? LI, CAI et al. report that they find S4 too complex, requiring sophisticated parameterization and initialization schemes (= HiPPO). And consequently that it is less intuitive and difficult to use for people with limited prior knowledge. So their aim is to demystify S4 by focusing on its convolutional view. They identify two critical principles from which S4 benefits and which are sufficient to constitute a high-performance global convolutional model:

  1. The parameterization of the convolutional kernel must be efficient in the sense that the number of parameters must increase sub-linearly with sequence length.
  2. The kernel must have a decreasing structure in which the weights for convolution with the nearest neighbors are greater than those for the farthest neighbors.
image
Figure 22: Respecting the two critical principles set out by the authors means having convolution kernels resembling those shown in the figure.

Based on these two principles, they propose an efficient convolutional model called Structured Global Convolution (SGConv).

Figure 23: SGConv constructs convolution kernels as the concatenation of successively longer sinusoids of lower norm. The advantage of this form is that it enables very fast convolution in the frequency domain.

The SGConv authors report better results than the S4 on several tasks (text, audio, image). We won't go into detail here. Let's just look at the LRA:

image
Figure 24: SGConv results on LRA.

Indeed, looking at this table, we can see that the SGConv does better than both versions of the S4. Nevertheless, it's curious that the authors don't include the Mega, Liquid-S4 or S5 in their comparison, which all achieve better results using a convolution kernel that is a sum of decreasing exponential functions.
What's more, while all the models evaluated on the LRA treat data as 1D sequences, SGConv implicitly incorporates a 2D inductive bias for image tasks, including PathX, which is questionable.

In the end, SGConv seems to perform similarly to the most recent SSM variants, but loses the recurrent view of S4.
Nevertheless, this paper appears to be the first to focus solely on the convolutional view of an SSM.

To dig deeper
The official implementation is available on GitHub.
Exchanges on Open Review.



Other models

Two other "theoretical" papers were published in 2022. The Pretraining Without Attention by WANG et al. introducing BiGS and the Hungry Hungry Hippos: Towards Language Modeling with State Space Models by FU, DAO et al. introducing H3.
Due to their late publication (December 20 and 28, 2022 respectively) and their communication made in 2023 after the V2s of each of these papers, I will deal with these models in the next article in the SSM series.



SSM applications

SaShiMi

In It's Raw! Audio Generation with State-Space Models published on February 20, 2022, GOEL, GU et al. apply S4 to causal audio generation.
In contrast to methods based on conditioning from texts, spectrograms, etc., this is a method that operates directly on the input signal, enabling comparisons with WaveNet by OORD et al. (2016).
SaShiMi can train directly on sequences of over 100K (8s audio) on a single V100 GPU, compared to the context length limitations faced by models like WaveNet. It makes effective use of this long context to improve density estimation.
The authors have compared their model on various benchmarks including piano music generation and speech (enunciation of numbers).
The generated audios can be viewed here.

image
Figure 25: Sashimi architecture overview

To dig deeper
The official implementation is available on GitHub.



ViS4mer

On April 4, 2022, ISLAM and BERTASIUS introduce the ViS4mer in their Long Movie Clip Classification with State-Space Video Models.
It's a hybrid between an S4 and a Transformer for (long) video classification. More precisely, the model uses a standard Transformer encoder for short-range spatiotemporal feature extraction, and a multi-scale temporal S4 decoder for long-range temporal reasoning. The resulting model appears to be 2.6 times faster and 8 times more memory-efficient than a Transformer.
To my knowledge, this is the first paper to highlight the benefits of hybridizing SSMs and Transformers.

Figure 26: ViS4mer decoder overview

ViS4mer achieves top results in 6 out of 9 long video classification tasks on the benchmark Long Video Understanding (LVU) by WU and KRÄHENBÜHL (2021), which consists of classifying videos with a duration of 1 to 3 min. The model also appears to have good generalization capabilities, achieving competitive results on the Breakfast and COIN procedural activity datasets despite having seen 275 times less data.

To dig deeper
The official implementation is available on GitHub.



CCNN

On June 7, 2022, ROMERO, KNIGGE et al. introduced CCNN in their paper Towards a General Purpose CNN for Long Range Dependencies in ND.
Their starting point is the idea that convolutional neural networks are powerful but need to be adapted specifically to each task:

  • Input length: 32x32, 1024x1024 → How to model long-range dependencies?
  • Input resolution: 8kHz, 16kHz → Long-range dependencies, resolution agnosticity?
  • Input dimensionality: 1D, 2D, 3D → How to define convolutional kernels?
  • Task: Classification, Segmentation, ... → How to define high-low sampling strategies?
    Is it then possible to design a unique architecture, with which tasks can be solved independently of dimensionality, resolution and input length, without modifying the architecture? Yes, thanks to CCNN, which uses continuous convolution kernels.

ROMERO, KNIGGE et al. drew inspiration from the S4 to create a variant of efficient residual blocks, which they call S4 block. However, unlike the S4, which only works with 1D signals, the CCNN easily models ND signals.

Figure 27: Overview of CCNN

To dig deeper
The official implementation is available on GitHub.
Slides from a presentation of the paper are available here



SSSDS4\mathbf{SSSD^{S4}}

On August 19, 2022, LOPEZ ALCARAZ and STRODTHOFF proposed in their paper Diffusion-based Time Series Imputation and Forecasting with Structured State Space Models a hybrid between an S4 and a diffusion model for predicting missing data in time series. Their model is called SSSDS4\mathbf{SSSD^{S4}} (or simply SSSD).

image
Figure 28: SSSD overview

To dig deeper
The official implementation is available on GitHub.



S4ND

On October 12, 2022, NGUYEN, GOEL, GU et al. present S4ND: Modeling Images and Videos as Multidimensional Signals Using State Spaces.

This model extends S4 (which is 1D) to multidimensional continuous signals such as images and videos (where ConvNets and ViT learn on discrete pixels). To do this, they transform the standard S4 ODE into a multidimensional PDE:

x(t)=Ax(t)+Bu(t)y(t)=Cx(t) \begin{aligned} x'(t) &= \mathbf{A}x(t) + \mathbf{B}u(t) \\ y(t) &= \mathbf{C}x(t) \end{aligned}

becomes :

t(1)x(t(1),t(2))=(A(1)x(1)(t(1),t(2)),x(2)(t(1),t(2)))+B(1)u(t(1),t(2))t(2)x(t(1),t(2))=(x(1)(t(1),t(2)),A(2)x(2)(t(1),t(2)))+B(2)u(t(1),t(2))y(t(1),t(2))=C,x(t(1),t(2)) \begin{aligned} \frac{\partial}{\partial t^{(1)}} x(t^{(1)}, t^{(2)}) &= (\mathbf{A}^{(1)} x^{(1)}(t^{(1)}, t^{(2)}), x^{(2)}(t^{(1)}, t^{(2)})) + \mathbf{B}^{(1)} u(t^{(1)}, t^{(2)}) \\ \frac{\partial}{\partial t^{(2)}} x(t^{(1)}, t^{(2)}) &= (x^{(1)}(t^{(1)}, t^{(2)}), \mathbf{A}^{(2)} x^{(2)}(t^{(1)}, t^{(2)})) + \mathbf{B}^{(2)} u(t^{(1)}, t^{(2)}) \\ y(t^{(1)}, t^{(2)}) &= \langle \mathbf{C}, x(t^{(1)}, t^{(2)}) \rangle \end{aligned}

with A(τ)CN(τ)×N(τ)\mathbf{A}^{(\tau)} \in \mathbb{C}^{N^{(\tau)} \times N^{(\tau)}}, B(τ)CN(τ)×1\mathbf{B}^{(\tau)} \in \mathbb{C}^{N^{(\tau)} \times 1}, CCN(1)×N(2) \mathbf{C} \in \mathbb{C}^{N^{(1)} \times N^{(2)}}, with the initial condition of the linear PDE x(0,0)=0x(0, 0) = 0.

Depending on the dataset tested, the authors obtain results similar to or better than those of ViT or ConvNext.
The main advantage of S4ND is that it can operate at different resolutions via different sampling rates. The authors highlight this feature in two experiments:

  1. In zero-shot, S4ND outperforms Conv2D by more than 40 points when trained on 8×88\times8 images and tested on 32×3232\times32 images.
  2. With progressive resizing, S4ND can speed up training by 22% with a decrease in final accuracy of ∼1% compared to training at high resolution alone.
image
Figure 30: Example of S4ND for 2D images

To dig deeper
The official implementation is available on GitHub.
A completely useless but amusing piece of information: the authors have called their TeX file Darude S4NDstorm.




Conclusion

We therefore reviewed the various SSM models released in 2022. This was a year in which work focused mainly on improving/simplifying S4 via various approaches (diagonalization, gating, LTC, etc.). The year 2022 also saw the first applications of SSM.
With SGConv and S5, we can also see the beginnings of a phenomenon which, as we'll see in the following article, will become more pronounced in 2023. Namely, the emergence of works focusing solely on the convolutional view of SSMs (e.g. Hyena and its derivatives) or focusing solely on the recursive view of SSMs (e.g. Mamba).


References

Citation

@inproceedings{ssm_in_2022_blog_post,
author = {Loïck BOURDOIS},
title = {Évolution des State Space Models (SSM) en 2022},
year = {2023},
url = {https://lbourdois.github.io/blog/ssm/ssm_en_2022}
}



Visitors