Methodology Overview#

This package implements Echo State Networks (ESNs), which were introduced by Jaeger [2001]. ESNs are a Recurrent Neural Network architecture that are in a class of techniques referred to as Reservoir Computing. One defining characteristic of these techniques is that all internal weights are determined by a handful of global or “macro-scale” scalar parameters, thereby avoiding problems during backpropagation and reducing training time dramatically.

This page describes our two ESN implementations. First is a “standard” ESN architecture that is useful for small systems that can easily fit into memory on a computer. Second is a distributed ESN architecture that can be used for larger scale systems.

Standard ESN Architecture#

The basic ESN architecture that is implemented by the xesn.ESN class is defined by the following discrete timestepping equations:

\[\mathbf{r}(n + 1) = (1 - \alpha) \mathbf{r}(n) + \alpha \tanh( \mathbf{W}\mathbf{r}(n) + \mathbf{W}_\text{in}\mathbf{u}(n) + \mathbf{b})\]

\[\hat{\mathbf{v}}(n + 1) = \mathbf{W}_\text{out} \mathbf{r}(n+1)\]

Here \(\mathbf{r}(n)\in\mathbb{R}^{N_r}\) is the hidden, or reservoir, state, \(u(n)\in\mathbb{R}^{N_\text{in}}\) is the input system state, and \(\hat{\mathbf{v}}(n)\in\mathbb{R}^{N_\text{out}}\) is the estimated target or output system state, all at timestep \(n\). The sizes of each of these vectors is specified to xesn.ESN via the n_reservoir, n_input, and n_output parameters, respectively. This form of the ESN has a “leaky” reservoir, where the leak_rate parameter \(\alpha\) determines how much of the previous hidden state to propagate forward in time. This ESN implementation is eager, in the sense that all of the inputs, network parameters, targets (during training), and predictions are held in memory.

Also, this form of the ESN assumes a linear readout, i.e., we do not transform the hidden state nonlinearly or augment it with the input or output state before passing it to the readout matrix. This choice is made based on testing by Platt et al. [2022], who showed that doing either of these two operations provided no additional prediction skill over a simple linear readout.

Internal Weights#

The adjacency matrix \(\mathbf{W}\in\mathbb{R}^{N_r \times N_r}\), the input matrix \(\mathbf{W}_\text{in}\in\mathbb{R}^{N_r \times N_u}\), and the bias vector \(\mathbf{b}\in\mathbb{R}^{N_r}\) are initialized with random elements, and usually re-scaled. These three matrices are generated as follows:

\[\mathbf{W} = \dfrac{\rho}{f(\hat{\mathbf{W}})} \hat{\mathbf{W}}\]

\[\mathbf{W}_\text{in} = \dfrac{\sigma}{g(\hat{\mathbf{W}}_\text{in})} \hat{\mathbf{W}}_\text{in}\]

\[\mathbf{b} = \sigma_b\hat{\mathbf{b}}\]

where \(\rho\), \(\sigma\), and \(\sigma_b\) are scaling factors, \(f(\cdot)\) and \(g(\cdot)\) are normalization factors, \(\hat{\mathbf{W}}\) and \(\hat{\mathbf{W}}_\text{in}\) are randomly generated matrices, and \(\hat{\mathbf{b}}\) is a randomly generated vector. Each of these components can be specified by the user via the adjacency_kwargs, input_kwargs, and bias_kwargs options passed to xesn.ESN.

As an example, it is common to use a very sparse adjacency matrix, \(\hat{\mathbf{W}}\), with nonzero elements chosen from a uniform distribution ranging from -1 to 1, and to normalize the matrix by its spectral radius. In many cases, the scaling factor is chosen to be close to 1. These options could be selected by choosing:

adjacency_kwargs={
    "factor": 0.99,
    "distribution": "uniform",
    "normalization": "eig",
    "is_sparse": True,
    "connectedness": 10,
}

As another example, in Smith et al. [2023] the authors used a dense input matrix with elements randomly chosen from a uniform distribution from -1 to 1, and normalized the matrix by its largest singular value. This could be achieved as follows:

input_kwargs={
    "factor": 0.5,
    "distribution": "uniform",
    "normalization": "svd",
}

with the factor=0.5 just for the sake of an example, and note that is_sparse=False is the default if the option is not provided.

The options to the bias vector are even more simple, as there is no option for sparsity and there are no normalization options.

Note

Internally, all of the options shown above are passed to the xesn.RandomMatrix and xesn.SparseRandomMatrix classes, where the is_sparse option selects between the two. Please see these two class descriptions for all available options, and numerous examples for creating different matrices. Also note that the number of rows and columns for each matrix and the length of the bias vector are automatically chosen based on the sizes set within the ESN.

Training#

The weights in the readout matrix \(\mathbf{W}_\text{out}\) are learned during training, xesn.ESN.train(), which aims to minimize the following loss function

\[\mathcal{J}(\mathbf{W}_\text{out}) = \dfrac{1}{2}\sum_{n=1}^{N_{\text{train}}} ||\mathbf{W}_\text{out}\mathbf{r}(n) - \mathbf{v}(n)||_2^2 + \dfrac{\beta}{2}||\mathbf{W}_\text{out}||_F^2\]

Here \(\mathbf{v}(n)\) is the training data at timestep \(n\), \(||\mathbf{A}||_F = \sqrt{Tr(\mathbf{A}\mathbf{A}^T)}\) is the Frobenius norm, \(N_{\text{train}}\) is the number of timesteps used for training, and \(\beta\) is a Tikhonov regularization parameter chosen to improve numerical stability and prevent overfitting, specified via the tikhonov_parameter option to xesn.ESN.

Due to the fact that the weights in the adjacency matrix, input matrix, and bias vector are fixed, the readout matrix weights can be compactly written as the solution to the linear ridge regression problem

\[\mathbf{W}_\text{out} = \mathbf{V}\mathbf{R}^T \left(\mathbf{R}\mathbf{R}^T + \beta\mathbf{I}\right)^{-1}\]

where we obtain the solution from scipy.linalg.solve on CPUs or cupy.linalg.solve on GPUs. Here \(\mathbf{I}\) is the identity matrix and the hidden and target states are expressed in matrix form by concatenating each time step “column-wise”: \(\mathbf{R} = (\mathbf{r}(1) \, \mathbf{r}(2) \, \cdots \, \mathbf{r}(N_{\text{train}}))\) and similarly \(\mathbf{V} = (\mathbf{v}(1) \, \mathbf{v}(2) \, \cdots \, \mathbf{v}(N_{\text{train}}))\).

Distributed ESN Architecture#

It is common to use hidden states that are \(\mathcal{O}(10)\) to \(\mathcal{O}(100)\) times larger than the target system dimension. In applications that have high dimensional system states, it becomes necessary to employ a parallelization strategy to distribute the target and hidden states across many semi-independent networks. xesn.LazyESN accomplishes this with a generalization of the algorithm introduced by Pathak et al. [2018], where we use dask to parallelize the computations.

The xesn.LazyESN architecture inherits most of its functionality from xesn.ESN. The key difference between the two is how they interact with the underlying data they’re working with. While the standard ESN had a single network that is eagerly operated on in memory, xesn.LazyESN distributes multiple networks to different subdomains of a single dataset and invokes lazy operations via the dask.Array API. This process is described with an example below.

Example: SQG Turbulence Dataset#

We describe the parallelization strategy based on the dataset used by Smith et al. [2023], which was generated by this model, written by Jeff Whitaker, following [Blumen, 1978, Held et al., 1995] for Surface Quasi-Geostrophic turbulence. For the purposes of this discussion, all that matters is the size of the dataset, which is illustrated below, and more details can be found in Section 2 of Smith et al. [2023].

The dataset has 3 spatial dimensions \((x, y, z)\), and evolves in time, so that the shape is \((N_x = 64, N_y = 64, N_z = 2, N_{time})\). Parallelization is achieved in this case by subdividing the domain into smaller chunks along the \(x\) and \(y\) dimensions, akin to domain decomposition techniques in General Circulation Models. The subdivisions are defined by specifying a chunk size to the model via esn_chunks. In the case of our example, the chunk size is

esn_chunks={"x": 16, "y": 16, "z": 2}

and these chunks are denoted by the black lines across the domain in the figure above. Under the hood, xesn.LazyESN assigns a local network to each chunk, where each chunk becomes a separate dask task. Note that unlike xesn.ESN, xesn.LazyESN does not have n_input and n_output parameters, but these are instead inferred from the multi-dimensional chunksize, given by esn_chunks.

Communication between chunks is enabled by defining an overlap region, harnessing dask’s flexible overlap function (see this explanation in the dask documentation for additional description of this function). The overlap is defined by specifying the size of the overlap in each direction. For example

overlap={"x": 1, "y": 1, "z": 0}

defines a single grid cell overlap in \(x\) and \(y\), but no overlap in the vertical. Note that this argument is very similar to what can be provided to dask, except that the dimensions are labelled here, rather than numeral indices, due to our reliance on xarray. As an example, the overlap region is indicated by the white box in the figure above, where this overlap extends to both vertical levels for the chunk.

Note

Because of how xesn.LazyESN relies on dask chunks to define the bounds of each distributed region, the time dimension is not allowed to be chunked, nor can it have an overlap. That is, the size passed to esn_chunks must be the size of the time dimension, or {"time":-1} (shorthand). The only option allowed for overlap is {"time":0}. These are the defaults if nothing is provided for time as they are the only acceptable options.

We have to tell the xesn.LazyESN how to handle overlaps on the boundaries. See here for available options, since this is passed directly to dask’s overlap function. In the case above, the domain is periodic in \(x\) and \(y\), so we can simply write

boundary="periodic"
# or equally
boundary={"x": "periodic", "y": "periodic"}

Note once again that if a dict is used like the second case, then as with overlap, the difference between what is expected here and with dask is that the dimensions should be labelled, rather than provided as numeral indices.

One final option to xesn.LazyESN is persist. When dask arrays are told to .persist() it means that they are brought into memory, using the memory of the resources available (this means all data are brought into memory if on a local machine). In xesn.LazyESN, the persist option is a boolean, where if True, then .persist() is called in the following places:

in xesn.LazyESN.train and xesn.LazyESN.predict on the input data, after calling dask’s overlap function
in xesn.LazyESN.train on the resulting readout matrix, xesn.LazyESN.Wout, after all computations
in xesn.LazyESN.predict on the resulting prediction, after all computations

See this StackOverflow post for some discussion about persisting data, and see this page in the dask documentation for more information.

More Generally#

Here we make some notes for extending the description beyond this example. The dimensions that are chosen to be chunked (here \(x\) and \(y\)) should be first in the dimension order, and time needs to be last. Additionally, the time dimension needs to be labelled “time”, whereas the names of all other dimensions do not matter. Finally, currently only two dimensions are regularly tested, but the capability to add more could be added in the future.

Mathematical Definition#

The parallelization is achieved by subdividing the domain into \(N_g\) chunks, and assigning individual ESNs to each chunk. That is, we generate the sets \(\{\mathbf{u}_k \subset \mathbf{u} | k = \{1, 2, ..., N_g\}\}\), and where each local input vector \(\mathbf{u}_k\) includes the overlap region discussed above. The distributed ESN equations are

\[\mathbf{r}_k(n + 1) = (1 - \alpha) \mathbf{r}_k(n) + \alpha \tanh( \mathbf{W}\mathbf{r}_k(n) + \mathbf{W}_\text{in}\mathbf{u}_k(n) + \mathbf{b})\]

\[\hat{\mathbf{v}}_k(n + 1) = \mathbf{W}_\text{out}^k \mathbf{r}_k(n+1)\]

Here \(\mathbf{r}_k, \, \mathbf{u}_k \, \mathbf{W}_\text{out}^k, \, \hat{\mathbf{v}}_k\) are the hidden state, input state, readout matrix, and estimated output state associated with the \(k^{th}\) data chunk. The local output state \(\hat{\mathbf{v}}_k\) does not include the overlap region. Note that the various macro-scale paramaters \(\{\alpha, \rho, \sigma, \sigma_b, \beta\}\) are fixed for all chunks. Therefore the only components that drive unique hidden states on each chunk are the different input states \(\mathbf{u}_k\) and the readout matrices \(\mathbf{W}_\text{out}^k\). Additionally, because the solution to the ridge regression problem is linear, the readout matrices are simply:

\[\mathbf{W}^k_\text{out} = \mathbf{V}_k\mathbf{R}_k^T \left(\mathbf{R}_k\mathbf{R}_k^T + \beta\mathbf{I}\right)^{-1}\]

where each readout matrix can be computed independently of one another.

Macro-Scale Parameters#

From our experience, the most important macro-scale parameters that must be specified by the user are the

input matrix scaling, \(\sigma\), input_kwargs["factor"]
adjacency matrix scaling, \(\rho\), adjacency_kwargs["factor"]
bias vector scaling, \(\sigma_b\), bias_kwargs["factor"]
Tikhonov parameter, \(\beta\), tikhonov_parameter
leak rate, \(\alpha\), leak_rate

See this example for a demonstration of using the surrogate modeling toolbox to perform Bayesian optimization and find well performing parameter values.