Deep convolutional architectures for extrapolative forecasts in time-dependent flow problems

Physical systems whose dynamics are governed by partial differential equations (PDEs) find numerous applications in science and engineering. The process of obtaining the solution from such PDEs may be computationally expensive for large-scale and parameterized problems. In this work, deep learning techniques developed especially for time-series forecasts, such as LSTM and TCN, or for spatial-feature extraction such as CNN, are employed to model the system dynamics for advection-dominated problems. This paper proposes a Convolutional Autoencoder(CAE) model for compression and a CNN future-step predictor for forecasting. These models take as input a sequence of high-fidelity vector solutions for consecutive time steps obtained from the PDEs and forecast the solutions for the subsequent time steps using auto-regression; thereby reducing the computation time and power needed to obtain such high-fidelity solutions. Non-intrusive reduced-order modeling techniques such as deep auto-encoder networks are utilized to compress the high-fidelity snapshots before feeding them as input to the forecasting models in order to reduce the complexity and the required computations in the online and offline stages. The models are tested on numerical benchmarks (1D Burgers’ equation and Stoker’s dam-break problem) to assess the long-term prediction accuracy, even outside the training domain (i.e. extrapolation). The most accurate model is then used to model a hypothetical dam break in a river with complex 2D bathymetry. The proposed CNN future-step predictor revealed much more accurate forecasting than LSTM and TCN in the considered spatiotemporal problems.


Introduction
Efficient numerical simulations of complex dynamical systems are needed in order to seek solutions at different times or parameter instances, especially in fluid dynamics.These systems are typically described by a set of parameterized nonlinear partial differential equations (PDEs).Obtaining numerical solutions using a high-fidelity (finite element, finite volume, or finite difference type) computational solver may be extremely expensive, as they must create highdimensional renderings of the solution to precisely resolve the spatial-temporal multifolds and inherent non-linearities.This method thus becomes inefficient for applications such as optimization and uncertainty quantification, where numerous simulations are required to arrive at the desired solution.Reduced-order models (ROMs) are suitable substitutions for computationally-expensive numerical solvers, as these methods generate a low-ranked structure of the high-dimensional snapshots, which are then utilized to model the spatio-temporal dynamics of the PDE system.Among the various ROM techniques that have been developed, projection-based ROMs are the type employed most extensively.The method involves the generation of a reduced set of basis functions or modes such that their linear superposition effectively overlaps a low-rank approximation of the solutions.Proper Orthogonal Decomposition (POD) is the most popular method among the reduced basis class.POD utilizes singular value decomposition (SVD) to generate an empirical basis of dominant, orthonormal modes to obtain an optimum linear subspace in which to project the system-governing PDEs [1,2].Availability of the governing equations is necessary to employ intrusive ROM techniques such as the Galerkin projection [3], or the Petrov-Galerkin projection [4], which produce an interpretable ROM defined by high-energy or dominant modes.However, scenarios where the governing equations are unavailable require the application of data-driven methods, such as non-intrusive ROM (NIROM) [5,6].In a NIROM, the expansion coefficients for the reduced solution are obtained via interpolation on the reduced basis space spanned by the set of dominant modes.However, since the reduced dynamics generally belong to nonlinear, matrix manifolds, a variety of interpolation and regression methods have been proposed, capable of enforcing the constraints characterizing those manifolds.Some of the methods most often employed are dynamic mode decomposition [7,8,9], radial basis function interpolation [10,11] and Gaussian process regression [12,13].The recent advancements in machine learning (ML) methods [14] have given rise to revolutionary approaches that effectively evaluate and expedite existing numerical models or solvers by using online-offline computational stages.In the offline stage, the ML model updates its weights or coefficients (training) to learn the system dynamics by using the high-fidelity solutions obtained by the numerical solver, hence requiring computational power and time.In the online stage, the model uses the pre-computed/optimized weights (from the training) to obtain the solution (prediction) for a new set of input instances, and does so almost instantly with minimal computational cost.Various data-driven ML-based frameworks have been proposed to model the propagation of system dynamics in latent space.Some of the more highly successful examples involve the use of deep neural networks (DNNs) [15], long-short-term memory (LSTM) networks [16,17,18,19] , neural ordinary differential equations (NODE) [20,21,19], and temporal convolutional networks (TCNs) [22,23].
Significant work has been carried out recently on predicting solution instances outside the training domain for a variety of fluid problems with discontinuities, wave propagation, and advection-dominated flows.Liu et al [24] presented a predictive data assimilation framework based on the Ensemble Kalman Filter (EnKF) and the DDROM model, which uses an autoencoder network for the compression of high-dimensional dynamics to lower dimensional space and then the LSTM method to model the fluid dynamics in latent space.The model capabilities were estimated using 2D Burgers' equation and flow past a cylinder test case.Maulik et al. [25] proposed a Convolutional Autoencoder (CAE) for compression and a recurrent LSTM network for time evolution on the reduced space.The CAE-LSTM model was capable of reconstructing the sharp profile of the advecting Burgers' equation more accurately than the POD-Galerkin technique.Dutta et al. [18] utilized an advection-aware (AA) autoencoder network that learns nonlinear embeddings of the high-fidelity system snapshots using an arbitrary snapshot from the dataset, and then models the latent space dynamics using an LSTM network to make predictions for the linear advection and Burgers' problem.Cheng et al. [26] used the POD-ANN model, in which they performed a priori dimension reduction on the high-fidelity dataset and parameterization with an artificial neural (ANN) network to solve the strongly non-linear Allen-Cahn equations and the cylinder flow problem.Heaney et al. [27] proposed an AI-DDNIROM framework, capable of making predictions for spatial domains, significantly larger than the training domain, using a domain decomposition approach, an autoencoder network for low-rank representation, and an adversarial network for making the predictions for flow past a cylinder and slug flow problems.Fatone et al. [19] introduced a µt-POD-LSTM ROM framework that is capable of extrapolation for time windows around 15% those of the training domain on unsteady advection-diffusion and unsteady Navier-Stokes equation for new parameter instances.Xu et al. [23] proposed a multi-level framework comprising a convolution autoencoder (CAE), a temporal CAE (TCAE) and a multilayer perceptron (MLP), for the purpose of parameterization, and a TCN network for auto-regressive future state predictions, and evaluated the results on problems such as Sod's-shock tube and transient ship airwakes.Wu et al. [22] developed a POD and TCN-based neural network for making predictions on the viscous periodic flow past a cylinder case.Abdedou et.al [28] proposed two CAE architectures to compress the high-dimensional snapshot matrices obtained from numerical solvers for the Burgers', Stoker's, and dam-break equations in space and time, and performed parameterization on the compressed latent space.Jacquier et al. [29] employed uncertainty quantification methods -Deep Ensembles and Variational Inference-based Bayesian Neural Networks on the POD-ANN order-reduction method to perform predictions within and outside of the training domain on problems such as shallow water equations for flood prediction, and generated probabilistic flooding maps aware of model uncertainty.Geneva et al. [30] presented a physics-constrained Bayesian auto-regressive CAE network that models non-linear dynamical systems (Kuramoto-Sivashinsky equation, 1D Burgers', 2D Burgers') devoid of training data, using only the initial conditions.This reduces the computation cost tremendously and provides uncertainty quantification at each time-step.
The caveat that remains is long-term temporal extrapolation for fluid problems marked by sharp gradients and discontinuities.Our study explores forecasting convolutional architectures (LSTM, TCN, and CNN) to obtain accurate solutions for time-steps distant from the training domain, on advection-dominated test cases.The high-dimensional input snapshots matrix is first compressed in space to obtain the reduced latent vectors before they are passed as a sequence Figure 1: Training and validation method to the forecasting models.Two types of architectures are first evaluated for space compression -MLP autoencoder and CAE autoencoder, to identify the one that is more accurate in terms of the reconstruction and preservation of the input information.A simple convolutional architecture is then proposed and shown to provide accurate results for the forecasts.To evaluate the epistemic uncertainties in the solutions, the methodology of deep ensembles is adopted.
The subsequent sections of the paper are organized as follows.Section 2 describes the dataset structure along with the training and testing strategies, followed by a presentation of the autoencoders for space compression and the forecasting convolutional architectures.In Section 3, the models are tested on two numerical cases -one-dimensional Burgers' and Stoker's equations, which are representative of advection-dominated flows.Finally, section 4 presents a summary of the results obtained by the models, and some concluding remarks.

Dataset
The dataset is comprised of T solution vectors/snapshots: v i with n s nodes (v i R ns ) at time-steps i {1, 2, . . ., T } obtained using a high-fidelity PDE solver.For the autoencoder models, the output is the reconstruction of the input, therefore the training and validation input and output data are snapshot vectors v i .For the forecasting models (Figure 1), N samples are used for training; in each sample, the input is a sequence of n t snapshots (lookback window = n t ): and the corresponding output is the vector at the time-step immediately after the sequence end -v i+1 R ns .For extrapolative testing (Figure 2), a sequence of n t vectors from the start of the dataset, V = [v 1 , . . ., v nt−1 , v nt ] R ns×nt is fed to the model to produce the vectors at all the subsequent time-steps: [v nt+1 , v nt+2 , . . ., v T ] R ns×(T −nt) in an auto-regressive manner, i.e, first only a single subsequent snapshot v nt+1 is predicted, which is then concatenated with previous n t − 1 vectors and passed to the forecasting model to produce vector v nt+2 .This process is repeated in accordance with the desired number of subsequent solution vectors.

Non-intrusive reduced-order modelling
Non-intrusive ROMs (NIROMs) bypass the governing equations and utilize the full-order model solutions to develop a data-driven model, which compresses the full order data (snapshot) into a reduced-order (latent) space.The method most widely adopted to perform this utilizes deep neural network architectures called autoencoders [31].An autoencoder learns the approximation of the identity mapping, χ: v i → v i ae such that v i ≈ v i ae and χ : R ns → R ns , where n s is the number of nodes in the solution vector v i .This process is accomplished using a two-part architecture.The first part of the autoencoder network is the encoder χ e , which maps a high-dimensional input vector v i to a low-dimensional latent vector z i : z i = χ e (v i ; θ e ) and z i R m (m n).The second part is called a decoder, χ d , Figure 4: MLP Autoencoder architecture which maps the latent vector z i to an approximation v i ae of the high-dimensional input vector v i : v i ae = χ d (z i ; θ d ).The combination of these two parts yields an autoencoder network (Figure 3) of the form χ : v i → χ d • χ e (v i ).The autoencoder model is trained by computing optimal values of the parameters (θ e , θ d ) that minimize the reconstruction error over all the training data [18]: where L(v i , v i ae ) is a chosen measure of discrepancy between v i and its approximation v i ae .The restriction (dim(z i ) = m) (n = dim(v i )) forces the autoencoder model to learn the salient features of the input data via compression into a low-dimensional space and to then reconstruct the input, instead of directly learning the identity function.Autoencoder architectures are generally comprised of MLPs (called AAs) [18], convolutional neural network autoencoders (called CAEs) [25,23,28], or a combination of both.While small-sized problems can be effectively modelled via an MLP architecture, problems involving data of high spatial complexity require CAE autoencoders for effective and accelerated spatial compression.The architecture of an MLP autoencoder, with two fully connected dense layers (hidden layers) in the encoder network and a mirrored decoder network, is shown in (Figure 4).The Convolution autoencoder consists of two convolution layers, each followed by batch normalization, swish activation, and an average pooling layer, as described in (Figure 5).

Forecasting Techniques
The dataset (Section 2.1) post compression by the encoder (χ e ) produces N samples of the form: Z = [z i−nt+1 , . . ., z i−1 , z i ] R m×nt , z i+1 R m , which are used to train the following forecasting models.

Long Short-Term Memory (LSTM)
LSTM [32] is a special type of recurrent neural network (RNN) that is well-suited for performing regression tasks based on time series data.The main difference between the traditional RNN and the LSTM architecture is the capability of an LSTM memory cell to retain information over time and an internal gating mechanism that regulates the flow of information in and out of the memory cell [33].The LSTM cell consists of three parts, also known as gates, that have specific functions.The first part, called the forget gate, chooses whether the information from the previous step in the sequence is to be remembered or can be forgotten.The second part, called the input gate, tries to learn new information from the current input to this cell.The third and final part, called the output gate, passes the updated information from the current step to the next step in the sequence.The basic LSTM equations for an input vector v i are: f orget gate : cell state : output gate : Here, F refers to a linear transformation defined by a matrix multiplication and bias addition, that is, where W R h×ns is a matrix of layer weights (h is number of neurons in the LSTM cell), b R h is a vector of bias values, and v i R ns is the input vector to the LSTM Cell.Also, α s and α t denote sigmoid and hyperbolic tangent activation functions, respectively, which are standard choices in an LSTM network, and x y denotes a Hadamard product of two vectors x and y.The sequence of snapshot vectors of n t time-steps: V = [v i−nt+1 , . . ., v i−1 , v i ], with V R ns×nt trains the LSTM network, with recurrence over time (Figure 6), to predict the subsequent vector v i+1 .The core concept of an LSTM network is the cell state c i , which behaves as the "memory" of the network.It can either allow greater preservation of past information, reducing the issues of short-term memory, or it can suppress the influence of the past, depending on the actions of the various gates during the training process.

Temporal Convolution Network (TCN)
The TCN is based on two principles [34]: The network produces an output of the same length as the input, and there can be no leakage from the future into the past.To verify that the first principle is respected, the TCN uses a 1D fully-convolutional network (FCN) where each hidden layer has the same length as the input layer, and zero padding of length (k − 1) is added to keep subsequent layers the same length as previous ones.To respect the second principle, the TCN uses causal convolutions (achieved by padding only on the starting side of input sequences), where the output at time i is convolved only with elements from time i and earlier in the previous layer (Figure 7).A TCN also makes use of dilated convolutions that enable an exponentially large receptive field.For an input sequence, V R ns×nt and a kernel K with learnable weights, K R k (k is the kernel size), the element O(s) with s {0, 1, ..., n t − k + 1} produced by the dilated 1D convolution is: where d is the dilation factor and k is the kernel size.When using dilated convolutions, d is increased exponentially with the depth of the network (eg., d = 2 l at level l of the network), ensuring that some filter hits each input within a large effective history.
Within a residual block (Figure 8), the TCN has two layers of dilated causal convolution with weight normalization and non-linearity, with a leaky rectified linear unit (leaky ReLU).To account for different input-output widths during addition operations, a 1D convolution (kernel size = 1 and channels = n s ) is used to ensure the element-wise addition operator (⊕) receives tensors of the same shape.
When convolving along the temporal axis, this (standard) TCN model uses information available from all the prior time-steps (due to the large receptive field) to evaluate the next time-step, as sketched in Figure 7.The model takes in a sequence of n t vectors corresponding to a look-back window of size The filters convolve along the temporal axis for all the n s vector nodes, since the nodes are passed in as the channels.However, the results produced from this model (Section ??) do not propagate beyond the training domain.Therefore, another model is proposed here, where the dilated convolutions of the TCN model convolve along the spatial axis and thus use the information available from the neighbouring nodes to determine the future time-step value of the node.This model takes in a sequence of n t vectors corresponding to a look-back window of size n t in a transposed manner, such that the n t solution vectors are on separate channels: This model produces significantly better results than the TCN on a temporal axis, but the causal padding and dilations employed are of no significance when the convolution filter operates along the spatial axis.Another architecture for modeling the system dynamics, with 1D convolutions and without any dilations or causal paddings, is therefore proposed in the following section.

A Convolution Neural Network (CNN) for time forecasting
A convolutional layer convolves filters with trainable weights on the input vector v i [31].Such filters are commonly referred to as convolutional kernels.In a convolutional neural network, the inputs and outputs can have multiple channels.For a convolutional layer with n in input channels and n out output channels, the total number of convolutional kernels is n k = n i × n o .Each kernel slides along the spatial direction, and the products of kernel weights and vector Zero padding of size (k − 1)/2 is added to both sides of the output feature map to maintain the spatial dimension as n s .
The forecasting model of CNN takes in a sequence of vectors with n t time-steps in a transposed manner as its input: so that the filter convolves on the spatial dimension of size n s , and the n t vectors lie on separate channels, as shown in Figure 9.The CNN architecture (Figure 10) consists of X residual blocks (X is a hyperparameter), in which the input to each block, after transformation (to make the channels equal) from a 1D Convolution layer (kernel = 1 and channels = 1) is added to the output from the block.A residual block consists of two convolution layers, each followed by a weight normalization and a leaky ReLU activation layer.

Uncertainty Quantification using Deep Ensembles
Deep neural networks, when applied in their traditional form, only predict the mean values of the output and do not provide any information regarding the uncertainty in the predicted output.Deep Ensembles address this issue by using an ensemble of variance-informed deep neural networks.Such neural networks possess a dual output [29,35,36].
In the context of the forecasting models employed with deep ensembles in this study, this means that the output size is twice the dimension m of the predicted latent vectors (z i+1 ) of solution steps, because the output contains both a mean value µ z and a raw variance ρ z .The raw variance is constrained to positiveness via a softplus activation, which produces the output variance σ z 2 .
The predicted variance accounts for the noise or spread in the data utilized by the forecasting models, thereby becoming a measure of the epistemic uncertainty [37].
An optimizer (Adam algorithm [38]) is used to update the model parameters (weights and biases) -θ(w, b).To account for the epistemic uncertainty, M sets of θ(w, b) are randomly initialized, thereby creating M independent forecasting networks and introducing variability in the training step [36,39].The epistemic uncertainty is associated with the model variance and its data-fitting capabilities, which diminish with the increase in training data.This uncertainty is more dominant in our case, compared to aleatoric uncertainty, since the dataset is obtained by analytical solutions.Therefore, the variance behaves as an indicator of the likelihood that the model is making predictions that are out-of-distribution.
The predicted latent moments (µ z θ , σ z θ 2 ) from each model create a probability mixture that can be approximated as a single Gaussian distribution, leading to a mean and variance obtained for the ensemble as: and

Predictions of the expanded solution vectors
During the online (testing) stage, the forecasting model predicts the latent vector z i along with the confidence interval at all time-steps after the input sequence.To obtain the full physical solution, the decoder (χ d ) is used to reconstruct or expand the obtained latent vector.Although the decoding operation can be directly applied to the predicted mean µ z * , this does not hold true for the predicted variance σ z * 2 , because it does not transform linearly to the expanded space by using the decoding operation.Since the latent vector z i is represented as a normal distribution ẑi ∼ N(µ z * , σ z * 2 ), a sufficiently large sample from ẑi can be drawn and reconstructed individually to produce the full solution vector vi = χ d (ẑ i ).
The unscented transform, as proposed by Julier et.al [40], is used to calculate the predicted mean (µ v * ) and variance (σ v * 2 ) of the full-order solution in expanded space (v i R ns ) from the predicted mean and variance of z i , with the help of nonlinear transformation (χ d ).A set of 2m + 1 points (known as sigma points) with a sample mean and sample variance, µ z * and σ z * 2 , respectively, are chosen, and the nonlinear function is applied to each point to yield a set of transformed points, of µ v * and σ v * 2 sample mean and variance, respectively.The primary difference of this approach from the Monte-Carlo method is that the samples are not drawn at random but instead according to a specific, deterministic algorithm.The points are obtained using: where k R is a constant, P zz R m×m is the covariance matrix formed by placing the elements of σ z * 2 as the diagonal, ( (m + k)P zz ) i is i th row or column of the matrix square root of (m + k)P zz and W i is the weight associated with the i th point.The transformation is then achieved via the following procedure: 1. Transform each sigma point (i {0, 1, ..., 2n}) by the non-linear decoder function χ d to obtain: 2. Predict the mean for the full-order solution using the weighted average of transformed points: 3. Predict the variance for the full order solution from the diagonal elements of the covariance matrix P vv , which are the weighted outer products of the transformed points:

Metrics
To evaluate the performance of the previous architectures, the following metrics are used: Mean Squared Error (L 2 Norm): The average of the square of the difference between the actual v i and predicted values vi over N samples: Mean Absolute Error (L 1 Norm): The average of the difference between the two vectors v i and vi over N samples: Relative L 2 Norm Error: The relative L 2 norm error (referred as error) is calculated as: 3

Results and Discussion
The capability of the autoencoders (MLP-AE and CAE) to efficiently transform high-dimensional vectors to a low dimensional space, and that of the forecasting models (LSTM, TCN, and CNN) to accurately model the system dynamics were tested using advection-dominated flow problems (1D Burgers' and Stoker's problems).

1D Burgers' problem
The test case involves the one-dimensional Burgers' equation, which is a non-linear advection-diffusion PDE.The equation along with the initial and Dirichlet boundary conditions are given by where the length L = 1m and the maximum time T max = 2s.The solutions obtained from the above equations produce sharp gradients even with smooth initial conditions if the viscosity ν is sufficiently small, due to the advection-dominated behavior.The analytical solution of the problem is given by: where t 0 = exp( Re 8 ) and Re = 1/ν.The high-fidelity solution vectors are generated by directly evaluating the analytical solution over a uniformly discretized spatial domain containing 200 grid points (n s = 200) at 250 uniform time-steps (T = 250) for two different values of Re: 300 and 600.The solution vectors obtained are then used to train the autoencoder and forecasting models (Section 2.1).For the autoencoder training, 200 solution vectors are chosen at random time-steps, and the remaining 50 are used for validation.For the forecasting model, the training set is comprised of the first 150 compressed samples, each sample containing n t consecutive solution vectors (i.e.look back window = n t ), where n t is a hyperparameter.The validation set is composed of the remaining samples.For testing, n t latent vectors from the start of the dataset are fed to the forecasting model to predict the subsequent time-steps via auto-regression (Table 1

Autoencoders for spatial compression
Two types of autoencoder architectures (Section 2.2) -MLP (referred to as AE) and Convolutional (referred to as CAE) are proposed for the compression of solution vectors -v i R ns to latent vectors z i R m by the encoder χ e .Sequences formed from these latent vectors ([z i−nt+1 , . . ., z i−1 , z i ] R m×nt ) are utilized to train the forecasting models -LSTM, TCN and CNN.The trained models are then used to forecast the latent vectors at subsequent time-steps to the sequence ([z nt+1 , z nt+2 , . . ., z T ] R m×(T −nt) ) given as input to the forecasting model.The latent vectors are then reconstructed into solution vectors ([v nt+1 , v nt+2 , . . ., v T ] R ns×(T −nt) ) using the decoder χ d .The heat map plots obtained by  stacking these reconstructed solution vectors along the x-axis (spatial nodes-n s along y, time-steps-n t along x), are illustrated for both autoencoder models in Appendix A (Table 11).
Both architectures, AE and CAE, are capable of efficiently compressing the solution vectors to latent vectors with few modes and fine reconstruction/decompression.However, only CAE compression followed by CNN autoregression produces accurate results on extrapolation.This is because the proposed CAE architecture is devoid of any dense layer (single layer of neurons), and therefore even during compression, the local spatial information in the vector remains preserved.This consistency facilitates the modeling of latent dynamics by the CNN model, as it convolves on the spatial axis of the input and utilizes information from the neighboring cells at the provided time-steps to predict nodal values at subsequent time-steps.The hyperparameters for the AE and CAE architectures are listed in Table 2, where encoder layers denote the number of neurons in the two dense encoder layers of AE, and the number of channels in the convolution layers of CAE.The decoders of both autoencoders are mirrored structures of their encoders.

LSTM model
When LSTM (Section 2.3.1)isused as the future step predictor, it takes in a sequence of n t (lookback window) latent vectors (spatial dimension = m) obtained by compression from the encoder network (Z R m×nt ) to produce the latent vector for the next time-step (z i+1 R m ).The LSTM model consists of multiple LSTM layers stacked together, each having a hidden dimension equal to the latent dimension of the solution vectors.Various sets of hyperparameters considered for the LSTM network, for both Re 300 and 600 are summarized in Table 3:   The models are trained in batches of size 15, and the loss values for both training and validation converge in 3000 epochs.The model with the least validation loss has a lookback window of size 10 and a single LSTM layer with hidden dimension 50 for both Re 300 and 600.The extrapolation (Figure 11) and error plots (Figure 12) obtained from these models show that the LSTM model accurately predicts the solution vectors for time-steps within the training domain (i <= 150), but the solution does not change for time-steps outside the training domain, and so the relative error increases drastically, reaching 35% for Re= 300 and 50% for Re=600.

TCN model
Similar to the LSTM, the TCN model (Section 2.3.2) takes in a sequence of n t latent vectors obtained by compression from the encoder network (Z R m×nt ) to produce the latent vector for the next time-step (z i+1 R m ), with the dilated convolutions operating on the temporal axis, and the latent dimension passed as a channel.The TCN model consists of either 2 or 3 TCN blocks, each with the same kernel size and number of channels, but with dilations increasing by a factor of 2 in subsequent blocks.The hyperparameters for the TCN network for both Re 300 and 600 are summarized in Table 4.

Hyperparameters Values
Sequence length (n t ) 5, 10, 20 TCN block channels [32,32], [64,64], [32,32,32]    For Re 300, the best model has 3 temporal blocks, each having 64 channels, whereas for Re 600, it has 2 TCN blocks with kernel size 3 and 64 channels each.The extrapolation (Figure 13) and error plots (Figure 14) obtained from these models indicate that the TCN model accurately predicts the solution vectors for time-steps within the training domain (i <= 150), but stops being accurate after the end of training, so that the error increases to 40% for Re= 300 and 50% for Re= 600.However, if the same model architecture operates on the input sequence, such that the dilated 1D convolutions propagate along the spatial axis, with each solution vector on a separate channel, then accurate forecasts

CNN model
The proposed CNN model (Section 2.3.3)takes in a sequence of n t latent vectors obtained by compression from the encoder network (Z T R nt×m ) to produce the latent vector for the next time-step (z i+1 R m ), with 1D convolutions operating on the spatial axis (latent dimension) and n t latent vectors on separate channels.The CNN model consists of two residual blocks, each with the same kernel size and number of channels.The hyperparameters for the CNN network for both Re=300 and 600 are summarized in Table 5.
Training and validation loss converges in 3000 epochs for batch size 15.The model with the least validation loss has a lookback window of size 10, and each of its blocks has a kernel size of 3 and 50 channels for both Re= 300 and 600.It is clear from the extrapolation (Figure 15) and error plots (Figure 16) that the CNN model accurately models the latent dynamics, and predicts solution vectors accurately for time-steps beyond the training domain.The error values increase with time, due to the accumulation of errors, since each subsequent time-step is predicted auto-regressively; i.e., using    previously-predicted time-steps that contain slight errors.Still, the error reaches a mere 2.5% for Re= 300 and 3.5% for Re= 600, which is significantly less than that produced by other models.

1D Stoker's Equation
Stoker's solution describes the propagation and rarefaction wave resulting from a one-dimensional dam break over a wet, flat, frictionless bottom.Stoker's solution is considered among the most challenging benchmark test case due to its strong hyperbolic behavior and the discontinuity accompanying the propagation of the front wave resulting from the initial break.The dynamic is initiated by unequal water levels of both the upstream and downstream sides located in the middle of the studied domain of 100 m.
The upstream water level is considered as an input random variable whose values are uniformly sampled within its plausible variability range h up µ [8,11], whereas the downstream water depth is kept constant at a deterministic value h ds = 1m.The analytical solution for the water level is given as: where x = the axial position,   6).

Autoencoder for spatial compression
A similar methodology to the Burgers' test case 3 is adopted for the training of autoencoder models, AE and CAE, and forecasting models, LSTM, TCN, and CNN, for the Stoker's problem.The heat map plots obtained by stacking the predicted solution vectors along the x-axis (spatial nodes-n s along y, time-steps-n t along x), are illustrated for both autoencoder models in appendix A (Table 12).
Both AE and CAE effectively transform the solution vectors to a reduced latent space, since they produce fine reconstruction for vectors within as well as outside of the training domain of the autoencoder model.But again, only the CAE-CNN model learns the latent dynamics accurately enough to predict solution vectors outside the training domain (extrapolation).The hyperparameters for the AE and CAE architectures are listed in Table 7.

LSTM
The LSTM model receives input Z R m×nt and predicts z i+1 R m (Section 2.3.1).The LSTM model has an architecture similar to that of the Burgers' case, with multiple LSTM layers having a latent dimension as their hidden dimension.The hyperparameters of the LSTM network for the Stoker's problem are summarized in Table 8.
The models are trained in batches of size 15, and the loss values for both training and validation converge in 2400 epochs.The model with the least validation loss has a lookback window of size 10 and 2 LSTM layers with hidden dimensions of 125 each.The extrapolation (Figure 17) and error plots (Figure 18) obtained from these models show that the LSTM model accurately estimates the solution vectors for time-steps within the training domain (i <= 250), but fails outside the training domain, as the relative error reaches 25% for the 150th time-step post-training.

TCN model
The TCN model also receives a sequence Z R m×nt and forecasts z i+1 R m , with the dilated convolutions operating on the temporal axis, and the latent dimension passed as a channel.The model contains three temporal blocks (Section   Model training and validation loss reaches convergence by 1000 epochs, when training and validation is performed in batches of size 15.The best model (with the least validation loss) accommodates a sequence of vectors with a lookback window of size 20, and 125 latent modes.Each temporal block contains kernels of size 3 and has 200 channels.The extrapolation (Figure 19) and error plots (Figure 20) obtained from these models indicate that similar to the LSTM, the TCN also predicts the solution vectors with acceptable accuracy for time-steps within the training domain (i <= 250), but the solution stops propagating further in the extrapolative domain, and so the error increases to 23%.

CNN model
The proposed CNN model takes the transposed vector sequence Z T R nt×m to produce z i+1 R m , with 1D convolutions operating on the spatial axis or latent dimension, and the n t latent vectors present on separate channels.The CNN model consists of three residual blocks (Section 2.3.3), each with the same kernel size and number of channels.The hyperparameters for the CNN model are summarized in Table 10: The training and validation loss converges at around 1000 epochs when batches of size 16 are used.The model with the highest accuracy (least validation loss and RA) has a lookback window of 20 steps, with 125 nodes in latent vectors at every step.Each of the three blocks possess a kernel of size 3 and 100 channels in the 1D convolution layers.The extrapolation (Figure 21) and error plots (Figure 22) indicate that the CNN model is capable of modelling the latent dynamics, since accurate forecasts are produced for time-steps beyond the training domain.The error values increase over time, reaching 5%, which is remarkably lower than that of earlier models.

Uncertainty Quantification using Deep Ensembles
The training-testing methodology described previously2.4for the evaluation of variance in the predicted output is adopted for the most stable (i.e., similar results for discrete initialization) and most accurate (i.e., least relative error on extrapolation) forecasting model, (i.e. the CNN future step predictor with a CAE autoencoder).After the model is trained to minimize the NLL, the mean and variance for the latent vector, µ z * and σ z The extrapolation results obtained by the ensemble are demonstrated in Figure 23 for the 1D Burgers' test case with Re = 300 and 600, and in Figure 25 for the 1D Stoker's test case.
The predicted mean for all time-steps outside the training domain overlaps with the actual value, and the error reaches a maximum of 6% (Figure 24) for the Burgers' and 3.5% (Figure 26) for the Stoker's problem.The variance is maximum at the spatial terminal ends of the predicted vector solutions, as they lack neighbouring nodes on one side, which are essential for information extraction using 1D convolution filters for accurate future step estimations.The variance remains almost constant throughout the prediction time, indicating the robustness of the proposed model.

Conclusion
This study proposes a Convolutional Autoencoder(CAE) model for compression and a CNN future step predictor for forecasting vector solutions for subsequent time-steps to input vector sequences.The approximation accuracy and time extrapolation capabilities of the model are evaluated using two advection-dominated flow problems, a 1D Burgers' equation and a 1D Stoker's equation, which are characterized by sharp gradients and discontinuities, respectively.The models built especially for time-series forecasts, LSTM and TCN models with propagation over time, produce acceptable results within the training domain, but the solution stops changing during extrapolation.However, when    Uncertainty quantification for the variance-informed CAE-CNN model is performed using deep ensembles, which produce accurate predictions, with low variance for long-term extrapolates as well (errors less than 6%).In addition, since the CNN model uses convolutions (like TCN [34]), the training and evaluation can be done in parallel for long input sequences, in contrast to the LSTM.Thus, a fast, accurate and robust framework is provided for order reduction.The model architecture is flexible and can be extended for two-and three-dimensional spaces as well, by increasing the dimension of the convolutional filters.
It would be interesting to test if the CAE-CNN architecture can adapt to problems where the solutions/data for the model are obtained using unstructured meshes.Future work could also focus on adapting the architecture to resolve real-life engineering problems.

Figure 2 :
Figure 2: Autoregressive testing method for forecasting models

Figure 7 :
Figure 7: 1D dilated filters convolving on the temporal dimension of vectors

Figure 9 :
Figure 9: 1D CNN filters convolving on spatial dimension of vectors

Figure 12 :
Figure 12: Burgers' problem, L 2 relative error of the autoregressive predictions with increasing time for Re = 300 (a) and Re = 600 (b) for the LSTM model

Figure 16 :
Figure 16: Burgers' problem: L 2 relative error of the auto-regressive predictions using the CNN model with increasing time for Re = 300 (a) and Re = 600 (b)

Figure 18 :
Figure 18: Stoker's problem" L 2 relative error of the auto-regressive predictions with increasing time for the LSTM model

Figure 21 :
Figure 21: Stoker's problem: Extrapolative auto-regressive predictions using the CNN model for time-steps = 320, 370 and 420; the end of training is at time-step = 270

Figure 23 :
Figure 23: Burgers' problem: Extrapolative auto-regressive predictions by the deep ensemble CAE-CNN model for time-steps = 180, 200 and 220 and for Re= 300 (a) and Re= 600 (b); the end of training is at time-step=160

Figure 24 :
Figure 24: Burgers' problem: L 2 relative error of the auto-regressive predictions with increasing time for Re = 300 (a) and Re = 600 (b) for a deep ensemble CAE-CNN model

Figure 25 :Figure 26 :
Figure 25: Stoker's problem: Extrapolative auto-regressive predictions by a deep ensemble CAE-CNN model for time-steps = 320, 370 and 420; the end of training is at time-step = 270

11 :
Burgers' problem, auto-regressive forecasts for Re= 300 (top) and Re= 600 (bottom) from the forecasting models when latent vectors are obtained by AE and Auto-regressive forecasts for the Stoker's problem from the forecasting models when latent vectors are obtained by AE and CAE

Table 2 :
Hyperparameters for the AE and CAE networks

Table 3 :
Hyperparameters for the LSTM network

Table 4 :
Hyperparameters for the TCN network

Table 5 :
Hyperparameters for the CNN network

Table 7 :
Hyperparameters for the AE and CAE networks

Table 8 :
Hyperparameters for the LSTM network 2.3.2), with the same kernels and channels in each, and dilations that increase in size by a factor of two in subsequent blocks.The hyperparameters for the TCN network are listed in Table9.

Table 9 :
Hyperparameters for the TCN network

Table 10 :
Hyperparameters for the CNN model