# Parameter estimation via conditional expectation: a Bayesian inversion

- Hermann G. Matthies
^{1}Email author, - Elmar Zander
^{1}, - Bojana V. Rosić
^{1}and - Alexander Litvinenko
^{2}

**3**:24

https://doi.org/10.1186/s40323-016-0075-7

© The Author(s) 2016

**Received: **12 March 2016

**Accepted: **21 June 2016

**Published: **11 August 2016

## Abstract

When a mathematical or computational model is used to analyse some system, it is usual that some parameters resp. functions or fields in the model are not known, and hence uncertain. These parametric quantities are then identified by actual observations of the response of the real system. In a probabilistic setting, Bayes’s theory is the proper mathematical background for this identification process. The possibility of being able to compute a conditional expectation turns out to be crucial for this purpose. We show how this theoretical background can be used in an actual numerical procedure, and shortly discuss various numerical approximations.

## Keywords

## Background

The fitting of parameters resp. functions or fields—these will all be for the sake of brevity be referred to as parameters—in a mathematical computational model is usually denoted as an inverse problem, in contrast to predicting the output or state resp. response of the system given certain inputs, which is called the forward problem. In the inverse problem, the response of the model is compared to the response of the system. The system may be a real world system, or just another computational model—usually a more complex one. One then tries in various ways to match the model response with the system response.

Typical deterministic procedures include such methods as minimising the mean square error (MMSE), leading to optimisation problems in the search of optimal parameters. As the inverse problem is typically ill-posed—the observations do not contain enough information to uniquely determine the parameters—some additional information has to be added to select a unique solution. In the deterministic setting one then typically invokes additional ad-hoc procedures like Tikhonov-regularisation [3, 4, 28, 29].

In a probabilistic setting (e.g. [10, 27] and references therein) the ill-posed problem becomes well-posed (e.g. [26]). This is achieved at a cost, though. The unknown parameters are considered as uncertain, and modelled as random variables (RVs). The information added is hence the *prior* probability distribution. This means on one hand that the result of the identification is a probability distribution, and not a single value, and on the other hand the computational work may be increased substantially, as one has to deal with RVs. That the result is a probability distribution may be seen as additional information though, as it offers an assessment of the residual uncertainty after the identification procedure, something which is not readily available in the deterministic setting. The probabilistic setting thus can be seen as modelling our knowledge about a certain situation—the value of the parameters—in the language of probability theory, and using the observation to update our knowledge, (i.e. the probabilistic description) by *conditioning* on the observation.

The key probabilistic background for this is Bayes’s theorem in the formulation of Laplace [10, 27]. It is well known that the Bayesian update is theoretically based on the notion of conditional expectation (CE) [1]. Here we take an approach which takes CE not only as a theoretical basis, but also as a basic computational tool. This may be seen as somewhat related to the “Bayes linear” approach [6, 13], which has a linear approximation of CE as its basis, as will be explained later.

In many cases, for example when tracking a dynamical system, the updates are performed sequentially step-by-step, and for the next step one needs not only a probability distribution in order to perform the next step, but a random variable which may be evolved through the state equation. Methods on how to transform the prior RV into the one which is conditioned on the observation will be discussed as well [18]. This approach is very different to the very frequently used one which refers to Bayes’s theorem in terms of densities and likelihood functions, and typically employs Markov-chain Monte Carlo (MCMC) methods to sample from the posterior (see e.g. [9, 16, 24]).

## Mathematical set-up

### Data model

*q*), where we assume for simplicity again that \(\mathcal {Q}\) is some Hilbert space. Both \(A_{\mathcal {V}}\), \(\tilde{\upsilon }_0\), and \(\eta \) could involve some noise, so that one may view Eq. (3) as an instance of a stochastic evolution equation. This is our model of the system generating the observed data, which we assume to be well-posed.

*q*, i.e. \(\hat{Y}:\mathcal {Q}\times \mathcal {V}\rightarrow \mathcal {Y}\), where we assume that \(\mathcal {Y}\) is a Hilbert space. To make things simple, assume additionally that we observe \(\hat{Y}(q;\tilde{\upsilon }(t))\) at regular time intervals \(t_n = n \cdot \mathrm {\Delta } t\), i.e. we observe \(y_{n}=\hat{Y}(q;\tilde{\upsilon }_n)\), where \(\tilde{\upsilon }_n := \tilde{\upsilon }(t_n)\). Denote the solution operator \(\Upsilon \) of Eq. (3) as

*v*is a random vector, and for each \(\tilde{\upsilon }\), \(S_{\mathcal {V}}(\tilde{\upsilon })\) is a bounded linear map to \(\mathcal {Y}\).

### Identification model

*q*as in Eq. (3), to be used for the identification, which we shall only write in its abstract from. Hence we assume that we can actually integrate Eq. (6) from \(t_n\) to \(t_{n+1}\) with its solution operator

*U*

*q*, and the identification may happen sequentially, i.e. our estimate of

*q*will change from step

*n*to step \(n+1\), we shall introduce an “extended” state vector \(x=(u,q)\in \mathcal {X}:=\mathcal {Q}\times \mathcal {U}\) and describe the change from

*n*to \(n+1\) by

## Synopsis of Bayesian estimation

There are many accounts of this, and this synopsis is just for the convenience of the reader and to introduce notation. Otherwise we refer to e.g. [6, 10, 13, 27], and in particular for the rôle of conditional expectation (CE) to our work [18, 24].

The idea is that the observation \(\hat{y}\) from Eq. (5) depends on the unknown parameters *q*, which ideally should equal \(y_{n}\) from Eq. (10), which in turn both directly and through the state \(x = (u(q),q)\) in Eq. (9) depends also on the parameters *q*, should be equal, and any difference should give an indication on what the “true” value of *q* should be. The problem in general is—apart from the distracting errors *w* and *v*—that the mapping \(q \mapsto y=Y(q;u(q))\) is in general not invertible, i.e. *y* does not contain information to uniquely determine *q*, or there are many *q* which give a good fit for \(\hat{y}\). Therefore the *inverse* problem of determining *q* from observing \(\hat{y}\) is termed an *ill-posed* problem.

The situation is a bit comparable to Plato’s allegory of the cave, where Socrates compares the process of gaining knowledge with looking at the shadows of the real things. The observations \(\hat{y}\) are the “shadows” of the “real” things *q* and \(\tilde{\upsilon }\), and from observing the “shadows” \(\hat{y}\) we want to infer what “reality” is, in a way turning our heads towards it. We hence want to “free” ourselves from just observing the “shadows” and gain some understanding of “reality”.

One way to deal with this difficulty is to measure the difference between observed \(\hat{y}_n\) and predicted system output \(y_n\) and try to find parameters \(q_n\) such that this difference is minimised. Frequently it may happen that the parameters which realise the minimum are not unique. In case one wants a unique parameter, a choice has to be made, usually by demanding additionally that some norm or similar functional of the parameters is small as well, i.e. some regularity is enforced. This optimisation approach hence leads to regularisation procedures [3, 4, 28, 29].

Here we take the view that our lack of knowledge or uncertainty of the actual value of the parameters can be described in a *Bayesian* way through a probabilistic model [10, 27]. The unknown parameter *q* is then modelled as a random variable (RV)—also called the *prior* model—and additional information on the system through measurement or observation changes the probabilistic description to the so-called *posterior* model. The second approach is thus a method to update the probabilistic description in such a way as to take account of the additional information, and the updated probabilistic description *is* the parameter estimate, including a probabilistic description of the remaining uncertainty.

It is well-known that such a Bayesian update is in fact closely related to *conditional expectation* [1, 6, 10, 18, 24], and this will be the basis of the method presented. For these and other probabilistic notions see for example [22] and the references therein. As the Bayesian update may be numerically very demanding, we show computational procedures to accelerate this update through methods based on *functional approximation* or *spectral representation* of stochastic problems [17, 18]. These approximations are in the simplest case known as Wiener’s so-called *homogeneous* or *polynomial chaos* expansion, which are polynomials in independent Gaussian RVs—the “chaos”—and which can also be used numerically in a Galerkin procedure [17, 18].

Although the Gauss-Markov theorem and its extensions [15] are well-known, as well as its connections to the Kalman filter [7, 11]—see also the recent Monte Carlo or *ensemble* version [5]—the connection to Bayes’s theorem is not often appreciated, and is sketched here. This turns out to be a linearised version of *conditional expectation*.

Since the parameters of the model to be estimated are uncertain, all relevant information may be obtained via their stochastic description. In order to extract information from the posterior, most estimates take the form of expectations w.r.t. the posterior, i.e. a conditional expectation (CE). These expectations—mathematically integrals, numerically to be evaluated by some quadrature rule—may be computed via asymptotic, deterministic, or sampling methods by typically computing first the posterior density. As we will see, the posterior density does not always exist [23]. Here we follow our recent publications [18, 21, 24] and introduce a novel approach, namely computing the CE directly and not via the posterior density [18]. This way all relevant information from the conditioning may be computed directly. In addition, we want to change the prior, represented by a random variable (RV), into a new random variable which has the correct posterior distribution. We will discuss how this may be achieved, and what approximations one may employ in the computation.

*expectation*corresponding to \({\mathbb {P}}\) will be denoted by \(\mathbb {E}\left( \right) \), e.g.

*x*.

Modelling our lack-of-knowledge about *q* in a Bayesian way [6, 10, 27] by replacing them with random variables (RVs), the problem becomes well-posed [26]. But of course one is looking now at the problem of finding a probability distribution that best fits the data; and one also obtains a probability distribution, not just *one* value *q*. Here we focus on the use of procedures similar to a linear Bayesian approach [6] in the framework of “white noise” analysis.

As formally *q* is a RV, so is the state \(x_n\) of Eq. (9), reflecting the uncertainty about the parameters and state of Eq. (3). From this follows that also the prediction of the measurement \(y_n\) Eq. (10) is a RV; i.e. we have a *probabilistic* description of the measurement.

### The theorem of Bayes and Laplace

Bayes original statement of the theorem which today bears his name was only for a very special case. The form which we know today is due to Laplace, and it is a statement about conditional probabilities. A good account of the history may be found in [19].

*x*’s on which we would like to gain some information, and \(\mathcal {M}_y\subset \mathcal {Y}\) is the information provided by the measurement. The term \({\mathbb {P}}(\mathcal {I}_x)\) is the so-called

*prior*, it is what we know before the observation \(\mathcal {M}_y\). The quantity \({\mathbb {P}}(\mathcal {M}_y|\mathcal {I}_x)\) is the so-called

*likelihood*, the conditional probability of \(\mathcal {M}_y\) assuming that \(\mathcal {I}_x\) is given. The term \({\mathbb {P}}(\mathcal {M}_y)\) is the so called

*evidence*, the probability of observing \(\mathcal {M}_y\) in the first place, which sometimes can be expanded with the

*law of total probability*, allowing to choose between different models of explanation. It is necessary to make the right hand side of Eq. (12) into a real probability—summing to unity—and hence the term \({\mathbb {P}}(\mathcal {I}_x|\mathcal {M}_y)\), the

*posterior*reflects our knowledge on \(\mathcal {I}_x\)

*after*observing \(\mathcal {M}_y\). The quotient \({\mathbb {P}}(\mathcal {M}_y|\mathcal {I}_x)/{\mathbb {P}}(\mathcal {M}_y)\) is sometimes termed the

*Bayes*factor, as it reflects the relative change in probability after observing \(\mathcal {M}_y\).

This statement Eq. (12) runs into problems if the set observations \(\mathcal {M}_y\) has vanishing measure, \({\mathbb {P}}(\mathcal {M}_y)=0\), as is the case when we observe *continuous* random variables, and the theorem would have to be formulated in *densities*, or more precisely in probability density functions (pdfs). But the Bayes factor then has the indeterminate form 0 / 0, and some form of limiting procedure is needed. As a sign that this is not so simple—there are different and inequivalent forms of doing it—one may just point to the so-called *Borel-Kolmogorov* paradox. See [23] for a thorough discussion.

*y*and

*x*have a

*joint*pdf \(\pi _{y,x}(y,x)\). As

*y*is essentially a function of

*x*, this is a special case depending on conditions on the error term

*v*. In this case Eq. (12) may be formulated as

*conditional*pdf, and the “evidence” \(Z_s\) (from German

*Zustandssumme*(sum of states), a term used in physics) is a normalising factor such that the conditional pdf \(\pi _{x|y}(\cdot |y)\) integrates to unity

*likelihood density*\(\pi _{y|x}(y|x)\) and the

*prior*pdf \(\pi _x(x)\)

*conditional expectation*(CE) \(\mathbb {E}\left( \cdot |\mathcal {M}_y\right) \) may be defined as an integral over that conditional measure resp. the conditional pdf. Thus classically, the conditional measure or pdf implies the conditional expectation:

*x*.

Please observe that the model for the RV representing the error \(v(\omega )\) determines the likelihood functions \({\mathbb {P}}(\mathcal {M}_y|\mathcal {I}_x)\) resp. the existence and form of the likelihood density \(\pi _{y|x}(\cdot |x)\). In computations, it is here that the computational model Eqs. (6) and (10) is needed to predict the measurement RV *y* given the state and parameters *x* as a RV.

Most computational approaches determine the pdfs, but we will later argue that it may be advantageous to work directly with RVs, and not with conditional probabilities or pdfs. To this end, the concept of conditional expectation (CE) and its relation to Bayes’s theorem is needed [1].

### Conditional expectation

To avoid the difficulties with conditional probabilities like in the Borel-Kolmogorov paradox alluded to in the “The theorem of Bayes and Laplace” section, *Kolmogorov* himself—when he was setting up the axioms of the mathematical theory probability—turned the relation between conditional probability or pdf and conditional expectation around, and defined as a first and fundamental notion *conditional expectation* [1, 23].

It has to be defined not with respect to measure-zero observations of a RV *y*, but w.r.t sub-\(\sigma \)-algebras \(\mathfrak {B}\subset \mathfrak {A}\) of the underlying \(\sigma \)-algebra \(\mathfrak {A}\). The \(\sigma \)-algebra may be loosely seen as the collection of subsets of \(\varOmega \) on which we can make statements about their probability, and for fundamental mathematical reasons in many cases this is *not* the set of *all* subsets of \(\varOmega \). The sub-\(\sigma \)-algebra \(\mathfrak {B}\) may be seen as the sets on which we learn something through the observation.

*finite variance*, i.e. the Hilbert-space

*closed*subspace, and hence has a well-defined continuous orthogonal projection \(P_\mathfrak {B}: \mathcal {S}\rightarrow \mathcal {S}_\mathfrak {B}\). The

*conditional expectation*(CE) of a RV \(r\in \mathcal {S}\) w.r.t. a sub-\(\sigma \)-algebra \(\mathfrak {B}\) is then defined as that orthogonal projection

*unconditional*expectation \(\mathbb {E}\left( \right) \) is in this view just the CE w.r.t. the minimal \(\sigma \)-algebra \(\mathfrak {B}=\{\emptyset , \varOmega \}\). As the CE is an orthogonal projection, it minimises the squared error

*variational equation*or orthogonality relation

*Pythagoras’s*theorem

*minimum mean square error*(MMSE) estimator.

*conditional*probability, e.g. for \(A \subset \varOmega , A \in \mathfrak {B}\) by

*usual*characteristic function, sometimes also termed an indicator function. Thus if we know \({\mathbb {P}}(A | \mathfrak {B})\) for each \(A \in \mathfrak {B}\), we know the conditional probability. Hence having the CE \(\mathbb {E}\left( \cdot |\mathfrak {B}\right) \) allows one to know everything about the conditional probability; the conditional or posterior density is not needed. If the prior probability was the distribution of some RV

*r*, we know that it is completely characterised by the

*prior*characteristic function—in the sense of probability theory—\(\varphi _r(s) := \mathbb {E}\left( \exp (\mathchoice{\displaystyle {\mathrm {i}}}{\textstyle {\mathrm {i}}}{\scriptstyle {\mathrm {i}}}{\scriptscriptstyle {\mathrm {i}}}r s)\right) \). To get the

*conditional*characteristic function \(\varphi _{r|\mathfrak {B}}(s) = \mathbb {E}\left( \exp (\mathchoice{\displaystyle {\mathrm {i}}}{\textstyle {\mathrm {i}}}{\scriptstyle {\mathrm {i}}}{\scriptscriptstyle {\mathrm {i}}}r s)|\mathfrak {B}\right) \), all one has to do is use the CE instead of the unconditional expectation. This then completely characterises the conditional distribution.

*y*, the sub-\(\sigma \)-algebra \(\mathfrak {B}\) will be the one generated by the

*observation*\(y=h(x,v)\), i.e. \(\mathfrak {B}=\sigma (y)\), these are those subsets of \(\varOmega \) on which we may obtain

*information*from the observation. According to the

*Doob-Dynkin*lemma the subspace \(\mathcal {S}_{\sigma (y)}\) is given by

*y*is a RV. Once an observation has been made, i.e. we observe for the RV

*y*the fixed value \(\hat{y}\in \mathcal {Y}\), then—for almost all \(\hat{y}\in \mathcal {Y}\)—\(\mathbb {E}\left( r|\hat{y}\right) \in {\mathbb {R}}\) is just a number—the

*posterior expectation*, and \({\mathbb {P}}(A|\hat{y})=\mathbb {E}\left( \chi _A|\hat{y}\right) \) is the

*posterior probability*. Often these are also termed conditional expectation and conditional probability, which leads to confusion. We therefore prefer the attribute

*posterior*when the actual observation \(\hat{y}\) has been observed and inserted in the expressions. Additionally, from Eq. (18) one knows that for some function \(\phi _r\)—for each RV

*r*it is a possibly different function—one has that

*y*or rather the posterior expectation, then the conditional and especially the posterior probabilities after the observation \(\hat{y}\) may as well be computed, regardless whether joint pdfs exist or not. We take this as the starting point to Bayesian estimation.

*total*varianceHere \(\Vert \tilde{y}(\omega ) \Vert _{\mathcal {Y}}^2 = \langle \tilde{y}(\omega ), \tilde{y}(\omega ) \rangle _{\mathcal {Y}}\) is the norm squared on the deterministic component \(\mathcal {Y}\) with inner product \(\langle \cdot , \cdot \rangle _{\mathcal {Y}}\); and the total \(\mathrm {L}_2\)-norm of an elementary tensor \(y\otimes r\in \mathcal {Y}\otimes \mathcal {S}\) with \(y\in \mathcal {Y}\) and \(r\in \mathcal {S}\) can also be written aswhere \(\langle r, r \rangle _{\mathcal {S}} = \Vert r \Vert _{\mathcal {S}}^2 := \mathbb {E}\left( |r|^2\right) \) is the usual inner product of scalar RVs.

## Constructing a posterior random variable

### Updating random variables

*posterior*expectation operator

But to then go on from \(t_{n+1}\) to \(t_{n+2}\) with the Eqs. (20) and (21), one needs a new RV \(x_{n+2}\) which has the posterior distribution described by the mappings \(\phi _\varPsi (\hat{y}_{n+1})\) in Eq. (23). Bayes’s theorem only specifies this probabilistic content. There are many RVs which have this posterior distribution, and we have to pick a particular representative to continue the computation. We will show a method which in the simplest case comes back to MMSE.

*assimilated*RV \(x_a = x_{n+1}\)—it has assimilated the new observation \(\hat{y}=\hat{y}_{n+1}\). Hence what we want is a new RV which is an

*update*of the forecast RV \(x_f\)

*B*resp. a change given by the

*innovation*map \(\varXi \). Such a transformation is often called a

*filter*—the measurement \(\hat{y}\) is filtered to produce the update.

### Correcting the mean

*mean*\(\bar{x}_a = \mathbb {E}\left( x_a|\hat{y}\right) \), i.e. we take \(\varPsi (x)=x\) in Eq. (23). Remember that according to Eq. (15) \(\mathbb {E}\left( x_a|\sigma (y_f)\right) = \phi _{x_f}(y_f) =: \phi _x(y_f)\) is an orthogonal projection \(P_{\sigma (y_f)}(x_f)\) from \(\mathscr {X} = \mathcal {X}\otimes \mathcal {S}\) onto \(\mathscr {X}_\infty := \mathcal {X}\otimes \mathcal {S}_\infty \), where \(\mathcal {S}_\infty := \mathcal {S}_{\sigma (y)}=\mathrm {L}_2(\varOmega ,\sigma (y_f),{\mathbb {P}})\). Hence there is an orthogonal decomposition

*translation*of the RV \(x_f\), i.e. a very simple map

*B*in Eq. (24). From Eq. (27) follows that

*correct*posterior mean.

### Correcting higher moments

*and*the correct posterior total variance

*B*in Eq. (24).

*and*the correct posterior covariance. Observe that this is just an affine transformation of the RV \(x_f\), i.e. still a fairly simple map

*B*in Eq. (24).

By combining further transport maps [20] it seems possible to construct a RV \(x_a\) which has the desired posterior distribution to any accuracy. This is beyond the scope of the present paper, and is ongoing work on how to do it in the simplest way. For the following, we shall be content with the update Eq. (28) in “Correcting the mean” section.

## The Gauss-Markov-Kalman filter (GMKF)

It turned out that practical computations in the context of Bayesian estimation can be extremely demanding, see [19] for an account of the history of Bayesian theory, and the break-throughs required in computational procedures to make Bayesian estimation possible at all for practical purposes. This involves both the Monte Carlo (MC) method and the Markov chain Monte Carlo (MCMC) sampling procedure. One may have gleaned this also already from the “Constructing a posterior random variable” section.

To arrive at computationally feasible procedures for computationally demanding models Eqs. (20) and (21), where MCMC methods are not feasible, approximations are necessary. This means in some way not using all information but having a simpler computation. Incidentally, this connects with the Gauss-Markov theorem [15] and the Kalman filter (KF) [7, 11]. These were initially stated and developed without any reference to Bayes’s theorem. The Monte Carlo (MC) computational implementation of this is the *ensemble* KF (EnKF) [5]. We will in contrast use a white noise or polynomial chaos approximation [18, 21, 24]. But the initial ideas leading to the abstract Gauss-Markov-Kalman filter (GMKF) are independent of any computational implementation and are presented first. It is in an abstract way just *orthogonal projection*, based on the update Eq. (28) in “Correcting the mean” section.

### Building the filter

*best unbiased*filter, with \(\phi (\hat{y})\) a MMSE estimate. It is clear that the

*stability*of the solution to Eq. (35) will depend on the contraction properties or otherwise of the map \(f - g \circ H \circ f = (I-g \circ H) \circ f\) as applied to \(x_n\), but that is not completely worked out yet and beyond the scope of this paper.

*Y*, resp.

*h*or

*H*, which evaluates

*y*, is not necessarily linear in

*x*, hence the optimal map \(\phi _x(y)\) is also not necessarily linear in

*y*. In some sense it has to be the opposite of

*Y*.

### The linear filter

The minimisation in Eq. (36) over all measurable maps is still a formidable task, and typically only feasible in an approximate way. One problem of course is that the space \(\mathscr {X}_{\infty }\) is in general infinite-dimensional. The standard Galerkin approach is then to approximate it by finite-dimensional subspaces, see [18] for a general description and analysis of the Galerkin convergence.

*affine*maps; they are certainly measurable. Note that \(\mathscr {X}_1\) is also an \(\mathcal {L}\)-invariant subspace of \(\mathscr {X}_{\infty }\subset \mathscr {X}\).

*best linear*filter, with the linear MMSE \(K(\hat{y})\). One may note that the constant term

*a*in Eq. (39) drops out in the filter equation.

### The Gauss-Markov theorem and the Kalman filter

The optimisation described in Eq. (39) is a familiar one, it is easily solved, and the solution is given by an extension of the *Gauss-Markov* theorem [15]. The same idea of a linear MMSE is behind the *Kalman* filter [5–7, 11, 22]. In our context it reads

### Theorem 1

*x*and

*y*, and \(\text{ cov }(y)\) is the auto-covariance of

*y*. In case \(\text{ cov }(y)\) is

*singular*or nearly singular, the

*pseudo-inverse*can be taken instead of the inverse.

The operator \(K\in \mathscr {L}(\mathcal {Y},\mathcal {X})\) is also called the *Kalman* gain, and has the familiar form known from least squares projections. It is interesting to note that initially the connection between MMSE and Bayesian estimation was not seen [19], although it is one of the simplest approximations.

**Gauss-Markov-Kalman**filter (GMKF). The original Kalman filter has Eq. (40) just for the means

### Theorem 2

*stochastic*discretisation is needed to be numerically implementable.

## Nonlinear filters

The derivation of nonlinear but polynomial filters is given in [18]. It has the advantage of showing the connection to the “Bayes linear” approach [6], to the Gauss-Markov theorem [15], and to the *Kalman* filter [11, 22]. Correcting higher moments of the posterior RV has been touched on in the “Correcting higher moments” section, and is not the topic here. Now the focus is on computing better than linear (see “The linear filter” section) approximations to the CE operator, which is the basic tool for the whole updating and identification process.

*x*, where \(\mathcal {R}\) is some Hilbert space, of which we want to compute the conditional expectation \(\mathbb {E}\left( \varPsi (x)|y\right) \). Denote by \(\mathcal {A}_k\) a finite part of \(\mathcal {A}\) of cardinality

*k*, such that \(\mathcal {A}_k \subset \mathcal {A}_\ell \) for \(k<\ell \) and \(\bigcup _k \mathcal {A}_k =\mathcal {A}\), and set

*Galerkin*-ansatz, and the Galerkin orthogonality Eq. (37) can be used to determine these coefficients.

### Theorem 3

### Proof

The Galerkin Eq. (47) is a simple consequence of the Galerkin orthogonality Eq. (37). As the Gram matrix \(\mathchoice{\displaystyle \varvec{G}}{\textstyle \varvec{G}}{\scriptstyle \varvec{G}}{\scriptscriptstyle \varvec{G}}_k\) and the identity \(I_{\mathcal {R}}\) on \(\mathcal {R}\) are positive definite, so is the tensor operator \((\mathchoice{\displaystyle \varvec{G}}{\textstyle \varvec{G}}{\scriptstyle \varvec{G}}{\scriptscriptstyle \varvec{G}}_k \otimes I_{\mathcal {R}})\), with inverse \((\mathchoice{\displaystyle \varvec{G}}{\textstyle \varvec{G}}{\scriptstyle \varvec{G}}{\scriptscriptstyle \varvec{G}}_k^{-1} \otimes I_{\mathcal {R}})\). \(\square \)

The block structure of the equations is clearly visible. Hence, to solve Eq. (47), one only has to deal with the ‘small’ matrix \(\mathchoice{\displaystyle \varvec{G}}{\textstyle \varvec{G}}{\scriptstyle \varvec{G}}{\scriptscriptstyle \varvec{G}}_k\).

Observe that this allows one to compute the map in Eq. (19) or rather Eq. (23) to any desired accuracy. Then, using this tool, one may construct a new random variable which has the desired posterior expectations; as was started in the “Correcting the mean” and “Correcting higher moments” sections. This is then a truly nonlinear extension of the linear filters described in “The Gauss-Markov-Kalman filter (GMKF)” section, and one may expect better tracking properties than even for the best linear filters. This could for example allow for less frequent observations of a dynamical system.

## Numerical realisation

This is only going to be a rough overview on possibilities of numerical realisations. Only the simplest case of the linear filter will be considered, all other approximations can be dealt with in an analogous manner. Essentially we will look at two different kinds of approximations, *sampling* and *functional* or *spectral* approximations.

### Sampling

*samples*of the RVs. Thus it is possible to take an

*ensemble*of sampling points \(\omega _1,\dots ,\omega _N\) and require

*ensemble*Kalman filter, the EnKF [5]; the points \(\mathchoice{\displaystyle \varvec{x}}{\textstyle \varvec{x}}{\scriptstyle \varvec{x}}{\scriptscriptstyle \varvec{x}}_f(\omega _\ell )\) and \(\mathchoice{\displaystyle \varvec{x}}{\textstyle \varvec{x}}{\scriptstyle \varvec{x}}{\scriptscriptstyle \varvec{x}}_a(\omega _\ell )\) are sometimes also denoted as

*particles*, and Eq. (49) is a simple version of a

*particle filter*. In Eq. (49), \(\mathchoice{\displaystyle \varvec{C}}{\textstyle \varvec{C}}{\scriptstyle \varvec{C}}{\scriptscriptstyle \varvec{C}}_{x_f y}=\text{ cov }(x_f,y)\) and \(\mathchoice{\displaystyle \varvec{C}}{\textstyle \varvec{C}}{\scriptstyle \varvec{C}}{\scriptscriptstyle \varvec{C}}_{y}=\text{ cov }(y)\)

Some of the main work for the EnKF consists in obtaining good estimates of \(\mathchoice{\displaystyle \varvec{C}}{\textstyle \varvec{C}}{\scriptstyle \varvec{C}}{\scriptscriptstyle \varvec{C}}_{x_f y}\) and \(\mathchoice{\displaystyle \varvec{C}}{\textstyle \varvec{C}}{\scriptstyle \varvec{C}}{\scriptscriptstyle \varvec{C}}_{y}\), as they have to be computed from the samples. Further approximations are possible, for example such as *assuming* a particular form for \(\mathchoice{\displaystyle \varvec{C}}{\textstyle \varvec{C}}{\scriptstyle \varvec{C}}{\scriptscriptstyle \varvec{C}}_{x_f y}\) and \(\mathchoice{\displaystyle \varvec{C}}{\textstyle \varvec{C}}{\scriptstyle \varvec{C}}{\scriptscriptstyle \varvec{C}}_{y}\). This is the basis for methods like *kriging* and *3DVAR* resp. *4DVAR*, where one works with an approximate Kalman gain \(\mathchoice{\displaystyle \varvec{\tilde{K}}}{\textstyle \varvec{\tilde{K}}}{\scriptstyle \varvec{\tilde{K}}}{\scriptscriptstyle \varvec{\tilde{K}}} \approx \mathchoice{\displaystyle \varvec{K}}{\textstyle \varvec{K}}{\scriptstyle \varvec{K}}{\scriptscriptstyle \varvec{K}}\). For a recent account see [12].

### Functional approximation

Here we want to pursue a different tack, and want to discretise RVs not through their samples, but by *functional* resp. *spectral approximations* [14, 17, 30]. This means that all RVs, say \(\mathchoice{\displaystyle \varvec{v}}{\textstyle \varvec{v}}{\scriptstyle \varvec{v}}{\scriptscriptstyle \varvec{v}}(\omega )\), are described as functions of *known* RVs \(\{\xi _1(\omega ),\dots ,\xi _\ell (\omega ),\dots \}\). Often, when for example stochastic processes or random fields are involved, one has to deal here with *infinitely* many RVs, which for an actual computation have to be truncated to a finite vector \(\mathchoice{\displaystyle \varvec{\xi }}{\textstyle \varvec{\xi }}{\scriptstyle \varvec{\xi }}{\scriptscriptstyle \varvec{\xi }}(\omega )=[\xi _1(\omega ),\dots ,\xi _n(\omega )]\) of significant RVs. We shall assume that these have been chosen such as to be independent. As we want to approximate later \(\mathchoice{\displaystyle \varvec{x}}{\textstyle \varvec{x}}{\scriptstyle \varvec{x}}{\scriptscriptstyle \varvec{x}}=[x_1,\dots ,x_n]\), we do not need more than *n* RVs \(\mathchoice{\displaystyle \varvec{\xi }}{\textstyle \varvec{\xi }}{\scriptstyle \varvec{\xi }}{\scriptscriptstyle \varvec{\xi }}\).

One further chooses a finite set of linearly independent functions \(\{\psi _\alpha \}_{\alpha \in \mathcal {J}_M}\) of the variables \(\mathchoice{\displaystyle \varvec{\xi }}{\textstyle \varvec{\xi }}{\scriptstyle \varvec{\xi }}{\scriptscriptstyle \varvec{\xi }}(\omega )\), where the index \(\alpha \) often is a *multi-index*, and the set \(\mathcal {J}_M\) is a finite set with cardinality (size) *M*. Many different systems of functions can be used, classical choices are [14, 17, 30] multivariate polynomials—leading to the *polynomial chaos expansion* (PCE), as well as trigonometric functions, kernel functions as in kriging, radial basis functions, sigmoidal functions as in artificial neural networks (ANNs), or functions derived from fuzzy sets. The particular choice is immaterial for the further development. But to obtain results which match the above theory as regards \(\mathcal {L}\)-invariant subspaces, we shall assume that the set \(\{\psi _\alpha \}_{\alpha \in \mathcal {J}_M}\) includes all the *linear* functions of \(\mathchoice{\displaystyle \varvec{\xi }}{\textstyle \varvec{\xi }}{\scriptstyle \varvec{\xi }}{\scriptscriptstyle \varvec{\xi }}\). This is easy to achieve with polynomials, and w.r.t kriging it corresponds to *universal* kriging. All other function systems can also be augmented by a linear trend.

*n*is the dimension of our problem, and if

*n*is large, one faces a high-dimensional problem. It is here that low-rank tensor approximations [8] become practically important.

It is not too difficult to see that the linear filter, when applied to the spectral approximation, has exactly the same form as shown in Eq. (42). Hence the basic formula Eq. (42) looks formally the same in both cases, once it is applied to samples or “particles”, in the other case to the functional approximation of RVs, i.e. to the coefficients in Eq. (50).

In both of the cases described here in the “Sampling” and “Functional approximation” sections, the question as how to compute the covariance matrices in Eq. (42) arises. In the EnKF in “Sampling” section they have to be computed from the samples [5], and in the case of functional resp. spectral approximations they can be computed from the coefficients in Eq. (50), see [21, 24].

In the sampling context, the samples or particles may be seen as \(\updelta \)-measures, and one generally obtains weak-\(*\) convergence of convex combinations of these \(\updelta \)-measures to the continuous limit as the number of particles increases. In the case of functional resp. spectral approximation one can bring the whole theory of Galerkin-approximations to bear on the problem, and one may obtain convergence of the involved RVs in appropriate norms [18]. We leave this topic with this pointer to the literature, as this is too extensive to be discussed any further and hence is beyond the scope of the present work.

## Examples

The first example is a dynamic system considered in [21], it is the well-known Lorenz-84 chaotic model, a system of three nonlinear ordinary differential equations operating in the chaotic regime. This is an example along the description of Eqs. (3) and (5) in the “Data model” section. Remember that this was originally a model to describe the evolution of some amplitudes of a spherical harmonic expansion of variables describing world climate. As the original scaling of the variables has been kept, the time axis in Fig. 1 is in *days*. Every 10 days a noisy measurement is performed and the state description is updated. In between the state description evolves according to the chaotic dynamic of the system. One may observe from Fig. 1 how the uncertainty—the width of the distribution as given by the quantile lines—shrinks every time a measurement is performed, and then increases again due to the chaotic and hence noisy dynamics. Of course, we did not really measure the world climate, but rather simulated the “truth” as well, i.e. a *virtual* experiment, like the others to follow. More details may be found in [21] and the references therein. All computations are performed in a functional approximation with polynomial chaos expansions as alluded to in the “Functional approximation” section, and the filter is linear according to Eq. (42).

Shown are the pdfs produced by the linear filter according to Eq. (42)—Linear polynomial chaos Bayesian update (Linear PCBU)—a special form of Eq. (28), also with an iterated linear filter—iterative LPCBU—using Newton iterations, i.e. an iterated version of Eq. (42), and using polynomials up to order two, the quadratic polynomial chaos Bayesian update (QPCBU). One may observe that due to the nonlinear observation, the differences between the linear filters and the quadratic one are already significant, the QPCBU yielding a better update.

As a first set of experiments we take the measurement operator to be linear in the state variable to be identified, i.e. we can observe the *whole* state directly. At the moment we consider updates after each day—whereas in Fig. 1 the updates were performed every 10 days. The update is done once with the linear Bayesian update (LBU), and again with a *quadratic* nonlinear BU (QBU). The results for the posterior pdfs are given in Fig. 3, where the linear update is dotted in blue and labelled *z*1, and the full red line is the quadratic QBU labelled *z*2; there is hardly any difference between the two except for the *z*-component of the state, most probably indicating that the LBU is already very accurate.

As a last example we follow [18] and take a strongly nonlinear and also non-smooth situation, namely elasto-plasticity with linear hardening and large deformations and a *Kirchhoff-St. Venant* elastic material law [24, 25]. This example is known as *Cook’s membrane*, and is shown in Fig. 5 with the undeformed mesh (initial), the deformed one obtained by computing with average values of the elasticity and plasticity material constants (deterministic), and finally the average result from a stochastic forward calculation of the probabilistic model (stochastic), which is described by a variational inequality [25].

The shear modulus *G*, a random field and not a deterministic value in this case, has to be identified, which is made more difficult by the non-smooth non-linearity. In Fig. 6 one may see the ‘true’ distribution at one point in the domain in an unbroken black line, with the mode—the maximum of the pdf—marked by a black cross on the abscissa, whereas the prior is shown in a dotted blue line. The pdf of the LBU is shown in an unbroken red line, with its mode marked by a red cross, and the pdf of the QBU is shown in a broken purple line with its mode marked by an asterisk. Again we see a difference between the LBU and the QBU. But here a curious thing happens; the mode of the LBU-posterior is actually closer to the mode of the ‘truth’ than the mode of the QBU-posterior. This means that somehow the QBU takes the prior more into account than the LBU, which is a kind of overshooting which has been observed at other occasions. On the other hand the pdf of the QBU is narrower—has less uncertainty—than the pdf of the LBU.

## Conclusion

A general approach for state and parameter estimation has been presented in a Bayesian framework. The Bayesian approach is based here on the conditional expectation (CE) operator, and different approximations were discussed, where the linear approximation leads to a generalisation of the well-known Kalman filter (KF), and is here termed the Gauss-Markov-Kalman filter (GMKF), as it is based on the classical Gauss-Markov theorem. Based on the CE operator, various approximations to construct a RV with the proper posterior distribution were shown, where just correcting for the mean is certainly the simplest type of filter, and also the basis of the GMKF.

Actual numerical computations typically require a discretisation of both the spatial variables—something which is practically independent of the considerations here—and the stochastic variables. Classical are sampling methods, but here the use of spectral resp. functional approximations is alluded to, and all computations in the examples shown are carried out with functional approximations.

## Declarations

### Author's contributions

HGM provided ideas and wrote draft. EZ and BVR helped improve the research idea, BVR and AL conducted the numerical implementation and computation and the results parts. All authors read and approved the final manuscript.

### Acknowledgements

Partly supported by the Deutsche Forschungsgemeinschaft (DFG) through SFB 880.

Dedicated to Pierre Ladevèze on the occasion of his 70th birthday.

### Competing interests

The authors declare that they have no competing interests.

**Open Access**This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

## Authors’ Affiliations

## References

- Bobrowski A. Functional analysis for probability and stochastic processes. Cambridge: Cambridge University Press; 2005.View ArticleMATHGoogle Scholar
- Bosq D. Linear processes in function spaces. Theory and applications. In: Lecture notes in statistics, vol. 149. Contains definition of strong or $$L$$ L -orthogonality for vector valued random variables. Berlin: Springer; 2000.Google Scholar
- Engl HW, Groetsch CW. Inverse and ill-posed problems. New York: Academic Press; 1987.MATHGoogle Scholar
- Engl HW, Hanke M, Neubauer A. Regularization of inverse problems. Dordrecht: Kluwer; 2000.MATHGoogle Scholar
- Evensen G. Data assimilation—the ensemble Kalman filter. Berlin: Springer; 2009.MATHGoogle Scholar
- Goldstein M, Wooff D. Bayes linear statistics—theory and methods, Wiley series in probability and statistics. Chichester: Wiley; 2007.MATHGoogle Scholar
- Grewal MS, Andrews AP. Kalman filtering: theory and practice using MATLAB. Chichester: Wiley; 2008.View ArticleMATHGoogle Scholar
- Hackbusch W. Tensor spaces and numerical tensor calculus. Berlin: Springer; 2012.View ArticleMATHGoogle Scholar
- Hastings WK. Monte Carlo sampling methods using Markov chains and their applications. Biometrika. 1970;57(1):97–109. doi:10.1093/biomet/57.1.97.MathSciNetView ArticleMATHGoogle Scholar
- Jaynes ET. Probability theory, the logic of science. Cambridge: Cambridge University Press; 2003.View ArticleMATHGoogle Scholar
- Kálmán RE. A new approach to linear filtering and prediction problems. J Basic Eng. 1960;82:35–45.View ArticleGoogle Scholar
- Kelly DTB, Law KJH, Stuart AM. Well-posedness and accuracy of the ensemble Kalman filter in discrete and continuous time. Nonlinearity. 2014;27:2579–603. doi:10.1088/0951-7715/27/10/2579.MathSciNetView ArticleMATHGoogle Scholar
- Kennedy MC, O’Hagan A. Bayesian calibration of computer models. J Royal Stat Soc Series B. 2001;63(3):425–64.MathSciNetView ArticleMATHGoogle Scholar
- Le Maître OP, Knio OM. Spectral methods for uncertainty quantification. Scientific computation. Berlin: Springer; 2010. doi:10.1007/978-90-481-3520-2.View ArticleMATHGoogle Scholar
- Luenberger DG. Optimization by vector space methods. Chichester: Wiley; 1969.MATHGoogle Scholar
- Marzouk YM, Najm HN, Rahn LA. Stochastic spectral methods for efficient Bayesian solution of inverse problems. J Comput Phys. 2007;224(2):560–86. doi:10.1016/j.jcp.2006.10.010.MathSciNetView ArticleMATHGoogle Scholar
- Matthies HG. Uncertainty quantification with stochastic finite elements. In: Stein E, de Borst R, Hughes TJR, editors. Encyclopaedia of computational mechanics. Chichester: Wiley; 2007. doi:10.1002/0470091355.ecm071.Google Scholar
- Matthies HG, Zander E, Rosić BV, Litvinenko A, Pajonk O. Inverse problems in a Bayesian setting. arXiv: 1511.00524 [math.PR]. 2015.
- McGrayne SB. The theory that would not die. New Haven: Yale University Press; 2011.MATHGoogle Scholar
- Moselhy TA, Marzouk YM. Bayesian inference with optimal maps. J Comput Phys. 2012;231:7815–50. doi:10.1016/j.jcp.2012.07.022.MathSciNetView ArticleMATHGoogle Scholar
- Pajonk O, Rosić BV, Litvinenko A, Matthies HG. A deterministic filter for non-Gaussian Bayesian estimation—applications to dynamical system estimation with noisy measurements. Physica D Nonlinear Phenom. 2012;241:775–88. doi:10.1016/j.physd.2012.01.001.View ArticleMATHGoogle Scholar
- Papoulis A. Probability, random variables, and stochastic processes. 3rd ed. New York: McGraw-Hill Series in Electrical Engineering, McGraw-Hill; 1991.MATHGoogle Scholar
- Rao MM. Conditional measures and applications. Boca Raton: CRC Press; 2005.View ArticleMATHGoogle Scholar
- Rosić BV, Kučerová A, Sýkora J, Pajonk O, Litvinenko A, Matthies HG. Parameter identification in a probabilistic setting. Eng Struct. 2013;50:179–96. doi:10.1016/j.engstruct.2012.12.029.View ArticleGoogle Scholar
- Rosić BV, Matthies HG. Identification of properties of stochastic elastoplastic systems. In: Papadrakakis M, Stefanou G, Papadopoulos V, editors. Computational methods in stochastic dynamics. Berlin: Springer; 2013. p. 237–53. doi:10.1007/978-94-007-5134-7_14.Google Scholar
- Stuart AM. Inverse problems: a Bayesian perspective. Acta Numerica. 2010;19:451–559. doi:10.1017/S0962492910000061.MathSciNetView ArticleMATHGoogle Scholar
- Tarantola A. Inverse problem theory and methods for model parameter estimation. Philadelphia: SIAM; 2004.MATHGoogle Scholar
- Tikhonov AN, Goncharsky AV, Stepanov VV, Yagola AG. Numerical methods for the solution of ill-posed problems. Berlin: Springer; 1995.View ArticleMATHGoogle Scholar
- Tikhonov AN, Arsenin VY. Solutions of ill-posed problems. Chichester: Wiley; 1977.MATHGoogle Scholar
- Xiu D. Numerical methods for stochastic computations: a spectral method approach. Princeton: Princeton University Press; 2010.MATHGoogle Scholar