Parameter Estimation via Conditional Expectation --- A Bayesian Inversion

When a mathematical or computational model is used to analyse some system, it is usual that some parameters resp.\ functions or fields in the model are not known, and hence uncertain. These parametric quantities are then identified by actual observations of the response of the real system. In a probabilistic setting, Bayes's theory is the proper mathematical background for this identification process. The possibility of being able to compute a conditional expectation turns out to be crucial for this purpose. We show how this theoretical background can be used in an actual numerical procedure, and shortly discuss various numerical approximations.


Introduction
The fitting of parameters resp. functions or fields -these will all be for the sake of brevity be referred to as parameters --in a mathematical computational model is usually denoted as an inverse problem, in contrast to predicting the output or state resp. response of the system given certain inputs, which is called the forward problem. In the inverse problem, the response of the model is compared to the response of the system. The system may be a real world system, or just another computational model -usually a more complex one. One then tries in various ways to match the model response with the system response. Typical deterministic procedures include such methods as minimising the mean square error (MMSE), leading to optimisation problems in the search of optimal parameters. As the inverse problem is typically ill-posed -the observations do not do contain enough information to uniquely determine the parameters -some additional oinformation has to be added to select a unique solution. In the deterministic setting on then typically invokes additional ad-hoc procedures like Tikhonov-regularisation [29,28,3,4].
In a probabilistic setting (e.g. [10,27] and references therein) the ill-posed problem becomes well-posed (e.g. [26]). This is achieved at a cost though. The unknown parameters are considered as uncertain, and modelled as random variables (RVs). The information added is hence the prior probability distribution. This means on one hand that the result of the identification is a probability distribution, and not a single value, and on the other hand the computational work may be increased substantially, as one has to deal with RVs. That the result is a probability distribution may be seen as additional information though, as it offers an assessment of the residual uncertainty after the identification procedure, something which is not readily available in the deterministic setting. The probabilistic setting thus can be seen as modelling our knowledge about a certain situation -the value of the parameters -in the language of probability theory, and using the observation to update our knowledge, (i.e. the probabilistic description) by conditioning on the observation.
The key probabilistic background for this is Bayes's theorem in the formulation of Laplace [10,27]. It is well known that the Bayesian update is theoretically based on the notion of conditional expectation (CE) [1]. Here we take an approach which takes CE not only as a theoretical basis, but also as a basic computational tool. This may be seen as somewhat related to the "Bayes linear" approach [6,13], which has a linear approximation of CE as its basis, as will be explained later.
In many cases, for example when tracking a dynamical system, the updates are performed sequentially step-by-step, and for the next step one needs not only a probability distribution in order to perform the next step, but a random variable which may be evolved through the state equation. Methods on how to transform the prior RV into the one which is conditioned on the observation will be discussed as well [18]. This approach is very different to the very frequently used one which refers to Bayes's theorem in terms of densities and likelihood functions, and typically employs Markov-chain Monte Carlo (MCMC) methods to sample from the posterior (see e.g. [9,16,24]).

Mathematical set-up
Let us start with an example to have a concrete idea of what the whole procedure is about. Imagine a system described by a diffusion equation, e.g. the diffusion of heat through a solid medium, or even the seepage of groundwater through porous rocks and soil: ∂υ ∂t (x, t) =υ(x, t) = ∇ · (κ(x,υ)∇υ(x, t)) + η(x, t), (1) υ(x, 0) =υ 0 (x) plus b.c.
Here x ∈ G is a spatial coordinate in the domain G ⊂ R n , t ∈ [0, T ] is the time,υ a scalar function describing the diffusing quantity, κ the (possibly non-linear) diffusion tensor, η external sources or sinks, and ∇ the Nabla operator. Additionally assume appropriate boundary conditions so that Eq. (1) is well-posed. Now, as often in such situations, imagine that we do not know the initial conditionsυ 0 in Eq. (2) precisely, nor the diffusion tensor κ, and maybe not even the driving source η, i.e. there is some uncertainty attached as to what their precise values are.

Data model
A more abstract setting which subsumes Eq. (1) is to viewυ(t) :=υ(·, t) as an element of a Hilbert-space (for the sake of simplicity) V. In the particular case of Eq. (1) one could take V = H 1 E (G), a closed subspace of the Sobolev space H 1 (G) incorporating the essential boundary conditions. Hence we may view Eq. (1) and Eq. (2) as an example of Here A V : Q × V → V is a possibly non-linear operator inυ ∈ V, and q ∈ Q are the parameters (like κ,υ 0 , or η, which more accurately would be described as functions of q), where we assume for simplicity again that Q is some Hilbert space. Both A V ,υ 0 , and η could involve some noise, so that one may view Eq. (3) as an instance of a stochastic evolution equation. This is our model of the system generating the observed data, which we assume to be well-posed.
Hence assume further that we may observe a functionŶ (q;υ(t)) of the stateυ(t) and the parameters q, i.e.Ŷ : Q × V → Y, where we assume that Y is a Hilbert space. To make things simple, assume additionally that we observeŶ (q;υ(t)) at regular time intervals t n = n · ∆t, i.e. we observe y n =Ŷ (q;υ n ), whereυ n :=υ(t n ). Denote the solution operator Υ of Eq. (3) asυ advancing the solution from t n to t n+1 . Hence we are observinĝ where some noise v n -inaccuracy of the observation -has been included, andĥ is an appropriate observation operator. A simple example is the often assumed additive noisê where v is a random vector, and for eachυ, S V (υ) is a bounded linear map to Y.

Identification model
Now that the model generating the data has been described, it is the appropriate point to introduce the identification model. Similarly as before in Eq. (3), we have a model which depends on the same parameters q as in Eq. (3), to be used for the identification, which we shall only write in its abstract from. Hence we assume that we can actually integrate Eq. (6) from t n to t n+1 with its solution operator U u n+1 = U (t n+1 , q, u n , t n , η).
Observe that the two spaces V in Eq. (3) and U in Eq. (6) are not the same, as in general we do not knowυ ∈ V, we only have observations y ∈ Y.
As later not only the state u ∈ U in Eq. (6) has to be identified, but also the parameters q, and the identification may happen sequentially, i.e. our estimate of q will change from step n to step n+1, we shall introduce an "extended" state vector x = (u, q) ∈ X := Q×U and describe the change from n to n + 1 by Let us explicitly introduce a noise w ∈ W to cover the stochastic contribution or modelling errors between Eq. (6) and Eq. (3), so that we set where w ∈ W is the random vector, and S W (x) ∈ L (W, X ) is for each x ∈ X a bounded linear map from W to X . To deal with the extended state, we shall define the identification or predicted observation operator as where the noise v n -the same as in Eq. (5), our model of the inaccuracy of the observation -has been included. A simple example with additive noise is where v ∈ V is the random vector, and S V (x) ∈ L (V, X ) is for each x ∈ X a bounded linear map from V to X . The mapping Y : Q×U → Y is similar to the mapŶ : Q×V → Y in the previous Subsection 2.1, it predicts the "true" observation without noise v n . Eq. (9) for the time evolution of the extended state and Eq. (10) for the observation are the basic building blocks for the identification.

Synopsis of Bayesian estimation
There are many accounts of this, and this synopsis is just for the convenience of the reader and to introduce notation. Otherwise we refer to e.g. [10,27,6,13], and in particular for the rôle of conditional expectation (CE) to our work [24,18]. The idea is that the observationŷ from Eq. (5) depends on the unknown parameters q, which ideally should equal y n from Eq. (10), which in turn both directly and through the state x = (u(q), q) in Eq. (9) depends also on the parameters q, should be equal, and any difference should give an indication on what the "true" value of q should be. The problem in general is -apart from the distracting errors w and v -that the mapping q → y = Y (q; u(q)) is in general not invertible, i.e. y does not contain information to uniquely determine q, or there are many q which give a good fit forŷ. Therefore the inverse problem of determining q from observingŷ is termed an ill-posed problem.
The situation is a bit comparable to Plato's allegory of the cave, where Socrates compares the process of gaining knowledge with looking at the shadows of the real things. The observationsŷ are the "shadows" of the "real" things q andυ, and from observing the "shadows"ŷ we want to infer what "reality" is, in a way turning our heads towards it. We hence want to "free" ourselves from just observing the "shadows" and gain some understanding of "reality".
One way to deal with this difficulty is to measure the difference between observedŷ n and predicted system output y n and try to find parameters q n such that this difference is minimised. Frequently it may happen that the parameters which realise the minimum are not unique. In case one wants a unique parameter, a choice has to be made, usually by demanding additionally that some norm or similar functional of the parameters is small as well, i.e. some regularity is enforced. This optimisation approach hence leads to regularisation procedures [29,28,3,4].
Here we take the view that our lack of knowledge or uncertainty of the actual value of the parameters can be described in a Bayesian way through a probabilistic model [10,27]. The unknown parameter q is then modelled as a random variable (RV)-also called the prior model-and additional information on the system through measurement or observation changes the probabilistic description to the so-called posterior model. The second approach is thus a method to update the probabilistic description in such a way as to take account of the additional information, and the updated probabilistic description is the parameter estimate, including a probabilistic description of the remaining uncertainty.
It is well-known that such a Bayesian update is in fact closely related to conditional expectation [10,1,6,24,18], and this will be the basis of the method presented. For these and other probabilistic notions see for example [22] and the references therein. As the Bayesian update may be numerically very demanding, we show computational procedures to accelerate this update through methods based on functional approximation or spectral representation of stochastic problems [17,18]. These approximations are in the simplest case known as Wiener's so-called homogeneous or polynomial chaos expansion, which are polynomials in independent Gaussian RVs-the "chaos"-and which can also be used numerically in a Galerkin procedure [17,18].
Although the Gauss-Markov theorem and its extensions [15] are well-known, as well as its connections to the Kalman filter [11,7]-see also the recent Monte Carlo or ensemble version [5]-the connection to Bayes's theorem is not often appreciated, and is sketched here. This turns out to be a linearised version of conditional expectation.
Since the parameters of the model to be estimated are uncertain, all relevant information may be obtained via their stochastic description. In order to extract information from the posterior, most estimates take the form of expectations w.r.t. the posterior, i.e. a conditional expectation (CE). These expectations-mathematically integrals, numerically to be evaluated by some quadrature rule-may be computed via asymptotic, deterministic, or sampling methods by typically computing first the posterior density. As we will see, the posterior density does not always exist [23]. Here we follow our recent publications [21,24,18] and introduce a novel approach, namely computing the CE directly and not via the posterior density [18]. This way all relevant information from the conditioning may be computed directly. In addition, we want to change the prior, represented by a random variable (RV), into a new random variable which has the correct posterior distribution. We will discuss how this may be achieved, and what approximations one may employ in the computation.
To be a bit more formal, assume that the uncertain parameters are given by x : Ω → X as a RV on a probability space (Ω, A, P), where the set of elementary events is Ω, A a σ-algebra of measurable events, and P a probability measure. The expectation corresponding to P will be denoted by E (), e.g.
for any measurable function Ψ of x.
Modelling our lack-of-knowledge about q in a Bayesian way [10,27,6] by replacing them with random variables (RVs), the problem becomes well-posed [26]. But of course one is looking now at the problem of finding a probability distribution that best fits the data; and one also obtains a probability distribution, not just one value q. Here we focus on the use of procedures similar to a linear Bayesian approach [6] in the framework of "white noise" analysis.
As formally q is a RV, so is the state x n of Eq. (9), reflecting the uncertainty about the parameters and state of Eq. (3). From this follows that also the prediction of the measurement y n Eq. (10) is a RV; i.e. we have a probabilistic description of the measurement.

The theorem of Bayes and Laplace
Bayes original statement of the theorem which today bears his name was only for a very special case. The form which we know today is due to Laplace, and it is a statement about conditional probabilities. A good account of the history may be found in [19].
Bayes's theorem is commonly accepted as a consistent way to incorporate new knowledge into a probabilistic description [10,27]. The elementary textbook statement of the theorem is about conditional probabilities where I x ⊂ X is some subset of possible x's on which we would like to gain some information, and M y ⊂ Y is the information provided by the measurement. The term P(I x ) is the so-called prior, it is what we know before the observation M y . The quantity P(M y |I x ) is the so-called likelihood, the conditional probability of M y assuming that I x is given. The term P(M y ) is the so called evidence, the probability of observing M y in the first place, which sometimes can be expanded with the law of total probability, allowing to choose between different models of explanation. It is necessary to make the right hand side of Eq. (12) into a real probability-summing to unity-and hence the term P(I x |M y ), the posterior reflects our knowledge on I x after observing M y . The quotient P(M y |I x )/P(M y ) is sometimes termed the Bayes factor, as it reflects the relative change in probability after observing M y . This statement Eq. (12) runs into problems if the set observations M y has vanishing measure, P(M y ) = 0, as is the case when we observe continuous random variables, and the theorem would have to be formulated in densities, or more precisely in probability density functions (pdfs). But the Bayes factor then has the indeterminate form 0/0, and some form of limiting procedure is needed. As a sign that this is not so simplethere are different and inequivalent forms of doing it-one may just point to the so-called Borel-Kolmogorov paradox. See [23] for a thorough discussion.
There is one special case where something resembling Eq. (12) may be achieved with pdfs, namely if y and x have a joint pdf π y,x (y, x). As y is essentially a function of x, this is a special case depending on conditions on the error term v. In this case Eq. (12) may be formulated as where π x|y (x|y) is the conditional pdf, and the "evidence" Z s (from German Zustandssumme (sum of states), a term used in physics) is a normalising factor such that the conditional pdf π x|y (·|y) integrates to unity The joint pdf may be split into the likelihood density π y|x (y|x) and the prior pdf π x (x) so that Eq. (13) has its familiar form ( [27] Ch. 1.5) These terms are in direct correspondence with those in Eq. (12) and carry the same names. Once one has the conditional measure P(·|M y ) or even a conditional pdf π x|y (·|y), the conditional expectation (CE) E (·|M y ) may be defined as an integral over that conditional measure resp. the conditional pdf. Thus classically, the conditional measure or pdf implies the conditional expectation: for any measurable function Ψ of x.
Please observe that the model for the RV representing the error v(ω) determines the likelihood functions P(M y |I x ) resp. the existence and form of the likelihood density π y|x (·|x). In computations, it is here that the computational model Eq. (6) and Eq. (10) is needed, to predict the measurement RV y given the state and parameters x as a RV.
Most computational approaches determine the pdfs, but we will later argue that it may be advantageous to work directly with RVs, and not with conditional probabilities or pdfs. To this end, the concept of conditional expectation (CE) and its relation to Bayes's theorem is needed [1].

Conditional expectation
To avoid the difficulties with conditional probabilities like in the Borel-Kolmogorov paradox alluded to in the previous Subsection 3.1, Kolmogorov himself-when he was setting up the axioms of the mathematical theory probability-turned the relation between conditional probability or pdf and conditional expectation around, and defined as a first and fundamental notion conditional expectation [1,23].
It has to be defined not with respect to measure-zero observations of a RV y, but w.r.t sub-σ-algebras B ⊂ A of the underlying σ-algebra A. The σ-algebra may be loosely seen as the collection of subsets of Ω on which we can make statements about their probability, and for fundamental mathematical reasons in many cases this is not the set of all subsets of Ω. The sub-σ-algebra B may be seen as the sets on which we learn something through the observation.
The simplest-although slightly restricted-way to define the conditional expectation [1] is to just consider RVs with finite variance, i.e. the Hilbert-space If B ⊂ A is a sub-σ-algebra, the space is a closed subspace, and hence has a well-defined continuous orthogonal projection P B : S → S B . The conditional expectation (CE) of a RV r ∈ S w.r.t. a sub-σ-algebra B is then defined as that orthogonal projection It can be shown [1] to coincide with the classical notion when that one is defined, and the unconditional expectation E () is in this view just the CE w.r.t. the minimal σ-algebra B = {∅, Ω}. As the CE is an orthogonal projection, it minimises the squared error from which one obtains the variational equation or orthogonality relation and one has a form of Pythagoras's theorem The CE is therefore a form of a minimum mean square error (MMSE) estimator.
Given the CE, one may completely characterise the conditional probability, e.g. for where χ A is the RV which is unity iff ω ∈ A and vanishes otherwise -the usual characteristic function, sometimes also termed an indicator function. Thus if we know P(A|B) for each A ∈ B, we know the conditional probability. Hence having the CE E (·|B) allows one to know everything about the conditional probability; the conditional or posterior density is not needed. If the prior probability was the distribution of some RV r, we know that it is completely characterised by the prior characteristic function -in the sense of probability theoryϕ r (s) := E (exp(irs)). To get the conditional characteristic function ϕ r|B (s) = E (exp(irs)|B), all one has to do is use the CE instead of the unconditional expectation. This then completely characterises the conditional distribution. In our case of an observation of a RV y, the sub-σ-algebra B will be the one generated by the observation y = h(x, v), i.e. B = σ(y), these are those subsets of Ω on which we may obtain information from the observation. According to the Doob-Dynkin lemma the subspace S σ(y) is given by i.e. functions of the observation. This means intuitively that anything we learn from an observation is a function of the observation, and the subspace S σ(y) ⊂ S is where the information from the measurement is lying.
Observe that the CE E (r|σ(y)) and conditional probability P(A|σ(y))-which we will abbreviate to E (r|y), and similarly P(A|σ(y)) = P(A|y)-is a RV, as y is a RV. Once an observation has been made, i.e. we observe for the RV y the fixed valueŷ ∈ Y, thenfor almost allŷ ∈ Y-E (r|ŷ) ∈ R is just a number-the posterior expectation, and P(A|ŷ) = E (χ A |ŷ) is the posterior probability. Often these are also termed conditional expectation and conditional probability, which leads to confusion. We therefore prefer the attribute posterior when the actual observationŷ has been observed and inserted in the expressions. Additionally, from Eq. (18) one knows that for some function φ r -for each RV r it is a possibly different function -one has that E (r|y) = φ r (y) and E (r|ŷ) = φ r (ŷ) (19) In relation to Bayes's theorem, one may conclude that if it is possible to compute the CE w.r.t. an observation y or rather the posterior expectation, then the conditional and especially the posterior probabilities after the observationŷ may as well be computed, regardless whether joint pdfs exist or not. We take this as the starting point to Bayesian estimation.
The conditional expectation has been formulated for scalar RVs, but it is clear that the notion carries through to vector-valued RVs in a straightforward manner, formally by seeing a-let us say-Y-valued RV as an element of the tensor Hilbert space the RVs in Y with finite total variance Here ỹ(ω) 2 Y = ỹ(ω),ỹ(ω) Y is the norm squared on the deterministic component Y with inner product ·, · Y ; and the total L 2 -norm of an elementary tensor y ⊗ r ∈ Y ⊗ S with y ∈ Y and r ∈ S can also be written as where r, r S = r 2 S := E (|r| 2 ) is the usual inner product of scalar RVs. The CE on Y is then formally given by E Y (·|B) := I Y ⊗ E (·|B), where I Y is the identity operator on Y. This means that for an elementary tensor y ⊗ r ∈ Y ⊗ S one has

The vector valued conditional expectation
is also an orthogonal projection, but in Y , for simplicity also denoted by E (·|B) = P B when there is no possibility of confusion.

Constructing a posterior random variable
We recall the equations governing our model Eq. (9) and Eq. (10), and interpret them now as equations acting on RVs, i.e. for ω ∈ Ω: where one may now see the mappings f : X × W → X and h : X × V → Y acting on the tensor Hilbert spaces of RVs with finite variance, e.g. Y := Y ⊗ S with the inner product as explained in Subsection 3.2; and similarly for X := X ⊗ S resp. W and V .

Updating random variables
We now focus on the step from time t n to t n+1 . Knowing the RV x n ∈ X , one predicts the new statex n+1 ∈ X and the measurement y n+1 ∈ Y . With the CE operator from the measurement prediction y n+1 in Eq. (21) and the actual observationŷ n+1 one may then compute the posterior expectation operator This has all the information about the posterior probability. But to then go on from t n+1 to t n+2 with the Eq. (20) and Eq. (21), one needs a new RV x n+2 which has the posterior distribution described by the mappings φ Ψ (ŷ n+1 ) in Eq. (23). Bayes's theorem only specifies this probabilistic content. There are many RVs which have this posterior distribution, and we have to pick a particular representative to continue the computation. We will show a method which in the simplest case comes back to MMSE.
Here it is proposed to construct this new RV x n+1 from the predictedx n+1 in Eq. (20) with a mapping, starting from very simple ones and getting ever more complex. For the sake of brevity of notation, the forecast RV will be called x f =x n+1 , and the forecast measurement y f = y n+1 , and we will denote the measurement just byŷ =ŷ n+1 . The RV x n+1 we want to construct will be called the assimilated RV x a = x n+1 -it has assimilated the new observationŷ =ŷ n+1 . Hence what we want is a new RV which is an update of the forecast RV x f with a Bayesian update map B resp. a change given by the innovation map Ξ. Such a transformation is often called a filter -the measurementŷ is filtered to produce the update.

Correcting the mean
We take first the task to give the new RV the correct posterior meanx a = E (x a |ŷ), i.e. we take Ψ (x) = x in Eq. (23). Remember that according to Eq.
Hence there is an orthogonal decomposition As P σ(y f ) = E (·|σ(y f )) is a projection, one sees from Eq. (26) that the second term has vanishing CE for any measurementŷ: One may view this also in the following way: From the measurement y a resp.ŷ we only learn something about the subspace X ∞ . Hence when the measurement comes, we change the decomposition Eq. (26) by only fixing the component φ x (y f ) ∈ X ∞ , and leaving the orthogonal rest unchanged: . (28) Observe that this is just a linear translation of the RV x f , i.e. a very simple map B in Eq. (24). From Eq. (27) follows that hence the RV x a,1 from Eq. (28) has the correct posterior mean.
Observe that according to Eq. (27) the term x ⊥ := (x f − φ x (y f )) in Eq. (28) is a zero mean RV, hence the covariance and total variance of x a,1 is given by var(x a,1 ) = E x ⊥ (ω) 2 X = tr(cov(x a,1 )).

Correcting higher moments
Here let us just describe two small additional steps: we take Ψ (x) = x − φ x (ŷ) 2 X in Eq. (23), and hence obtain the total posterior variance as Now it is easy to correct Eq. (28) to obtain a RV which has the correct posterior mean and the correct posterior total variance var(x a,t ) = var(x a ).
Observe that this is just a linear translation and partial scaling of the RV x f , i.e. still a very simple map B in Eq. (24). With more computational effort, one may choose Ψ (x) = (x − φ x (ŷ)) ⊗2 in Eq. (23), to obtain the covariance of x a : Instead of just scaling the RV as in Eq. (32), one may now take where B 1 is any operator "square root" that satisfies B 1 B 1 * = C 1 in Eq. (29), and similarly B a B a * = C a in Eq. (33). One possibility is the real square root -as C 1 and C a are positive definite - 1 , but computationally a Cholesky factor is usually cheaper. In any case, no matter which "square root" is chosen, the RV x a,2 in Eq. (34) has the correct posterior mean and the correct posterior covariance. Observe that this is just an affine transformation of the RV x f , i.e. still a fairly simple map B in Eq. (24).
By combining further transport maps [20] it seems possible to construct a RV x a which has the desired posterior distribution to any accuracy. This is beyond the scope of the present paper, and is ongoing work on how to do it in the simplest way. For the following, we shall be content with the update Eq. (28) in Subsection 4.2.

The Gauss-Markov-Kalman filter (GMKF)
It turned out that practical computations in the context of Bayesian estimation can be extremely demanding, see [19] for an account of the history of Bayesian theory, and the break-throughs required in computational procedures to make Bayesian estimation possible at all for practical purposes. This involves both the Monte Carlo (MC) method and the Markov chain Monte Carlo (MCMC) sampling procedure. One may have gleaned this also already from Section 4. To arrive at computationally feasible procedures for computationally demanding models Eq. (20) and Eq. (21), where MCMC methods are not feasible, approximations are necessary. This means in some way not using all information but having a simpler computation. Incidentally, this connects with the Gauss-Markov theorem [15] and the Kalman filter (KF) [11,7]. These were initially stated and developed without any reference to Bayes's theorem. The Monte Carlo (MC) computational implementation of this is the ensemble KF (EnKF) [5]. We will in contrast use a white noise or polynomial chaos approximation [21,24,18]. But the initial ideas leading to the abstract Gauss-Markov-Kalman filter (GMKF) are independent of any computational implementation and are presented first. It is in an abstract way just orthogonal projection, based on the update Eq. (28) in Subsection 4.2.

Building the filter
Recalling Eq. (20) and Eq. (21) together with Eq. (28), the algorithm for forecasting and assimilating with just the posterior mean looks likê ).
For simplicity of notation the argument ω will be suppressed. Also it will turn out that the mapping φ x representing the CE can in most cases only be computed approximately, so we want to look at update algorithms with a general map g : Y → X to possibly approximate φ x : where the first two equations have been inserted into the last. This is the filter equation for tracking and identifying the extended state of Eq. (20). One may observe that the normal evolution model Eq. (20) is corrected by the innovation term. This is the best unbiased filter, with φ(ŷ) a MMSE estimate. It is clear that the stability of the solution to Eq. (35) will depend on the contraction properties or otherwise of the map f −g•H •f = (I − g • H) • f as applied to x n , but that is not completely worked out yet and beyond the scope of this paper.
By combining the minimisation property Eq. (16) and the Doob-Dynkin lemma Eq. (18), we see that the map φ Ψ is defined by where ranges over all measurable maps : Y → X . As X σ(y) = X ∞ is L-closed [2,18], it is characterised similarly to Eq. (17), but by orthogonality in the L-invariant sense i.e. the RV (Ψ (x) − (y)) is orthogonal in the L-invariant sense to all RVs z ∈ X ∞ , which means its correlation operator vanishes. Although the CE E (x|y) = P σ(y) (x) is an orthogonal projection, as the measurement operator Y , resp. h or H, which evaluates y, is not necessarily linear in x, hence the optimal map φ x (y) is also not necessarily linear in y. In some sense it has to be the opposite of Y .

The linear filter
The minimisation in Eq. (36) over all measurable maps is still a formidable task, and typically only feasible in an approximate way. One problem of course is, that the space X ∞ is in general infinite-dimensional. The standard Galerkin approach is then to approximate it by finite-dimensional subspaces, see [18] for a general description and analysis of the Galerkin convergence.
Here we replace X ∞ by much smaller subspace; and we choose in some way the simplest possible one where the Φ are just affine maps; they are certainly measurable. Note that X 1 is also an L-invariant subspace of X ∞ ⊂ X . Note that also other, possibly larger, L-invariant subspaces of X ∞ can be used, but this seems to be smallest useful one. Now the minimisation Eq. (36) may be replaced by and the optimal affine map is defined by K ∈ L (Y, X ) and a ∈ X . Using this g(y) = K(y) + a, one disregards some information as X 1 ⊂ X ∞ is usually a true subspace -observe that the subspace represents the information we may learn from the measurement -but the computation is easier, and one arrives in lieu of Eq. (28) at This is the best linear filter, with the linear MMSE K(ŷ). One may note that the constant term a in Eq. (39) drops out in the filter equation.

The Gauss-Markov theorem and the Kalman filter
The optimisation described in Eq. (39) is a familiar one, it is easily solved, and the solution is given by an extension of the Gauss-Markov theorem [15]. The same idea of a linear MMSE is behind the Kalman filter [11,7,6,22,5]. In our context it reads Theorem 1. The solution to Eq. (39), minimising is given by K := cov(x, y)cov(y) −1 and a :=x − K(ȳ), where cov(x, y) is the covariance of x and y, and cov(y) is the auto-covariance of y. In case cov(y) is singular or nearly singular, the pseudo-inverse can be taken instead of the inverse.
The operator K ∈ L (Y, X ) is also called the Kalman gain, and has the familiar form known from least squares projections. It is interesting to note that initially the connection between MMSE and Bayesian estimation was not seen [19], although it is one of the simplest approximations.
The resulting filter Eq. (40) is therefore called the Gauss-Markov-Kalman filter (GMKF). The original Kalman filter has Eq. (40) just for the means It easy to compute that Theorem 2. The covariance operator corresponding to Eq. (29) cov(x a,1L ) of x a,1L is given by which is Kalman's formula for the covariance.
This shows that Eq. (40) is a true extension of the classical Kalman filter (KF). Rewriting Eq. (40) explicitly in less symbolic notation one may see that it is a relation between RVs, and hence some further stochastic discretisation is needed to be numerically implementable.

Nonlinear filters
The derivation of nonlinear but polynomial filters is given in [18]. It has the advantage of showing the connection to the "Bayes linear" approach [6], to the Gauss-Markov theorem [15], and to the Kalman filter [11] [22]. Correcting higher moments of the posterior RV has been touched on in Subsection 4.3, and is not the topic here. Now the focus is on computing better than linear (see Subsection 5.2) approximations to the CE operator, which is the basic tool for the whole updating and identification process. We follow [18] for a more general approach not limited to polynomials, and assume a set of linearly independent measurable functions, not necessarily orthonormal, where A is some countable index set. Galerkin convergence [18] will require that i.e. that B is a Hilbert basis of S ∞ . Let us consider a general function Ψ : X → R of x, where R is some Hilbert space, of which we want to compute the conditional expectation E (Ψ (x)|y). Denote by A k a finite part of A of cardinality k, such that A k ⊂ A for k < and k A k = A, and set where the finite dimensional and hence closed subspaces S k are given by Observe that the spaces R k from Eq. (44) are L-closed, see [18]. In practice, also a "spatial" discretisation of the spaces X resp. R has to be carried out; but this is a standard process and will be neglected here for the sake of brevity and clarity. For a RV Ψ (x) ∈ R = R ⊗ S we make the following 'ansatz' for the optimal map φ Ψ,k such that P R k (Ψ (x)) = φ Ψ,k (y): with as yet unknown coefficients v α ∈ R. This is a normal Galerkin-ansatz, and the Galerkin orthogonality Eq. (37) can be used to determine these coefficients. Take Z k := R A k with canonical basis {e α | α ∈ A k }, and let be the symmetric positive definite Gram matrix of the basis of S k ; also set v := Theorem 3. For any k ∈ N, the coefficients {v α } α∈A k of the optimal map φ Ψ,k in Eq. (46) are given by the unique solution of the Galerkin equation

It has the formal solution
Proof. The Galerkin Eq. (47) is a simple consequence of the Galerkin orthogonality Eq. (37). As the Gram matrix G k and the identity I R on R are positive definite, so is the tensor operator (G k ⊗ I R ), with inverse (G −1 k ⊗ I R ). The block structure of the equations is clearly visible. Hence, to solve Eq. (47), one only has to deal with the 'small' matrix G k .
The update corresponding to Eq. (35), using again Ψ (x) = x, one obtains a possibly nonlinear filter based on the basis B: In case that Y * ⊆ span{ψ α } α∈A k , i.e. the functions with indices in A k generate all the linear functions on Y, this is a true extension of the Kalman filter.
Observe that this allows one to compute the map in Eq. (19) or rather Eq. (23) to any desired accuracy. Then, using this tool, one may construct a new random variable which has the desired posterior expectations; as was started in Subsection 4.2 and Subsection 4.3. This is then a truly nonlinear extension of the linear filters described in Section 5, and one may expect better tracking properties than even for the best linear filters. This could for example allow for less frequent observations of a dynamical system.

Numerical realisation
This is only going to be a rough overview on possibilities of numerical realisations. Only the simplest case of the linear filter will be considered, all other approximations can be dealt with in an analogous manner. Essentially we will look at two different kind of approximations, sampling and functional or spectral approximations.

Sampling
As starting point take Eq. (42). As it is a relation between RVs, it certainly also holds for samples of the RVs. Thus it is possible to take an ensemble of sampling points ω 1 , . . . , ω N and require and this is the basis of the ensemble Kalman filter, the EnKF [5]; the points x f (ω ) and x a (ω ) are sometimes also denoted as particles, and Eq. (49) is a simple version of a particle filter. In Eq. (49), C x f y = cov(x f , y) and C y = cov(y) Some of the main work for the EnKF consists in obtaining good estimates of C x f y and C y , as they have to be computed from the samples. Further approximations are possible, for example such as assuming a particular form for C x f y and C y . This is the basis for methods like kriging and 3DVAR resp. 4DVAR, where one works with an approximate Kalman gainK ≈ K. For a recent account see [12].

Functional approximation
Here we want to pursue a different tack, and want to discretise RVs not through their samples, but by functional resp. spectral approximations [17,30,14]. This means that all RVs, say v(ω), are described as functions of known RVs {ξ 1 (ω), . . . , ξ (ω), . . . }. Often, when for example stochastic processes or random fields are involved, one has to deal here with infinitely many RVs, which for an actual computation have to be truncated to a finte vector ξ(ω) = [ξ 1 (ω), . . . , ξ n (ω)] of significant RVs. We shall assume that these have been chosen such as to be independent. As we want to approximate later x = [x 1 , . . . , x n ], we do not need more than n RVs ξ.
One further chooses a finite set of linearly independent functions {ψ α } α∈J M of the variables ξ(ω), where the index α often is a multi-index, and the set J M is a finite set with cardinality (size) M . Many different systems of functions can be used, classical choices are [17,30,14] multivariate polynomials -leading to the polynomial chaos expansion (PCE), as well as trigonometric functions, kernel functions as in kriging, radial basis functions, sigmoidal functions as in artificial neural networks (ANNs), or functions derived from fuzzy sets. The particular choice is immaterial for the further development. But to obtain results which match the above theory as regards L-invariant subspaces, we shall assume that the set {ψ α } α∈J M includes all the linear functions of ξ. This is easy to achieve with polynomials, and w.r.t kriging it corresponds to universal kriging. All other functions systems can also be augmented by a linear trend.
Thus a RV v(ω) would be replaced by a functional approximation The argument ω will be omitted from here on, as we transport the probability measure P on Ω to Ξ = Ξ 1 ×· · ·×Ξ n , the range of ξ, giving P ξ = P 1 ×· · ·×P n as a product measure, where P = (ξ ) * P is the distribution measure of the RV ξ , as the RVs ξ are independent. All computations from here on are performed on Ξ, typically some subset of R n . Hence n is the dimension of our problem, and if n is large, one faces a high-dimensional problem.
It is here that low-rank tensor approximations [8] become practically important.
It is not too difficult to see that the linear filter when applied to the spectral approximation has exactly the same form as shown in Eq. (42). Hence the basic formula Eq. (42) looks formally the same in both cases, once it is applied to samples or "particles", in the other case to the functional approximation of RVs, i.e. to the coefficients in Eq. (50).
In both of the cases described here in Subsection 7.1 and in this Subsection 7.2, the question as how to compute the covariance matrices in Eq. (42) arises. In the EnKF in Subsection 7.1 they have to be computed from the samples [5], and in the case of functional resp. spectral approximations they can be computed from the coefficients in Eq. (50), see [21,24].
In the sampling context, the samples or particles may be seen as δ-measures, and one generally obtains weak- * convergence of convex combinations of these δ-measures to the continuous limit as the number of particles increases. In the case of functional resp. spectral approximation one can bring the whole theory of Galerkin-approximations to bear on the problem, and one may obtain convergence of the involved RVs in appropriate norms [18]. We leave this topic with this pointer to the literature, as this is too extensive to be discussed any further and hence is beyond the scope of the present work.

Examples
The first example is a dynamic system considered in [21], it is the well-known Lorenz-84 chaotic model, a system of three nonlinear ordinary differential equations operating in the chaotic regime. This is an example along the description of Eq. (3) and Eq. (5) in Subsection 2.1. Remember that this was originally a model to describe the evolution of some amplitudes of a spherical harmonic expansion of variables describing world climate. As the original scaling of the variables has been kept, the time axis in Fig. 1 is in days. Every ten days a noisy measurement is performed and the state description is updated. In between the state description evolves according to the chaotic dynamic of the system. One  [21]. For the estimated state uncertainty the 50% (full line), ±25%, and ±45% quantiles are shown. may observe from Fig. 1 how the uncertainty-the width of the distribution as given by the quantile lines-shrinks every time a measurement is performed, and then increases again due to the chaotic and hence noisy dynamics. Of course, we did not really measure world climate, but rather simulated the "truth" as well, i.e. a virtual experiment, like the others to follow. More details may be found in [21] and the references therein. All computations are performed in a functional approximation with polynomial chaos expansions as alluded to in Subsection 7.2, and the filter is linear according to Eq. (42).
To introduce the nonlinear filter as sketched in Section 6, where for the nonlinear filter the functions in Eq. (46) included polynomials up to quadratic terms, one may look shortly at a very simplified example, identifying a value, where only the third power of the value plus a Gaussian error RV is observed. All updates follow Eq. (28), but the update map is computed with different accuracy. Shown are the pdfs produced by the linear filter according to Eq. (42) -Linear polynomial chaos Bayesian update (Linear PCBU) -a special form of Eq. (28), and using polynomials up to order two, the quadratic polynomial chaos Bayesian update (QPCBU). One may observe that due to the nonlinear observation, the differences between the linear filters and the quadratic one are already significant, the QPCBU yielding a better update.
We go back to the example shown in Fig. 1, but now consider only for one step a nonlinear filter like in Fig. 2, see [18]. As a first set of experiments we take the measurement operator to be linear in the state variable to be identified, i.e. we can observe the whole state directly. At the moment we consider updates after each day-whereas in Fig. 1 the updates were performed every 10 days. The update is done once with the linear Bayesian update (LBU), and again with a quadratic nonlinear BU (QBU). The results for the posterior pdfs are given in Fig. 3, where the linear update is dotted in blue and labelled z1, and the full red line is the quadratic QBU labelled z2; there is hardly any difference between the two except for the z-component of the state, most probably   [18] indicating that the LBU is already very accurate. Now the same experiment, but the measurement operator is cubic: These differences Figure 4: Lorenz-84 model, perturbed cubic observations of the state: Posterior for LBU and QBU after one update, from [18] in posterior pdfs after one update may be gleaned from Fig. 4, and they are indeed larger than in the linear case Fig. 3, due to the strongly nonlinear measurement operator, showing that the QBU may provide much more accurate tracking of the state, especially for non-linear observation operators.  Figure 6: Cook's membrane -large strain elasto-plasticity, perturbed linear observations of the deformation, LBU and QBU for the shear modulus, from [18] As a last example we follow [18] and take a strongly nonlinear and also non-smooth situation, namely elasto-plasticity with linear hardening and large deformations and a Kirchhoff-St. Venant elastic material law [24], [25]. This example is known as Cook's membrane, and is shown in Fig. 5 with the undeformed mesh (initial), the deformed one obtained by computing with average values of the elasticity and plasticity material constants (deterministic), and finally the average result from a stochastic forward calculation of the probabilistic model (stochastic), which is described by a variational inequality [25].
The shear modulus G, a random field and not a deterministic value in this case, has to be identified, which is made more difficult by the non-smooth non-linearity. In Fig. 6 one may see the 'true' distribution at one point in the domain in an unbroken black line, with the mode -the maximum of the pdf -marked by a black cross on the abscissa, whereas the prior is shown in a dotted blue line. The pdf of the LBU is shown in an unbroken red line, with its mode marked by a red cross, and the pdf of the QBU is shown in a broken purple line with its mode marked by an asterisk. Again we see a difference between the LBU and the QBU. But here a curious thing happens; the mode of the LBU-posterior is actually closer to the mode of the 'truth' than the mode of the QBU-posterior. This means that somehow the QBU takes the prior more into account than the LBU, which is a kind of overshooting which has been observed at other occasions. On the other hand the pdf of the QBU is narrower -has less uncertainty -than the pdf of the LBU.

Conclusion
A general approach for state and parameter estimation has been presented in a Bayesian framework. The Bayesian approach is based here on the conditional expectation (CE) operator, and different approximations were discussed, where the linear approximation leads to a generalisation of the well-known Kalman filter (KF), and is here termed the Gauss-Markov-Kalman filter (GMKF), as it is based on the classical Gauss-Markov theorem. Based on the CE operator, various approximations to construct a RV with the proper posterior distribution were shown, where just correcting for the mean is certainly the simplest type of filter, and also the basis of the GMKF.
Actual numerical computations typically require a discretisation of both the spatial variables -something which is practically independent of the considerations here -and the stochastic variables. Classical are sampling methods, but here the use of spectral resp. functional approximations is alluded to, and all computations in the examples shown are carried out with functional approximations.