The goal is to identify the quantity \({{\mathchoice{\displaystyle {{\textsf {\textit{Q}}}}}{\textstyle {{\textsf {\textit{Q}}}}}{\scriptstyle {{\textsf {\textit{Q}}}}}{\scriptscriptstyle {{\textsf {\textit{Q}}}}}}}\) defined on \(\varOmega \) conditioned on the observation \(y_m\) using Bayes’s rule. As some or all of the components of \({{\mathchoice{\displaystyle {{\textsf {\textit{Q}}}}}{\textstyle {{\textsf {\textit{Q}}}}}{\scriptstyle {{\textsf {\textit{Q}}}}}{\scriptscriptstyle {{\textsf {\textit{Q}}}}}}}\) may be required to be positive definite—as is often the case for material quantities—, this constraint has to be taken into consideration. In our case all components of \({{\mathchoice{\displaystyle {{\textsf {\textit{Q}}}}}{\textstyle {{\textsf {\textit{Q}}}}}{\scriptstyle {{\textsf {\textit{Q}}}}}{\scriptscriptstyle {{\textsf {\textit{Q}}}}}}}\) have to fulfil that requirement. In most updating methods it is advantageous if the quantities to be identified have no constraints. We shall explain how to achieve this by considering a scalar component \({Q}\) of \({{\mathchoice{\displaystyle {{\textsf {\textit{Q}}}}}{\textstyle {{\textsf {\textit{Q}}}}}{\scriptstyle {{\textsf {\textit{Q}}}}}{\scriptscriptstyle {{\textsf {\textit{Q}}}}}}}\). The first is to scale \({Q}\) by a reference \({Q}_0\) to obtain a dimensionless quantity, and consider now \({{q}}:= \log ( {Q}/{Q}_0)\). As numerically \(\log ( {Q}/{Q}_0) = \log {Q}- \log {Q}_0\), it is convenient to choose \({Q}_0\) such that \({Q}_0 = 1\) in the units used, and hence numerically \(\log {Q}_0 = 0\) and \({{q}}:= \log {Q}\). Henceforth we assume that this has been done. The variable \({{q}}\) now has no constraints, it is a free variable on all of \(\mathbb {R}\). This procedure may be extended to all even order positive tensors, but will only be needed for scalars here. So instead of identifying the collection \({{\mathchoice{\displaystyle {{\textsf {\textit{Q}}}}}{\textstyle {{\textsf {\textit{Q}}}}}{\scriptstyle {{\textsf {\textit{Q}}}}}{\scriptscriptstyle {{\textsf {\textit{Q}}}}}}}\) directly, we identify the logarithms of its components giving in this way a collection \({\mathchoice{\displaystyle {{\textsf {\textit{q}}}}}{\textstyle {{\textsf {\textit{q}}}}}{\scriptstyle {{\textsf {\textit{q}}}}}{\scriptscriptstyle {{\textsf {\textit{q}}}}}}\) where we write symbolically \({{\mathchoice{\displaystyle {{\textsf {\textit{q}}}}}{\textstyle {{\textsf {\textit{q}}}}}{\scriptstyle {{\textsf {\textit{q}}}}}{\scriptscriptstyle {{\textsf {\textit{q}}}}}}= \log {\mathchoice{\displaystyle {{\textsf {\textit{Q}}}}}{\textstyle {{\textsf {\textit{Q}}}}}{\scriptstyle {{\textsf {\textit{Q}}}}}{\scriptscriptstyle {{\textsf {\textit{Q}}}}}}}\), as in this way whatever approximations or linear operations are performed computationally on the numerical representation of \({{q}}(x,\omega )\), in the end \(\exp ({{q}}(x, \omega ))\) is always going to be positive. This also gives the right kind of mean—the geometric mean—for positive quantities. The underlying reason is that the multiplicative group of positive real numbers—a (commutative) one-dimensional Lie group—is thereby put into correspondence with the additive group of reals, which also represents the (one-dimensional) tangent vector space at the group unit, the number one. This is the corresponding Lie algebra. A positive quadratic form on the Lie algebra—in one dimension necessarily proportional to Euclidean distance squared—can thereby be carried to a Riemannian metric on the Lie group. A similar argument holds for positive tensors of any even order.
Therefore, instead of Problem 2, we consider its modified version:
Problem 3
Find a random variable \({{\mathchoice{\displaystyle {{\textsf {\textit{q}}}}}{\textstyle {{\textsf {\textit{q}}}}}{\scriptstyle {{\textsf {\textit{q}}}}}{\scriptscriptstyle {{\textsf {\textit{q}}}}}}:=\log {\mathchoice{\displaystyle {{\textsf {\textit{Q}}}}}{\textstyle {{\textsf {\textit{Q}}}}}{\scriptstyle {{\textsf {\textit{Q}}}}}{\scriptscriptstyle {{\textsf {\textit{Q}}}}}}:\varOmega \rightarrow {\mathcal {Q}}}\) for \({{\mathchoice{\displaystyle {{\textsf {\textit{Q}}}}}{\textstyle {{\textsf {\textit{Q}}}}}{\scriptstyle {{\textsf {\textit{Q}}}}}{\scriptscriptstyle {{\textsf {\textit{Q}}}}}}=\exp {\mathchoice{\displaystyle {{\textsf {\textit{q}}}}}{\textstyle {{\textsf {\textit{q}}}}}{\scriptstyle {{\textsf {\textit{q}}}}}{\scriptscriptstyle {{\textsf {\textit{q}}}}}}}\) in Eq. (11), such that the re-interpreted predictions of Eq. (10) match those of Eq. (14) in a measurement sense.
Bayesian updating is in essence a probabilistic conditioning, the foundation of which is the conditional expectation operator [4]. Here we are interested in the case where the conditioning occurs w.r.t. another random variable, namely \(y({\mathchoice{\displaystyle {{\textsf {\textit{q}}}}}{\textstyle {{\textsf {\textit{q}}}}}{\scriptstyle {{\textsf {\textit{q}}}}}{\scriptscriptstyle {{\textsf {\textit{q}}}}}}(\omega ))\), which depends on the quantity \({\mathchoice{\displaystyle {{\textsf {\textit{q}}}}}{\textstyle {{\textsf {\textit{q}}}}}{\scriptstyle {{\textsf {\textit{q}}}}}{\scriptscriptstyle {{\textsf {\textit{q}}}}}}\) to be updated. For any function \(\varphi :\mathcal {Q}\rightarrow \mathcal {F}\) of finite variance of \({\mathchoice{\displaystyle {{\textsf {\textit{q}}}}}{\textstyle {{\textsf {\textit{q}}}}}{\scriptstyle {{\textsf {\textit{q}}}}}{\scriptscriptstyle {{\textsf {\textit{q}}}}}}\), the conditional expectation of it is defined [4] by the projection onto a closed subspace \(\mathcal {C}_{\varphi }\subset \mathrm {L}_2\), which is in simple terms the \(\mathrm {L}_2\)-closure of all multivariate polynomials in the components of y with coefficients from the vector space \(\mathcal {F}\), i.e.
$$\begin{aligned} \mathcal {C}_{\varphi } := {{\,\mathrm{cl}\,}}\; \{ \mathchoice{\displaystyle {{\textsf {\textit{p}}}}}{\textstyle {{\textsf {\textit{p}}}}}{\scriptstyle {{\textsf {\textit{p}}}}}{\scriptscriptstyle {{\textsf {\textit{p}}}}}(y({\mathchoice{\displaystyle {{\textsf {\textit{q}}}}}{\textstyle {{\textsf {\textit{q}}}}}{\scriptstyle {{\textsf {\textit{q}}}}}{\scriptscriptstyle {{\textsf {\textit{q}}}}}})) \mid \mathchoice{\displaystyle {{\textsf {\textit{p}}}}}{\textstyle {{\textsf {\textit{p}}}}}{\scriptstyle {{\textsf {\textit{p}}}}}{\scriptscriptstyle {{\textsf {\textit{p}}}}}(y({\mathchoice{\displaystyle {{\textsf {\textit{q}}}}}{\textstyle {{\textsf {\textit{q}}}}}{\scriptstyle {{\textsf {\textit{q}}}}}{\scriptscriptstyle {{\textsf {\textit{q}}}}}})) = \sum _\alpha \mathchoice{\displaystyle {{\textsf {\textit{f}}}}}{\textstyle {{\textsf {\textit{f}}}}}{\scriptstyle {{\textsf {\textit{f}}}}}{\scriptscriptstyle {{\textsf {\textit{f}}}}}_\alpha V_\alpha (y({\mathchoice{\displaystyle {{\textsf {\textit{q}}}}}{\textstyle {{\textsf {\textit{q}}}}}{\scriptstyle {{\textsf {\textit{q}}}}}{\scriptscriptstyle {{\textsf {\textit{q}}}}}})) \}. \end{aligned}$$
(15)
where \(\mathchoice{\displaystyle {{\textsf {\textit{f}}}}}{\textstyle {{\textsf {\textit{f}}}}}{\scriptstyle {{\textsf {\textit{f}}}}}{\scriptscriptstyle {{\textsf {\textit{f}}}}}_\alpha \in \mathcal {F}\) and the \(V_\alpha \) are real-valued multivariate polynomials in \(y=(y_1,\dots ,y_j,\dots )\), which means that for a multi-index \(\alpha =(\alpha _1,\alpha _2,\dots )\) the polynomial \(V_\alpha \) is of degree \(\alpha _j\) in the variable \(y_j\). It turns out [4] that \(\mathcal {C}_{\varphi }\) contains all measurable functions \(\mathchoice{\displaystyle {{\textsf {\textit{g}}}}}{\textstyle {{\textsf {\textit{g}}}}}{\scriptstyle {{\textsf {\textit{g}}}}}{\scriptscriptstyle {{\textsf {\textit{g}}}}}:\mathcal {Y}\rightarrow \mathcal {F}\) so that \(\mathchoice{\displaystyle {{\textsf {\textit{g}}}}}{\textstyle {{\textsf {\textit{g}}}}}{\scriptstyle {{\textsf {\textit{g}}}}}{\scriptscriptstyle {{\textsf {\textit{g}}}}}(y({\mathchoice{\displaystyle {{\textsf {\textit{q}}}}}{\textstyle {{\textsf {\textit{q}}}}}{\scriptstyle {{\textsf {\textit{q}}}}}{\scriptscriptstyle {{\textsf {\textit{q}}}}}}(\omega )))\) is of finite variance.
Here we will be only interested in the function \(\varphi ({\mathchoice{\displaystyle {{\textsf {\textit{q}}}}}{\textstyle {{\textsf {\textit{q}}}}}{\scriptstyle {{\textsf {\textit{q}}}}}{\scriptscriptstyle {{\textsf {\textit{q}}}}}}) = {\mathchoice{\displaystyle {{\textsf {\textit{q}}}}}{\textstyle {{\textsf {\textit{q}}}}}{\scriptstyle {{\textsf {\textit{q}}}}}{\scriptscriptstyle {{\textsf {\textit{q}}}}}}\), i.e. the conditional mean of q. To compute it, one may use the variational characterisation and compute the minimal distance from the subspace \(\mathcal {C}:=\mathcal {C}_{\mathchoice{\displaystyle {{\textsf {\textit{q}}}}}{\textstyle {{\textsf {\textit{q}}}}}{\scriptstyle {{\textsf {\textit{q}}}}}{\scriptscriptstyle {{\textsf {\textit{q}}}}}}\) to the point \({\mathchoice{\displaystyle {{\textsf {\textit{q}}}}}{\textstyle {{\textsf {\textit{q}}}}}{\scriptstyle {{\textsf {\textit{q}}}}}{\scriptscriptstyle {{\textsf {\textit{q}}}}}}\):
$$\begin{aligned} {\phi }(y({\mathchoice{\displaystyle {{\textsf {\textit{q}}}}}{\textstyle {{\textsf {\textit{q}}}}}{\scriptstyle {{\textsf {\textit{q}}}}}{\scriptscriptstyle {{\textsf {\textit{q}}}}}})):=\mathbb {E}\left( {\mathchoice{\displaystyle {{\textsf {\textit{q}}}}}{\textstyle {{\textsf {\textit{q}}}}}{\scriptstyle {{\textsf {\textit{q}}}}}{\scriptscriptstyle {{\textsf {\textit{q}}}}}}\mid y\right) := P_{\mathcal {C}} {\mathchoice{\displaystyle {{\textsf {\textit{q}}}}}{\textstyle {{\textsf {\textit{q}}}}}{\scriptstyle {{\textsf {\textit{q}}}}}{\scriptscriptstyle {{\textsf {\textit{q}}}}}}= \underset{{\hat{{\phi }}}\in \mathcal {C}}{\arg \min } \; \mathbb {E}\left( \left\| {\mathchoice{\displaystyle {{\textsf {\textit{q}}}}}{\textstyle {{\textsf {\textit{q}}}}}{\scriptstyle {{\textsf {\textit{q}}}}}{\scriptscriptstyle {{\textsf {\textit{q}}}}}}-{\hat{{\phi }}}(y({\mathchoice{\displaystyle {{\textsf {\textit{q}}}}}{\textstyle {{\textsf {\textit{q}}}}}{\scriptstyle {{\textsf {\textit{q}}}}}{\scriptscriptstyle {{\textsf {\textit{q}}}}}})) \right\| ^2\right) . \end{aligned}$$
(16)
In this section, the expectation operators are to be understood as acting only on the variables which describe the uncertainty in the estimation, i.e. in the notation of “Abstract model problem” section only on the variables from \(\varOmega _u\). One may observe from Eq. (16) that \({\phi }\) is the best “inverse” of \(y({\mathchoice{\displaystyle {{\textsf {\textit{q}}}}}{\textstyle {{\textsf {\textit{q}}}}}{\scriptstyle {{\textsf {\textit{q}}}}}{\scriptscriptstyle {{\textsf {\textit{q}}}}}})\) in a least square sense, the orthogonal projection \(P_{\mathcal {C}} {\mathchoice{\displaystyle {{\textsf {\textit{q}}}}}{\textstyle {{\textsf {\textit{q}}}}}{\scriptstyle {{\textsf {\textit{q}}}}}{\scriptscriptstyle {{\textsf {\textit{q}}}}}}\) of \({\mathchoice{\displaystyle {{\textsf {\textit{q}}}}}{\textstyle {{\textsf {\textit{q}}}}}{\scriptstyle {{\textsf {\textit{q}}}}}{\scriptscriptstyle {{\textsf {\textit{q}}}}}}\) onto \(\mathcal {C}\). Following Eq. (16), one may decompose the random variable \({\mathchoice{\displaystyle {{\textsf {\textit{q}}}}}{\textstyle {{\textsf {\textit{q}}}}}{\scriptstyle {{\textsf {\textit{q}}}}}{\scriptscriptstyle {{\textsf {\textit{q}}}}}}\) into the projected component \({\mathchoice{\displaystyle {{\textsf {\textit{q}}}}}{\textstyle {{\textsf {\textit{q}}}}}{\scriptstyle {{\textsf {\textit{q}}}}}{\scriptscriptstyle {{\textsf {\textit{q}}}}}}_p\in \mathcal {C}\) and the orthogonal residual \({\mathchoice{\displaystyle {{\textsf {\textit{q}}}}}{\textstyle {{\textsf {\textit{q}}}}}{\scriptstyle {{\textsf {\textit{q}}}}}{\scriptscriptstyle {{\textsf {\textit{q}}}}}}_r\in \mathcal {C}^\perp \), such that
$$\begin{aligned} {\mathchoice{\displaystyle {{\textsf {\textit{q}}}}}{\textstyle {{\textsf {\textit{q}}}}}{\scriptstyle {{\textsf {\textit{q}}}}}{\scriptscriptstyle {{\textsf {\textit{q}}}}}}={\mathchoice{\displaystyle {{\textsf {\textit{q}}}}}{\textstyle {{\textsf {\textit{q}}}}}{\scriptstyle {{\textsf {\textit{q}}}}}{\scriptscriptstyle {{\textsf {\textit{q}}}}}}_p+{\mathchoice{\displaystyle {{\textsf {\textit{q}}}}}{\textstyle {{\textsf {\textit{q}}}}}{\scriptstyle {{\textsf {\textit{q}}}}}{\scriptscriptstyle {{\textsf {\textit{q}}}}}}_r=P_{\mathcal {C}} {\mathchoice{\displaystyle {{\textsf {\textit{q}}}}}{\textstyle {{\textsf {\textit{q}}}}}{\scriptstyle {{\textsf {\textit{q}}}}}{\scriptscriptstyle {{\textsf {\textit{q}}}}}}+ (I-P_{\mathcal {C}}){\mathchoice{\displaystyle {{\textsf {\textit{q}}}}}{\textstyle {{\textsf {\textit{q}}}}}{\scriptstyle {{\textsf {\textit{q}}}}}{\scriptscriptstyle {{\textsf {\textit{q}}}}}}= {{\phi }(y({\mathchoice{\displaystyle {{\textsf {\textit{q}}}}}{\textstyle {{\textsf {\textit{q}}}}}{\scriptstyle {{\textsf {\textit{q}}}}}{\scriptscriptstyle {{\textsf {\textit{q}}}}}})) + ({\mathchoice{\displaystyle {{\textsf {\textit{q}}}}}{\textstyle {{\textsf {\textit{q}}}}}{\scriptstyle {{\textsf {\textit{q}}}}}{\scriptscriptstyle {{\textsf {\textit{q}}}}}}-{\phi }(y({\mathchoice{\displaystyle {{\textsf {\textit{q}}}}}{\textstyle {{\textsf {\textit{q}}}}}{\scriptstyle {{\textsf {\textit{q}}}}}{\scriptscriptstyle {{\textsf {\textit{q}}}}}})))} \end{aligned}$$
(17)
holds. Here, \({\mathchoice{\displaystyle {{\textsf {\textit{q}}}}}{\textstyle {{\textsf {\textit{q}}}}}{\scriptstyle {{\textsf {\textit{q}}}}}{\scriptscriptstyle {{\textsf {\textit{q}}}}}}_p=P_{\mathcal {C}} {\mathchoice{\displaystyle {{\textsf {\textit{q}}}}}{\textstyle {{\textsf {\textit{q}}}}}{\scriptstyle {{\textsf {\textit{q}}}}}{\scriptscriptstyle {{\textsf {\textit{q}}}}}}=\mathbb {E}\left( {\mathchoice{\displaystyle {{\textsf {\textit{q}}}}}{\textstyle {{\textsf {\textit{q}}}}}{\scriptstyle {{\textsf {\textit{q}}}}}{\scriptscriptstyle {{\textsf {\textit{q}}}}}}\mid y\right) = {\phi }(y({\mathchoice{\displaystyle {{\textsf {\textit{q}}}}}{\textstyle {{\textsf {\textit{q}}}}}{\scriptstyle {{\textsf {\textit{q}}}}}{\scriptscriptstyle {{\textsf {\textit{q}}}}}}))\) is the orthogonal projection onto the subspace \(\mathcal {C}\) of all random variables consistent with the data, whereas \({\mathchoice{\displaystyle {{\textsf {\textit{q}}}}}{\textstyle {{\textsf {\textit{q}}}}}{\scriptstyle {{\textsf {\textit{q}}}}}{\scriptscriptstyle {{\textsf {\textit{q}}}}}}_r:=(I-P_{\sigma (y)}){\mathchoice{\displaystyle {{\textsf {\textit{q}}}}}{\textstyle {{\textsf {\textit{q}}}}}{\scriptstyle {{\textsf {\textit{q}}}}}{\scriptscriptstyle {{\textsf {\textit{q}}}}}}\) is its orthogonal residual.
This can be used to build a filter—filtering the observation \(y_m\) together with the prior forecast \({\mathchoice{\displaystyle {{\textsf {\textit{q}}}}}{\textstyle {{\textsf {\textit{q}}}}}{\scriptstyle {{\textsf {\textit{q}}}}}{\scriptscriptstyle {{\textsf {\textit{q}}}}}}_f\)—which is optimal in this least square sense [38, 39] and returns an assimilated random variable \({\mathchoice{\displaystyle {{\textsf {\textit{q}}}}}{\textstyle {{\textsf {\textit{q}}}}}{\scriptstyle {{\textsf {\textit{q}}}}}{\scriptscriptstyle {{\textsf {\textit{q}}}}}}_a\) which has the correct conditional expectation. The first term in the sum in Eq. (17) is taken to be the conditional expectation given the observation \(y_m\), i.e. \({\phi }(y_m) = \mathbb {E}\left( {\mathchoice{\displaystyle {{\textsf {\textit{q}}}}}{\textstyle {{\textsf {\textit{q}}}}}{\scriptstyle {{\textsf {\textit{q}}}}}{\scriptscriptstyle {{\textsf {\textit{q}}}}}}_f \mid y_m\right) \), whereas \(({\mathchoice{\displaystyle {{\textsf {\textit{q}}}}}{\textstyle {{\textsf {\textit{q}}}}}{\scriptstyle {{\textsf {\textit{q}}}}}{\scriptscriptstyle {{\textsf {\textit{q}}}}}}_f-{\phi }(y({\mathchoice{\displaystyle {{\textsf {\textit{q}}}}}{\textstyle {{\textsf {\textit{q}}}}}{\scriptstyle {{\textsf {\textit{q}}}}}{\scriptscriptstyle {{\textsf {\textit{q}}}}}}_f)))\) is the residual component. Following this, Eq. (17) can be recast to obtain the update \({\mathchoice{\displaystyle {{\textsf {\textit{q}}}}}{\textstyle {{\textsf {\textit{q}}}}}{\scriptstyle {{\textsf {\textit{q}}}}}{\scriptscriptstyle {{\textsf {\textit{q}}}}}}_a\) for the prior random variable \({\mathchoice{\displaystyle {{\textsf {\textit{q}}}}}{\textstyle {{\textsf {\textit{q}}}}}{\scriptstyle {{\textsf {\textit{q}}}}}{\scriptscriptstyle {{\textsf {\textit{q}}}}}}_f\) as
$$\begin{aligned} {\mathchoice{\displaystyle {{\textsf {\textit{q}}}}}{\textstyle {{\textsf {\textit{q}}}}}{\scriptstyle {{\textsf {\textit{q}}}}}{\scriptscriptstyle {{\textsf {\textit{q}}}}}}_a={\mathchoice{\displaystyle {{\textsf {\textit{q}}}}}{\textstyle {{\textsf {\textit{q}}}}}{\scriptstyle {{\textsf {\textit{q}}}}}{\scriptscriptstyle {{\textsf {\textit{q}}}}}}_f+{\phi }(y_m)-{\phi }(y_M({\mathchoice{\displaystyle {{\textsf {\textit{q}}}}}{\textstyle {{\textsf {\textit{q}}}}}{\scriptstyle {{\textsf {\textit{q}}}}}{\scriptscriptstyle {{\textsf {\textit{q}}}}}}_f)), \end{aligned}$$
(18)
in which \(y_M({\mathchoice{\displaystyle {{\textsf {\textit{q}}}}}{\textstyle {{\textsf {\textit{q}}}}}{\scriptstyle {{\textsf {\textit{q}}}}}{\scriptscriptstyle {{\textsf {\textit{q}}}}}}_f)\) (see Eq. (10)) is the random variable representing our prior prediction/forecast of the measurement data, and \({\mathchoice{\displaystyle {{\textsf {\textit{q}}}}}{\textstyle {{\textsf {\textit{q}}}}}{\scriptstyle {{\textsf {\textit{q}}}}}{\scriptscriptstyle {{\textsf {\textit{q}}}}}}_a\) is the assimilated random variable. By recalling \({\phi }(y({\mathchoice{\displaystyle {{\textsf {\textit{q}}}}}{\textstyle {{\textsf {\textit{q}}}}}{\scriptstyle {{\textsf {\textit{q}}}}}{\scriptscriptstyle {{\textsf {\textit{q}}}}}}_f)) = P_{\mathcal {C}} {\mathchoice{\displaystyle {{\textsf {\textit{q}}}}}{\textstyle {{\textsf {\textit{q}}}}}{\scriptstyle {{\textsf {\textit{q}}}}}{\scriptscriptstyle {{\textsf {\textit{q}}}}}}_f\), one sees immediately that \(\mathbb {E}\left( {\mathchoice{\displaystyle {{\textsf {\textit{q}}}}}{\textstyle {{\textsf {\textit{q}}}}}{\scriptstyle {{\textsf {\textit{q}}}}}{\scriptscriptstyle {{\textsf {\textit{q}}}}}}_a \mid y_m\right) = \mathbb {E}\left( {\mathchoice{\displaystyle {{\textsf {\textit{q}}}}}{\textstyle {{\textsf {\textit{q}}}}}{\scriptstyle {{\textsf {\textit{q}}}}}{\scriptscriptstyle {{\textsf {\textit{q}}}}}}_f \mid y_m\right) \), and the assimilated random variable \({\mathchoice{\displaystyle {{\textsf {\textit{q}}}}}{\textstyle {{\textsf {\textit{q}}}}}{\scriptstyle {{\textsf {\textit{q}}}}}{\scriptscriptstyle {{\textsf {\textit{q}}}}}}_a\) has the correct conditional expectation. As in engineering practice one is often not interested in estimating the full posterior measure, and the conditional expectation is the most important characterisation, we will use this computationally simpler procedure.
Therefore, to estimate \({\mathchoice{\displaystyle {{\textsf {\textit{q}}}}}{\textstyle {{\textsf {\textit{q}}}}}{\scriptstyle {{\textsf {\textit{q}}}}}{\scriptscriptstyle {{\textsf {\textit{q}}}}}}_a\) one requires only information on the map \({\phi }\) in Eq. (18). To make the determination of this map computationally feasible, and for the sake of simplicity, the map \({\phi }\) in Eq. (18) can be approximated by a n-th order polynomial—i.e. the minimisation in Eq. (16) is not over all measurable maps, but only over n-th order polynomials—such that the map \({\phi }\) in Eq. (18) becomes
$$\begin{aligned} {\phi }_n(y;\beta )=\sum _{\alpha } K^{(\alpha )} V_\alpha ( y) \end{aligned}$$
(19)
with characterising coefficients \(\beta =\{K^{(\alpha )}\}_\alpha , K^{(\alpha )}\in \mathcal {Q}\), multi-indices \(\alpha :=(\alpha _1,\dots )\) with \(\forall j: 0\le \alpha _j \le n\), and multivariate polynomials \(V_\alpha \) as before in Eq. (15). In the affine case, when \(n=1\) and \({\phi }_1(y;\beta )=Ky+b\) in the previous formula Eq. (19), the Eq. (18) reduces to the Gauss-Markov-Kalman filter [38, 39]:
$$\begin{aligned} {\mathchoice{\displaystyle {{\textsf {\textit{q}}}}}{\textstyle {{\textsf {\textit{q}}}}}{\scriptstyle {{\textsf {\textit{q}}}}}{\scriptscriptstyle {{\textsf {\textit{q}}}}}}_a={\mathchoice{\displaystyle {{\textsf {\textit{q}}}}}{\textstyle {{\textsf {\textit{q}}}}}{\scriptstyle {{\textsf {\textit{q}}}}}{\scriptscriptstyle {{\textsf {\textit{q}}}}}}_f+K(y_m-y_M({\mathchoice{\displaystyle {{\textsf {\textit{q}}}}}{\textstyle {{\textsf {\textit{q}}}}}{\scriptstyle {{\textsf {\textit{q}}}}}{\scriptscriptstyle {{\textsf {\textit{q}}}}}}_f)), \end{aligned}$$
(20)
a generalisation of the well known Kalman filter.
In order to estimate the macro-scale properties using Eq. (18), one requires both \(y_m\) and \(y_M\), preferably in the functional approximation form. Note that \(y_M\) is the prediction of the measurement data on the macro-scale level, and is obtained by propagating the prior knowledge \({\mathchoice{\displaystyle {{\textsf {\textit{q}}}}}{\textstyle {{\textsf {\textit{q}}}}}{\scriptstyle {{\textsf {\textit{q}}}}}{\scriptscriptstyle {{\textsf {\textit{q}}}}}}_f\) (here a spatially homogeneous quantity) through the macro-scale model. In this paper we use Bayesian regression —not related to the Bayesian updating—to estimate the functional approximation of \(y_M(q_f)\), as will be presented in “Approximating the macro-scale response by Bayesian regression” section. On the other hand, \(y_m\) represents the response of the high-dimensional meso-scale model in which meso-scale properties \({\mathchoice{\displaystyle {{\textsf {\textit{q}}}}}{\textstyle {{\textsf {\textit{q}}}}}{\scriptstyle {{\textsf {\textit{q}}}}}{\scriptscriptstyle {{\textsf {\textit{q}}}}}}_m\) are heterogeneous and uncertain. By modelling \({\mathchoice{\displaystyle {{\textsf {\textit{q}}}}}{\textstyle {{\textsf {\textit{q}}}}}{\scriptstyle {{\textsf {\textit{q}}}}}{\scriptscriptstyle {{\textsf {\textit{q}}}}}}_m\) one may estimate \(y_m\) in a similar manner as \(y_M\), see the first case scenario in Fig. 1. However, due to the high-dimensionality of the input \({\mathchoice{\displaystyle {{\textsf {\textit{q}}}}}{\textstyle {{\textsf {\textit{q}}}}}{\scriptstyle {{\textsf {\textit{q}}}}}{\scriptscriptstyle {{\textsf {\textit{q}}}}}}_m\), and the nonlinearity of the meso-scale model, a straightforward uncertainty quantification of \(y_m\) is often not computationally affordable. The estimate would require too many data points as thousands of input parameters can easily be involved in the description of the meso-scale properties. Therefore, we use an unsupervised learning algorithm to reduce the stochastic dimension of the meso-scale measurement, as further described in “Approximation of the meso-scale observation by unsupervised learning” section.
Approximating the macro-scale response by Bayesian regression
The measurement prediction \(y_M\) is approximated by a surrogate model
$$\begin{aligned} {\hat{y}}_M=\phi _M(q_f;\beta ) \end{aligned}$$
(21)
in which \(\phi _M\) is usually taken to be nonlinear map of \(q_f\in \mathrm {L}_2(\varOmega _u,\mathfrak {B}_u,\mathbb {P}_u)\) with the coefficients \(\beta \). Furthermore, we assume that \(y_M\) is in general only known as a set of samples, and our goal is to match \({\hat{y}}_M\) with \(y_M\). Let \(x:=(q_f(\omega _i),y_M(\omega _i))_{i=1}^N\) be the full set of N data samples describing the forward propagation of \(q_f\) to \(y_M\) via macro-scale model. In order to specify \({\hat{y}}_M\) the only thing we need to find are the coefficients of the map \(\phi _f\). Therefore, we infer \(\beta \) given data x using Bayes’s rule
$$\begin{aligned} p(\beta |x)=\frac{p(x,\beta )}{\int p(x,\beta )\, \mathop {}\!\mathrm {d}\beta } \end{aligned}$$
(22)
In general case the marginalisation in Eq. (22) can be expensive, and therefore in this paper we use the variational Bayesian inference instead [31]. The idea is to introduce a family \({{\mathcal {D}}}:=\{g(\beta ):=g(\beta |\lambda ,w)\}\) over \(\beta \) indexed by a set of free parameters \((w,\lambda )\) such that \({\hat{y}}_M \sim y_M\). Thus, the idea is to optimise the parameter values by minimising the Kullback-Leibler divergence
$$\begin{aligned} g^*(\beta )=\underset{g(\beta )\in {{\mathcal {D}}}}{\arg \min \,} D_{KL}(g(\beta )||p(\beta |x))= \underset{g(\beta )\in {{\mathcal {D}}}}{\arg \min \,} \int g(\beta )\, \log \frac{g(\beta )}{p(\beta |x) }\,\mathop {}\!\mathrm {d}\beta . \end{aligned}$$
(23)
After few derivation steps as depicted in [31], the previous minimisation problem reduces to
$$\begin{aligned} \beta ^*=\arg \max {\mathcal {L}}(g(\beta )):= {\mathbb {E}}_g(\text {log }p(x,\beta ))-{\mathbb {E}}_g(\text {log }g(\beta )) \end{aligned}$$
(24)
in which \({\mathcal {L}}(g)\) is the evidence lower bound (ELBO), or variational free energy. To obtain a closed-form solution for \(\beta ^*\), the usual practice is to assume that both the posterior \(p(\beta |x)\) as well as its approximation \(g(\beta )\) can be factorised in a mean sense, i.e.
$$\begin{aligned} p(\beta |x)=\prod p_i(\beta _i|x), \quad g(\beta )=\prod g_i(\beta _i) \end{aligned}$$
(25)
in which each factor \(p_i(\beta _i|x)\), \(g_i(\beta _i)\) is independent and belongs to an exponential family. Similarly, their complete conditionals given all other variables and observations are also assumed to belong to exponential families, and are assumed to be independent. Obviously these assumptions lead to conjugacy relationships, and closed form solution of Eq. (24) as further discussed in more detail in [31].
To approximate \(y_M(\omega )\), we take Eq. (21) to be described in a form of a polynomial chaos expansion (PCE) or generalised PCE (gPCE) [67]. In other words, \(y_M(\omega )\) and \(q_f(\omega )\) are taken to be functions of known RVs \(\{\theta _1(\omega ),\dots ,\theta _n(\omega ),\dots \}\). Often, when for example stochastic processes or random fields are involved, one has to deal here with infinitely many RVs, which for an actual computation have to be truncated to a finite vector \({\mathchoice{\displaystyle \varvec{\theta }}{\textstyle \varvec{\theta }}{\scriptstyle \varvec{\theta }}{\scriptscriptstyle \varvec{\theta }}(\omega )=[\theta _1(\omega ),\dots ,\theta _L(\omega )]\in \varTheta \cong \mathbb {R}^L}\) of significant RVs. We shall assume that these have been chosen such as to be independent, and often even normalised Gaussian and independent. The reason to not use \(q_f\) directly is that in the process of identification of q they may turn out to be correlated, whereas \({\mathchoice{\displaystyle \varvec{\theta }}{\textstyle \varvec{\theta }}{\scriptstyle \varvec{\theta }}{\scriptscriptstyle \varvec{\theta }}}\) can stay independent as they are. Thus a RV \({\mathchoice{\displaystyle \varvec{y}}{\textstyle \varvec{y}}{\scriptstyle \varvec{y}}{\scriptscriptstyle \varvec{y}}_M(\mathchoice{\displaystyle \varvec{\theta }}{\textstyle \varvec{\theta }}{\scriptstyle \varvec{\theta }}{\scriptscriptstyle \varvec{\theta }})}\) is replaced by a functional approximation
$$\begin{aligned} {\mathchoice{\displaystyle \varvec{y}}{\textstyle \varvec{y}}{\scriptstyle \varvec{y}}{\scriptscriptstyle \varvec{y}}_M(\mathchoice{\displaystyle \varvec{\theta }}{\textstyle \varvec{\theta }}{\scriptstyle \varvec{\theta }}{\scriptscriptstyle \varvec{\theta }}) = \sum _{\alpha \in \mathcal {J}_Z} \mathchoice{\displaystyle \varvec{y}}{\textstyle \varvec{y}}{\scriptstyle \varvec{y}}{\scriptscriptstyle \varvec{y}}_M^{(\alpha )} \varPsi _\alpha (\mathchoice{\displaystyle \varvec{\theta }}{\textstyle \varvec{\theta }}{\scriptstyle \varvec{\theta }}{\scriptscriptstyle \varvec{\theta }}),} \end{aligned}$$
(26)
and analogously \({\mathchoice{\displaystyle \varvec{q}}{\textstyle \varvec{q}}{\scriptstyle \varvec{q}}{\scriptscriptstyle \varvec{q}}_f}\) by
$$\begin{aligned} {\mathchoice{\displaystyle \varvec{q}}{\textstyle \varvec{q}}{\scriptstyle \varvec{q}}{\scriptscriptstyle \varvec{q}}_f(\mathchoice{\displaystyle \varvec{\theta }}{\textstyle \varvec{\theta }}{\scriptstyle \varvec{\theta }}{\scriptscriptstyle \varvec{\theta }}) = \sum _{\alpha \in \mathcal {J}_Z} \mathchoice{\displaystyle \varvec{q}}{\textstyle \varvec{q}}{\scriptstyle \varvec{q}}{\scriptscriptstyle \varvec{q}}_f^{(\alpha )} \varPsi _\alpha (\mathchoice{\displaystyle \varvec{\theta }}{\textstyle \varvec{\theta }}{\scriptstyle \varvec{\theta }}{\scriptscriptstyle \varvec{\theta }})} \end{aligned}$$
(27)
in which the multi-index \(\alpha =(\dots ,\alpha _k,\dots ),\) and the set \(\mathcal {J}_Z\) of multi-indices is a finite set with cardinality (size) Z.
The coefficients \({\mathchoice{\displaystyle \varvec{q}}{\textstyle \varvec{q}}{\scriptstyle \varvec{q}}{\scriptscriptstyle \varvec{q}}_f^{(\alpha )},\mathchoice{\displaystyle \varvec{y}}{\textstyle \varvec{y}}{\scriptstyle \varvec{y}}{\scriptscriptstyle \varvec{y}}_M^{(\alpha )}}\), e.g. \({\mathchoice{\displaystyle \varvec{v}}{\textstyle \varvec{v}}{\scriptstyle \varvec{v}}{\scriptscriptstyle \varvec{v}}:=\{\mathchoice{\displaystyle \varvec{y}}{\textstyle \varvec{y}}{\scriptstyle \varvec{y}}{\scriptscriptstyle \varvec{y}}_M^{(\alpha )}\}_{\alpha \in {\mathcal {J}}_Z}}\), are estimated by minimising the ELBO analogous to the one in Eq. (24) by using the variational relevance vector machine method [3]. Namely, the measurement forecast \({\mathchoice{\displaystyle \varvec{y}}{\textstyle \varvec{y}}{\scriptstyle \varvec{y}}{\scriptscriptstyle \varvec{y}}_s:=\{\mathchoice{\displaystyle \varvec{y}}{\textstyle \varvec{y}}{\scriptstyle \varvec{y}}{\scriptscriptstyle \varvec{y}}_M(\mathchoice{\displaystyle \varvec{\theta }}{\textstyle \varvec{\theta }}{\scriptstyle \varvec{\theta }}{\scriptscriptstyle \varvec{\theta }}_j)\}_{j=1}^N}\) can be rewritten in a vector form as
$$\begin{aligned} {\mathchoice{\displaystyle \varvec{y}}{\textstyle \varvec{y}}{\scriptstyle \varvec{y}}{\scriptscriptstyle \varvec{y}}_s=\mathchoice{\displaystyle \varvec{v}}{\textstyle \varvec{v}}{\scriptstyle \varvec{v}}{\scriptscriptstyle \varvec{v}}\mathchoice{\displaystyle \varvec{\varPsi }}{\textstyle \varvec{\varPsi }}{\scriptstyle \varvec{\varPsi }}{\scriptscriptstyle \varvec{\varPsi }}} \end{aligned}$$
(28)
in which \({\mathchoice{\displaystyle \varvec{\varPsi }}{\textstyle \varvec{\varPsi }}{\scriptstyle \varvec{\varPsi }}{\scriptscriptstyle \varvec{\varPsi }}}\) is the matrix of collection of basis functions \({\varPsi _\alpha (\mathchoice{\displaystyle \varvec{\theta }}{\textstyle \varvec{\theta }}{\scriptstyle \varvec{\theta }}{\scriptscriptstyle \varvec{\theta }})}\) evaluated at the set of sample points \({\{\mathchoice{\displaystyle \varvec{\theta }}{\textstyle \varvec{\theta }}{\scriptstyle \varvec{\theta }}{\scriptscriptstyle \varvec{\theta }}_j\}_{i=1}^N}\). However, the expression in the previous equation is not complete, as the PCE in Eq. (26) is truncated. This implies the presence of the modelling errors. Under a Gaussian assumption, the data then can be modelled as
$$\begin{aligned} {p(\mathchoice{\displaystyle \varvec{y}}{\textstyle \varvec{y}}{\scriptstyle \varvec{y}}{\scriptscriptstyle \varvec{y}}_s)\sim {\mathcal {N}}(\mathchoice{\displaystyle \varvec{v}}{\textstyle \varvec{v}}{\scriptstyle \varvec{v}}{\scriptscriptstyle \varvec{v}}\mathchoice{\displaystyle \varvec{\varPsi }}{\textstyle \varvec{\varPsi }}{\scriptstyle \varvec{\varPsi }}{\scriptscriptstyle \varvec{\varPsi }},\varsigma ^{-1}\mathchoice{\displaystyle \varvec{I}}{\textstyle \varvec{I}}{\scriptstyle \varvec{I}}{\scriptscriptstyle \varvec{I}})} \end{aligned}$$
(29)
in which \(\varsigma \sim \varGamma (a_\varsigma ,b_\varsigma )\) denotes the imprecision parameter, here assumed to follow Gamma distribution. The coefficients \({\mathchoice{\displaystyle \varvec{v}}{\textstyle \varvec{v}}{\scriptstyle \varvec{v}}{\scriptscriptstyle \varvec{v}}}\) are given a normal distribution under the independency assumption:
$$\begin{aligned} {p(\mathchoice{\displaystyle \varvec{v}}{\textstyle \varvec{v}}{\scriptstyle \varvec{v}}{\scriptscriptstyle \varvec{v}}|\mathchoice{\displaystyle \varvec{a}}{\textstyle \varvec{a}}{\scriptstyle \varvec{a}}{\scriptscriptstyle \varvec{a}})\sim \prod _{i=0}^Z {\mathcal {N}}(0,\zeta _i^{-1})} \end{aligned}$$
(30)
in which Z denotes the cardinality of the PCE, and \({\mathchoice{\displaystyle \varvec{\zeta }}{\textstyle \varvec{\zeta }}{\scriptstyle \varvec{\zeta }}{\scriptscriptstyle \varvec{\zeta }}:=\{\zeta _i\}}\) is a vector of hyper-parameters. To promote for sparsity, the vector of hyper-parameters is further assumed to follow Gamma distribution
$$\begin{aligned} p(\zeta _i) \sim \varGamma (a_{i},b_i) \end{aligned}$$
(31)
under the independency assumption. In this manner the posterior for \({\mathchoice{\displaystyle \varvec{\beta }}{\textstyle \varvec{\beta }}{\scriptstyle \varvec{\beta }}{\scriptscriptstyle \varvec{\beta }}:=\{\mathchoice{\displaystyle \varvec{v}}{\textstyle \varvec{v}}{\scriptstyle \varvec{v}}{\scriptscriptstyle \varvec{v}},\mathchoice{\displaystyle \varvec{\zeta }}{\textstyle \varvec{\zeta }}{\scriptstyle \varvec{\zeta }}{\scriptscriptstyle \varvec{\zeta }},\mathchoice{\displaystyle \varvec{\varsigma }}{\textstyle \varvec{\varsigma }}{\scriptstyle \varvec{\varsigma }}{\scriptscriptstyle \varvec{\varsigma }})}\), i.e. \({p(\mathchoice{\displaystyle \varvec{\beta }}{\textstyle \varvec{\beta }}{\scriptstyle \varvec{\beta }}{\scriptscriptstyle \varvec{\beta }}|\mathchoice{\displaystyle \varvec{y}}{\textstyle \varvec{y}}{\scriptstyle \varvec{y}}{\scriptscriptstyle \varvec{y}}_s)}\), can be approximated by a variational mean field form
$$\begin{aligned} {g(\mathchoice{\displaystyle \varvec{\beta }}{\textstyle \varvec{\beta }}{\scriptstyle \varvec{\beta }}{\scriptscriptstyle \varvec{\beta }})=g_v(\mathchoice{\displaystyle \varvec{v}}{\textstyle \varvec{v}}{\scriptstyle \varvec{v}}{\scriptscriptstyle \varvec{v}})g_{\zeta }(\mathchoice{\displaystyle \varvec{\zeta }}{\textstyle \varvec{\zeta }}{\scriptstyle \varvec{\zeta }}{\scriptscriptstyle \varvec{\zeta }})g_{\varsigma }(\mathchoice{\displaystyle \varvec{\varsigma }}{\textstyle \varvec{\varsigma }}{\scriptstyle \varvec{\varsigma }}{\scriptscriptstyle \varvec{\varsigma }}),} \end{aligned}$$
(32)
the factors of which are chosen to take same distribution type as the corresponding prior due to the conjugacy reasons. Once this assumption is made, one may maximise the corresponding ELBO in order to estimate the parameter set.
Finally, we have everything to describe the macro-scale response \(y_M\), and therefore we may fill this term in the filtering equation as presented in Eq. (18) to obtain:
$$\begin{aligned} {\mathchoice{\displaystyle \varvec{q}}{\textstyle \varvec{q}}{\scriptstyle \varvec{q}}{\scriptscriptstyle \varvec{q}}_a=\sum _{\alpha \in {\mathcal {J}}} \mathchoice{\displaystyle \varvec{q}}{\textstyle \varvec{q}}{\scriptstyle \varvec{q}}{\scriptscriptstyle \varvec{q}}_f^{(\alpha )}\varPsi _\alpha (\mathchoice{\displaystyle \varvec{\theta }}{\textstyle \varvec{\theta }}{\scriptstyle \varvec{\theta }}{\scriptscriptstyle \varvec{\theta }})+ \phi _n({y}_m)-\phi _n \left( \sum _{\alpha \in {\mathcal {J}}} \mathchoice{\displaystyle \varvec{y}}{\textstyle \varvec{y}}{\scriptstyle \varvec{y}}{\scriptscriptstyle \varvec{y}}_M^{(\alpha )}\varPsi _\alpha (\mathchoice{\displaystyle \varvec{\theta }}{\textstyle \varvec{\theta }}{\scriptstyle \varvec{\theta }}{\scriptscriptstyle \varvec{\theta }})\right) } \end{aligned}$$
(33)
Note that Eq. (33) is not yet fully computationally operational, the RV \({y}_m\) has to be put into a computationally accessible form. This is considered in the following “Approximation of the meso-scale observation by unsupervised learning” section.
Approximation of the meso-scale observation by unsupervised learning
Let the measurement \(y_m \) be approximated by
$$\begin{aligned} y_m=\phi _m(w,\eta ) \end{aligned}$$
(34)
in which \(\phi _m\) is an analytical function (e.g. a Gaussian mixture model, a neural network, etc.) parameterised by global variables/parameters w describing the whole data set, and the latent local/hidden variables \(\eta \) that describe each data point. An example is the generalised mixture model in which parameters w include statistics of individual components, and the mixture weights, whereas the hidden variable \(\eta \) stands for the indicator variable that describes the membership of data points to the mixture components. The goal is to estimate the pair \({\beta }:=(w,\eta )\) given data \(y_d:=\{y_m({\hat{\omega }}_i)\}, i=1,..,M\) with the help of Bayes’s rule. Note that we do not take full set of the input–output data \((q_m({\hat{\omega }}_i),y_m({\hat{\omega }}_i))\) with \(q_m\) defined on \({(\varOmega _{\mathchoice{\displaystyle {{\textsf {\textit{Q}}}}}{\textstyle {{\textsf {\textit{Q}}}}}{\scriptstyle {{\textsf {\textit{Q}}}}}{\scriptscriptstyle {{\textsf {\textit{Q}}}}}},\mathfrak {B}_{\mathchoice{\displaystyle {{\textsf {\textit{Q}}}}}{\textstyle {{\textsf {\textit{Q}}}}}{\scriptstyle {{\textsf {\textit{Q}}}}}{\scriptscriptstyle {{\textsf {\textit{Q}}}}}},\mathbb {P}_{\mathchoice{\displaystyle {{\textsf {\textit{Q}}}}}{\textstyle {{\textsf {\textit{Q}}}}}{\scriptstyle {{\textsf {\textit{Q}}}}}{\scriptscriptstyle {{\textsf {\textit{Q}}}}}})}\), but only its incomplete version generated only by the output \(y_d:=\{y_m({\hat{\omega }}_i)\}, i=1,..,M\). Following this, the coefficients of \(y_m\) can be estimated as
$$\begin{aligned} p(\beta |y_d)=\frac{p(y_d,\beta )}{\int p(y_d,\beta ) \,\mathop {}\!\mathrm {d}\beta }. \end{aligned}$$
(35)
The previous equation is more general than Eq. (25), and hence includes the problem described in “Approximating the macro-scale response by Bayesian regression” section as a special case. The main reason is that next to the coefficients w we also need to estimate the argument \(\eta \) such that the functional approximation in Eq. (34) is minimally parametrised.
Following theory in “Approximating the macro-scale response by Bayesian regression” section, Eq. (35) is reformulated to the computationally simpler variational inference problem. In other words, we introduce a family of density functions \({{\mathcal {D}}}:=\{g(\beta ):=g(\beta |\lambda ,\varpi )\}\) over \(\beta \) indexed by a set of free parameters \((\varpi ,\lambda )\) that approximate the posterior density \(p(\beta |y_d)\), and further optimise the variational parameter values by minimising the Kullback–Leibler divergence between the approximation \(g(\beta )\) and the exact posterior \(p(\beta |y_d)\). Hence, following Eq. (24), we maximise the ELBO
$$\begin{aligned} {\mathcal {L}}(g)={\mathbb {E}}_{g(\beta )}(\text {log }p(y_d,\beta )) -{\mathbb {E}}_{g(\beta )}(\text {log }g(\beta )))) \end{aligned}$$
(36)
by using the mean-field factorisation assumption, and conjugacy relationships. The optimisation problem attains a closed form solution in which the lower bound is iteratively optimised with respect to the global parameters keeping the local parameters fixed, and in the second step the local parameters are updated and the global parameters are held fixed. The algorithm can be improved by considering the stochastic optimisation in which a noisy estimate of the gradient is used instead of the natural one.
The mean field factorisation as presented previously is computationally simple, but not accurate. For example, one cannot assume independence between the stored energy and dissipation coming from the same experiment. In other words, the correlation among the latent variables is not explored. As a result, the covariance of the measurement will be underestimated. To allow dependence in the factorisation, one may extend the mean-field approach via copula factorisations [63, 64]:
$$\begin{aligned} g(\beta )=c(F_1(\beta _1),...,F_m(\beta _m),\chi )\prod _{i=1}^m g_i(\beta _i) \end{aligned}$$
(37)
in which \(c(F_1(\beta _1),...,F_m(\beta _m),\chi )\) is the representative of a copula family, \(F_i(\beta _i)\) is the marginal cumulative distribution function of the random variable \(\beta _i\), and \(\chi \) is the set of parameters describing the copula family. Similarly, \(g_i(\beta _i)\) represent the independent marginal densities. In this manner any distribution type can be represented by a formulation as given in Eq. (37) according to Sklar’s theorem [56].
Following Eq. (37), the goal is to find \(g(\beta )\) such that the Kullback-Leibler divergence to the exact posterior distribution is minimised. Note that if the true posterior is described by
$$\begin{aligned} p(\beta |y_d)=c_t(F_1(\beta _1),...,F _m(\beta _m),\chi _t) \prod _{i=1}^m f_i(\beta _i), \end{aligned}$$
(38)
then the Kullback-Leibler divergence reads:
$$\begin{aligned} D_{KL}(g(\beta )|| p(\beta |y_m)=D_{KL}(c||c_t)+\sum _{i=1}^m D_{KL}(g_i(\beta _i)||f_i(\beta _i)), \end{aligned}$$
(39)
and contains one additional term compared to the mean field approximation. When the copula is uniform, the previous equation reduces to the mean field one, and hence only the second term is minimised. On the other hand, if the mean field factorisation is not a good assumption and the dependence relations are neglected, then the total approximation error will be dominated by the first term. To avoid this, the ELBO derived in Eq. (36) modifies to
$$\begin{aligned} {\mathcal {L}}(g)={\mathbb {E}}_{g(\beta )}(\text {log } p(y_m,\beta ))- \text {log } g(\beta ,\chi ) \end{aligned}$$
(40)
and is a function of parameters of the latent variables \(\beta \), as well as of the copula parameters \(\chi \). Therefore, the algorithm applied here consists of iteratively finding the parameters of the mean field approximation, as well as those of the copula. The algorithm is adopted from [63], and is a black-box algorithm as it only depends on the likelihood \(p(y_m,\beta )\) and copula description in a vine form. Note that when the copula is equal to identity, i.e. uniform, the previous factorisation collapses to the mean field one.
Once the copula dependence structure is found, the measurement data \(y_m\) are represented in a functional form—here taken as generalised mixture model—as in Eq. (34), which is different than the polynomial chaos representation. In other words, the measurement is given in terms of dependent random variables, and not independent ones. Therefore, the dependence structure has to be mapped to an independent one. In a Gaussian copula case, the Nataf transformation can be used, and otherwise the Rosenblatt transformation is applied. For high-dimensional copulas, such as a regular vine copula, [1] provides algorithms to compute the Rosenblatt transform and its inverse. The result of the transformation are mutually independent and marginally uniformly distributed random variables, which further can be mapped to Gaussian ones or other types of standard random variables via marginals [62]. Let the functional approximation of the measurement be given as
$$\begin{aligned} {\mathchoice{\displaystyle \varvec{y}}{\textstyle \varvec{y}}{\scriptstyle \varvec{y}}{\scriptscriptstyle \varvec{y}}_m (\mathchoice{\displaystyle \varvec{\xi }}{\textstyle \varvec{\xi }}{\scriptstyle \varvec{\xi }}{\scriptscriptstyle \varvec{\xi }}) \approx \sum _{\alpha \in {\mathcal {J}}_m} \mathchoice{\displaystyle \varvec{y}}{\textstyle \varvec{y}}{\scriptstyle \varvec{y}}{\scriptscriptstyle \varvec{y}}_m^{(\alpha )}G_\alpha (\mathchoice{\displaystyle \varvec{\xi }}{\textstyle \varvec{\xi }}{\scriptstyle \varvec{\xi }}{\scriptscriptstyle \varvec{\xi }})} \end{aligned}$$
(41)
in which \({\mathcal {J}}_m\) is a multi-index set, and \({G_\alpha (\mathchoice{\displaystyle \varvec{\xi }}{\textstyle \varvec{\xi }}{\scriptstyle \varvec{\xi }}{\scriptscriptstyle \varvec{\xi }})}\) is a set of functions (e.g. orthogonal polynomials) with random variables \({\mathchoice{\displaystyle \varvec{\xi }}{\textstyle \varvec{\xi }}{\scriptstyle \varvec{\xi }}{\scriptscriptstyle \varvec{\xi }}}\) as arguments. With this, we have obtained the measurement \(y_m\) in a minimised functional approximation form, which further can be plugged into Eq. (33) to obtain the final filter discretisation. By combining the random variables \({\mathchoice{\displaystyle \varvec{\theta }}{\textstyle \varvec{\theta }}{\scriptstyle \varvec{\theta }}{\scriptscriptstyle \varvec{\theta }}}\) and \({\mathchoice{\displaystyle \varvec{\xi }}{\textstyle \varvec{\xi }}{\scriptstyle \varvec{\xi }}{\scriptscriptstyle \varvec{\xi }}}\), one may re-write Eq. (33) in the following form
$$\begin{aligned} \sum _{\alpha \in \mathcal {J}_a} \mathchoice{\displaystyle \varvec{q}}{\textstyle \varvec{q}}{\scriptstyle \varvec{q}}{\scriptscriptstyle \varvec{q}}_a^{(\alpha )}H_\alpha (\mathchoice{\displaystyle \varvec{\theta }}{\textstyle \varvec{\theta }}{\scriptstyle \varvec{\theta }}{\scriptscriptstyle \varvec{\theta }},\mathchoice{\displaystyle \varvec{\xi }}{\textstyle \varvec{\xi }}{\scriptstyle \varvec{\xi }}{\scriptscriptstyle \varvec{\xi }}) =\sum _{\alpha \in \mathcal {J}} \mathchoice{\displaystyle \varvec{q}}{\textstyle \varvec{q}}{\scriptstyle \varvec{q}}{\scriptscriptstyle \varvec{q}}_f^{(\alpha )}H_\alpha (\mathchoice{\displaystyle \varvec{\theta }}{\textstyle \varvec{\theta }}{\scriptstyle \varvec{\theta }}{\scriptscriptstyle \varvec{\theta }},\mathchoice{\displaystyle \varvec{\xi }}{\textstyle \varvec{\xi }}{\scriptstyle \varvec{\xi }}{\scriptscriptstyle \varvec{\xi }})+ \phi _n \left( \sum _{\alpha \in \mathcal {J}_m} \mathchoice{\displaystyle \varvec{y}}{\textstyle \varvec{y}}{\scriptstyle \varvec{y}}{\scriptscriptstyle \varvec{y}}_m^{(\alpha )}H_\alpha (\mathchoice{\displaystyle \varvec{\theta }}{\textstyle \varvec{\theta }}{\scriptstyle \varvec{\theta }}{\scriptscriptstyle \varvec{\theta }},\mathchoice{\displaystyle \varvec{\xi }}{\textstyle \varvec{\xi }}{\scriptstyle \varvec{\xi }}{\scriptscriptstyle \varvec{\xi }})\right) - \phi _n \left( \sum _{\alpha \in \mathcal {J}} \mathchoice{\displaystyle \varvec{y}}{\textstyle \varvec{y}}{\scriptstyle \varvec{y}}{\scriptscriptstyle \varvec{y}}_M^{(\alpha )}H_\alpha (\mathchoice{\displaystyle \varvec{\theta }}{\textstyle \varvec{\theta }}{\scriptstyle \varvec{\theta }}{\scriptscriptstyle \varvec{\theta }},\mathchoice{\displaystyle \varvec{\xi }}{\textstyle \varvec{\xi }}{\scriptstyle \varvec{\xi }}{\scriptscriptstyle \varvec{\xi }})\right) \qquad \end{aligned}$$
(42)
in which \(H_\alpha \) is a generalised polynomial chaos expansion with random variables \({(\mathchoice{\displaystyle \varvec{\theta }}{\textstyle \varvec{\theta }}{\scriptstyle \varvec{\theta }}{\scriptscriptstyle \varvec{\theta }},\mathchoice{\displaystyle \varvec{\xi }}{\textstyle \varvec{\xi }}{\scriptstyle \varvec{\xi }}{\scriptscriptstyle \varvec{\xi }})}\) as arguments. Note that the coefficients \({\mathchoice{\displaystyle \varvec{q}}{\textstyle \varvec{q}}{\scriptstyle \varvec{q}}{\scriptscriptstyle \varvec{q}}_f^{(\alpha )}}\), as well as \({\mathchoice{\displaystyle \varvec{y}}{\textstyle \varvec{y}}{\scriptstyle \varvec{y}}{\scriptscriptstyle \varvec{y}}_M^{(\alpha )}}\) and \({\mathchoice{\displaystyle \varvec{y}}{\textstyle \varvec{y}}{\scriptstyle \varvec{y}}{\scriptscriptstyle \varvec{y}}_m^{(\alpha )}}\), are sparse as they only depend on \({\mathchoice{\displaystyle \varvec{\theta }}{\textstyle \varvec{\theta }}{\scriptstyle \varvec{\theta }}{\scriptscriptstyle \varvec{\theta }}}\) or \({\mathchoice{\displaystyle \varvec{\xi }}{\textstyle \varvec{\xi }}{\scriptstyle \varvec{\xi }}{\scriptscriptstyle \varvec{\xi }}}\), respectively. As \({\mathchoice{\displaystyle \varvec{\theta }}{\textstyle \varvec{\theta }}{\scriptstyle \varvec{\theta }}{\scriptscriptstyle \varvec{\theta }}}\) describes the a priori (epistemic) uncertainty, one may take the mathematical expectation of the previous equation w.r.t. \({\mathchoice{\displaystyle \varvec{\theta }}{\textstyle \varvec{\theta }}{\scriptstyle \varvec{\theta }}{\scriptscriptstyle \varvec{\theta }}}\) to obtain the natural (aleatoric) variability of the macro-scale parameters:
$$\begin{aligned}&\sum _{\alpha \in \mathcal {J}_m} \mathchoice{\displaystyle \varvec{q}}{\textstyle \varvec{q}}{\scriptstyle \varvec{q}}{\scriptscriptstyle \varvec{q}}_a^{(\alpha )}G_\alpha (\mathchoice{\displaystyle \varvec{\xi }}{\textstyle \varvec{\xi }}{\scriptstyle \varvec{\xi }}{\scriptscriptstyle \varvec{\xi }}) =\mathbb {E}_{\mathchoice{\displaystyle \varvec{\theta }}{\textstyle \varvec{\theta }}{\scriptstyle \varvec{\theta }}{\scriptscriptstyle \varvec{\theta }}} \left( \sum _{\alpha \in \mathcal {J}_a} \mathchoice{\displaystyle \varvec{q}}{\textstyle \varvec{q}}{\scriptstyle \varvec{q}}{\scriptscriptstyle \varvec{q}}_a^{(\alpha )}H_\alpha (\mathchoice{\displaystyle \varvec{\theta }}{\textstyle \varvec{\theta }}{\scriptstyle \varvec{\theta }}{\scriptscriptstyle \varvec{\theta }},\mathchoice{\displaystyle \varvec{\xi }}{\textstyle \varvec{\xi }}{\scriptstyle \varvec{\xi }}{\scriptscriptstyle \varvec{\xi }})\right) \nonumber \\&\quad =\phi _n \left( \sum _{\alpha \in \mathcal {J}_m} \mathchoice{\displaystyle \varvec{y}}{\textstyle \varvec{y}}{\scriptstyle \varvec{y}}{\scriptscriptstyle \varvec{y}}_m^{(\alpha )}\varGamma _\alpha (\mathchoice{\displaystyle \varvec{\xi }}{\textstyle \varvec{\xi }}{\scriptstyle \varvec{\xi }}{\scriptscriptstyle \varvec{\xi }})\right) + \mathbb {E}_{\mathchoice{\displaystyle \varvec{\theta }}{\textstyle \varvec{\theta }}{\scriptstyle \varvec{\theta }}{\scriptscriptstyle \varvec{\theta }}}\left( \sum _{\alpha \in \mathcal {J}} \mathchoice{\displaystyle \varvec{q}}{\textstyle \varvec{q}}{\scriptstyle \varvec{q}}{\scriptscriptstyle \varvec{q}}_f^{(\alpha )}\varPsi _\alpha (\mathchoice{\displaystyle \varvec{\theta }}{\textstyle \varvec{\theta }}{\scriptstyle \varvec{\theta }}{\scriptscriptstyle \varvec{\theta }})- \phi _n\left( \sum _{\alpha \in \mathcal {J}} \mathchoice{\displaystyle \varvec{y}}{\textstyle \varvec{y}}{\scriptstyle \varvec{y}}{\scriptscriptstyle \varvec{y}}_M^{(\alpha )} \varPsi _\alpha (\mathchoice{\displaystyle \varvec{\theta }}{\textstyle \varvec{\theta }}{\scriptstyle \varvec{\theta }}{\scriptscriptstyle \varvec{\theta }})\right) \right) . \end{aligned}$$
(43)
In general, the approximation of the meso-scale information as previously described can be cubersome due to high nonlinearity and time-dependence of \(y_m\). Therefore, instead of approximating \(y_m\) in a form as in Eq. (34), one may discretise \(y_m\) in a Monte Carlo sampling manner such that Eq. (33) rewrites to \(\forall {\hat{\omega }}_i: i=1,\dots ,M\)
$$\begin{aligned} {\mathchoice{\displaystyle \varvec{q}}{\textstyle \varvec{q}}{\scriptstyle \varvec{q}}{\scriptscriptstyle \varvec{q}}_a^{(i)}(\mathchoice{\displaystyle \varvec{\theta }}{\textstyle \varvec{\theta }}{\scriptstyle \varvec{\theta }}{\scriptscriptstyle \varvec{\theta }}):= \sum _{\alpha \in \mathcal {J}} \mathchoice{\displaystyle \varvec{q}}{\textstyle \varvec{q}}{\scriptstyle \varvec{q}}{\scriptscriptstyle \varvec{q}}_i^{(\alpha )}\varPsi _\alpha (\mathchoice{\displaystyle \varvec{\theta }}{\textstyle \varvec{\theta }}{\scriptstyle \varvec{\theta }}{\scriptscriptstyle \varvec{\theta }})=\sum _{\alpha \in \mathcal {J}} \mathchoice{\displaystyle \varvec{q}}{\textstyle \varvec{q}}{\scriptstyle \varvec{q}}{\scriptscriptstyle \varvec{q}}_f^{(\alpha )}\varPsi _\alpha (\mathchoice{\displaystyle \varvec{\theta }}{\textstyle \varvec{\theta }}{\scriptstyle \varvec{\theta }}{\scriptscriptstyle \varvec{\theta }})+ \phi _n(\mathchoice{\displaystyle \varvec{y}}{\textstyle \varvec{y}}{\scriptstyle \varvec{y}}{\scriptscriptstyle \varvec{y}}_m({\hat{\omega }}_i))-\phi _n \left( \sum _{\alpha \in \mathcal {J}} \mathchoice{\displaystyle \varvec{y}}{\textstyle \varvec{y}}{\scriptstyle \varvec{y}}{\scriptscriptstyle \varvec{y}}_M^{(\alpha )}\varPsi _\alpha (\mathchoice{\displaystyle \varvec{\theta }}{\textstyle \varvec{\theta }}{\scriptstyle \varvec{\theta }}{\scriptscriptstyle \varvec{\theta }})\right) .} \end{aligned}$$
(44)
In other words we repeat the update formula M times for each instance of the measurement \(y_m\), and thus obtain M posteriors \(q_a^{(i)}, i=1,\dots ,M\) that depend only on the epistemic uncertainty embodied in \({\mathchoice{\displaystyle \varvec{\theta }}{\textstyle \varvec{\theta }}{\scriptstyle \varvec{\theta }}{\scriptscriptstyle \varvec{\theta }}}\). By averaging over \({\mathchoice{\displaystyle \varvec{\theta }}{\textstyle \varvec{\theta }}{\scriptstyle \varvec{\theta }}{\scriptscriptstyle \varvec{\theta }}}\) one obtains a set of samples:
$$\begin{aligned} {\forall \omega _i: \bar{\mathchoice{\displaystyle \varvec{q}}{\textstyle \varvec{q}}{\scriptstyle \varvec{q}}{\scriptscriptstyle \varvec{q}}}_i=\mathbb {E}_{\mathchoice{\displaystyle \varvec{\theta }}{\textstyle \varvec{\theta }}{\scriptstyle \varvec{\theta }}{\scriptscriptstyle \varvec{\theta }}}(\mathchoice{\displaystyle \varvec{q}}{\textstyle \varvec{q}}{\scriptstyle \varvec{q}}{\scriptscriptstyle \varvec{q}}_a^{(i)}(\mathchoice{\displaystyle \varvec{\theta }}{\textstyle \varvec{\theta }}{\scriptstyle \varvec{\theta }}{\scriptscriptstyle \varvec{\theta }})),\quad i=1,\dots ,M,} \end{aligned}$$
(45)
i.e. the data which are to be used for the estimation of the functional approximation form of the macro-scale parameter \(q_M\) similar to Eq. (34). To achieve this, we search for an approximation
$$\begin{aligned} {q_M=\varphi _q(\mathchoice{\displaystyle \varvec{w}}{\textstyle \varvec{w}}{\scriptstyle \varvec{w}}{\scriptscriptstyle \varvec{w}}_q,\eta _q)} \end{aligned}$$
(46)
given the incomplete data set \({q_d:=(\bar{\mathchoice{\displaystyle \varvec{q}}{\textstyle \varvec{q}}{\scriptstyle \varvec{q}}{\scriptscriptstyle \varvec{q}}}_i)_{i=1}^n}\). Here, \(w_q\) and \(\eta _q\) have same meaning as in Eq. (34), and therefore can be estimated by using same unsupervised algorithm as previously described. This approach is computationally more convenient, as the correlation structure between the material parameters is easier to learn than the one between measurement data on the meso-scale.
For better clarity, we re-capitulate the upscaling procedure in Fig. 1 for comparison reasons. In Fig. 1a) is shown the direct computational approach in which Eq. (42) is used with \(y_m\) being approximated in same manner as \(y_M\) by supervised Bayesian regression described in “Approximating the macro-scale response by Bayesian regression” section. Due to a high computational footprint, this approach is not considered in this paper, for more information please see [47]. The upscaling approach presented in Eq. (42), in which \(y_m\) is approximated by Eq. (34) via an unsupervised learning algorithm, is further depicted in Fig. 1b. Here one first uses the Bayesian unsupervised learning algorithm to learn the distribution of the meso-scale measurement, and later a Bayesian upscaling procedure to estimate the macro-scale parameters. Finally, the upscaling approach given by Eqs. (44) and (46) is shown in Fig. 1c. In this approach one first uses a Bayesian upscaling procedure and estimates the macro-scale parameter sample-wise, after which the Bayesian unsupervised learning algorithm is used to approximate the distribution of the macro-scale parameters. The choice of algorithm depends on the application and dimensionality, as well as on the nonlinearity of the meso-scale model.