Numerical example: estimation of error in CGapproximation of a polyethelyne chain
We describe in this section an application of a poasterior error estimation described in the previous section, involving CG approximations of a wellknown model of polyethylene. For the base “AA” model, we consider a united atom model of a polyethylene chain containing 200 CH2 (methyl) monomers, meaning we have aggregated hydrogen and carbon atoms into an “AA” bead for simplicity. The unitedatom coordinates r
_{
i
} define the locations of each particle on an raxis, and the displacement is denoted u
_{
i
}(t). As an additional simplification, we assume that the interatomic potential is characterized by harmonic bonds of the form (1a), with parameter k
_{
l
}=k= 350 k
C
a
l/m
o
l, bond length of l= 1.5 Å, and atomic mass m= 14.026 g
r/m
o
l. Initially, each united atom is separated by bond length l, the initial velocities are zero, and the system is assigned an initial displacement field u
_{
i
}(0)=f(r
_{
i
}), where \(f(r_{i}) = 1.2 \text {e}^{0.1r_{i}(0)}\). Under these conditions, the AA system (5) reduces to
$$ \boldsymbol{m}\ddot{\boldsymbol{u}}  \boldsymbol{k}\boldsymbol{u} = 0, \qquad \boldsymbol{u}(0)=\mathbf{u}_{0},\; \dot{\boldsymbol{u}}(0) = \mathbf{v}_{0}, $$
((39))
where m and k are the mass matrix and the stiffness matrix of the united atom system and are of the form
$$ \boldsymbol{m} = m\left[ \begin{array}{lll} 1 & & \\ & \ddots & \\ & & 1 \\ \end{array}\right], \quad \boldsymbol{k} = k\left[ \begin{array}{cccr} 2 & 1 & & \\ 1 & 2 & \ddots & \\ & \ddots & \ddots & 1 \\ & \ddots & 1 & 2 \end{array}\right]. $$
((40))
A family of CG approximations of this model is obtained by aggregating the atoms into beads, with CG models in distinguished by the number P of atoms per CG bead. The resulting CG system (8) is of the form
$$ \boldsymbol{M}\ddot{\boldsymbol{U}}  \boldsymbol{K}\boldsymbol{U} = \mathbf{0}, \qquad \boldsymbol{U}(0)=\mathbf{U}_{0},\; \dot{\boldsymbol{U}}(0) = \mathbf{V}_{0}, $$
((41))
with the mass of each bead set to M=P
m and the bond stiffness K=α
k/P; \(\alpha \in \mathbb {R}^{+}\).
Upon solving (41) for the CG displacement trajectory U(t), we compute the residual trajectory
$$ \boldsymbol{\rho} = \boldsymbol{m}\ddot{\boldsymbol{U}}_{\text{CG}}  \boldsymbol{k}\boldsymbol{U}_{\text{CG}}, $$
((42))
where U
_{CG}(t) is the projection Π
U(t) onto AA atom locations.
The bilinear and linear forms described in (19)(21) reduce, in this case, to
$$\begin{array}{@{}rcl@{}} \mathcal{B} (\boldsymbol{u} ; \boldsymbol{v}) &=& \int_{0}^{\tau} \boldsymbol{v}^{T} \left(\boldsymbol{m}\ddot{\boldsymbol{u}}  \boldsymbol{k}\boldsymbol{u} \right) \text{d}t \boldsymbol{v}^{T}(0) \boldsymbol{m} \dot{\boldsymbol{u}}(0) \dot{\boldsymbol{v}}^{T}(0) \boldsymbol{m} \boldsymbol{u}(0), \end{array} $$
((43))
$$\begin{array}{@{}rcl@{}} \mathcal{F} (\boldsymbol{v}) &=& \boldsymbol{v}^{T}(0) \boldsymbol{m} \mathbf{v}_{0} \dot{\boldsymbol{v}}^{T}(0) \boldsymbol{m} \mathbf{u}_{0}. \end{array} $$
((44))
and (25) yielding
$$\begin{array}{@{}rcl@{}} \mathcal{B}' (\boldsymbol{u} ; \boldsymbol{v}, \boldsymbol{z}) &=& \int_{0}^{\tau} \left(\boldsymbol{m}\ddot{\boldsymbol{z}}  \boldsymbol{k}\boldsymbol{z} \right)^{T} \boldsymbol{v} \text{d}t\\ &+& \left(\boldsymbol{m} \boldsymbol{z}(\tau)\right)^{T} \dot{\boldsymbol{v}}(\tau)  \left(\boldsymbol{m} \dot{\boldsymbol{z}}(\tau)\right)^{T} \boldsymbol{v}(\tau) \\ &=& \mathcal{Q}'(\boldsymbol{v}). \end{array} $$
((45))
As an example of a QoI, we take Q
_{
r
} to be the locallyaveraged displacement,
$$ \mathcal{Q}_{\bold{r}} = \int_{0}^{\tau} \zeta(t) \text{d}t; \; \; \zeta (t) = \frac{1}{N_{0}} \sum_{i \in \mathcal{N}} u_{i}(t); \; \; \mathcal{N} = \{i: x_{i} \leq \beta l ; \beta \in \mathbb{R}^{+}\}, \; \; N_{0} = card\; \mathcal{N}, $$
((46))
for which the strong form of the dual problem is
$$ \boldsymbol{m}\ddot{\boldsymbol{z}}  \boldsymbol{k}\boldsymbol{z} = \boldsymbol{q}, \qquad m \dot{\boldsymbol{z}}(\tau) = \boldsymbol{0}, \quad \boldsymbol{z}(\tau)=\boldsymbol{0},\, $$
((47))
where N
_{0} is the number of united atoms considered in set and q is the vector defined such that \(\mathcal {Q}(\boldsymbol {u}) = \boldsymbol {q}^{T}\boldsymbol {u}\). Given the QoI (46), q will be as a n×1 vector,
$$ q_{i} = \begin{cases} 1 & \text{if } x_{i} \leq \beta l \\ 0 & \text{otherwise } \end{cases} \qquad i=1, \cdots, n. $$
((48))
The residual function \(\mathcal {R} (\cdot, \cdot)\) of (36) in this example is of the form
$$ \mathcal{R} (\boldsymbol{U}_{\text{CG}},\boldsymbol{z}^{n})= \int_{0}^{\tau} \eta_{t}(t) \text{d}t; \qquad \eta_{t}(t) = \sum^{n}_{i=1} {z}_{i}(t) \cdot {\rho}_{i}(t) \; \text{d}t. $$
((49))
The estimated error in the QoI is then
$$ \mathcal{E}_{\text{est.}} = \mathcal{R} (\boldsymbol{U}_{\text{CG}},\boldsymbol{z}^{n}) \approx \mathcal{Q}_{\mathbf{r}}  \mathcal{Q}_{\mathbf{R}}, $$
((50))
where \(Q_{\mathbf {R}} = \int _{0}^{\tau } \eta _{t}(t)\text {d}t\) and then the exact error is
$$ \mathcal{E}_{\text{exact}} = \mathcal{E}_{\text{est.}} + \Delta, $$
((51))
Δ being the blackremainder in (34). Since the forms in (43)(46) are linear in their respective arguments, the exact blackremainder Δ should be zero, but the error introduced by the numerical integration schemes employed generally leads to an additional numerical error Δ
_{
Δ
t
}≠0. We employ a converted RungaKutta algorithm here to integrate (39), (41), and (47).
The results of the coarsegrained model for the case of P=4 are presented in Figure 2. Figure 2a shows the coarsescale displacement U=U(t) at different times, obtained from solution of (41) over the time domain t∈[0,τ]. The local residual of Figure 2b is then computed from (42). Figure 2c shows the solution of z(t) at different times. It observed that the adjoint solution propagates in time in the opposite direction to the primal solution, (47) being integrated backward in time.
It is known that in general, the solution of the base model is not available. However, in order to show the effectiveness of the method presented here, the equations of motion for the united atom system is also solved in this example. Having the solution of the united atom model, u(t), the evolution of the exact ζ and estimated η
_{
t
} over time is shown in Figure 2d.
Numerical approximations to the exact error are compared with the estimated error for various CG approximation of the united atom model in Figure 3a. The estimated error \(\mathcal {R} = \mathcal {R}(\boldsymbol {U}_{\textit {CG}}(\boldsymbol {\theta }); \boldsymbol {z}^{n})\) for various values of P, with θ=α
k/P, are indicated in Figure 3b for α=1. The computed estimated error (\(\mathcal {E}_{\text {est}}=\mathcal {R}\)) versus the parameter α are indicated in Figure 3c.
In general, the solution of the base model is not available, but the effectiveness of the method presented here is determined by comparing the CG solutions with the exact united atom model. The exact ζ and estimated η
_{
t
} over time are shown in Figure (2d).
Maximum entropy principle for atomic systems
Among features of the AA system that could qualify as quantities of interest, we consider a special measure of uncertainty content embodied in the socalled information entropy. In 1948, Shannon [6] introduced the concept of information entropy as a realvalued function H(p) of probability distributions (densities) p as a logical measure of uncertainty content in p that satisfied four rather straight forward “commonsense” desiderata (see also [12] for full details). For a discrete pdf p={p
_{1},p
_{2},…,p
_{
n
}}, the entropy is defined by
$$ H(p) =  \sum_{i=1}^{n} p_{i} \log p_{i}, $$
((52))
and for a continuous density, \(p \in L^{2} (\mathbb {R})\), we write
$$ H(p) =  \int_{\mathbb{R}} p(y) \log p(y) \text{d}y. $$
((53))
Given two probability densities p and q, with nonempty support of domains, the relative entropy between p and q is given by the KullbackLeibler divergence,
$$\begin{array}{@{}rcl@{}} D_{KL}(p \ q) &=& \int_{\mathbb{R}} p(y) \log \frac{p(y)}{q(y)} \text{d}y \\ &=& H(p,q)  H(p), \end{array} $$
((54))
where H(p,q) (\(=\int _{\mathbb {R}} p \log q \text {d}y\)) is the cross entropy and it is understood that \(0 \log \frac {0}{0}=0\) and \(0 \log \frac {0}{q}=0\).
Shannon’s principle of maximum entropy asserts that in the set of all possible probability distributions relevant to a random field, the correct probability p corresponds to the maximum entropy:
$$ H(p) = \underset{q \in \mathcal{P}}{max} \; H(q). $$
((55))
Errors in information entropy
The connection with the statistical mechanics characterization of the AA and CG models can be established by choosing as a quantity of interest the infinite–time average of the phase function q(r
^{n}(t)) over [0,∞]. For this we invoke the ergodic hypothesis,
$$\begin{array}{@{}rcl@{}} Q_{\mathbf{r}} &=& {\lim}_{\tau \rightarrow \infty} \tau^{1} \int_{t_{0}}^{t_{0}+\tau} q(\mathbf{r}(t)) \text{d}t \end{array} $$
((56))
$$\begin{array}{@{}rcl@{}} &=& \int_{\Gamma} \rho(\mathbf{r}^{n}) q(\mathbf{r}^{n}) \text{d} \mathbf{r}^{n} \end{array} $$
((57))
$$\begin{array}{@{}rcl@{}} &=& \langle q \rangle \end{array} $$
((58))
ρ(r
^{n}) being the distribution function for the ensemble under study and Γ the corresponding phase space subdomain. The corresponding CG approximation is
$$\begin{array}{@{}rcl@{}} Q_{\mathbf{R}}(\boldsymbol{\theta}) &=& {\lim}_{\tau \rightarrow \infty} \tau^{1} \int_{t_{0}}^{t_{0}+\tau} q(G(\mathbf{R}^{N}(t);\boldsymbol{\theta})) \text{d}t \end{array} $$
((59))
$$\begin{array}{@{}rcl@{}} &=& \int_{\Gamma} \rho(\mathbf{r}^{n}) q(G(\mathbf{R}^{N};\boldsymbol{\theta})) \text{d} \mathbf{r}^{n} \end{array} $$
((60))
where the notation G(R
^{N};θ) represents the relation (11). Setting
$$ q(\mathbf{r}^{n}) = \log \rho (\mathbf{r}^{n}), $$
((61))
gives immediately
$$ Q_{\mathbf{r}}  Q_{\mathbf{R}}(\boldsymbol{\theta}) = D_{KL} \left(\rho(\mathbf{r}^{n}) \ \rho (G(\mathbf{R}^{N};\boldsymbol{\theta})\right), $$
((62))
where D
_{
KL
}(·∥·) is the KullbackLeibler divergence defined in (54). Thus, if z
^{n} is the equilibrium solution of (25) with \(Q'(\bold {r}^{n} ; \bold {v}^{n}) = \int _{0}^{\tau } \partial _{\alpha _{i}} q(\mathbf {r}^{n}(t)) v_{\alpha i} \text {d}t\), then
$$ D_{KL}(\rho(\mathbf{r}^{n}) \ \rho(G (\mathbf{R}^{N};\boldsymbol{\theta}))) \cong \mathcal{R}(\mathbf{R}^{N} (\boldsymbol{\theta}); \mathbf{z}^{n}). $$
((63))
The specification (60) of the CG approximation (with ρ(r
^{n}) as opposed to ρ(R
^{N}(θ))) requires some explanation. In interpreting (60), one assumes the role of an observer who resides in the AA system and, instead of the true phase function q(r
^{n}), observes a corrupted version for each choice of θ constrained to reside only in microstates accessible by the CGmodel. This is also the interpretation of the residual described in (13) and (15). It is also noted that the estimate (63) is reminiscent of the minimum relative entropy method suggested by Shell [13].
A fundamental question arises at this point: given estimates (36) or (63), is it possible to find a special parameter vector θ
^{∗} that makes the error ε(θ
^{∗})=0? This question is related to the socalled wellspecification or missspecification of the CG model. We believe the answer to this question is generally “no.”
Model misspecification and statistical analysis
A fundamental concept in the mathematical statistics literature on parametric models is the notion of a wellspecified model, one that has the property that a special parameter vector θ
^{∗} exists that the model maps into the truth; i.e. the true observational data. If no such parameter exists, the model is said to be misspecified.
More generally, we consider a space of physical observables (in our case, the values of appropriate observables sampled from the AA model) and a set \(\mathbb {M(\mathcal {Y})}\) of probability measures μ on . As always, a target quantity of interest \(Q:\mathbb {M} \rightarrow \mathbb {R}\) is identified (e.g. Q(μ)=μ[X≥a], X being a random variable and a a threshold value). We seek a particular measure μ
^{∗} which yields the “true” value of the quantity of interest Q(μ
^{∗}). We wish to predict Q(μ
^{∗}) using a parametric model \(\mathcal {P}: \Theta \rightarrow \mathbb {M}(\mathcal {Y})\), Θ being the space of parameters. Again, if a θ
^{∗}∈Θ exists such that \(\mathcal {P}(\boldsymbol {\theta }^{*})=\mu ^{*}\), the model is said to be wellspecified; otherwise, if \(\mu ^{*} \notin \mathcal {P}(\Theta)\), the model is misspecified. See, e.g., Geyer [14], Kleijn and van der Vaart [7], Freedman [15], Nickl [16]. In our model discussed in Section ‘Preliminaries, conventions and notations’, we seek a parameter θ
^{∗} of the CG model such that ε(θ
^{∗}) of (36) is zero, an unlikely possibility for most choices of Q.
To recast the issue of error estimation into a statistical setting, we presume that our goal is to determine (predict) a probability distribution of a random variable, an observable q in the AA system, using a CG model , given a set y
_{1},y
_{2},⋯,y
_{
n
} of iid (independent, identicallydistributed) random variables representing samples y
_{
i
}=q(ω
_{
i
}) (\(\omega _{i} = \mathbf {r}_{i}^{n}\) is meant to denote a particular point in phase space). We denote by π(y
_{
i
}θ) the conditional probability density p of the distance between the random data y
_{
i
} and the parametertoobservation map d
_{
i
}(θ),
$$ p(y_{i}  d_{i}(\boldsymbol{\theta})) = \pi (y_{i}\boldsymbol{\theta}) ; \quad i = 1, 2, \cdots, n, $$
((64))
where π(y
_{
i
}θ) is the ith component of the likelihood function. The joint density of the data vector y
^{n}=y
_{1},y
_{2},⋯,y
_{
n
} is then,
$$ \pi_{n}(y_{1}, y_{2}, \cdots, y_{n}  \boldsymbol{\theta}) = \pi (\mathbf{y}  \boldsymbol{\theta}) = \prod_{i=1}^{n} \pi (y_{i}, \boldsymbol{\theta}). $$
((65))
The loglikelihood function is
$$ L_{n}(\boldsymbol{\theta}) = \log \pi(\mathbf{y}  \boldsymbol{\theta}) = \sum_{i=1}^{n} \log \pi(y_{i}  \boldsymbol{\theta}). $$
((66))
Let π(θ) be any prior probability density on the parameters θ (computed, for instance, using the maximum entropy method of Jaynes [12], as described for CG models in [3]); then the posterior density satisfies,
$$ \pi_{n}(\boldsymbol{\theta}\mathbf{y}) = \pi (y_{1}, y_{2}, \cdots, y_{n}  \boldsymbol{\theta}) \pi (\boldsymbol{\theta}) / Z(\boldsymbol{\theta}), $$
((67))
where \(Z(\boldsymbol {\theta }) = \int _{\Theta } \pi (\mathbf {y}  \boldsymbol {\theta }) \pi (\boldsymbol {\theta }) \text {d}\boldsymbol {\theta }\) is the model evidence.
The following definitions and theorems follow from these relations:

The Maximum Likelihood Estimate (MLE) is the parameter \(\hat {\boldsymbol {\theta }}^{n}\) that maximizes L
_{
n
}(θ):
$$ \hat{\boldsymbol{\theta}}^{n} = \underset{\boldsymbol{\theta} \in \Theta}{\text{argmax}} \; L_{n}(\boldsymbol{\theta}). $$
((68))

The Maximum A Posterior Estimate (MAP) is the parameter \(\tilde {\boldsymbol {\theta }}^{n}\) that maximizes the posterior pdf:
$$ \tilde{\boldsymbol{\theta}}^{n} = \underset{\boldsymbol{\theta} \in \Theta}{\text{argmax}} \; \pi_{n}(\boldsymbol{\theta}  \mathbf{y}). $$
((69))

The Bayesian Central Limit Theorem for wellspecified models under commonly satisfied smoothness assumptions (also called the Bernsteinvon Mises Theorem [7,16,17]) asserts that
$$ \pi_{n}(\boldsymbol{\theta}  \mathbf{y}) \overset{\mathcal{P}}{\rightarrow} \mathcal{N}(\boldsymbol{\theta}^{*}; \mathbf{I}^{1}(\boldsymbol{\theta}^{*})), $$
((70))
where convergence is convergence in probability, \(\mathcal {N}(\boldsymbol {\mu }, \boldsymbol {\Sigma })\) denotes a normal distribution with mean μ and covariance matrix Σ, \(\hat {\boldsymbol {\theta }}\) is the generalized MLE, and I(θ) is the Fisher information matrix,
$$ I_{ij}(\boldsymbol{\theta}) = \sum^{n}_{k=1} \left[ \frac{\partial^{2}}{\partial \theta_{i} \partial \theta_{j}} \log \pi({y}_{k}  \boldsymbol{\theta}) \right]_{\boldsymbol{\theta} = \boldsymbol{\theta}^{*}} $$
((71))

Given a set of parametric models, \(\mathcal {M} = \{ \mathcal {P}_{1} (\boldsymbol {\theta }_{1}), \mathcal {P}_{2}(\boldsymbol {\theta }_{2}), \cdots, \mathcal {P}_{m} (\boldsymbol {\theta }_{m}) \}\), the posterior plausibility of model j is defined through the applications of Bayesian arguments by (see [3,18])
$$ \rho_{j} = \pi(\mathcal{P}_{j}  \mathbf{y}, \mathcal{M}) = \frac{\int_{\Theta_{i}} \pi (\mathbf{y}  \boldsymbol{\theta}_{j}, \mathcal{P}_{j}, \mathcal{M}) \pi (\boldsymbol{\theta}_{j}  \mathcal{P}_{j}, \mathcal{M}) \text{d} \boldsymbol{\theta}_{j} \pi (\mathcal{P}_{j}  \mathcal{M})} {\pi(\mathbf{y}  \mathcal{M})} $$
((72))
with \(\sum _{j=1}^{m} \rho _{j} = 1\), and the largest ρ
_{
j
}∈[0,1] corresponds to the most plausible model for data \(\mathbf {y} \in \mathcal {Y}\).
Finally, we come to the case of misspecified parametric models in which \(\mu ^{*} \notin \mathcal {P}(\Theta)\); i.e. no parameter θ
^{∗} exists such that the truth \(\mu ^{*} = \mathcal {P}(\boldsymbol {\theta }^{*})\). This situation, we believe, is by far the most common encountered in the use of CG models.
We remark that in the (rare?) case of a wellspecified CG model, for any continuous functional \(Q: \Theta \rightarrow \mathbb {R}\) and if Θ is compact, if θ
^{∗} is the unique minimizer of Q and if
$$ \underset{\boldsymbol{\theta} \in \Theta}{\text{sup}} \;  Q(\boldsymbol{\theta}; y_{1}, y_{2}, \cdots, y_{n})  Q(\boldsymbol{\theta})  \overset{\mathcal{P}}{\rightarrow} 0, $$
((73))
as n→∞, then the sequence
$$ \hat{\boldsymbol{\theta}}^{n} = \underset{\boldsymbol{\theta} \in \Theta}{\text{argmin}} \; Q_{n} (\boldsymbol{\theta}; y_{1}, y_{2}, \cdots, y_{n}) $$
((74))
converges to θ
^{∗} in probability as n→∞. This is proved in Nickl [16]. In particular, under mild assumptions on the smoothness of the loglikelihood L
_{
n
}(θ),
$$ Q(\boldsymbol{\theta}^{*})  Q(\boldsymbol{\theta}) = D_{KL} (\pi(\cdot  \boldsymbol{\theta}) \; \ \; \pi(\cdot  \boldsymbol{\theta}^{*})) $$
((75))
D
_{
KL
}(·∥·) being the KullbackLeibler distance defined in (54). By Jensen’s inequality (see, e.g. [16]), Q(θ
^{∗})≤Q(θ)∀θ∈Θ; i.e. θ
^{∗} is the minimizer of Q.
The asymptotic results for the finite misspecified case is summed up in the powerful result of Kleijn and van der Vaart [7,19]: let g(y) denote the probability density associated with the true distribution μ
^{∗}. Then the posterior density π
_{
n
}(θy) converges in probability to the normal distribution,
$$ \pi_{n}(\boldsymbol{\theta} \mathbf{y}) \overset{\mathcal{P}}{\rightarrow} \mathcal{N} (\boldsymbol{\theta}^{\dagger}, \mathbf{V}^{1}(\boldsymbol{\theta}^{\dagger})), $$
((76))
where
$$ \mathbf{V}_{ij}(\boldsymbol{\theta}) = \mathbb{E}_{g} \left[ \frac{\partial^{2}}{\partial \theta_{i} \partial \theta_{j}} D_{KL} \left(\cdot \;  \; \pi(\cdot\boldsymbol{\theta}) \right) \right]_{\boldsymbol{\theta} = \boldsymbol{\theta}^{\dagger}}. $$
((77))
Thus, the best approximation to g in \(\mathcal {P}(\Theta)\) is the model with the parameter
$$ \boldsymbol{\theta}^{\dagger} = \underset{\boldsymbol{\theta} \in \Theta}{\text{argmin}} \; D_{KL} \left(g \ \pi (\cdot  \boldsymbol{\theta}, \mathcal{P}, \mathcal{M}) \right) $$
((78))
being a class of parametric models to which belongs.
It is easily shown that θ
^{†} is a maximum likelihood estimate, i.e. it maximizes the expected value of the loglikelihood relative to the true density g:
$$\begin{array}{@{}rcl@{}} \boldsymbol{\theta}^{\dagger} & = & \underset{\Theta}{\text{argmin}} \left[ \int_{\mathcal{Y}^{n}} g(\textbf{y}) \log g(\textbf{y}) \; d\textbf{y}  \int_{\mathcal{Y}^{n}} g(\textbf{y}) \log \pi(\textbf{y}  \boldsymbol{\theta}) \; d\textbf{y} \right] \\ & = & \underset{\Theta}{\text{argmin}} \left[  \int_{\mathcal{Y}^{n}} g(\textbf{y}) \log \pi(\textbf{y}  \boldsymbol{\theta}) \; d\textbf{y} \right] \\ & = & \underset{\Theta}{\text{argmax}} \int_{\mathcal{Y}^{n}} g(\textbf{y}) \log \pi(\textbf{y}  \boldsymbol{\theta}) \; d \textbf{y} \\ & = & \underset{\Theta}{\text{argmax}} \mathbb{E}_{g} \left[ \log \pi(\textbf{y}  \boldsymbol{\theta}) \right], \end{array} $$
((79))
where the negative selfentropy \(\int g \log g \; d\textbf {y}\) was eliminated since it does not depend on θ and therefore does not affect the optimization.
Plausibility D
_{
KL
} theory
Let us now suppose that we have two misspecified models, \(\mathcal {P}_{1}\) and \(\mathcal {P}_{2}\). We may compare these models in the Bayesian setting through the concept of model plausibility: if \(\mathcal {P}_{1}\) is more plausible than \(\mathcal {P}_{2}\), ρ
_{1}>ρ
_{2}. In the maximum likelihood setting, the model that yields a probability measure closer to μ
^{∗} is considered the “better” model. That is, if
$$ D_{KL}(g \ \pi(\textbf{y}  \boldsymbol{\theta}^{\dagger}_{1}, \mathcal{P}_{1}, \mathcal{M})) < D_{KL}(g \ \pi(\textbf{y} \boldsymbol{\theta}^{\dagger}_{2}, \mathcal{P}_{2}, \mathcal{M})), $$
((80))
it can be said that model \(\mathcal {P}_{1}\) is better than model \(\mathcal {P}_{2}\). The theorems presented here define the relationship between these two notions of model comparison.
However, Bayesian and frequentist methods fundamentally differ in the way they view the model parameters. Bayesian methods consider parameters to be stochastic, characterized by probability density functions, while frequentist approaches seek a single, deterministic parameter value. To bridge this gap in methodology, we note that considering parameters as deterministic vectors, for example θ
_{0}, is akin to assigning them delta functions as their posterior probability distributions, which result from delta prior distributions. In this case, the model evidence is given by
$$ \pi(\textbf{y}  \mathcal{P}_{i}, \mathcal{M}) = \int_{\Theta} \pi(\textbf{y}  \boldsymbol{\theta}, \mathcal{P}_{i}, \mathcal{M}) \delta(\boldsymbol{\theta}  \boldsymbol{\theta}_{0}) \; d\boldsymbol{\theta} = \pi(\textbf{y}  \boldsymbol{\theta}_{0}, \mathcal{P}_{i}, \mathcal{M}). $$
((81))
In particular, if we consider the optimal parameter \(\boldsymbol {\theta }^{\dagger }_{i}\) for model \(\mathcal {P}_{i}\), \(\pi (\textbf {y}  \mathcal {P}_{i}, \mathcal {M}) = \pi (\textbf {y}  \boldsymbol {\theta }_{i}^{\dagger }, \mathcal {P}_{i}, \mathcal {M})\). We can take the ratio of posterior model plausibilities,
$$ \frac{\rho_{1}}{\rho_{2}} = \frac{\pi(\textbf{y}  \mathcal{P}_{1}, \mathcal{M}) \pi(\mathcal{P}_{1}  \mathcal{M})}{\pi(\textbf{y}  \mathcal{P}_{2}, \mathcal{M}) \pi(\mathcal{P}_{2}  \mathcal{M})} = \frac{\pi(\textbf{y}  \boldsymbol{\theta}_{1}^{\dagger}, \mathcal{P}_{1}, \mathcal{M}) \pi(\mathcal{P}_{1}  \mathcal{M})}{\pi(\textbf{y}  \boldsymbol{\theta}_{2}^{\dagger}, \mathcal{P}_{2}, \mathcal{M}) \pi(\mathcal{P}_{2}  \mathcal{M})} = \frac{\pi(\textbf{y}  \boldsymbol{\theta}_{1}^{\dagger}, \mathcal{P}_{1}, \mathcal{M})}{\pi(\textbf{y}  \boldsymbol{\theta}_{2}^{\dagger}, \mathcal{P}_{2}, \mathcal{M}) } O_{12}, $$
((82))
where \(O_{12} = \pi (\mathcal {P}_{1}\mathcal {M})/\pi (\mathcal {P}_{2}\mathcal {M})\) is the prior odds and is often assumed to be one. With these assumptions in force, we present the following theorems.
Theorem 2.
Let (82) hold. If \(\mathcal {P}_{1}\) is more plausible than \(\mathcal {P}_{2}\) and O
_{12}≤1, then (80) holds.
Proof.
If \(\mathcal {P}_{1}\) is more plausible than \(\mathcal {P}_{2}\),
$$ 1 < \frac{\rho_{1}}{\rho_{2}} = \frac{\pi(\textbf{y}  \boldsymbol{\theta}_{1}^{\dagger}, \mathcal{P}_{1}, \mathcal{M})}{\pi(\textbf{y}  \boldsymbol{\theta}_{2}^{\dagger}, \mathcal{P}_{2}, \mathcal{M})} O_{12} \leq \frac{\pi(\textbf{y}  \boldsymbol{\theta}_{1}^{\dagger}, \mathcal{P}_{1}, \mathcal{M})}{\pi(\textbf{y}  \boldsymbol{\theta}_{2}^{\dagger}, \mathcal{P}_{2}, \mathcal{M})} $$
((83))
Equivalently, the reciprocal of the far righthand side is less than one, so
$$ \log \frac{\pi(\textbf{y}  \boldsymbol{\theta}_{2}^{\dagger}, \mathcal{P}_{2}, \mathcal{M})}{\pi(\textbf{y}  \boldsymbol{\theta}_{1}^{\dagger}, \mathcal{P}_{1}, \mathcal{M})} < 0. $$
((84))
Since g(y) is a probability measure, it is always nonnegative. Thus
$$ g(\textbf{y}) \log \frac{\pi(\textbf{y}  \boldsymbol{\theta}_{2}^{\dagger}, \mathcal{P}_{2}, \mathcal{M})}{\pi(\textbf{y}  \boldsymbol{\theta}_{1}^{\dagger}, \mathcal{P}_{1}, \mathcal{M})} < 0 \Rightarrow \int_{\mathcal{Y}^{n}}g(\textbf{y}) \log \frac{\pi(\textbf{y}  \boldsymbol{\theta}_{2}^{\dagger}, \mathcal{P}_{2}, \mathcal{M})}{\pi(\textbf{y}  \boldsymbol{\theta}_{1}^{\dagger}, \mathcal{P}_{1}, \mathcal{M})}\; d\textbf{y} < 0. $$
((85))
This can be expanded into
$$ \int_{\mathcal{Y}^{n}}g(\textbf{y}) \log \pi(\textbf{y}  \boldsymbol{\theta}_{2}^{\dagger}, \mathcal{P}_{2}, \mathcal{M}) \; d\textbf{y} \int_{\mathcal{Y}^{n}}g(\textbf{y}) \log \pi(\textbf{y}  \boldsymbol{\theta}_{1}^{\dagger}, \mathcal{P}_{1}, \mathcal{M}) \; d\textbf{y}< 0, $$
((86))
which means
$$  \int_{\mathcal{Y}^{n}}g(\textbf{y}) \log \pi(\textbf{y}  \boldsymbol{\theta}_{1}^{\dagger}, \mathcal{P}_{1}, \mathcal{M}) \; d\textbf{y} <  \int_{\mathcal{Y}^{n}}g(\textbf{y}) \log \pi(\textbf{y}  \boldsymbol{\theta}_{2}^{\dagger}, \mathcal{P}_{2}, \mathcal{M}) \; d\textbf{y}. $$
((87))
By adding the quantity \(\int _{\mathcal {Y}^{n}} g \log g \; d\textbf {y}\) to both sides, the desired result (80) immediately follows.
This theorem demonstrates that if model \(\mathcal {P}_{1}\) is “better” than model \(\mathcal {P}_{2}\) in the Bayesian sense, it is also a “better” deterministic model in the sense of (80). However, the reverse implication requires much stronger conditions. The assertion (80) can be equivalently written as
$$ \int_{\mathcal{Y}^{n}} g(\textbf{y}) \log \frac{\pi(\textbf{y}  \boldsymbol{\theta}^{\dagger}_{2}, \mathcal{P}_{2}, \mathcal{M})}{ \pi(\textbf{y}  \boldsymbol{\theta}^{\dagger}_{1}, \mathcal{P}_{1}, \mathcal{M})} \; d \textbf{y} < 0. $$
((88))
For this inequality to hold, the relationship
$$ \frac{\pi(\textbf{y}  \boldsymbol{\theta}^{\dagger}_{2}, \mathcal{P}_{2}, \mathcal{M})}{ \pi(\textbf{y}  \boldsymbol{\theta}^{\dagger}_{1}, \mathcal{P}_{1}, \mathcal{M})} < 1 $$
((89))
does not necessarily need to be true for every point \(\textbf {y} \in \mathcal {Y}^{n}\).
One perhaps naive way to proceed is to invoke the Mean Value Theorem: if \(\left  \mathcal {Y}^{n} \right  < \infty \) and under suitable smoothness conditions, there exists some \(\bar {\textbf {y}} \in \mathcal {Y}^{n}\) such that
$$ \int_{\mathcal{Y}^{n}} g(\textbf{y}) \log \frac{\pi(\textbf{y}  \boldsymbol{\theta}^{\dagger}_{2}, \mathcal{P}_{2}, \mathcal{M})}{ \pi(\textbf{y}  \boldsymbol{\theta}^{\dagger}_{1}, \mathcal{P}_{1}, \mathcal{M})} \; d \textbf{y} = \left \mathcal{Y}^{n} \right g(\bar{\textbf{y}}) \log \frac{\pi(\bar{\textbf{y}}  \boldsymbol{\theta}^{\dagger}_{2}, \mathcal{P}_{2}, \mathcal{M})}{ \pi(\bar{\textbf{y}}  \boldsymbol{\theta}^{\dagger}_{1}, \mathcal{P}_{1}, \mathcal{M})}. $$
((90))
Then, combining (88) and (90) yields,
$$ \left \mathcal{Y}^{n} \right g(\bar{\textbf{y}}) \log \frac{\pi(\bar{\textbf{y}}  \boldsymbol{\theta}^{\dagger}_{2}, \mathcal{P}_{2}, \mathcal{M})}{ \pi(\bar{\textbf{y}}  \boldsymbol{\theta}^{\dagger}_{1}, \mathcal{P}_{1}, \mathcal{M})} < 0. $$
((91))
Since \(\left  \mathcal {Y}^{n} \right  > 0\) and g(y)>0,
$$ \log \frac{\pi(\bar{\textbf{y}}  \boldsymbol{\theta}^{\dagger}_{2}, \mathcal{P}_{2}, \mathcal{M})}{ \pi(\bar{\textbf{y}}  \boldsymbol{\theta}^{\dagger}_{1}, \mathcal{P}_{1}, \mathcal{M})} < 0 \Rightarrow \frac{\pi(\bar{\textbf{y}}  \boldsymbol{\theta}^{\dagger}_{2}, \mathcal{P}_{2}, \mathcal{M})}{ \pi(\bar{\textbf{y}}  \boldsymbol{\theta}^{\dagger}_{1}, \mathcal{P}_{1}, \mathcal{M})} < 1 \Rightarrow \frac{\pi(\bar{\textbf{y}}  \boldsymbol{\theta}^{\dagger}_{1}, \mathcal{P}_{1}, \mathcal{M})}{ \pi(\bar{\textbf{y}}  \boldsymbol{\theta}^{\dagger}_{2}, \mathcal{P}_{2}, \mathcal{M})} > 1. $$
((92))
If O
_{12}≥1,
$$ \frac{\pi(\bar{\textbf{y}}  \boldsymbol{\theta}^{\dagger}_{1}, \mathcal{P}_{1}, \mathcal{M})}{ \pi(\bar{\textbf{y}}  \boldsymbol{\theta}^{\dagger}_{2}, \mathcal{P}_{2}, \mathcal{M})} O_{12}> 1 \Rightarrow \frac{\rho_{1} }{\rho_{2}} > 1. $$
((93))
Thus \(\mathcal {P}_{1}\) is more plausible than \(\mathcal {P}_{2}\) for given data \(\bar {\textbf {y}}\).
In summary, we have:
Theorem 3.
If \(D_{\textit {KL}}(g \ \pi (\textbf {y}  \boldsymbol {\theta }^{\dagger }_{1}, \mathcal {P}_{1}, \mathcal {M})) < D_{\textit {KL}}(g \ \pi (\textbf {y}  \boldsymbol {\theta }^{\dagger }_{2}, \mathcal {P}_{2},\mathcal {M}))\) and if \(\left  \mathcal {Y}^{n} \right  < \infty \) and if (90) holds, then there exists a \(\bar {\textbf {y}} \in \mathcal {Y}^{n}\) such that \(\mathcal {P}_{1}\) is more plausible than \(\mathcal {P}_{2}\), given that O
_{12}≥1.