Supervised statistical learning
We consider a model that returns a real-valued variable \(Y=u(X)\). An approximation v of the function u, also referred to as metamodel, can be obtained by minimizing the risk
$$\begin{aligned} {\mathcal {R}}(v) = \int _{\mathcal {X}}\ell (u(x),v(x)) \text {d} p(x) \end{aligned}$$
over a model class \({\mathcal {M}}\). The loss function \(\ell \) measures a distance between the observation u(x) and the prediction v(x). In the case of the least squares method, it is chosen as \(\ell (y,v(x))=(y-v(x))^2\).
Let \(S=\{(x^k,y^k):1\le k \le N\}\) be a sample of n realizations of (X, Y). In practice, the approximation is constructed by minimizing the empirical risk
$$\begin{aligned} {\mathcal {R}}_S(v)=\frac{1}{N}\sum _{k=1}^N \ell (y^k,v(x^k)), \end{aligned}$$
(10)
taken as an estimator of the risk. A regularization term R can be used for stability reasons when the training sample is small. An approximation \(\tilde{u}\) of u is then solution of
$$\begin{aligned} \min _{v\in {\mathcal {M}}} \frac{1}{N} \sum _{k=1}^N\left( y^k-v(x^k)\right) ^2+\lambda R(v), \end{aligned}$$
(11)
with the regularization parameter \(\lambda \ge 0\), chosen or computed. The accuracy of the metamodel \(\tilde{u}\) is estimated using an estimator of the \(L^2\) error. In practice, the number of numerical experiments is too small to sacrifice part of it for the error estimation. The error is thus estimated using a k-fold cross validation estimator and more specifically the leave-one-out cross validation estimator [4] which can be easily evaluated by constructing one single metamodel [13]. Cross validation estimators can be used for model selection.
In the following, we present a method to determine \(\tilde{u}(x) \) in a sequence of model classes
$$\begin{aligned} {\mathcal {M}}(m)=\left\{ v(x)=h(g_1(x), \ldots , g_m(x),\xi );\, h\in {\mathcal {M}}_h,\, g_i \in {\mathcal {M}}_g \text { for }i=1,\ldots , m\right\} \end{aligned}$$
(12)
with \({\mathcal {M}}_h\) a linear or multilinear model class and \(g_i\in {\mathcal {M}}_g\) a linear model class. We consider here multilinear models and more specifically the tensor subset \({\mathcal {T}}^T_r\) in order to handle the possibly high dimensional \(m+d\) problem. We briefly recall the learning algorithm in a tensor subset [12] and then present the automatic computation of the new variables \(z_i=g_i(x)\), \(i = 1,\ldots ,m\).
Learning with tensor formats
Let \(z \in {\mathbb {R}}^M\), an approximation of u in a tensor subset (2) can be obtained by minimizing the empirical least squares risk:
$$\begin{aligned} \min _{\mathbf {a}^1,\ldots , \mathbf {a}^L} \frac{1}{N} \sum _{k=1}^N \left( y^k-\Psi (z^k)(\mathbf {a}_1,\ldots ,\mathbf {a}_L)\right) ^2+\sum _{i=1}^L\lambda _i R_i(\mathbf {a}_i) \end{aligned}$$
(13)
where \(\lambda _i R_i(\mathbf {a}_i)\) are regularization functions. Problem (13) is solved using an alternating minimization algorithm which consists in successively solving an optimization problem on \(\mathbf {a}^j\)
$$\begin{aligned} \min _{\mathbf {a}_j} \frac{1}{N} \sum _{k=1}^N\left( y^k-\Psi (z^k)(\ldots ,\mathbf {a}_j,\ldots )\right) ^2 +\lambda _j R_j(\mathbf {a}_j) \end{aligned}$$
(14)
for fixed parameters \(\mathbf {a}_i\), \(i\ne j\). Introducing the linear map \(\Psi ^j(z)(\mathbf {a}_j)=\Psi (z)(\mathbf {a}_1,\ldots ,\mathbf {a}_L)\), problem (14) yields the following learning problem with a linear model
$$\begin{aligned} \min _{\mathbf {a}_j} \frac{1}{N} \sum _{k=1}^N\left( y^k-\Psi ^j(z^k)(\mathbf {a}_j)\right) ^2 +\lambda _j R_j(\mathbf {a}_j). \end{aligned}$$
(15)
If \(R_j(\mathbf {a}_j)=\left\| {\mathbf {a}_j}\right\| _1\), with \(\left\| {\mathbf {v}}\right\| _1=\sum _{i=1}^{\# v}\left| {v_i}\right| \) the \(\ell _1\)-norm, problem (15) is a convex optimization problem known as Lasso [14] or basis pursuit [15]. The \(\ell _1\)-norm is a sparsity inducing regularization function that yields a solution \(\mathbf {a}_j\) of (15) that may have coefficients equal to zero. The Lasso is solved using the modified least angle regression algorithm (LARS in [16]).
The algorithm to solve Problem (13) is described in [11] for the canonical tensor format, and in [12] for the tree-based tensor format, which is a special case of rank-structured tensor where T is a dimension partition tree. Adaptive algorithms are proposed to automatically select the tuple yielding a good convergence of the approximation with respect to its complexity. For tree-based tensor formats, at an iteration i, given an approximation \(v^i\) of u with T-rank \((r^i_\alpha )_{\alpha \in T}\), the strategy consists in estimating and studying the truncation error \(\min _{\mathrm{rank}_\alpha (v) \le r^i_\alpha } {\mathcal {R}}(v) - {\mathcal {R}}(u)\) for different \(\alpha \) in T, and choosing to increase the ranks \(r^i_\alpha \) associated with the indices \(\alpha \) yielding the highest errors. The algorithm and more details in the tree-based tensor format case can be found in [12, 17].
Learning method with automatic definition of new variables
We now present the method used to automatically search an adapted parametrization of the problem by looking for favored directions in the space of the \(d+1\) input variables. It consists in writing the approximation under the form (5) where \(g_n( x)\) can be represented with a parametrized linear map
$$\begin{aligned} g_n(x)=\mathbf {w}_n^\intercal \varvec{\varphi }(x)=\sum _{j=1}^{p}w_{n,j}\varphi _{j}(x) \end{aligned}$$
(16)
with \(\mathbf {w}_{n}=(w_{n,1},\ldots , w_{n,p})^\intercal \in {\mathbb {R}}^{p}\) the vector of parameters of the representation of \(g_n\) on an orthonormal functional basis \(\{\varphi _{j}\}_{j=1}^p\) of \({\mathcal {H}}\), and h a M-dimensional function in the model class of rank structured formats \({\mathcal {T}}_r^C\) or \({\mathcal {T}}_r^T\) that can be represented with a parametrized multilinear map
$$\begin{aligned} h(z) =\Psi (z)(\mathbf {a}^{1},\ldots ,\mathbf {a}^{L}) \end{aligned}$$
(17)
with parameters \(\mathbf {a}^l\), \(l = 1,\ldots ,L\). The new set of variables \(z=(z_1,\ldots ,z_m,\xi )\) is such that \(z_n=g_n(x)\), \(n=1,\ldots ,m\).
The method is a Projection Pursuit like method [8] that is generalized to a larger class of models for h than just additive models. It is shown in [18] that under reasonable conditions, for \(h\in {\mathcal {T}}^T_r\) and \(g\in {\mathcal {H}}\) we have \(h\circ g \in L^2_p({\mathcal {X}})\). The approximation \(\tilde{u} \) of the form (5) is thus parametrized as follows:
$$\begin{aligned} \tilde{u}(x)=\Psi \left( \mathbf {w}_1^\intercal \varvec{\varphi }(x),\ldots ,\mathbf {w}_m^\intercal \varvec{\varphi }(x),\xi \right) (\mathbf {a}^{1},\ldots ,\mathbf {a}^{L}). \end{aligned}$$
(18)
Let \((z_1,\ldots ,z_{m-1})\) be an initial set of variables. A new variable \(z_m=g_m(x)\) is introduced using Algorithm 1.
The parameters \(\mathbf {a}_{l}\), \(l = 1,\ldots ,L\), and \(\mathbf {w}_n\), \(n = 1,\ldots ,m\), solve the minimization problem (11) over the model class \({\mathcal {M}}(m)\):
$$\begin{aligned} \min _{\{\mathbf {a}_{l}\}_{l=1}^L, \{\mathbf {w}_n\}_{n=1}^m} \frac{1}{N} \sum _{k=1}^N\left( y^k-\Psi \left( \mathbf {w}_1^\intercal \varvec{\varphi }(x^k), \ldots ,\mathbf {w}_m^\intercal \varvec{\varphi }(x^k),\xi ^k\right) (\mathbf {a}^{1}, \ldots ,\mathbf {a}^{L})\right) ^2+ \sum _{i=1}^L\lambda _i R_i(\mathbf {a}_i), \end{aligned}$$
(19)
with \(x^k = (\xi ^k,t^k)\). The solution of this problem is found by alternatively solving
$$\begin{aligned} \min _{\mathbf {a}_{1},\ldots , \mathbf {a}_L} \frac{1}{N} \sum _{k=1}^N\left( y^k-\Psi \left( \mathbf {w}_1^\intercal \varvec{\varphi }(x^k),\ldots ,\mathbf {w}_m^\intercal \varvec{\varphi }(x^k),\xi ^k\right) (\mathbf {a}^{1},\ldots ,\mathbf {a}^{L})\right) ^2 +\sum _{i=1}^L\lambda _i R_i(\mathbf {a}_i) \end{aligned}$$
(20)
for fixed \((\mathbf {w}_1,\ldots ,\mathbf {w}_m)\) using a learning algorithm with rank adaptation [11, 12] and
$$\begin{aligned} \min _{\mathbf {w}_1\in {\mathbb {R}}^{p},\ldots , \mathbf {w}_m\in {\mathbb {R}}^{p}} \frac{1}{N} \sum _{k=1}^N\left( y^k-\Psi \left( \mathbf {w}_1^\intercal \varvec{\varphi } (x^k),\ldots ,\mathbf {w}_m^\intercal \varvec{\varphi }(x^k),\xi ^k\right) (\mathbf {a}^{1},\ldots ,\mathbf {a}^{L})\right) ^2 \end{aligned}$$
(21)
for fixed \((\mathbf {a}^{1},\ldots ,\mathbf {a}^{L})\). The optimization problem (21) is a nonlinear least squares problem that is solved with a Gauss-Newton algorithm. The overall algorithm is presented in Algorithm 2.
In step 5 of algorithm 2, the parametrization of the model is to be selected in a collection of parametrizations. In this paper, we consider the rank structured function \({\mathcal {T}}^T_r\) where T is a dimension partition tree over \(\left\{ 1,\dots ,m\right\} \), which corresponds to the model class of functions in tree-based tensor format [19], a particular class of tensor networks [20]. The new node associated with the new variable \(z_i=g_i(x)\) is added at the top of the tree. The representation of a function v in \({\mathcal {T}}^T_r({\mathcal {H}})\) requires the storage of a number of parameters that depends both on the collection T and on the associated T-rank r, as well as on the dimensions of the functional spaces in each dimension. To reduce the number of coefficients that need to be computed during the learning process of a tensor v, one can then represent \(v \in {\mathcal {T}}^T_r({\mathcal {H}})\) for different collections T and associated T-ranks r, choosing the ones yielding the smallest storage complexity. Furthermore, this adaptation can prove useful when dealing with changes of variables, as introduced in the previous subsection, because it can remove the difficulty of how to add a variable: no matter what the initial ordering of the variables is, an adaptation procedure may be able to find an optimal one yielding a smaller storage complexity. When T is a dimension partition tree, a stochastic algorithm is presented in [12] for trying to reduce the storage complexity of a tree-based tensor format at a given accuracy. This adaptation is not considered in the paper.
Change of variables for periodic functions
In this section we present a generalization of the Fourier series where one does not need a structured sample (e.g. a grid) in the variable t. Indeed when the function to approximate is known to be periodic with respect to t, the periodicity of the approximation can be forced. It is done on the one hand by introducing a functional basis \(\varphi \) such that \(g_n=\mathbf {w}_n^\intercal \varvec{\varphi }\) in (16) can be identified to (6) by choosing
$$\begin{aligned} \varphi _j(x)=\varphi _j^{(\xi )}(\xi ) t \end{aligned}$$
(22)
where \(\{\varphi _j^{(\xi )}\}\) is a d-dimensional tensorized orthogonal basis of \({\mathcal {H}}_{1}\otimes \dots \otimes {\mathcal {H}}_{d}\). The circular frequencies in (6) are then expressed as:
$$\begin{aligned} \omega _n(\xi )=\sum _{j=1}^{p}w_{n,j} \varphi ^{(\xi )}_{j}(\xi ) =\mathbf {w}_n^\intercal \varvec{\varphi }^{(\xi )}(\xi ). \end{aligned}$$
(23)
On the other hand, we choose bases of trigonometric functions \(\{\psi ^n_i\}_{i=1}^{P_n}\) for the representation of h in the dimensions associated with the new variables \(z_n\), \(n = 1,\ldots ,m\).
Let \(\textsf {T}_{max}\) be the maximal width of the observation interval in the dimension t. Supposing this interval is large enough to include the largest periods of the periodic functions that can be learned, the guarantee for the approximation not to have larger periods is to constrain the circular frequencies in (6) such that:
$$\begin{aligned} \omega _n \ge \frac{2 \pi }{\textsf {T}_{max}}, \, n=1,\ldots ,m. \end{aligned}$$
(24)
This constraint is imposed for all values taken by \(\xi \) in S. Using expression (23), it is recast under the form
$$\begin{aligned} -A \mathbf {w}\le -B \end{aligned}$$
(25)
where \(\mathbf {w}=(\mathbf {w}_1, \ldots ,\mathbf {w}_m)\in {\mathbb {R}}^{p\times m}\), \(A\in {\mathbb {R}}^{N\times p}\) is the array of evaluations of \(\varphi ^{(\xi )}_j(\xi )\) for the values \((x^k_{1},\ldots , x^k_{d})_{k=1}^N\) of \(\xi \) in the training set S:
$$\begin{aligned} A=\left[ \begin{matrix} \varphi _{1}^{(\xi )}(x_{1}^1,\ldots , x_{d}^1)&{} \ldots &{} \varphi _{p}^{(\xi )}(x_1^1,\ldots , x_{d}^1)\\ \vdots &{}\ddots &{}\vdots \\ \varphi _{1}^{(\xi )}(x_{1}^N,\ldots , x_{d}^N)&{} \ldots &{} \varphi _{p}^{(\xi )}(x_{1}^N,\ldots , x_{d}^N) \end{matrix}\right] , \end{aligned}$$
(26)
and \(B\in {\mathbb {R}}^{N\times m}\) is a full array with values \(-{2\pi }/{\textsf {T}_{max}}\). The optimization problem (21) for the computation of the parameters \(\mathbf {w}\) is replaced with the constrained optimization problem
$$\begin{aligned} \min _{\begin{array}{c} \mathbf {w}_1\in {\mathbb {R}}^{p},\ldots , \mathbf {w}_m\in {\mathbb {R}}^{p} \\ {-A \mathbf {w}\le -B} \end{array}} \frac{1}{N} \sum _{k=1}^N\left( y^k-\Psi \left( \mathbf {w}_1\varphi (x^k),\ldots ,\mathbf {w}_m\varphi (x^k),\xi ^k\right) (\mathbf {a}^{1},\ldots ,\mathbf {a}^{L})\right) ^2 \end{aligned}$$
(27)
which is solved with a NonLinear Programming (NLP) method.