Most real optimization problems are defined over a mixed search space where the variables are both discrete and continuous. In engineering applications, the objective function is typically calculated with a numerically costly black-box simulation. General mixed and costly optimization problems are therefore of a great practical interest, yet their resolution remains in a large part an open scientific question. In this article, costly mixed problems are approached through Gaussian processes where the discrete variables are relaxed into continuous latent variables. The continuous space is more easily harvested by classical Bayesian optimization techniques than a mixed space would. Discrete variables are recovered either subsequently to the continuous optimization, or simultaneously with an additional continuous-discrete compatibility constraint that is handled with augmented Lagrangians. Several possible implementations of such Bayesian mixed optimizers are compared. In particular, the reformulation of the problem with continuous latent variables is put in competition with searches working directly in the mixed space. Among the algorithms involving latent variables and an augmented Lagrangian, a particular attention is devoted to the Lagrange multipliers for which a local and a global estimation techniques are studied. The comparisons are based on the repeated optimization of three analytical functions and a beam design problem.

Introduction

A key task in engineering design is to find an optimal configuration from a very large set of alternatives. When the performance of the candidate solutions is measured through a realistic simulation, the numerical cost of the procedure becomes a bottleneck. The optimization of computationally expensive simulators is a topic widely studied in the literature Thi et al. [26].

In this work, we focus on Bayesian optimization (BO), which is particularly suitable for solving such problems Frazier [9]. Bayesian optimization is a sequential design strategy that requires a data-driven mathematical model or metamodel that provides predictions along with their uncertainty Bartz-Beielstein et al. [2]. The metamodel replaces some of the calls to the expensive simulation and is a key ingredient to the optimization of costly functions. An acquisition criterion Wilson et al. [29] aggregates the spatial predictions and uncertainties. The metamodel is trained from a reduced set of simulation data and the acquisition criterion is maximized to propose new configurations to be simulated at the next iteration. When the acquisition criterion is the expected improvement (EI), as first introduced in Mockus et al. [18], the BO algorithm is often called EGO (Efficient Global Optimization, Jones et al. [12]). EGO is currently a state-of-the-art approach to medium size, continuous and costly optimization problems, both from an empirical Le Riche and Picheny [14] and a theoretical point of view Vazquez and Bect [27].

However, in realistic settings, some of decision variables are categorical. In structural design for example, the type of material, the number of components, the choice between alternative technologies lead to discrete variables with no obvious distance between them. The combination of continuous and categorical variables is called a mixed optimization problem.

In non-costly cases, mixed optimization problems can be approached by Mixed-Integer NonLinear Programming Belotti et al. [4] (when the discrete variables are integers), by sampling based techniques such as evolutionary optimization Cao et al. [6], Emmerich et al. [8], Ocenasek and Schwarz [20] or by alternating mixed programming Audet and Dennis Jr [1].

When the objective function is costly, mixed optimization problems remain challenging and a topic for research. It is customary to replace some of the calls to the original objective function by calls to a (meta)model of it. Bartz-Beielstein and Zaefferer [3] provide an overview of metamodels that have or can be used in optimization when the variables are continuous or discrete. Bayesian optimization methods, which rely on metamodels to save computations, have already been extended to mixed problems. It was made possible by the realization that GP kernels (covariance functions) in mixed variables can be created by composing continuous and discrete kernels. The acquisition function is defined over the same space as the objective function. Therefore maximizing the acquisition function is also a mixed variables problem.

To the best of our knowledge, the first EGO-like algorithm for mixed variables has been proposed in Hutter et al. [11]. In this article, the mixed kernel is a product of continuous and discrete Gaussian kernels, and random forests constitute an alternative choice of mixed metamodel. More precisely, the discrete kernel is a Gaussian of integer or hamming (also known as Gower) distance for ordinal or nominal variables, respectively. In Hutter et al. [11], the expected improvement is first optimized with a multi-start local search for both the continuous and discrete variables (thus a neighborhood for the discrete variables is defined) which is then complemented by a random search. This work was continued with the REMBO method in Wang et al. [28], where a random linear embedding is introduced to tackle high-dimensional problems. Discrete variables were relaxed into continuous variables thanks to a mapping function. The optimization of the acquisition function was made with a combination of the DIRECT and CMA-ES continuous global optimizers. Both Hutter et al. [11] and Wang et al. [28] have been motivated by applications to the automatic configuration of algorithms. The goal of reaching very high dimensions (millions) probably forced the authors to use isotropic kernels as a way to keep the number of hyper-parameters low (only one length-scale for all dimensions).

A Bayesian mixed optimizer is presented in Pelamatti et al. [21]. The GP kernels are products of continuous and discrete kernels. Different discrete kernels are compared, namely the homo- and hetero-scedastic hypersphere decomposition and the compound symmetric kernels. The optimization of the acquisition function is performed with a genetic algorithm in mixed variables. A similar BO with mixed kernel is described in Zuniga and Sinoquet [33], but the expected improvement is optimized with the mixed version of the MADS algorithm Audet and Dennis Jr [1] and the neighborhood of the categorical variables is defined through a probabilistic model.

Random forests can replace the kriging model in BO with mixed inputs as they natively have a measure of prediction uncertainty. Such an implementation, first done in Hutter et al. [11], is part of the mlrMBOR package Bischl et al. [5], in conjunction with several acquisition criteria that can be optimized with a “focus-search” algorithm. The focus-search algorithm hierarchically samples the search space of the chosen acquisition criterion.

Recent developments in metamodels involving mixed variables show that it is possible to map the categorical variables into quantitative non-observed latent variables that are then considered as continuous Zhang et al. [31]. Whenever it is possible to write a model of the studied system, quantitative latent variables exist that describe the effects of the categorical variables. Typically, there are more latent variables than categorical variables. The existence of continuous latent variables can sometimes be established from the physics of the considered phenomena, e.g. in material science Zhang et al. [32]. In structural mechanics for example, if the categorical variable describes the shape and the material of an element load in flexion, its bending moment of inertia is a candidate latent variable. Latent variables can emulate the properties of the original categorical variables, in particular within the metamodel, and open the way to reasonings with continous quantities: the kernels of the Gaussian processes can be taken as continuous, gradients and neighborhoods are naturally defined during the optimization. On the contrary, categorical variables and their inherent lack of distance definition is the cause of complications in the kernel definition and in the optimization.

This article presents a new Bayesian optimization algorithm for mixed variables called LV-EGO (for Latent Variable EGO). Our contribution with respect to Zhang et al. [32] is that the continuity of the latent variables is also taken advantage of during the optimization of the acquisition criterion. This implies that categorical variables must be recovered from the continuous latent variables proposed by the optimizer, which creates a new “pre-image” problem.

“Problem statement and background” introduces the problem and the principles of Bayesian optimization. In “EGO with latent variables”, several variants of LV-EGO are described. They differ in the handling of the relationship between the categorical and the latent variables: the “vanilla” LV-EGO just recovers categorical variables after the optimization while augmented Lagrangian versions account for the link during the optimization through constraints. “Description of the numerical experiments” presents a set of benchmarks comparing our method to other state-of-the-art techniques. One of the benchmarks is a beam design problem and gives the opportunity to discuss the interpretation of the latent variables. Finally, “Conclusions and perspectives” offers conclusions and perspectives to this work.

Problem statement and background

We consider the problem of minimizing a function y(x, u) depending on a vector of continuous variables \(x = (x_1, \dots , x{_{n_{c}}})\) and a vector of discrete variables \(u = (u_1, \dots , u{_{n_{d}}})\), where each \(u_i\) has \(m_{i}\) levels encoded \(1, \dots , m_{i}\). We denote \({\mathcal {X}} \) the domain of definition for the continuous inputs, typically, after rescaling, the hypercubic domain \([0, 1]^n_{c}\). Similarly, we denote \({\mathcal {U}} = \prod _{j=1}^{n_{d}} \{1, \dots , m_j\}\) the domain of definition for the discrete inputs. \({\mathcal {X}} \times {\mathcal {U}} \) is the set of the mixed optimization variables.

We focus on costly functions, meaning that each evaluation of y is time-consuming, and we aim at minimizing y with a tiny budget of evaluations. In this context, minimizing directly y is hardly possible. An alternative is to use Bayesian optimization (BO). In BO approaches, there are two main ingredients: a Gaussian process (GP) serving as a fast proxy, often called metamodel, built from the current learning set, and a sampling criterion, often called acquisition criterion, used to update the learning set with a new data point computed with y. A famous acquisition criterion is the expected improvement (EI). In that case, the BO approach is often called Efficient Global Optimization (EGO) algorithm.

To be more precise, let \(({{X}},{{U}}) = \{(x,u)^{(1)}, \dots , (x,u)^{(t)} \} \in ({\mathcal {X}} \times {\mathcal {U}})^t\) be a design of experiments (DoE), and \(y_i = y(x ^{(i)},u ^{(i)})\) be the corresponding function evaluations (\(i=1, \dots , t\)). Let \(y_{\min } = \min (y_1, \dots , y_t)\) be the current minimum. Let us now assume that y is a particular realization of the GP Y defined on \({\mathcal {X}} \times {\mathcal {U}} \). In that case, the EI criterion is defined by

where \(Y^t\) is the conditional GP knowing the observations:

Notice that is large when exploiting interesting area, that is to say when there is a good chance that \(Y^t(x,u)\) is smaller than \(y_{\min }\). This may occur when \(\mathrm {E}[Y^t(x,u)]\) is close to \(y_{\min }\), or when exploring unvisited areas, i.e. when the variance of \(Y^t(x,u)\) is large compared to \((\mathrm {E}[Y^t(x,u)]-y_{\min })^2\). The idea of EGO is to evaluate y at a new point maximizing the EI criterion until a stopping criterion is reached. See Algorithm 1 for a synthetic description of the EGO algorithm when the stopping criterion is a maximum number of evaluations of y, noted
\(\textit{budget}\). A maximum budget is the logical stopping criterion in our context of costly optimization. Other stopping conditions are possible in the form of lower bounds on the acquisition criteria (expected improvement, knowledge gradient Frazier [9],...) i.e., minimal measures of progress below which the search should stop. In line 9, the solution returned by the algorithm is the best point of the last DoE,
\(({{X}},{{U}})\).

This EGO algorithm has been intensively studied to minimize nonlinear functions that are expensive to be evaluated in the case
\({\mathcal {U}} =\emptyset \), i.e. when all input variables are continuous (see Le Riche and Picheny [14] for numerical illustrations of its efficiency). The application of this algorithm in the presence of categorical variables is much less documented (see e.g. Pelamatti et al. [21], Zuniga and Sinoquet [33]), which can be explained by two main difficulties. The first one is related to the difficult estimation of covariance kernels on mixed spaces. Indeed, multi-dimensional covariance functions are often built by combination of one-dimensional ones. Therefore, covariance functions on mixed spaces can be obtained by combining covariance functions on
\({\mathcal {X}} \) and
\({\mathcal {U}} \):

where
\(k^x_1,\ldots ,k^x_{n_c},k_1^u,\ldots ,k_{n_d}^u\) are covariance functions and
\(*\) is an operation that preserves positive definiteness, such as sum or product. If we focus on the single categorical variable
\(u_j\) with levels
\(1, \dots , m_{j}\), we can identify the covariance function
\(k_j^u\) to a
\((m_j\times m_{j})\)-dimensional positive semidefinite matrix
\({\mathbf {T}}\), such that for all
\(1\le k,\ell \le m_{j}\),

This means that \(\sum _{j=1}^{n_d}m_{j}(m_{j}+1)/2\) coefficients need to be estimated to determine a covariance on \({\mathcal {U}} \) in the general case. That number can be large when m is large, which very often makes this estimation very difficult in practice. Furthermore, the optimization problem is often harder than the box-constrained one met with continuous variables. Indeed it is either constrained by the positive definiteness of \({\mathbf {T}}\), which is non-linear, or defined on a manifold if \({\mathbf {T}}\) is parameterized in spherical coordinates. We refer to Roustant et al. [25] for more details and other parsimonious representations of \(k_j^u\), which can reduce but not totally fix these issues. The second reason that can explain the few number of direct applications of EGO algorithm on mixed space is related to the difficult maximization of the expected improvement, i.e. the search of the new input points where to call the function y, which are solutions of:

(1.3)

Indeed, classical optimization algorithms on continuous spaces usually try to exploit information related to the gradient of the function to be maximized, as well as notions of proximity in the space of the inputs. However, these two notions are difficult to exploit when dealing with categorical inputs, i.e. without any a priori ordering between the input instances. To circumvent this difficulty, a naive approach of resolution would consist in no longer considering a single maximization problem on \({\mathcal {X}} \times {\mathcal {U}} \), but the resolution in parallel of \(\prod _{j=1}^{n_d}m_{j}\) maximization problems on \({\mathcal {X}} \), i.e. one problem per combination of instances of the categorical inputs u. Such an approach is not tractable when the number of optimization problems to be solved becomes large, which has motivated the definition of heuristics, such as evolutionary algorithms Li et al. [15], Cao et al. [6], Lin et al. [16], which seek to concentrate the searches only on the interesting instances of u. These approaches still rely on a large number of calls to the function to be optimized, and their convergence is not always easy to quantify.

Because mixed optimization problems are difficult, an alternative approach is proposed in the rest of this paper. It is based on the possibility to relax the discrete variables into continuous latent variables, therefore benefiting from the more efficient search mechanisms that exist in continuous spaces (e.g. gradients).

EGO with latent variables

Latent variables

For an easier handling of categorical inputs, it was proposed in Zhang et al. [31] to replace each categorical input \(u_j\) by a vector of \(q_j\ge 1\) continuous inputs with values in \({\mathbb {R}}^{q_j}\), noted \(\ell _j\). To give an intuition of the underlying idea in the automotive domain, a category of lubricant may be determined by physical continuous features such as boiling temperature, viscosity, etc that act as latent variables. In structural mechanics, the shape of a load carrying structure, which is categorical, has underlying continuous flexural and membrane moments that drive its behavior. This amounts to associating to the Gaussian process (GP) Y a new GP \({\widetilde{Y}}\), such that for each instance u of the categorical inputs there exists a particular value of \(\ell {:}{=}(\ell _1,\ldots ,\ell _{n_{d}})\in {\mathcal {L}} \subset {\mathbb {R}}^{q_1}\times \cdots \times {\mathbb {R}}^{q_{n_d}}\), which is called latent variable, allowing us to write:

An important point is that the values of \(\ell \) are unobserved and therefore \({\widetilde{Y}}\) is unknown. Nevertheless, in order to replace the EI maximization problem on \({\mathcal {X}} \times {\mathcal {U}} \) by a new optimization problem on \({\mathcal {X}} \times {\mathcal {L}} \), a precise knowledge of \({\widetilde{Y}}\) is not necessary. Indeed, assuming that kernels for mixed inputs are built by combining 1-dimensional ones as in (1.1), it is sufficient to identify the mappings \(\phi _j\) from \(\{ 1,\ldots ,m_{j}\}\) to \({\mathbb {R}}^{q_j}\) to each variable \(u_j\) such that

(2.2)

where \(k_j\) is a continuous kernel on \({\mathbb {R}}^{q_j} \times {\mathbb {R}}^{q_j}\). Thus, it is not so much the values of that are important, but their relative positions in \({\mathbb {R}}^{q_j}\) in order to allow a reasonable reconstruction of the dependency structure between Y(x, u) and \(Y(x',u')\).

According to the works achieved in Zhang et al. [31], it appears that interesting mappings can be obtained by likelihood maximization and that relatively small values of \(q_j\) can give a satisfying reconstruction. Following their recommendations, \(q_j\) can be chosen equal to 1 if \(m_{j}\le 3\) and to 2 otherwise, which will be the values chosen in the rest of this paper. We denote by \(n_{\ell }=\sum _{j=1}^{n_d}q_j\) the total number of latent variables. Following Roustant et al. [25], the continuous kernel \(k_j\) associated to the latent variables was chosen as the dot product kernel \(k_j(t, t') = \langle t, t' \rangle \). The corresponding covariance matrix is then low-rank, and provided better performances than the Gaussian kernel in the examples considered in the latter reference.

This new parametrization leads us to the following adaptation of the EI maximization problem defined by Eq. (1.3), which we name acquisition problem as it allows to acquire a new point to evaluate:

(2.3)

Here, is the expected improvement associated with GP \({\widetilde{Y}}\) at iteration t, is the vector-valued mapping from \(\prod _{j=1}^{n_d}\{1,\ldots ,m_j\}\) to \({\mathbb {R}}^{q_1}\times \cdots \times {\mathbb {R}}^{q_{n_d}}\) at iteration t, and the constraint on the values of \(\ell \) is driven by the fact that the values of the latent variables at the new point have to remain compatible with the current mapping functions.

We follow two paths to solve this acquisition problem. In the vanilla LV-EGO approach, which will be described soon, the EI maximization and the latent-discrete compatibility constraint are addressed one after each other. Alternatively, with the augmented Lagrangian approaches, which will be described in “LV-EGO algorithms with Augmented Lagrangian”, the full constrained optimization problem is treated.

The vanilla LV-EGO algorithm

At each iteration, the vanilla LV-EGO algorithm first maximizes EI in a relaxed, fully continuous, formulation where the discrete variables are replaced by relaxed continuous latent variables. Then, a pre-image problem is solved where EI is maximized over the discrete variables only, the continuous variables being fixed at their value of the relaxed problem. The LV-EGO methodology is summarized in Algorithm 2.

The main difference with the generic Bayesian algorithm 1 is the new discrete pre-image problem in line 6. Notice that the pre-image is formulated in terms of the EI objective, as opposed to a more arbitrary distance like . Solving the pre-image in terms of the iterative figure of merit, the expected improvement, is meant to provide a gain in efficiency with respect to a pre-image minimizing an Euclidean distance between the map of a discrete level and the latent variables. In the particular situation where the latent variable coincides with the image of a discrete level, , both approaches yield the same result since \(\ell ^{t+1}\) is a maximizer of EI (see line 5 of Algorithm 2). In terms of implementation, the EI maximization (line 5) is done with the COBYLA algorithm, a gradient free non-linear optimization technique Powell [23]. Since COBYLA is a local optimizer and the EI is a multimodal function, the maximization is repeated (10 times, which is more than the maximum dimension of the test cases studied in this article and more than the default—3—of the kergp package) from randomly chosen initial points and the best result is kept. An exhaustive search is carried out for the EI maximization of the pre-image problem (line 6).

A comparison of the numerical complexities of the vanilla LV-EGO (Algorithm 2) and the generic EGO (Algorithm 1) shows that the cost of the latent variables is limited. Let us consider that the discrete space can be searched essentially by enumeration in \({\mathcal {O}}({{\,\mathrm{card}\,}}{\mathcal {U}}) = {\mathcal {O}}(\prod _{i=1}^{n_{d}} m_{i})\) operations (where \(m_{i}\) is the number of levels per discrete variable) while a continuous space can be searched more efficiently in linear time. At each iteration, the Bayesian algorithms of this paper have three steps: first a GP is learned, then an acquisition criterion (EI for now and an augmented Lagrangian later) is maximized and finally a pre-image problem is solved. In the vanilla LV-EGO algorithm, these steps take place at lines 4, 5 and 6 of Algorithm 2, respectively. Table 1 summarizes the number of operations per step. The number of operations for learning the GPs is proportional to the cube of the number of points evaluated (t) because of the inversions of the covariance matrices, times the number of (continuous) parameters of the GP for the likelihood maximization.

The two other steps, the acquisition and the pre-image, imply predictions by the GP in \(t^2\) operations times a number of operations that depends on the specific algorithm. Comparing in Table 1 the column of the generic EGO with that of the vanilla LV-EGO, and assuming that for all i\(m_{i}=m_{}\) to keep the discussion simple, it can be seen that the latent variables induce a slight extra cost to be learnt. When \(q=2\), which is our default here, this extra cost is \(n_{d}\times m_{i} \times t^3\) operations. Setting \(q=1\) would not add any cost to the learning. An advantage, which comes from the sequential resolution of the mixed problem, occurs in the maximization of the acquisition criterion when \(n_{c}+ q\times n_{d}\times m_{} < m_{}^{n_{d}} \times n_{c}\), at the cost of an additional pre-image problem to solve. Thus, LV-EGO will be faster than a mixed EGO once the latent variables are estimated if \(m_{}^{n_{d}} + n_{c}+ q\times m_{} \times n_{d}< m_{}^{n_{d}} \times n_{c}\), which happens frequently (take for example \(n_{c}=4,n_{d}=2, m_{}=10, q=2\)).

LV-EGO algorithms with Augmented Lagrangian

A possible pitfall of the vanilla LV-EGO detailed in Algorithm is that the link between the discrete variables \(u\) and their relaxed continuous counterparts \(\ell \) is lost when maximizing in line 5. Recovering it during the discrete pre-image problem where \(x \) is fixed to a value optimal in the relaxed formulation but possibly non-optimal with respect to the mixed problem (1.3) may yield a sub-optimal solution. For this reason, we now propose LV-EGO algorithms that account for the discreteness constraint during the optimization using augmented Lagrangians.

In that prospect, notice that problem (2.3) can be approximated as an optimization problem with an inequality constraint:

(2.4)

where \(\epsilon \) is a small positive relaxation constant and \(\Vert \cdot \Vert \) the Euclidean norm. In this reformulation, called relaxed acquisition problem, notice the \(\log \) scaling of the EI which does not change the solution but improves the conditioning of the problem. Two values of \(\epsilon \) will be discussed in the sequel, \(\epsilon =0\) in which case the constraint becomes an equality constraint, , and \(\epsilon >0\) but small which corresponds to a relaxation of the equality. In the sequel, \(\epsilon \) is normalized with respect to the size of the vector of latent variables and set to \(\epsilon =0.01\).

The constrained optimization problem (2.4) is solved through an augmented Lagrangian approach Minoux, [17], Nocedal and Wright [19]. The augmented Lagrangian is that of Rockafellar [24] which, specified for Problem (2.4), is,

When \(\epsilon =0\), the constraint \(g^{(t)}(\ell ) \le 0\) becomes an equality constraint, \(g^{(t)}(\ell )=0\). In this case, the augmented Lagrangian connected to that of Rockaffelar is that of Hestenes [10] and takes the form

Complementary explanations about the augmented Lagrangians are given in Appendix .

Augmented Lagrangians require to specify the values of the Lagrange multiplier, \(\lambda \), and of the penalty parameter, \(\rho \). The general principle to fix them is to calculate the generalized Lagrange multiplier with a dual formulation Minoux [17]: the dual function \(D ^{(t)}\) is maximized with respect to the multiplier \(\lambda \) while the penalty parameter \(\rho \) should take the smallest value that allows one to find feasible solutions,

There are two logics to solve Problem (2.7), both of which have been investigated in this study. Following an idea presented in Le Riche and Guyon [13] for classical Lagrangians, we first propose to approximate the dual function \(D ()\) as the lower front of the augmented Lagrangians of a finite set of calculated points. The approximated dual is

where \(({{X}} ',{{{L}}} ')\) is a DoE that should not be mistaken for \(({{X}},{{U}})\), the DoE of the original expensive problem. \((\lambda _t,\rho _t,x ^t,\ell ^t)\) comes from solving Problem (2.7) with minimizations over the finite set \(({{X}} ',{{{L}}} ')\) instead of the initial \({\mathcal {X}} \times {\mathcal {L}} \). The functions in Problem (2.4) are not costly, \(({{X}} ',{{{L}}} ')\) can be quite large. This approach is called global dual as a global approximation to the dual function is built and maximized. It applies to very general functions, e.g., non differentiable functions. Another advantage of this approach is to allow large changes in the dual space. Figure 10 provides an illustration of the approximated dual function and the effect of \(\rho \) on the dual problem. The sketch is done for an inequality constraint, yet it also stands with marginal changes for an equality (cf. Appendix and the caption to the Figure). Under the non-restrictive hypothesis that there is a \(\rho \) beyond which the solution to the primal problem (2.4) maximizes the dual function, maximizing the dual function preserves the global aspect of the search. However, the accuracy of the obtained \((\lambda _t,\rho _t)\)’s will depend on the DoE. Because there is only one constraint in the current problem and evaluating it does not require calling the costly function, the maximization on \(\lambda \) and \(\rho \) is done by enumeration on a \(100 \times 20\) grid and \(({{X}} ',{{{L}}} ')\) is a 100 LHS sample.

The other path to updating the multiplier is to progressively change them based on the minimizers of the augmented Lagrangian at the current step. This updating can be seen as a step in the dual space which makes it general, although it is usually proved by analogy with the Karush Kuhn and Tucker optimality conditions Nocedal and Wright [19] which add unnecessary conditions (like differentiability), cf. Appendix . Let \((x ^t,\ell ^t)\) be a solution to

The update scheme based on Eqs. (2.10) and (2.11) is called local dual as a local step in the dual \((\lambda ,\rho )\) space is taken.

Algorithm 3 gathers all these changes and is called ALV-EGO. The essential difference between this ALV-EGO algorithm and the vanilla counterpart (Algorithm 2) is that the EI maximization step is constrained so that the link between the discrete variables and the relaxed latent variables (hence the continuous \(x \)) is not lost and left to the pre-image step. The coupling between the continuous and the discrete variables is better accounted for. However, a pre-image step (line 7) is still necessary to fully recover a discrete solution in cases when the constraint is relaxed (\(\epsilon >0\)). In ALV-EGO like in the vanilla LV-EGO, there are \(q=2\) continuous latent variable per discrete variable.

The global and local dual schemes are further detailed in Algorithms 4 and 5. The continuous minimizations of the Augmented Lagrangians once the Lagrange multipliers are set are always done with 10 random restarts of the COBYLA algorithm Powell [23]. They occur in Algorithm 4, line 4 and Algorithm 5 line 5. To allow comparisons, this implementation is identical to the EI maximization of the vanilla LV-EGO (step 5 of Algorithm 2).

While the local update of \(\lambda \) and \(\rho \) might seem less robust, it is the most common implementation and it might be sufficient for the constrained EI maximization. Indeed, between two iterations, the EI changes only locally around the current iterate. Providing the latent mapping functions do not change too much, a local update of \(\lambda \) and \(\rho \) seems appropriate. The numerical complexity of the ALV-EGO-g and -l algorithms is essentially the same as that of the vanilla LV-EGO, cf. Table 1. The global dual scheme has a slight extra-cost because of the search for the Lagrange multiplier and penalty parameter that require \(N_{\text {DoE}}'\) extra GP predictions.

Eventually, four variants of ALV-EGO are considered, ALV-EGO-ge or -gi or -le or -li where g stands for global, l for local, e for equality (\(\epsilon =0\)) and i for inequality (\(\epsilon >0\)).

Description of the numerical experiments

Algorithms tested

The various algorithms tested are summarized in the Table 2 which provides their names, the type of formulation for the mixed variables, the type of metamodel, the acquisition criterion and the technique to optimize the acquisition criterion. The two possible formulations for the mixed variables are either by searching in a mixed space (MS) or by a formulation in latent variables (LV). All Gaussian processes (GPs) are built with the kerpg package Deville et al. [7]. The meaning of the acronyms is: LV-EGO, Latent Variables EGO; LV-RFO, Latent Variables Random Forest Optimization ; ALV-EGO-ge/-gi/-le/-li, Augmented Lagrangian Latent Variables global/local dual scheme with equality/inequality pre-image constraints; MS-RFO, Mixed Space search with Random Forest Optimization; MS-ES, Mixed Space search with Evolution Strategy; MS-MKES, Mixed Space search with Mixed Kriging metamodel and Evolution Strategy.

The different algorithms will be tested on the suite of test problems described hereafter.

Test cases

There are 3 analytical test cases and a beam bending problem. The analytical test cases have all been designed by discretizing some of the variables of classical multimodal continuous test functions. The following notation is introduced to describe the discretization: if the continuous variable \(x _i\) is discretized with \(u _j\) that takes values in \(\{1,\ldots ,m_{j}\}\), then \(u _j(k)=\beta \) means \(x _i=\beta \) when \(u _j = k\), \(\beta \) a scalar, \(1\le k \le m_{j}\).

Test case 1: discretized Branin function. We modified the 2 dimensional Branin-Hoo function whose expression is

$$\begin{aligned} y(x_1,x_2)&= ({x'}_2 - b {x'}_1^2 + c {x'}_1 - r)^2 + s(1 - t) \cos {({x'}_1)} + s, \\ x'&= {x'}^{\text {min}} + ({x'}^{\text {max}}-{x'}^{\text {min}}) \times x \end{aligned}$$

where \(b = 5/(4\pi ^2), c = 5/\pi , r = 6, s = 10, t = 1/(8\pi )\), \({x'}^{\text {min}}=[-5;0] , {x'}^{\text {max}}=[10;15]\) by keeping \(x_1\) continuous in [0; 1] and making \(x_2\) discrete with 4 levels \(\{u (1) = 0; u (2) = 0.333; u (3)= 0.666; u (4) = 1\}\). The discretized Branin, which was already used in Zhang et al. [32], has several local minima as shown in Figure 1a.

The global optimum is located at \((x _1^\star ,u ^\star ) = (0.182;u (3))\) with \(y(x _1^\star ,u ^\star )=2.791\).

Test case 2: discretized Goldstein function. As a second test case, the continuous Goldstein function

is partly discretized by replacing \(x _2\) by \(u \) with 5 levels \(\{u (1) = 0; u (2) = 1/2; u (3)= 1/2; u (4) = 3/4 ; u (5) = 1\}\). The discretized Goldstein, which has also been studied in Zhang et al. [32], is drawn in Figure 1b. It has several local optima. The global optimum is located at \((x _1^\star ,u ^\star ) = (0.5; u (2))\) with \(y(x _1^\star ,u ^\star )=3\).

Test case 3: discretized Hartman function. Two variables are discretized in the 6 dimensional Hartman function,

The variables \(x_5\) and \(x_6\) are discretized with 5 and 4 levels respectively such that \(\{u _1(1) = 0.350; u _1(2) = 0.257; u _1(3)= 0.477; u _1(4) = 0.312; u _1(5) = 0.657\}\) and \(\{u _2(1) = 0.150; u _2(2) = 0.657; u _2(3)= 0.512; u _2(4) = 0.741\}\). Again, there are multiple local minima and the global optimum is located at \((x ^\star ,u ^\star ) = (0.202;0.150;0.477;\)\(0.275;u _1(4),u _2(2))\) with \(y(x ^\star ,u ^\star )=-3.322\).

Euler-Bernoulli beam bending problem. This test case corresponds to an horizontal beam that is clamped at one end and subject to a vertical force at the other end. If the length of the beam is sufficiently long compared to the dimensions of its cross section, and if it is operating within its linear elastic range, the final beam deflection y (to be minimized) is expressed as

where \(P = 600 N\) is the vertical load, \(E= 600 GP\!a\) is the Young’s modulus, \(L \in [10,20]\) is the horizontal length of the beam, \(S \in [1,2]\) is the cross-section area and \({\tilde{I}} = I/S^2, \in \{{\tilde{I}}(1), {\tilde{I}}(2),\dots , {\tilde{I}}(12)\}\) is the normalized moment of inertia that can explicitly be derived for a given catalog of beam profiles. The 12 levels of the normalized moment of inertia are

Here \(\alpha \) is the weight balancing the two effects in the objective function. It is chosen as \(\alpha =60\) so that y has several local minima and only one global minimum. This global solution is \((x _1^\star ,x _2^\star ,u _1^\star ) = (0; 0.43; {\tilde{I}}(3))\) with output \(y^\star = 1.287385\times 10^3\).

Experiments setup and metrics

The optimization of each pair of algorithm and test case are repeated 50 times from different initial DoEs. The DoEs are generated by minimax Latin Hypercube Sampling. The size of the DoEs is \(N_{\text {DoE}} = 4 \times n_{c}\times n_{d}\times \text {max}(m_{i})\) and a budget of \(N_{\text {DoE}} \) + 50 evaluations of the true objective function. Remember that the true objective function is supposed to be computationally intensive although it is not in these experiments so that runs can be repeated. The evolution strategies are stopped after \(N_{\text {DoE}} + 50\) evaluations of the true function, like the other algorithms.

The internal local optimizer, COBYLA, is restarted 5 times during the likelihood maximization and 10 times during the maximization of the acquisition criterion. The focus-search algorithm has a sample size of 1000 with 5 boundary reduction iterations and 3 multi-starts, for a total of 3000 calls to the acquisition criterion.

A summary of the dimensions involved in the different examples is given in Table 3.

Results and discussion

The results are provided with 4 main metrics. The performance of an algorithm is classically described by the median objective function over the 50 repeated runs, calculated at each iteration. The associated measure of dispersion of the performance is the interquartile over the repetitions as a function of the iteration. To discriminate between methods that are rapid but provide rough solutions from the ones that take more time but yield better solutions, the two other metrics are based on the definition of targets. For each test case, a target is a given quantile of all the objectives functions found by all the algorithms throughout all the repetitions. A 10% target is difficult, while a 50% target is the median performance. The third metric is the iteration number at which the median objective function of a given algorithm reaches a given target. The fourth metric is the success rate (given a target), which is the percentage of the runs that do better than the target. The metrics associated to the quantile targets have the advantage that they are normalized with respect to the test cases: thanks to the quantiles, the definitions of an easy, a median or a hard target stands accross the different functions to optimize. The target-based metrics will later be averaged over the different test cases.

Let us now review the performances of the algorithms on each test case.

Analytical test functions

Branin function Figure 2 presents the results for the Branin function with the four metrics. On the top left plot, showing the median value for the objective function, it is clear that the two methods that rely on the random forest metamodel (MS-RFO and LV-RFO) are overtaken by all other methods. This indicates that, whether in the mixed or in the latent-augmented space, random forests do not represent sufficiently well the Branin function in comparison to Gaussian processes. Looking at Fig. 2b, it is observed that the fast methods typically have the lowest spread in performance and vice versa. This is expected as non converging runs may yield a wide range of performances. All methods involving the discrete constraint (i.e., the augmented Lagrangians) managed to improve over the LV-EGO performance; and including a mixed metamodel increased significantly the success rate and the median solution for the evolutionary strategy.

Regarding the success rate on Fig. 2d, the methods MS-MKES, LV-EGO, ALV-EGO-li, -le, -ge and -gi were the most prominent, the latter being capable to reach success rates of about \(20\%\) for a \(10\%\) target. Notice that all these methods contain Gaussian processes. Indeed, the Branin function is easy to represent by a GP whether continuous or mixed. In the same vein, MS-MKES which differs from MS-ES by the use of a GP, clearly benefits from that metamodel.

All ALV- methods, which account for the discrete constraint, obtained the best median performances. ALV-EGO-ge in particular found all targets, in the median sense, earlier than the other algorithms as can be seen from Fig. 2c.

A last comment is necessary regarding the bottom of Fig. 2: the plot on the left describes the median performance (in terms of targets reached) while the right plot counts the success rate at reaching a target over all runs. Therefore, some targets are reached on the right by some of the runs of a given algorithm, while they are never atteined on the left by the median of the same algorithm. This comment stands accross all test cases.

Goldstein function The experiments done with the Goldstein test function are summed up in Fig. 3. Like with the Branin function, algorithms relying on random forests (LV-RFO and MS-RFO) showed both poor performance (top left plot). The associated high constant interquartile (top right) is that of the best points in the initial designs, which remains unchanged since no better point is found by these algorithms.

Considering the success rates for all targets (bottom plots), it is seen that accounting for the discreteness through a constraint (which is the distinctive feature of ALV- methods) is useful with the Goldstein function: like with Branin, ALV-EGO-gi is the best performer, but the other ALV- follow and outperform LV-EGO. All ALV- strategies almost reach the absolute target of percentile \(25\%\) with a rate of \(25\%\) or higher. The comparison of the plots Fig. 3c, d also shows that, behind the ALV- methods, LV-EGO has a good median performance (cf. Fig. 3c) but more of the MS-MKES searches manage to find difficult targets (the 25% and 10% quantiles).

Hartmann function Results on the Hartmann function which has 4 continuous and 2 discrete variables, with a total of 9 discrete levels, will be impacted by the sensitivity of the algorithms to an increase in dimension. These results are reported in Fig. 4.

LV-EGO stands out as the best method with respect to all criteria for Hartmann. The next two best methods are LV-RFO and ALV-EGO-gi, followed by MS-RFO and ALV-EGO-ge. This time, LV-RFO and MS-RFO, which both rely on random forests, belong to the efficient methods: random forests gain in relative performance with respect to the GPs when the dimension and the size of the initial DoE increase. For Hartmann, LV-EGO consistently outperforms the ALV- implementations. The importance of keeping the coupling between discrete and latent variables during the optimization seems less crucial, and even somewhat detrimental, in the Hartmann case. We think that this is due to the very tight budget (50 iterations after the initial DoE) which does not allow the convergence of the optimizers, as can be seen in the Plot 4a where the global optimum is not reached. Because the optimum is not really found, constraints on discreteness are superfluous and their handling through the pre-image problem is sufficient. As in the other test cases, MS-ES was slower than the other methods.

Beam bending application

Optimization results Figure 5 summarizes the 4 comparison metrics of all 9 algorithms in the bended beam test case. The ranking of the algorithms is similar to that obtained with the Branin and Goldstein functions. LV-EGO has the best convergence both in terms of median speed (cf. plots of the left column) and accuracy (bottom right plot). ALV-EGO-gi is the second most efficient method followed by ALV-EGO-ge. Again, the algorithms that resort to random forests, LV-RFO and MS-RFO, are the slowest and most inaccurate. They share this counter-performance with MS-ES.

Latent variables in the beam application The beam subject to a bending load is a test case that allows to interprete the latent variables. Indeed, the normalized moment of inertia, \({{\tilde{I}}}\), is a candidate latent variable once it is allowed to take continuous values as it determines, with the continuous cross-section S and the length L, the output (the penalized beam deflection) y in Eq. (3.5). The levels of \({{\tilde{I}}}\) (given in Eq. (3.2)) correspond to 3 increasingly hollow profiles of 4 shapes, as illustrated in Fig. 6. Because a relaxed \({{\tilde{I}}}\) is a possible latent variable, it is expected that the latent variables learned from the data will be grouped in the same way as \({{\tilde{I}}}\). Looking at \({{\tilde{I}}}\) values and at Fig. 6, we thus expect, in the image space defined by latent variables, three groups of levels: those corresponding to solid forms (levels \(\{1, 4, 7, 10\}\)), medium-hollow forms (levels \(\{2, 5, 8, 11\}\)) and hollow forms (levels \(\{3, 6, 9, 12\}\)).

For the sake of interpretation, we select 1 run that found the global optimum with the Vanilla LV-EGO algorithm. In Fig. 7, we represent in a color scale the estimated correlation matrix corresponding to the categorical kernel of Eq. (2.2), at iterations [1; 26; 49; 50]. At the beginning of the optimization, at iteration 1, we can see a block-structure which corresponds quite well to the three groups of forms described above. This structure becomes less clear for the next iterations of the LV-EGO algorithm. This may be explained by the fact that the algorithm creates an unbalanced design, with more points in the promising areas according to the optimizers, so that all levels are no longer properly represented.

Summary and discussion

The results of all the previous test cases which are measured through targets can be averaged. For example, the success rate of an algorithm at 25% difficulty is the average of the rates for the 25% quantiles of all test cases. The average results are presented in Fig. 8.

The three leading algorithms out of the 9 tested are ALV-EGO-gi, -ge and LV-EGO. Among them, LV-EGO is slightly better at locating difficult targets (10% quantile) while ALV-EGO-gi (closely followed by ALV-EGO-ge) is more robust at locating 50% targets as can be seen from the median success plot in Fig. 8a. All three algorithms have in common to use latent variables. In particular, these algorithms outperformed MS-MKES which benefits from a Gaussian process but works only in the mixed space, i.e., MS-MKES does not imply latent variables. This shows that latent variables are useful to speed up a Bayesian search for mixed problems.

No clear advantage, on the average, was found for accounting for the discrete nature of the variables through constraints: LV-EGO, which ignores the link between latent variables and the discrete variables until the pre-image problem, is competitive with the best of the augmented Lagrangian ALV-EGO algorithms. We hypothesize that the constraint on latent variables, by creating disconnected feasibility islands around , makes the optimization of the acquisition criterion almost as difficult to solve as it originally was in the mixed space, therefore not allowing to fully benefit from the continuity of the \({\mathcal {X}} \times {\mathcal {L}} \) space.

In our tests, the global updating of the Lagrange multipliers was always preferable to the local counterparts, ALV-EGO-gi and -ge eclipsing ALV-EGO-li and -le.

The ALV-EGO-gi approach, where the discrete constraint is relaxed and turned into an inequality (Eq. (2.4)), works better on the average than ALV-EGO-ge where the constraint is an equality. This illustrates the positive effect of the relaxation \(\epsilon \), that softens the phenomenon we mentionned above where the feasible domain is broken into disconnected regions.

MS-ES is consistently less efficient than the other algorithms. It was expected, because there is no metamodel to save calls to the function. Furthermore, the sampling is done in the mixed space. The optimizers based on random forests have also rather poor average performances, to the exception of the 6 dimensional Hartmann function. We believe the random forests need a sufficiently large initial DoE (which happened with a higher dimension) to fruitfully guide the search.

As a final comment, we discuss the necessity of re-estimating the latent variables at each iteration. The estimation of the latent variables has an important numerical cost of about \(qt^3 \sum _{i=1}^{n_{d}} m_{i}\) operations at each iteration t (cf. Table 1). It was repeated at each iteration in the algorithms with latent variables considered so far. In the experiment reported in Figure 9, a version of the LV-EGO algorithm is considered where the latent variables are estimated once only, with the initial DoE, yielding the NR-LV-EGO algorithm (for Non Repeated estimation of ).

As can be seen in Fig. 9 when comparing LV-EGO with NR-LV-EGO, the re-estimation of the latent variables at each iteration, as implemented in the LV-EGO algorithm and its ALV-EGO variants, considerably improves its performance. An accompanying result is the visualization of the correlation matrix of the discrete variable provided in Fig. 7, where one notices that the correlation (hence the latent variables) evolves in time. Our experiments indicate that this evolution is beneficial to the optimization efficiency.

Conclusions and perspectives

This work has investigated five Bayesian optimization approaches to small and medium size mixed problems that hinged on latent variables. They differed in the way the coupling between the discrete variables and their relaxed pendants, the latent variables, is implemented.

Algorithms involving latent variables were compared to other algorithms directly working in the mixed space and were found to consistently outperform them. LV-EGO and ALV-EGO-gi were more efficient (in terms of calls to the true objective function) than MS-MKES which also benefits from the Gaussian process. These first results show that latent variables provide a flexible way to handle mixed problems where the total number of levels and of variables is less or equal to about 10 variables and 10 levels in total.

Accounting for the discrete nature of some variables through a constraint during the relaxed optimization with augmented Lagrangians was not clearly found to further increase the performance of the search as LV-EGO competed equally and even sometimes outperformed the ALV versions of the algorithms. It was also observed that expressing the discreteness as an inequality constraint by adding a tolerance was a better option than expressing it as an equality. The global updating strategy of the Lagrange multipliers, which to the best of our knowledge is original, improved over the more common local updating schemes. Finally, the random forests metamodels did not do as well as the Gaussian processes, whether in their continuous or mixed forms, within the Bayesian optimization algorithm.

Our study needs to be completed in three ways. To fully leverage on the continuous latent space, the gradient of the acquisition function should be analytically calculated and used to guide its maximization. The implementation we proposed creates more latent variables than there are discrete levels, which limits its application to about 10 levels. This limitation can be overcome with under-parameterized kernels based on groups Roustant et al. [25] or warping techniques Deville et al. ?[7]. Mixed Bayesian optimization through latent variables would also gain in credibility if the convergence results of EGO were generalized to it.

Complements on the augmented Lagrangians

Case of an equality constraint

Let us first consider an optimization problem with an equality constraint,

At this point, f() and h() are very general functions on a d-dimensional general set \({\mathcal {X}}\). We only require that \({\mathcal {X}}\) is not empty, that f() and h() are bounded, and that there is at least one solution to (A.1), \(x^\star \in {\mathcal {X}}\), which can be attained. f() and h() are not necessarily continuous, a fortiori not necessarily differentiable. With respect to the main body of the article, the notations are simplified in this Section: \({\mathcal {X}}\) stands for the cartesian product of \({\mathcal {X}}\) and \({\mathcal {L}} \), f(x) generalizes and h(x) corresponds to \(g^{(t)}(\ell )\) when \(\epsilon =0\). Note that \(g^{(t)}()\), being made of the minimum distance to a discrete set of points (cf. Eq. (2.4)), is not differentiable. \(g^{(t)}()\) is the only constraint in the article. This appendix considers one constraint too, but all the results given readily generalize to many constraints by replacing the products by vector scalar products.

where \(\rho \ge 0\) is a penalty parameter. The two above formulations have the same solution \(x^\star \) and the same value of optimal objective function since \(x^\star \) is feasible, \(h(x^\star )=0\), therefore \(f(x^\star ) = f(x^\star ) + \frac{1}{2}\rho h^2(x^\star )\). However, as proved in Minoux [17] and sketched in Fig. 10, there is always a positive lower bound on the penalty parameters, \(\rho \ge \rho ^\star \ge 0\), such that Problem (A.2) can be equivalently solved through the dual formulation,

In this way, the augmented Lagrangian of Hestenes [10] is the classical Lagrangian of the penalized problem (A.2). We write \(\lambda ^\star ,\rho ^\star \) a solution to (A.3). \(D (\lambda ,\rho )\) is the lower front of all augmented Lagrangians for varying x at a given \(\lambda ,\rho \). The “global dual” update of \((\lambda ,\rho )\) comes from the resolution of (A.3) where the set \({\mathcal {X}}\) is approximated by the finite subset of samples \({{X}} \).

a solution at given multiplier and penalty parameter. The function \(D (\lambda ,\rho )\) is concave in \(\lambda \) and \(\rho \) and \(h(x(\lambda ,\rho ))\) is a subgradient with respect to \(\lambda \) Minoux [17]. This is at the root of updating strategies that we called “local dual” earlier and which consist in a gradient step in the dual space,

More specific update strategies such as those given in Nocedal and Wright [19], Picheny et al. [22] stem from the Karush Kuhn and Tucker (KKT) optimality conditions and require the additional assumption that \({\mathcal {X}} \in {\mathbb {R}}^d\) and f() and h() are differentiable. At \(x^\star \), since \(h(x^\star )=0\) and \(\lambda ^{KKT}\) being the KKT multiplier^{Footnote 1}, one has

The updates (A.5) and (A.8) have the same form, (A.8) is more restrictive since the KKT conditions must apply but the step size is known.

The equality constraint of the article (Eq. (2.4) with \(\epsilon =0\)) is a minimum over distances. It has the additional feature that it is always positive or null, \(\forall x \in {\mathcal {X}}~,~h(x) \ge 0\). Because of this, if h is locally differentiable around \(x^\star \), \(\nabla h(x^\star ) = 0\) since h has a minimum at \(x^\star \). The constraint qualification condition is not satisfied (\(\nabla h(x^\star )\) does not span a non-empty set) and the KKT conditions do not apply. Another consequence is that the optimal Lagrange multiplier must be positive and the search for \(\lambda \) can be written \(\max _{\lambda \ge 0} D (\lambda ,\rho )\) in Problem (A.3), as in Problem (2.7).

Proof

Assume \(\rho \) is large enough for Problem (A.2) to have a saddle point at its optimum, \(f(x^\star ) \le f(x) + \rho /2 h^2(x) + \lambda ^\star h(x)~,~ \forall x\) where \(\lambda ^\star \) is the optimum Lagrange multiplier. Since the optimization problem has an active constraint, there is a point \(x^I\) that is infeasible, \(h(x^I)>0\), and has a better objective function than the feasible solution (otherwise the constraint is useless), \(f(x^I)+\frac{\rho }{2}h^2(x^I) \le f(x^\star )\). If the optimum Lagrange multiplier is negative, \(\lambda ^\star < 0\), \(f(x^I)+\frac{\rho }{2}h^2(x^I)+\lambda ^\star h(x^I) < f(x^\star )\) which contradicts the fact that \(x^\star \) is a solution to the dual problem. \(\square \)

Inequality constraint

When \(\epsilon >0\), Problem (2.4) has an inequality constraint which we rewrite here more simply,

The considerations on augmented Lagragian done above for equality constraints readily extend to inequality constraints by introducing a slack variable,

which is equivalent to the expression of Rockafellar with the 2 cases given in Eq. (2.5) (recall \(-\log (1+EI)\) is f(x)).

The update equations for \(\lambda \) are the same as those for the equality case where the slack variable \(s^2\) takes its optimal value. On the one hand, it is possible to solve the approximated dual problem as in (2.8). On the other hand, a step along a subgradient in the dual space can be taken,

where \(\alpha \) is again a positive step factor. It has the same form as Eq. (2.10). The update (2.10) is fully recovered from the KKT conditions as above for equalities, (A.8),

Equations (A.14) and (A.15) are the same but in the latest the step factor \(\alpha \) is known and equal to \(\rho \), which comes at the additional expense of the KKT validity conditions.

Availibility of data and materials

The source code of this work will be made available upon request to the corresponding author.

Notes

The Lagrange multiplier that maximizes the dual function is equal to the KKT multiplier only when the functions are differentiable, the constraints qualification conditions apply, and there is a saddle point i.e., \(min_x max_{\lambda } L_A(x;\lambda ,\rho ) = max_{\lambda } min_x L_A(x;\lambda ,\rho )\).

Abbreviations

ALV:

Augmented Lagrangian latent Variable

DoE:

Design of Experiment

\(\textit{D}(),{{\widehat{D}}()}\)
:

dual and approximate dual functions

MLE:

Maximum Likelihood Estimation

\(\epsilon \)
:

relaxation constant for the discreteness constraint

Bischl B, Richter J, Bossek J, Horn D, Thomas J, Lang M. mlrMBO: A modular framework for model-based optimization of expensive black-box functions, 2018.

Cao YJ, Jiang L, Wu QH. An evolutionary programming approach to mixed-variable optimization problems. Appl Math Model. 2000;24(12):931–42.

Emmerich Michael, Zhang A, Li R, Flesch I, Lucas Peter J. Mixed-integer Bayesian optimization utilizing a-priori knowledge on parameter dependences. J Phys Chem A. 2008;65–72.

Frazier Peter I. A Tutorial on Bayesian Optimization. arXiv e-prints, page arXiv:1807.02811, July 2018.

Hestenes Magnus R. Multiplier and gradient methods. J Optim Theory Appl. 1969;4(5):303–20.

Hutter F, Hoos HH, Leyton-Brown K. Sequential model-based optimization for general algorithm configuration. In: International Conference on Learning and Intelligent Optimization, pp. 507–523. Springer, 2011.

Jones DR, Schonlau M, Welch WJ. Efficient global optimization of expensive black-box functions. J Glob Optim. 1998;13(4):455–92. https://doi.org/10.1023/A:1008306431147 (ISSN 1573-2916).

Le Riche R, Frédéric G. Dual evolutionary optimization. Lecture Notes in Computer Science, (2310): 281–294. selected papers of the 5th Int. Evolution Artificielle Conf; 2002.

Le Riche R, Picheny V. Revisiting bayesian optimization in the light of the coco benchmark. Struct MultiDiscip Optim. 2021. to appear.

Li R, Emmerich MTM, Eggermont J, Bäck T, Schütz M, Dijkstra J, Reiber Johan HC. Mixed integer evolution strategies for parameter optimization. Evol Comput. 2013;21(1):29–64.

Nocedal J, Wright SJ. Numerical optimization. Springer series in operations research. Springer, New York, 2nd edn, 2006. ISBN 978-0-387-30303-1. OCLC: ocm68629100.

Ocenasek J, Schwarz J. Estimation of distribution algorithm for mixed continuous-discrete optimization problems. In: 2nd Euro-International Symposium on Computational Intelligence. pp. 227–232. IOS Press Kosice, Slovakia, 2002.

Pelamatti J, Brevault L, Balesdent M, Talbi E-G, Guerin Y. Efficient global optimization of constrained mixed variable problems. J Glob Optim. 2019;73(3):583–613. https://doi.org/10.1007/s10898-018-0715-1 (ISSN 0925-5001, 1573-2916).

Picheny V, Gramacy RB, Wild S, Le Digabel S. Bayesian optimization under mixed constraints with a slack-variable augmented Lagrangian. In Lee D, Sugiyama M, Luxburg, Guyon I, Garnett R. editors, Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016. URL https://proceedings.neurips.cc/paper/2016/file/31839b036f63806cba3f47b93af8ccb5-Paper.pdf.

Powell MJD. A direct search optimization method that models the objective and constraint functions by linear interpolation, pp. 51–67. Netherlands, Dordrecht: Springer; 1994. ISBN 978-94-015-8330-5. https://doi.org/10.1007/978-94-015-8330-5_4.

Roustant O, Padonou E, Deville Y, Clément A, Perrin G, Giorla J, Wynn Henry. Group kernels for gaussian process metamodels with categorical inputs. SIAM/ASA J Uncertainty Quantif. 2020;8(2):775–806. https://doi.org/10.1137/18M1209386.

Thi HAL, Le HM, Dinh TP. Optimization of complex systems: theory, models, algorithms and applications. In: Advances in intelligent systems and computing. Springer International Publishing, 2019. ISBN 9783030218034. URL https://books.google.fr/books?id=R46dDwAAQBAJ.

Wang Z, Hutter F, Zoghi M, Matheson D, de Feitas Nando. Bayesian optimization in a billion dimensions via random embeddings. J Artif Intell Res. 2016;55:361–87.

Wilson JT, Hutter F, Deisenroth MP. Maximizing acquisition functions for bayesian optimization. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18, page 9906-9917, Red Hook, NY, USA, 2018. Curran Associates Inc.

Zhang Y, Tao S, Chen W, Apley DW. A latent variable approach to Gaussian Process modeling with qualitative and quantitative factors. Technometrics, 2019;1–12. ISSN 0040-1706, 1537-2723. https://doi.org/10.1080/00401706.2019.1638834.

Zuniga MM, Sinoquet Delphine. Global optimization for mixed categorical-continuous variables based on gaussian process models with a randomized categorical space exploration step. INFOR Inform Syst Opera Res. 2020;58(2):310–41. https://doi.org/10.1080/03155986.2020.1730677.

The kernels with latent variables were developed jointly by OR, GP and JC-R. The Bayesian optimization formulation was developed jointly by RLR, JC-R, OR, GP and CD. The augmented Lagrangians schemes were developped jointly by RLR and JC-R. The test cases were proposed by GP, CD, AG and JC-R. JC-R did the code implementation. All authors reviewed the manuscript. All authors read and approved the final manuscript.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Cuesta Ramirez, J., Le Riche, R., Roustant, O. et al. A comparison of mixed-variables Bayesian optimization approaches.
Adv. Model. and Simul. in Eng. Sci.9, 6 (2022). https://doi.org/10.1186/s40323-022-00218-8