A comparison of mixed-variables Bayesian optimization approaches

Most real optimization problems are defined over a mixed search space where the variables are both discrete and continuous. In engineering applications, the objective function is typically calculated with a numerically costly black-box simulation.General mixed and costly optimization problems are therefore of a great practical interest, yet their resolution remains in a large part an open scientific question. In this article, costly mixed problems are approached through Gaussian processes where the discrete variables are relaxed into continuous latent variables. The continuous space is more easily harvested by classical Bayesian optimization techniques than a mixed space would. Discrete variables are recovered either subsequently to the continuous optimization, or simultaneously with an additional continuous-discrete compatibility constraint that is handled with augmented Lagrangians. Several possible implementations of such Bayesian mixed optimizers are compared. In particular, the reformulation of the problem with continuous latent variables is put in competition with searches working directly in the mixed space. Among the algorithms involving latent variables and an augmented Lagrangian, a particular attention is devoted to the Lagrange multipliers for which a local and a global estimation techniques are studied. The comparisons are based on the repeated optimization of three analytical functions and a beam design problem.


Introduction
A key task in engineering design is to find an optimal configuration from a very large set of alternatives. When the performance of the candidate solutions is measured through a realistic simulation, the numerical cost of the procedure becomes a bottleneck. The optimization of computationally expensive simulators is a topic widely studied in the literature Thi et al. [26].
In this work, we focus on Bayesian optimization (BO), which is particularly suitable for solving such problems Frazier [9]. Bayesian optimization is a sequential design strategy that requires a data-driven mathematical model or metamodel that provides predictions along with their uncertainty Bartz-Beielstein et al. [2]. The metamodel replaces some of the calls to the expensive simulation and is a key ingredient to the optimization of costly functions. An acquisition criterion Wilson et al. [29] aggregates the spatial predictions and uncertainties. The metamodel is trained from a reduced set of simulation data and the acquisition criterion is maximized to propose new configurations to be simulated at the next iteration. When the acquisition criterion is the expected improvement (EI), as first introduced in Mockus et al. [18], the BO algorithm is often called EGO (Efficient Global Optimization, Jones et al. [12]). EGO is currently a state-of-the-art approach to medium size, continuous and costly optimization problems, both from an empirical Le Riche and Picheny [14] and a theoretical point of view Vazquez and Bect [27].
However, in realistic settings, some of decision variables are categorical. In structural design for example, the type of material, the number of components, the choice between alternative technologies lead to discrete variables with no obvious distance between them. The combination of continuous and categorical variables is called a mixed optimization problem.
In non-costly cases, mixed optimization problems can be approached by Mixed-Integer NonLinear Programming Belotti et al. [4] (when the discrete variables are integers), by sampling based techniques such as evolutionary optimization Cao et al. [6], Emmerich et al. [8], Ocenasek and Schwarz [20] or by alternating mixed programming Audet and Dennis Jr [1].
When the objective function is costly, mixed optimization problems remain challenging and a topic for research. It is customary to replace some of the calls to the original objective function by calls to a (meta)model of it. Bartz-Beielstein and Zaefferer [3] provide an overview of metamodels that have or can be used in optimization when the variables are continuous or discrete. Bayesian optimization methods, which rely on metamodels to save computations, have already been extended to mixed problems. It was made possible by the realization that GP kernels (covariance functions) in mixed variables can be created by composing continuous and discrete kernels. The acquisition function is defined over the same space as the objective function. Therefore maximizing the acquisition function is also a mixed variables problem.
To the best of our knowledge, the first EGO-like algorithm for mixed variables has been proposed in Hutter et al. [11]. In this article, the mixed kernel is a product of continuous and discrete Gaussian kernels, and random forests constitute an alternative choice of mixed metamodel. More precisely, the discrete kernel is a Gaussian of integer or hamming (also known as Gower) distance for ordinal or nominal variables, respectively. In Hutter et al. [11], the expected improvement is first optimized with a multi-start local search for both the continuous and discrete variables (thus a neighborhood for the discrete variables is defined) which is then complemented by a random search. This work was continued with the REMBO method in Wang et al. [28], where a random linear embedding is introduced to tackle high-dimensional problems. Discrete variables were relaxed into continuous variables thanks to a mapping function. The optimization of the acquisition function was made with a combination of the DIRECT and CMA-ES continuous global optimizers. Both Hutter et al. [11] and Wang et al. [28] have been motivated by applications to the automatic configuration of algorithms. The goal of reaching very high dimensions (millions) probably forced the authors to use isotropic kernels as a way to keep the number of hyper-parameters low (only one length-scale for all dimensions).
A Bayesian mixed optimizer is presented in Pelamatti et al. [21]. The GP kernels are products of continuous and discrete kernels. Different discrete kernels are compared, namely the homo-and hetero-scedastic hypersphere decomposition and the compound symmetric kernels. The optimization of the acquisition function is performed with a genetic algorithm in mixed variables. A similar BO with mixed kernel is described in Zuniga and Sinoquet [33], but the expected improvement is optimized with the mixed version of the MADS algorithm Audet and Dennis Jr [1] and the neighborhood of the categorical variables is defined through a probabilistic model.
Random forests can replace the kriging model in BO with mixed inputs as they natively have a measure of prediction uncertainty. Such an implementation, first done in Hutter et al. [11], is part of the mlrMBO R package Bischl et al. [5], in conjunction with several acquisition criteria that can be optimized with a "focus-search" algorithm. The focus-search algorithm hierarchically samples the search space of the chosen acquisition criterion.
Recent developments in metamodels involving mixed variables show that it is possible to map the categorical variables into quantitative non-observed latent variables that are then considered as continuous Zhang et al. [31]. Whenever it is possible to write a model of the studied system, quantitative latent variables exist that describe the effects of the categorical variables. Typically, there are more latent variables than categorical variables. The existence of continuous latent variables can sometimes be established from the physics of the considered phenomena, e.g. in material science Zhang et al. [32]. In structural mechanics for example, if the categorical variable describes the shape and the material of an element load in flexion, its bending moment of inertia is a candidate latent variable. Latent variables can emulate the properties of the original categorical variables, in particular within the metamodel, and open the way to reasonings with continous quantities: the kernels of the Gaussian processes can be taken as continuous, gradients and neighborhoods are naturally defined during the optimization. On the contrary, categorical variables and their inherent lack of distance definition is the cause of complications in the kernel definition and in the optimization.
This article presents a new Bayesian optimization algorithm for mixed variables called LV-EGO (for Latent Variable EGO). Our contribution with respect to Zhang et al. [32] is that the continuity of the latent variables is also taken advantage of during the optimization of the acquisition criterion. This implies that categorical variables must be recovered from the continuous latent variables proposed by the optimizer, which creates a new "pre-image" problem.
"Problem statement and background" introduces the problem and the principles of Bayesian optimization. In "EGO with latent variables", several variants of LV-EGO are described. They differ in the handling of the relationship between the categorical and the latent variables: the "vanilla" LV-EGO just recovers categorical variables after the optimization while augmented Lagrangian versions account for the link during the optimization through constraints. "Description of the numerical experiments" presents a set of benchmarks comparing our method to other state-of-the-art techniques. One of the benchmarks is a beam design problem and gives the opportunity to discuss the interpretation of the latent variables. Finally, "Conclusions and perspectives" offers conclusions and perspectives to this work.

Problem statement and background
We consider the problem of minimizing a function y(x, u) depending on a vector of continuous variables x = (x 1 , . . . , x n c ) and a vector of discrete variables u = (u 1 , . . . , u n d ), where each u i has m i levels encoded 1, . . . , m i . We denote X the domain of definition for the continuous inputs, typically, after rescaling, the hypercubic domain [0, 1] n c . Similarly, we denote U = n d j=1 {1, . . . , m j } the domain of definition for the discrete inputs. X × U is the set of the mixed optimization variables. We focus on costly functions, meaning that each evaluation of y is time-consuming, and we aim at minimizing y with a tiny budget of evaluations. In this context, minimizing directly y is hardly possible. An alternative is to use Bayesian optimization (BO). In BO approaches, there are two main ingredients: a Gaussian process (GP) serving as a fast proxy, often called metamodel, built from the current learning set, and a sampling criterion, often called acquisition criterion, used to update the learning set with a new data point computed with y. A famous acquisition criterion is the expected improvement (EI). In that case, the BO approach is often called Efficient Global Optimization (EGO) algorithm.
To be more precise, let (X, U ) = {(x, u) (1) , . . . , (x, u) (t) } ∈ (X × U) t be a design of experiments (DoE), and y i = y(x (i) , u (i) ) be the corresponding function evaluations (i = 1, . . . , t). Let y min = min(y 1 , . . . , y t ) be the current minimum. Let us now assume that y is a particular realization of the GP Y defined on X × U. In that case, the EI criterion is defined by where Y t is the conditional GP knowing the observations: Notice that EI(x, u) is large when exploiting interesting area, that is to say when there is a good chance that Y t (x, u) is smaller than y min . This may occur when E[Y t (x, u)] is close to y min , or when exploring unvisited areas, i.e. when the variance of Y t (x, u) is large compared to (E[Y t (x, u)] − y min ) 2 . The idea of EGO is to evaluate y at a new point maximizing the EI criterion until a stopping criterion is reached. See Algorithm 1 for a synthetic description of the EGO algorithm when the stopping criterion is a maximum number of evaluations of y, noted budget. A maximum budget is the logical stopping criterion in our context of costly optimization. Other stopping conditions are possible in the form of lower bounds on the acquisition criteria (expected improvement, knowledge gradient Frazier [9],…) i.e., minimal measures of progress below which the search should stop. In line 9, the solution returned by the algorithm is the best point of the last DoE, (X, U ). Algorithm 1 EGO algorithm on a mixed space 1: Generate the initial DoE of size N DoE , (X, U ), and calculate Y = (y 1 , . . . , y N DoE ), t ← N DoE . 2: while t ≤ budget do 3: Estimate the GP Y t from the learning set formed by (X, U ) and Y . 4: Look for the current minimum y min and maximize (x, u) → EI(x, u) on X × U: (x t+1 , u t+1 ) ∈ argmax (x,u)∈X ×U EI(x, u). 5: Evaluate y at (x t+1 , u t+1 ), y t+1 = y(x t+1 , u t+1 ). 6: Update the learning set: (X, U ) ← (X, U ) ∪ (x t+1 , u t+1 ), Y ← Y ∪ {y t+1 }. 7: t ← t + 1 8: end while 9: (x , u ) = arg min (x,u)∈(X,U) y(x, u), y = y(x , u ) 10: return (x , u , y ) This EGO algorithm has been intensively studied to minimize nonlinear functions that are expensive to be evaluated in the case U = ∅, i.e. when all input variables are continuous (see Le Riche and Picheny [14] for numerical illustrations of its efficiency). The application of this algorithm in the presence of categorical variables is much less documented (see e.g. Pelamatti et al. [21], Zuniga and Sinoquet [33]), which can be explained by two main difficulties. The first one is related to the difficult estimation of covariance kernels on mixed spaces. Indeed, multi-dimensional covariance functions are often built by combination of one-dimensional ones. Therefore, covariance functions on mixed spaces can be obtained by combining covariance functions on X and U: where k x 1 , . . . , k x n c , k u 1 , . . . , k u n d are covariance functions and * is an operation that preserves positive definiteness, such as sum or product. If we focus on the single categorical variable u j with levels 1, . . . , m j , we can identify the covariance function k u j to a (m j × m j )dimensional positive semidefinite matrix T, such that for all 1 ≤ k, ≤ m j , This means that n d j=1 m j (m j + 1)/2 coefficients need to be estimated to determine a covariance on U in the general case. That number can be large when m is large, which very often makes this estimation very difficult in practice. Furthermore, the optimization problem is often harder than the box-constrained one met with continuous variables. Indeed it is either constrained by the positive definiteness of T, which is non-linear, or defined on a manifold if T is parameterized in spherical coordinates. We refer to Roustant et al. [25] for more details and other parsimonious representations of k u j , which can reduce but not totally fix these issues. The second reason that can explain the few number of direct applications of EGO algorithm on mixed space is related to the difficult maximization of the expected improvement, i.e. the search of the new input points where to call the function y, which are solutions of: Indeed, classical optimization algorithms on continuous spaces usually try to exploit information related to the gradient of the function to be maximized, as well as notions of proximity in the space of the inputs. However, these two notions are difficult to exploit when dealing with categorical inputs, i.e. without any a priori ordering between the input instances. To circumvent this difficulty, a naive approach of resolution would consist in no longer considering a single maximization problem on X × U, but the resolution in parallel of n d j=1 m j maximization problems on X , i.e. one problem per combination of instances of the categorical inputs u. Such an approach is not tractable when the number of optimization problems to be solved becomes large, which has motivated the definition of heuristics, such as evolutionary algorithms Li et al. [15], Cao et al. [6], Lin et al. [16], which seek to concentrate the searches only on the interesting instances of u. These approaches still rely on a large number of calls to the function to be optimized, and their convergence is not always easy to quantify.
Because mixed optimization problems are difficult, an alternative approach is proposed in the rest of this paper. It is based on the possibility to relax the discrete variables into con-tinuous latent variables, therefore benefiting from the more efficient search mechanisms that exist in continuous spaces (e.g. gradients).

Latent variables
For an easier handling of categorical inputs, it was proposed in Zhang et al. [31] to replace each categorical input u j by a vector of q j ≥ 1 continuous inputs with values in R q j , noted j . To give an intuition of the underlying idea in the automotive domain, a category of lubricant may be determined by physical continuous features such as boiling temperature, viscosity, etc that act as latent variables. In structural mechanics, the shape of a load carrying structure, which is categorical, has underlying continuous flexural and membrane moments that drive its behavior. This amounts to associating to the Gaussian process (GP) Y a new GP Y , such that for each instance u of the categorical inputs there exists a particular value of :=( 1 , . . . , n d ) ∈ L ⊂ R q 1 ×· · ·×R q n d , which is called latent variable, allowing us to write: (2.1) An important point is that the values of are unobserved and therefore Y is unknown. Nevertheless, in order to replace the EI maximization problem on X × U by a new optimization problem on X × L, a precise knowledge of Y is not necessary. Indeed, assuming that kernels for mixed inputs are built by combining 1-dimensional ones as in (1.1), it is sufficient to identify the mappings φ j from {1, . . . , m j } to R q j to each variable u j such that where k j is a continuous kernel on R q j × R q j . Thus, it is not so much the values of φ j (u j ) that are important, but their relative positions in R q j in order to allow a reasonable reconstruction of the dependency structure between Y (x, u) and Y (x , u ). According to the works achieved in Zhang et al. [31], it appears that interesting mappings can be obtained by likelihood maximization and that relatively small values of q j can give a satisfying reconstruction. Following their recommendations, q j can be chosen equal to 1 if m j ≤ 3 and to 2 otherwise, which will be the values chosen in the rest of this paper. We denote by n = n d j=1 q j the total number of latent variables. Following Roustant et al. [25], the continuous kernel k j associated to the latent variables was chosen as the dot product kernel k j (t, t ) = t, t . The corresponding covariance matrix is then low-rank, and provided better performances than the Gaussian kernel in the examples considered in the latter reference.
This new parametrization leads us to the following adaptation of the EI maximization problem defined by Eq. (1.3), which we name acquisition problem as it allows to acquire a new point to evaluate: Here, EI (t) (x, ) is the expected improvement associated with GP Y at iteration t, We follow two paths to solve this acquisition problem. In the vanilla LV-EGO approach, which will be described soon, the EI maximization and the latent-discrete compatibility constraint are addressed one after each other. Alternatively, with the augmented Lagrangian approaches, which will be described in "LV-EGO algorithms with Augmented Lagrangian", the full constrained optimization problem is treated.

The vanilla LV-EGO algorithm
At each iteration, the vanilla LV-EGO algorithm first maximizes EI in a relaxed, fully continuous, formulation where the discrete variables are replaced by relaxed continuous latent variables. Then, a pre-image problem is solved where EI is maximized over the discrete variables only, the continuous variables being fixed at their value of the relaxed problem. The LV-EGO methodology is summarized in Algorithm 2. 4: Estimate the latent variable mappings φ (t) and the parameters of the continuous GP Y . 5: Perform one EGO iteration in the relaxed continuous space :

Algorithm 2 Vanilla LV-EGO with mixed inputs
The main difference with the generic Bayesian algorithm 1 is the new discrete pre-image problem in line 6. Notice that the pre-image is formulated in terms of the EI objective, as opposed to a more arbitrary distance like t+1 − φ (t) (u) . Solving the pre-image in terms of the iterative figure of merit, the expected improvement, is meant to provide a gain in efficiency with respect to a pre-image minimizing an Euclidean distance between the map of a discrete level and the latent variables. In the particular situation where the latent variable coincides with the image of a discrete level, t+1 = φ (t) (u t+1 ), both approaches yield the same result since t+1 is a maximizer of EI (see line 5 of Algorithm 2). In terms of implementation, the EI maximization (line 5) is done with the COBYLA algorithm, a gradient free non-linear optimization technique Powell [23]. Since COBYLA is a local optimizer and the EI is a multimodal function, the maximization is repeated (10 times, which is more than the maximum dimension of the test cases studied in this article and more than the default-3-of the kergp package) from randomly chosen initial points and the best result is kept. An exhaustive search is carried out for the EI maximization of the pre-image problem (line 6).
A comparison of the numerical complexities of the vanilla LV-EGO (Algorithm 2) and the generic EGO (Algorithm 1) shows that the cost of the latent variables is limited. Let us consider that the discrete space can be searched essentially by enumeration in where m i is the number of levels per discrete variable) while a continuous space can be searched more efficiently in linear time. At each iteration, the Bayesian algorithms of this paper have three steps: first a GP is learned, then an acquisition criterion (EI for now and an augmented Lagrangian later) is maximized and finally a preimage problem is solved. In the vanilla LV-EGO algorithm, these steps take place at lines 4, 5 and 6 of Algorithm 2, respectively. Table 1 summarizes the number of operations per step. The number of operations for learning the GPs is proportional to the cube of the number of points evaluated (t) because of the inversions of the covariance matrices, times the number of (continuous) parameters of the GP for the likelihood maximization.
The two other steps, the acquisition and the pre-image, imply predictions by the GP in t 2 operations times a number of operations that depends on the specific algorithm. Comparing in Table 1 the column of the generic EGO with that of the vanilla LV-EGO, and assuming that for all i m i = m to keep the discussion simple, it can be seen that the latent variables induce a slight extra cost to be learnt. When q = 2, which is our default here, this extra cost is n d × m i × t 3 operations. Setting q = 1 would not add any cost to the learning. An advantage, which comes from the sequential resolution of the mixed problem, occurs in the maximization of the acquisition criterion when n c + q × n d × m < m n d × n c , at the cost of an additional pre-image problem to solve. Thus, LV-EGO will be faster than a mixed EGO once the latent variables are estimated if m n d + n c + q × m × n d < m n d × n c , which happens frequently (take for example n c = 4, n d = 2, m = 10, q = 2).

LV-EGO algorithms with Augmented Lagrangian
A possible pitfall of the vanilla LV-EGO detailed in Algorithm is that the link between the discrete variables u and their relaxed continuous counterparts is lost when maximizing EI (t) (x, ) in line 5. Recovering it during the discrete pre-image problem where x is fixed to a value optimal in the relaxed formulation but possibly non-optimal with respect to the mixed problem (1.3) may yield a sub-optimal solution. For this reason, we now propose LV-EGO algorithms that account for the discreteness constraint during the optimization using augmented Lagrangians.
In that prospect, notice that problem (2.3) can be approximated as an optimization problem with an inequality constraint: where is a small positive relaxation constant and · the Euclidean norm. In this reformulation, called relaxed acquisition problem, notice the log scaling of the EI which does not change the solution but improves the conditioning of the problem. Two values of will be discussed in the sequel, = 0 in which case the constraint becomes an equality constraint, min u∈U − φ (t) (u) = 0, and > 0 but small which corresponds to a relaxation of the equality. In the sequel, is normalized with respect to the size of the vector of latent variables and set to = 0.01.
The constrained optimization problem (2.4) is solved through an augmented Lagrangian approach Minoux, [17], Nocedal and Wright [19]. The augmented Lagrangian is that of Rockafellar [24] which, specified for Problem (2.4), is, (2.5) Table 1 Numerical complexities of the algorithms compared at each iteration (for a given t) When = 0, the constraint g (t) ( ) ≤ 0 becomes an equality constraint, g (t) ( ) = 0. In this case, the augmented Lagrangian connected to that of Rockaffelar is that of Hestenes [10] and takes the form Complementary explanations about the augmented Lagrangians are given in Appendix .
Augmented Lagrangians require to specify the values of the Lagrange multiplier, λ, and of the penalty parameter, ρ. The general principle to fix them is to calculate the generalized Lagrange multiplier with a dual formulation Minoux [17]: the dual function D (t) is maximized with respect to the multiplier λ while the penalty parameter ρ should take the smallest value that allows one to find feasible solutions, There are two logics to solve Problem (2.7), both of which have been investigated in this study. Following an idea presented in Le Riche and Guyon [13] for classical Lagrangians, we first propose to approximate the dual function D() as the lower front of the augmented Lagrangians of a finite set of calculated points. The approximated dual is where (X , L ) is a DoE that should not be mistaken for (X, U ), the DoE of the original expensive problem. (λ t , ρ t , x t , t ) comes from solving Problem (2.7) with minimizations over the finite set (X , L ) instead of the initial X × L. The functions in Problem (2.4) are not costly, (X , L ) can be quite large. This approach is called global dual as a global approximation to the dual function is built and maximized. It applies to very general functions, e.g., non differentiable functions. Another advantage of this approach is to allow large changes in the dual space. Figure 10 provides an illustration of the approximated dual function and the effect of ρ on the dual problem. The sketch is done for an inequality constraint, yet it also stands with marginal changes for an equality (cf. Appendix and the caption to the Figure). Under the non-restrictive hypothesis that there is a ρ beyond which the solution to the primal problem (2.4) maximizes the dual function, maximizing the dual function preserves the global aspect of the search. However, the accuracy of the obtained (λ t , ρ t )'s will depend on the DoE. Because there is only one constraint in the current problem and evaluating it does not require calling the costly function, the maximization on λ and ρ is done by enumeration on a 100 × 20 grid and (X , L ) is a 100 LHS sample. The other path to updating the multiplier is to progressively change them based on the minimizers of the augmented Lagrangian at the current step. This updating can be seen as a step in the dual space which makes it general, although it is usually proved by analogy with the Karush Kuhn and Tucker optimality conditions Nocedal and Wright [19] which add unnecessary conditions (like differentiability), cf. Appendix . Let (x t , t ) be a solution to min x, ∈X ×L⊂R nc +n The update formula reads As in Picheny et al. [22], the penalty parameter ρ is simply increased if the constraint is not satisfied, The update scheme based on Eqs. (2.10) and (2.11) is called local dual as a local step in the dual (λ, ρ) space is taken.
ALV-EGO-g variant: with the global dual scheme, cf. Algorithm 4 ALV-EGO-l variant: with the local dual scheme, cf. Algorithm 5 7: recover the discrete pre-image component u t+1 as: u t+1 = arg max u∈U EI (t) (x t+1 , φ (t) (u)) 8: update DoE: add (x t+1 , u t+1 ) and its costly evaluation y(x t+1 , u t+1 ) to the DoE (X, U ). 9: t ← t + 1 10: end while 11: return (x , u ) = arg min (X,U) y(x, u) Algorithm 3 gathers all these changes and is called ALV-EGO. The essential difference between this ALV-EGO algorithm and the vanilla counterpart (Algorithm 2) is that the EI maximization step is constrained so that the link between the discrete variables and the relaxed latent variables (hence the continuous x) is not lost and left to the pre-image step. The coupling between the continuous and the discrete variables is better accounted for. However, a pre-image step (line 7) is still necessary to fully recover a discrete solution in cases when the constraint is relaxed ( > 0). In ALV-EGO like in the vanilla LV-EGO, there are q = 2 continuous latent variable per discrete variable. The global and local dual schemes are further detailed in Algorithms 4 and 5. The continuous minimizations of the Augmented Lagrangians once the Lagrange multipliers are set are always done with 10 random restarts of the COBYLA algorithm Powell [23]. They occur in Algorithm 4, line 4 and Algorithm 5 line 5. To allow comparisons, this implementation is identical to the EI maximization of the vanilla LV-EGO (step 5 of Algorithm 2).  A (x, ; λ t , ρ t ) 6: return x t+1 , t+1 While the local update of λ and ρ might seem less robust, it is the most common implementation and it might be sufficient for the constrained EI maximization. Indeed, between two iterations, the EI changes only locally around the current iterate. Providing the latent mapping functions do not change too much, a local update of λ and ρ seems appropriate. The numerical complexity of the ALV-EGO-g and -l algorithms is essentially the same as that of the vanilla LV-EGO, cf. Table 1. The global dual scheme has a slight extra-cost because of the search for the Lagrange multiplier and penalty parameter that require N DoE extra GP predictions.
Eventually, four variants of ALV-EGO are considered, ALV-EGO-ge or -gi or -le or -li where g stands for global, l for local, e for equality ( = 0) and i for inequality ( > 0).

Algorithms tested
The various algorithms tested are summarized in the Table 2 which provides their names, the type of formulation for the mixed variables, the type of metamodel, the acquisition criterion and the technique to optimize the acquisition criterion. The two possible formulations for the mixed variables are either by searching in a mixed space (MS) or by a formulation in latent variables (LV). All Gaussian processes (GPs) are built with the kerpg  The different algorithms will be tested on the suite of test problems described hereafter.

Test cases
There are 3 analytical test cases and a beam bending problem. The analytical test cases have all been designed by discretizing some of the variables of classical multimodal continuous test functions. The following notation is introduced to describe the discretization: if the continuous variable x i is discretized with u j that takes values in {1, . . . , m j }, then u j (k) = β means x i = β when u j = k, β a scalar, 1 ≤ k ≤ m j . Test case 1: discretized Branin function. We modified the 2 dimensional Branin-Hoo function whose expression is . The discretized Branin, which was already used in Zhang et al. [32], has several local minima as shown in Figure 1a. The global optimum is located at (x 1 , u ) = (0.182; u(3)) with y(x 1 , u ) = 2.791. Test case 2: discretized Goldstein function. As a second test case, the continuous Goldstein function is partly discretized by replacing x 2 by u with 5 levels {u(1) = 0; u(2) = 1/2; u(3) = 1/2; u(4) = 3/4; u(5) = 1}. The discretized Goldstein, which has also been studied in Zhang et al. [32], is drawn in Figure 1b. It has several local optima. The global optimum is located at (x 1 , u ) = (0.5; u(2)) with y(x 1 , u ) = 3. Test case 3: discretized Hartman function. Two variables are discretized in the 6 dimensional Hartman function, We are interested in finding the best compromise between a minimization of the vertical deflection and the total weight, as expressed in the objective where L = 10 + 10 × x 1 , S = 1 + x 2 , u 1 =Ĩ , (3.4) and (x 1 , Here α is the weight balancing the two effects in the objective function. It is chosen as α = 60 so that y has several local minima and only one global minimum. This global solution is (x 1 , x 2 , u 1 ) = (0; 0.43;Ĩ(3)) with output y = 1.287385 × 10 3 .

Experiments setup and metrics
The optimization of each pair of algorithm and test case are repeated 50 times from different initial DoEs. The DoEs are generated by minimax Latin Hypercube Sampling. The size of the DoEs is N DoE = 4 × n c × n d × max(m i ) and a budget of N DoE + 50 evaluations of the true objective function. Remember that the true objective function is supposed to be computationally intensive although it is not in these experiments so that runs can be repeated. The evolution strategies are stopped after N DoE + 50 evaluations of the true function, like the other algorithms. The internal local optimizer, COBYLA, is restarted 5 times during the likelihood maximization and 10 times during the maximization of the acquisition criterion. The focussearch algorithm has a sample size of 1000 with 5 boundary reduction iterations and 3 multi-starts, for a total of 3000 calls to the acquisition criterion.
A summary of the dimensions involved in the different examples is given in Table 3.

Results and discussion
The results are provided with 4 main metrics. The performance of an algorithm is classically described by the median objective function over the 50 repeated runs, calculated at each iteration. The associated measure of dispersion of the performance is the interquartile over the repetitions as a function of the iteration. To discriminate between methods that are rapid but provide rough solutions from the ones that take more time but yield better solutions, the two other metrics are based on the definition of targets. For each test case, a target is a given quantile of all the objectives functions found by all the algorithms throughout all the repetitions. A 10% target is difficult, while a 50% target is the median performance. The third metric is the iteration number at which the median objective function of a given algorithm reaches a given target. The fourth metric is the success rate (given a target), which is the percentage of the runs that do better than the target. The metrics associated to the quantile targets have the advantage that they are normalized with respect to the test cases: thanks to the quantiles, the definitions of an easy, a median or a hard target stands accross the different functions to optimize. The target-based metrics will later be averaged over the different test cases. Let us now review the performances of the algorithms on each test case.

Analytical test functions
Branin function Figure 2 presents the results for the Branin function with the four metrics. On the top left plot, showing the median value for the objective function, it is clear that the two methods that rely on the random forest metamodel (MS-RFO and LV-RFO) are overtaken by all other methods. This indicates that, whether in the mixed or in the latent-augmented space, random forests do not represent sufficiently well the Branin function in comparison to Gaussian processes. Looking at Fig. 2b, it is observed that the fast methods typically have the lowest spread in performance and vice versa. This is expected as non converging runs may yield a wide range of performances. All methods involving the discrete constraint (i.e., the augmented Lagrangians) managed to improve over the LV-EGO performance; and including a mixed metamodel increased significantly the success rate and the median solution for the evolutionary strategy.
Regarding the success rate on Fig. 2d, the methods MS-MKES, LV-EGO, ALV-EGO-li, -le, -ge and -gi were the most prominent, the latter being capable to reach success rates of about 20% for a 10% target. Notice that all these methods contain Gaussian processes. Indeed, the Branin function is easy to represent by a GP whether continuous or mixed. In the same vein, MS-MKES which differs from MS-ES by the use of a GP, clearly benefits from that metamodel. All ALV-methods, which account for the discrete constraint, obtained the best median performances. ALV-EGO-ge in particular found all targets, in the median sense, earlier than the other algorithms as can be seen from Fig. 2c.
A last comment is necessary regarding the bottom of Fig. 2: the plot on the left describes the median performance (in terms of targets reached) while the right plot counts the success rate at reaching a target over all runs. Therefore, some targets are reached on the right by some of the runs of a given algorithm, while they are never atteined on the left by the median of the same algorithm. This comment stands accross all test cases. Goldstein function The experiments done with the Goldstein test function are summed up in Fig. 3. Like with the Branin function, algorithms relying on random forests (LV-RFO and MS-RFO) showed both poor performance (top left plot). The associated high constant interquartile (top right) is that of the best points in the initial designs, which remains unchanged since no better point is found by these algorithms.
Considering the success rates for all targets (bottom plots), it is seen that accounting for the discreteness through a constraint (which is the distinctive feature of ALV-methods) is useful with the Goldstein function: like with Branin, ALV-EGO-gi is the best performer, but the other ALV-follow and outperform LV-EGO. All ALV-strategies almost reach the absolute target of percentile 25% with a rate of 25% or higher. The comparison of the plots Fig. 3c, d also shows that, behind the ALV-methods, LV-EGO has a good median performance (cf. Fig. 3c) but more of the MS-MKES searches manage to find difficult targets (the 25% and 10% quantiles).  Hartmann function Results on the Hartmann function which has 4 continuous and 2 discrete variables, with a total of 9 discrete levels, will be impacted by the sensitivity of the algorithms to an increase in dimension. These results are reported in Fig. 4.
LV-EGO stands out as the best method with respect to all criteria for Hartmann. The next two best methods are LV-RFO and ALV-EGO-gi, followed by MS-RFO and ALV- EGO-ge. This time, LV-RFO and MS-RFO, which both rely on random forests, belong to the efficient methods: random forests gain in relative performance with respect to the GPs when the dimension and the size of the initial DoE increase. For Hartmann, LV-EGO consistently outperforms the ALV-implementations. The importance of keeping the coupling between discrete and latent variables during the optimization seems less crucial, and even somewhat detrimental, in the Hartmann case. We think that this is due to the very tight budget (50 iterations after the initial DoE) which does not allow the convergence of the optimizers, as can be seen in the Plot 4a where the global optimum is not reached. Because the optimum is not really found, constraints on discreteness are superfluous and their handling through the pre-image problem is sufficient. As in the other test cases, MS-ES was slower than the other methods.

Beam bending application
Optimization results Figure 5 summarizes the 4 comparison metrics of all 9 algorithms in the bended beam test case. The ranking of the algorithms is similar to that obtained with the Branin and Goldstein functions. LV-EGO has the best convergence both in terms of median speed (cf. plots of the left column) and accuracy (bottom right plot). ALV-EGO-gi is the second most efficient method followed by ALV-EGO-ge. Again, the algorithms that resort to random forests, LV-RFO and MS-RFO, are the slowest and most inaccurate. They share this counter-performance with MS-ES. Latent variables in the beam application The beam subject to a bending load is a test case that allows to interprete the latent variables. Indeed, the normalized moment of inertia,Ĩ, is a candidate latent variable once it is allowed to take continuous values as it determines, with the continuous cross-section S and the length L, the output (the penalized beam deflection) y in Eq. (3.5). The levels ofĨ (given in Eq. (3.2)) correspond to 3 increasingly hollow profiles of 4 shapes, as illustrated in Fig. 6. Because a relaxedĨ is a possible latent variable, it is expected that the latent variables φ (t) learned from the data will be grouped in the same way asĨ. Looking atĨ values and at Fig. 6, we thus expect, in the image space defined by latent variables, three groups of levels: those corresponding to solid forms (levels {1, 4, 7, 10}), medium-hollow forms (levels {2, 5, 8, 11}) and hollow forms (levels {3, 6, 9, 12}).
For the sake of interpretation, we select 1 run that found the global optimum with the Vanilla LV-EGO algorithm. In Fig. 7, we represent in a color scale the estimated correlation matrix corresponding to the categorical kernel of Eq. (2.2), at iterations [1; 26; 49; 50]. At the beginning of the optimization, at iteration 1, we can see a block-structure which corresponds quite well to the three groups of forms described above. This structure becomes less clear for the next iterations of the LV-EGO algorithm. This may be explained by the fact that the algorithm creates an unbalanced design, with more points in the promising areas according to the optimizers, so that all levels are no longer properly represented.

Summary and discussion
The results of all the previous test cases which are measured through targets can be averaged. For example, the success rate of an algorithm at 25% difficulty is the average of the rates for the 25% quantiles of all test cases. The average results are presented in Fig. 8.
The three leading algorithms out of the 9 tested are ALV-EGO-gi, -ge and LV-EGO. Among them, LV-EGO is slightly better at locating difficult targets (10% quantile) while ALV-EGO-gi (closely followed by ALV-EGO-ge) is more robust at locating 50% targets as can be seen from the median success plot in Fig. 8a. All three algorithms have in common to use latent variables. In particular, these algorithms outperformed MS-MKES which benefits from a Gaussian process but works only in the mixed space, i.e., MS-MKES does not imply latent variables. This shows that latent variables are useful to speed up a Bayesian search for mixed problems.
No clear advantage, on the average, was found for accounting for the discrete nature of the variables through constraints: LV-EGO, which ignores the link between latent variables and the discrete variables until the pre-image problem, is competitive with the best of the augmented Lagrangian ALV-EGO algorithms. We hypothesize that the constraint on latent variables, by creating disconnected feasibility islands around φ (t) (u), u ∈ U, makes the optimization of the acquisition criterion almost as difficult to solve as it originally was in the mixed space, therefore not allowing to fully benefit from the continuity of the X × L space.
In our tests, the global updating of the Lagrange multipliers was always preferable to the local counterparts, ALV-EGO-gi and -ge eclipsing ALV-EGO-li and -le. The ALV-EGO-gi approach, where the discrete constraint is relaxed and turned into an inequality (Eq. (2.4)), works better on the average than ALV-EGO-ge where the constraint is an equality. This illustrates the positive effect of the relaxation , that softens the phenomenon we mentionned above where the feasible domain is broken into disconnected regions.
MS-ES is consistently less efficient than the other algorithms. It was expected, because there is no metamodel to save calls to the function. Furthermore, the sampling is done in the mixed space. The optimizers based on random forests have also rather poor average performances, to the exception of the 6 dimensional Hartmann function. We believe the random forests need a sufficiently large initial DoE (which happened with a higher dimension) to fruitfully guide the search.
As a final comment, we discuss the necessity of re-estimating the latent variables at each iteration. The estimation of the latent variables has an important numerical cost of about qt 3 n d i=1 m i operations at each iteration t (cf. Table 1). It was repeated at each iteration in the algorithms with latent variables considered so far. In the experiment reported in Figure 9, a version of the LV-EGO algorithm is considered where the latent variables are estimated once only, with the initial DoE, yielding the NR-LV-EGO algorithm (for Non Repeated estimation of φ()).
As can be seen in Fig. 9 when comparing LV-EGO with NR-LV-EGO, the re-estimation of the latent variables at each iteration, as implemented in the LV-EGO algorithm and its ALV-EGO variants, considerably improves its performance. An accompanying result is the visualization of the correlation matrix of the discrete variable provided in Fig. 7, where one notices that the correlation (hence the latent variables) evolves in time. Our experiments indicate that this evolution is beneficial to the optimization efficiency.

Conclusions and perspectives
This work has investigated five Bayesian optimization approaches to small and medium size mixed problems that hinged on latent variables. They differed in the way the coupling between the discrete variables and their relaxed pendants, the latent variables, is implemented.
Algorithms involving latent variables were compared to other algorithms directly working in the mixed space and were found to consistently outperform them. LV-EGO and ALV-EGO-gi were more efficient (in terms of calls to the true objective function) than MS-MKES which also benefits from the Gaussian process. These first results show that Accounting for the discrete nature of some variables through a constraint during the relaxed optimization with augmented Lagrangians was not clearly found to further increase the performance of the search as LV-EGO competed equally and even sometimes outperformed the ALV versions of the algorithms. It was also observed that expressing the discreteness as an inequality constraint by adding a tolerance was a better option than expressing it as an equality. The global updating strategy of the Lagrange multipliers, which to the best of our knowledge is original, improved over the more common local updating schemes. Finally, the random forests metamodels did not do as well as the Gaussian processes, whether in their continuous or mixed forms, within the Bayesian optimization algorithm.
Our study needs to be completed in three ways. To fully leverage on the continuous latent space, the gradient of the acquisition function should be analytically calculated and used to guide its maximization. The implementation we proposed creates more latent variables than there are discrete levels, which limits its application to about 10 levels. This limitation can be overcome with under-parameterized kernels based on groups Roustant et al. [25] or warping techniques Deville et al. ? [7]. Mixed Bayesian optimization through latent variables would also gain in credibility if the convergence results of EGO were generalized to it.

Case of an equality constraint
Let us first consider an optimization problem with an equality constraint, At this point, f () and h() are very general functions on a d-dimensional general set X . We only require that X is not empty, that f () and h() are bounded, and that there is at least one solution to (A.1), x ∈ X , which can be attained. f () and h() are not necessarily continuous, a fortiori not necessarily differentiable. With respect to the main body of the article, the notations are simplified in this Section: X stands for the cartesian product of X and L, f (x) generalizes − log(1 + EI (t) (x, )) and h(x) corresponds to g (t) ( ) when = 0. Note that g (t) (), being made of the minimum distance to a discrete set of points (cf. Eq. (2.4)), is not differentiable. g (t) () is the only constraint in the article. This appendix considers one constraint too, but all the results given readily generalize to many constraints by replacing the products by vector scalar products. Problem (A.1) can be equivalently reformulated as where ρ ≥ 0 is a penalty parameter. The two above formulations have the same solution x and the same value of optimal objective function since x is feasible, h(x ) = 0, therefore f (x ) = f (x ) + 1 2 ρh 2 (x ). However, as proved in Minoux [17] and sketched in Fig. 10, there is always a positive lower bound on the penalty parameters, ρ ≥ ρ ≥ 0, such that Problem (A.2) can be equivalently solved through the dual formulation, In this way, the augmented Lagrangian of Hestenes [10] is the classical Lagrangian of the penalized problem (A.2). We write λ , ρ a solution to (A.3). D(λ, ρ) is the lower front of all augmented Lagrangians for varying x at a given λ, ρ. The "global dual" update of (λ, ρ) comes from the resolution of (A.3) where the set X is approximated by the finite subset of samples X. Let us denote a solution at given multiplier and penalty parameter. The function D(λ, ρ) is concave in λ and ρ and h(x(λ, ρ)) is a subgradient with respect to λ Minoux [17]. This is at the root of updating strategies that we called "local dual" earlier and which consist in a gradient step in the dual space, where α > 0 is a step size factor.
More specific update strategies such as those given in Nocedal and Wright [19], Picheny et al. [22] stem from the Karush Kuhn and Tucker (KKT) optimality conditions and require the additional assumption that X ∈ R d and f () and h() are differentiable. At x , since h(x ) = 0 and λ KKT being the KKT multiplier 1 , one has At iteration t, the necessary conditions for x t = x(λ t , ρ t ) to be the minimum of L A (; λ t , ρ t ) are Comparing Eqs. (A.6) and (A.7), x t can be driven to x if The updates (A.5) and (A.8) have the same form, (A.8) is more restrictive since the KKT conditions must apply but the step size is known. The equality constraint of the article (Eq. (2.4) with = 0) is a minimum over distances. It has the additional feature that it is always positive or null, ∀x ∈ X , h(x) ≥ 0. Because of this, if h is locally differentiable around x , ∇h(x ) = 0 since h has a minimum at x . The constraint qualification condition is not satisfied (∇h(x ) does not span a nonempty set) and the KKT conditions do not apply. Another consequence is that the optimal Lagrange multiplier must be positive and the search for λ can be written max λ≥0 D(λ, ρ) in Problem (A.3), as in Problem (2.7).
Proof Assume ρ is large enough for Problem (A.2) to have a saddle point at its optimum, f (x ) ≤ f (x) + ρ/2h 2 (x) + λ h(x) , ∀x where λ is the optimum Lagrange multiplier. Since the optimization problem has an active constraint, there is a point x I that is infeasible, h(x I ) > 0, and has a better objective function than the feasible solution (otherwise the constraint is useless), f (x I ) + ρ 2 h 2 (x I ) ≤ f (x ). If the optimum Lagrange multiplier is negative, λ < 0, f (x I ) + ρ 2 h 2 (x I ) + λ h(x I ) < f (x ) which contradicts the fact that x is a solution to the dual problem.

Inequality constraint
When > 0, Problem (2.4) has an inequality constraint which we rewrite here more simply, min x∈X f (x) such that g(x) ≤ 0 (A.9) The considerations on augmented Lagragian done above for equality constraints readily extend to inequality constraints by introducing a slack variable, Fig. 10 Sketch of Rockafellar's augmented Lagrangian for ρ ≈ 0 in blue and ρ > 0 in red. x 1 is infeasible, x 2 feasible (and g(x 2 ) < −λ/ρ) and x is an optimum with g(x ) = 0. The black highlighted curves are the approximation to the dual function, D(λ) for X = {x 1 , x 2 , x }, for ρ ≈ 0 and ρ > 0. There is no saddle point and a duality gap with the blue set of curves in that x / ∈ arg min x L A (x; λ , ρ ≈ 0) and D(λ ) = min x L A (x; λ , ρ ≈ 0) < L A (x ; λ , ρ ≈ 0), i.e., minimizing the augmented Lagrangian does not lead to the result of the problem. However, by increasing ρ, it is visible that the y-intercept of the infeasible points increase so that one always reaches a state where x = arg min x L A (x; λ , ρ) as in the red set of curves. A similar illustration can be done with the augmented Lagrangian with equality constraint: f (x) + ρ/2h 2 (x) is the y-intercept and h(x) is the slope of the augmented Lagrangian associated to x. The main difference is that all points contribute linearly in terms of λ to L A (x; λ, ρ) min x,s∈X ×R f (x) such that g(x) + s 2 = 0 (A.10) and the expression for the augmented Lagrangian (A.3) becomes The minimization of L A () on the slack variable s can be done analytically: Since s 2 needs to be positive, all cases are summed up in Reinjecting the expression of s 2 into the augmented Lagrangian yields L A (x; λ, ρ) = f (x) + 1 2ρ (max(0, λ + ρg(x))) 2 − λ 2 (A. 13) which is equivalent to the expression of Rockafellar with the 2 cases given in Eq. (2.5) (recall − log(1 + EI) is f (x)).
The update equations for λ are the same as those for the equality case where the slack variable s 2 takes its optimal value. On the one hand, it is possible to solve the approximated dual problem as in (2.8). On the other hand, a step along a subgradient in the dual space can be taken, where α is again a positive step factor. It has the same form as Eq. (2.10). The update (2.10) is fully recovered from the KKT conditions as above for equalities, (A.8), Equations (A.14) and (A.15) are the same but in the latest the step factor α is known and equal to ρ, which comes at the additional expense of the KKT validity conditions.