GLOBAL OPTIMIZATION OF FINITE MIXTURE MODELS

Postgraduate

ABSTRACT

The Expectation-Maximization (EM) algorithm is a popular and convenient tool for the estimation of Gaussian mixture models and its natural extension, modelbased clustering. However, while the algorithm is convenient to implement and numerically very stable, it only produces solutions that are locally optimal. Thus, EM may not achieve the globally optimal solution in Gaussian mixture analysis problems, which can have a large number of local optima. This dissertation introduces several new algorithms designed to produce globally optimal solutions for Gaussian mixture models. The building blocks for these algorithms are methods from the operations research literature, namely the Cross-Entropy (CE) method and Model Reference Adaptive Search (MRAS).

The new algorithms we propose must efficiently simulate positive definite covariance matrices of the Gaussian mixture components. We propose several new solutions to this problem. One solution is to blend the updating procedure of CE and MRAS with the principles of Expectation-Maximization updating for the covariance matrices, leading to two new algorithms, CE-EM and MRAS-EM. We also propose two additional algorithms, CE-CD and MRAS-CD, which rely on the Cholesky decomposition to construct the random covariance matrices. Numerical experiments illustrate the effectiveness of the proposed algorithms in finding global optima where the classical EM fails to do so. We find that although a single run of the new algorithms may be slower than EM, they have the potential of producing significantly better global solutions to the model-based clustering problem. We also show that the global optimum matters in the sense that it significantly improves the clustering task.

Furthermore, we provide a a theoretical proof of global convergence to the optimal solution of the likelihood function of Gaussian mixtures for one of the algorithms, namely MRAS-CD. This offers support that the algorithm is not merely an ad-hoc heuristic, but is systematically designed to produce global solutions to Gaussian mixture models. Finally, we investigate the fitness landscape of Gaussian mixture models and give evidence for why this is a difficult global optimization problem. We discuss different metrics that can be used to evaluate the difficulty of global optimization problems, and then apply them to the context of Gaussian mixture models.


Background

A mixture model is a statistical model where the probability density function is a convex sum of multiple density functions. Mixture models provide a flexible and powerful mathematical approach to modeling many natural phenomena in a wide range of fields (McLachlan and Peel, 2000). One particularly convenient attribute of mixture models is that they provide a natural framework for clustering data, where the data are assumed to originate from a mixture of probability distributions, and the cluster memberships of the data points are unknown. Mixture models are highly popular and widely applied in many fields, including biology, genetics, economics, engineering, and marketing. Mixture models also form the basis of many modern supervised and unsupervised classification methods such as neural networks or mixtures of experts.

The primary application of mixture models in this dissertation is clustering data. Mixture models are an extremely common tool in practice for clustering data to achieve many different goals. For example, in biological sequence analysis, clustering is used to group DNA sequences with similar properties. In data mining, researchers use cluster analysis to partition data items into related subsets, based on their quantifiable attributes. In social sciences, clustering may be used to recognize communities within social networks of people. These examples represent a small portion of the many applications of clustering via mixture models for real-world data.

In mixture analysis, the goal is to estimate the parameters of the underlying mixture distributions by maximizing the likelihood function of the mixture density with respect to the observed data. One of the most popular methods for obtaining the maximum likelihood estimate is the Expectation-Maximization (EM) algorithm. The EM algorithm has gained popularity in mixture analysis, primarily because of its many convenient properties. One of these properties is that it guarantees an increase in the likelihood function in every iteration (Dempster et al., 1977). Moreover, because the algorithm operates on the log-scale, the EM updates are analytically simple and numerically stable for distributions that belong to the exponential family, such as Gaussian. However, one drawback of EM is that it is a local optimization method only; that is, it converges to a local optimum of the likelihood function (Wu, 1983). This is a problem because with increasing data-complexity (e.g., higher dimensionality of the data and/or increasing number of clusters), the number of local optima in the mixture likelihood increases. Furthermore, the EM algorithm is a deterministic method; i.e., it converges to the same stationary point if initiated from the same starting value. So, depending on its starting values, there is a chance that the EM algorithm can get stuck in a sub-optimal solution, one that may be far from the global (and true) solution. The mathematical details of mixture models and the EM algorithm are given in Chapter 2 of this dissertation.

There exist many modifications of the EM algorithm that address shortcomings

or limitations of the basic EM formulation. For instance, Booth and Hobert (1999) propose solutions to overcome intractable E-steps (see also Levine and Casella, 2001; Levine and Fan, 2003; Jank, 2004; Caffo et al., 2003). On the other hand, Meng and

Rubin (1993) suggest ways to overcome complicated M-steps (see also Meng, 1994; Liu and Rubin, 1994). The EM algorithm is also known to converge only at a linear rate; ways to accelerate convergence have been proposed in Louis (1982), Jamshidian and Jennrich (1993), and Jamshidian and Jennrich (1997). Yet, to date, very few modifications have addressed global optimization qualities of the EM paradigm.

There have been relatively few attempts at systematically addressing the shortcomings of EM in the mixture model context. Perhaps the most common approach in practice is to simply re-run EM from multiple (e.g., randomly chosen) starting values, and then select the parameter value that provides the best solution obtained from all runs (see Biernacki et al., 2003). In addition to being computationally burdensome, especially when the parameter space is large, this approach is somewhat ad-hoc. More systematic approaches involve using stochastic versions of the EM algorithm such as the Monte Carlo EM (MCEM) algorithm (Wei and Tanner, 1990). Alternative approaches rely on producing ergodic Markov chains that exhaustively explore every point in the parameter space (see e.g., Diebolt and Robert, 1990; Cao and West, 1996; Celeux and Govaert, 1992). Another approach that has been proposed recently is to use methodology from the global optimization literature. In that context, Jank (2006a) proposes a Genetic Algorithm version of the MCEM algorithm to overcome local solutions in the mixture likelihood (see also Jank, 2006b;

Tu et al., 2006). Along the same lines, Ueda and Nakano (1998) propose a deterministic annealing EM (DAEM) designed to overcome the local maxima problem associated with EM.

Numerous additional methods for clustering, other than simply extensions of EM, have been developed in recent years. Mangiameli et al. (1996) compare the self-organizing map (SOM) neural network with other hierarchical clustering methods. Milligan (1981) gives a computational study of many algorithms for clustering analysis, including the well-known Ward’s minimum variance hierarchical procedure (Ward, Jr., 1963). However, many of these clustering procedures do not incorporate ideas from the theory of global optimization.

Two methods from the operations research literature that are designed to attain globally optimal solutions to general multi-extremal continuous optimization problems are the Cross-Entropy (CE) method (De Boer et al., 2005) and Model Reference Adaptive Search (MRAS) (Hu et al., 2007). The CE method iteratively generates candidate solutions from a parametric sampling distribution. The candidates are all scored according to an objective function, and the highest scoring candidates are used to update the parameters of the sampling distribution. These parameters are updated by taking a convex combination of the sampling parameters from the previous iteration and sample statistics of the top candidate solutions. In this way, the properties of the best candidates in each iteration are retained. MRAS shares similarities with CE. Like the CE method, MRAS also solves continuous optimization problems by producing candidate solutions in each iteration. However, the primary difference is that MRAS utilizes a different procedure for updating its sampling parameters, leading to a more general framework in which theoretical convergence of a particular instantiated algorithm can be rigorously proved (Hu et al., 2007). In this dissertation we propose methods that apply CE and MRAS updating principles for the global optimization of Gaussian mixture models.