Constrained Optimization for a Subset of the Gaussian Parsimonious
Clustering Models

Abstract

The expectation-maximization (EM) algorithm is an iterative method for finding maximum likelihood estimates when data are incomplete or are treated as being incomplete. The EM algorithm and its variants are commonly used for parameter estimation in applications of mixture models for clustering and classification. This despite the fact that even the Gaussian mixture model likelihood surface contains many local maxima and is singularity riddled. Previous work has focused on circumventing this problem by constraining the smallest eigenvalue of the component covariance matrices. In this paper, we consider constraining the smallest eigenvalue, the largest eigenvalue, and both the smallest and largest within the family setting. Specifically, a subset of the GPCM family is considered for model-based clustering, where we use a re-parameterized version of the famous eigenvalue decomposition of the component covariance matrices. Our approach is illustrated using various experiments with simulated and real data.

The expectation-maximization (EM) algorithm (Dempster
et al., 1977) is an iterative procedure for finding maximum likelihood estimates when data are incomplete or treated as such. Although the EM algorithm is commonly attributed to Dempster
et al. (1977), Titterington
et al. (1985, Section 4.3.2) point out that similar treatments had previously been employed by Baum
et al. (1970), Orchard and
Woodbury (1972), and Sundberg (1974). The EM algorithm involves the iteration of two steps until convergence is attained. In the expectation step (E-step) the expected value of the complete-data log-likelihood is computed and then, in the maximization step (M-step), this expected value is maximized with respect to the model parameters. Here, ‘complete-data’ refers to the missing plus observed data.

In this paper, we are concerned with the application of the EM algorithm to Gaussian mixture models with applications in clustering and classification. The likelihood surface for Gaussian mixture models is known to be unbounded and the presence of local maxima is extremely common; one may go so far as to argue that the surface is singularity riddled (Titterington
et al., 1985).
When using the EM algorithm to fit Gaussian mixture models, problems like convergence to a spurious local maxima tend to arise when one fitted component has much smaller variance than the others (cf. Biernacki, 2004); an illustrative example of this phenomenon is given by Ingrassia and
Rocci (2007), who we will follow by referring to such fitted components as ‘degenerate’. The behaviour of the EM algorithm near a degenerate solution has been studied by Biernacki and
Chrétien (2003) and Ingrassia and
Rocci (2007), who tackle the problem by constraining the value of the smallest eigenvalue of the component covariance matrices. In this paper, we consider constraining the smallest eigenvalue, the largest eigenvalue, and both the smallest and largest eigenvalues. While this approach is applicable to Gaussian mixture models in general, we impose these constraints while maintaining a parsimonious covariance structure. We focus on a subset of the famous parsimonious Gaussian clustering models (GPCM) family of mixture models of Celeux and
Govaert (1995), cf. Section 2.1.

This paper illustrates the benefit of constraining the range (minimum and maximum) of eigenvalues in two applications. In what we call ‘dynamic initialization’, we begin with at a random starting value and impose stringent constraints on the eigenvalues and these constraints are slowly lifted during the first k iterations of the EM algorithm. Dynamic initialization maintains the monotonicity property of the EM algorithm while reducing the risk of converging to a degeneracy. We also show that in most cases, dynamic initialization increases the changes of converging to a solution with higher log-likelihood than when compared to using the ‘standard’ (with no dynamic initialization) EM algorithm with the same starting values.
The other application of constraining the range of eigenvalues is a more direct one. We use a constraint on the range of eigenvalues to fit the relevant GPCM models to two well known data sets within the model-based clustering literature. For each data set, we find solutions that are an improvement over the GPCM models; however, we must choosing constraints a priori (cf. Section 5).

The remainder of the paper is laid out as follows. In Section 2.1, the GPCM family of models is introduced. Then parameter estimation while constraining the largest and/or smallest eigenvalue is discussed and our methodology for constraining both is described (Section 3). Our approach is illustrated via several experiments using real and simulated data (Section 4) and the paper concludes with discussion and suggestions for future work (Section 5).

2.1 The GPCM and MCLUST families

Suppose that we observe np-dimensional data vectors x1,…,xn and that each must be assigned to one of G clusters. The Gaussian mixture model-based approach has become popular for such problems. When mixture models are used for clustering in this way, the locution ‘model-based clustering’ is used. The Gaussian model-based clustering likelihood is

L(\boldmathϑ∣x1,…,xn)=n∏i=1G∑g=1πgϕ(xi∣\boldmathμg,Σg),

where πg>0, such that ∑Gg=1πg=1, are mixing proportions and ϕ(xi∣\boldmathμg,Σg) is the density of a multivariate Gaussian random variable with mean \boldmathμg and covariance matrix Σg.
Extensive details on finite mixture models and their applications are given by Titterington
et al. (1985), McLachlan and
Basford (1988), McLachlan and
Peel (2000), and Frühwirth-Schnatter (2006).

Unless p is small relative to n, model fitting issues arise with these models because of the large number of covariance parameters; there are p(p+1)/2 parameters for each component covariance matrix Σg. Celeux and
Govaert (1995) introduce parsimony into these Gaussian mixture models by proposing and giving estimation algorithms for fourteen different eigen-decompositions of the covariance matrix (Table 1). These decompositions have the form Σg=λgDgAgD′g, where Dg is the matrix of eigenvectors, Ag is a diagonal matrix with entries proportional to the eigenvalues, and λg is the associated constant of proportionality. The resulting 14 models are called Gaussian parsimonious clustering models (GPCMs). Fraley and
Raftery (1998) implemented ten of these fourteen models, based on the algorithms given in Celeux and
Govaert (1995), as the popular mclust package for the R software (R Core Team, 2013). Such has been the popularity of the MCLUST package (Fraley
et al., 2012) that only ten of the fourteen GPCMs are routinely employed. Browne and
McNicholas (2013b) implemented all fourteen models in the mixture package for the R software, using the algorithms given in Celeux and
Govaert (1995) and Browne and
McNicholas (2013a).

Mod.

Volume

Shape

Orient.

Σg

Free covariance parameters

EII

Equal

Spherical

–

λI

1

VII

Variable

Spherical

–

λgI

G

EEI

Equal

Equal

Axis-Aligned

λA

p

VEI

Variable

Equal

Axis-Aligned

λgA

p+G−1

EVI

Equal

Variable

Axis-Aligned

λAg

pG−G+1

VVI

Variable

Variable

Axis-Aligned

λgAg

pG

EEE

Equal

Equal

Equal

λDAD′

p(p+1)/2

EEV

Equal

Equal

Variable

λDgAD′g

Gp(p+1)/2−(G−1)p

VEV

Variable

Equal

Variable

λgDgAD′g

Gp(p+1)/2−(G−1)(p−1)

VVV

Variable

Variable

Variable

λgDgAgD′g

Gp(p+1)/2

EVE

Equal

Variable

Equal

λDAgD′

p(p+1)/2+(G−1)(p−1)

VVE

Variable

Variable

Equal

λgDAgD′

p(p+1)/2+(G−1)p

VEE

Variable

Equal

Equal

λgDAD′

p(p+1)/2+(G−1)

EVV

Equal

Variable

Variable

λDgAgD′g

Gp(p+1)/2−(G−1)

Table 1: Nomenclature, covariance structure, and number of free covariance parameters for each member of the GPCM family; all models are available within mixture whereas the last four are not included within mclust.

In this paper, we consider a reduced set of GPCMs, that we refer to as the rGPCM family
(cf. Table 2). We are, in effect, reparameterizing the GPCM covariance structure to two parameters by writing Bg=λgAg and B=λA, where Bg and B are unconstrained matrices of eigenvalues. Not including λg as a parameter ties together component volume and shape in terms of whether they are constrained, i.e., Bg=λgAg and B=λA, while allowing component orientation Dg to vary separately. One may argue that the paramterization Σg=DgBD′g is attractive because it has a ‘natural’ eigenvalue interpretation; however, it inherently has less flexibility than the GPCM models (i.e., Σg=λDgAD′g) and this loss of flexibility needs to be considered in context with the benefits of our constrained eigenvalue decomposition.

Model

Volume/Shape

Orientation

Σg

Free covariance parameters

1I

Equal

-

λI

1

GI

Variable

-

λgI

G

EI

Equal

Axis-Aligned

λA

p

VI

Variable

Axis-Aligned

λgAg

pG

EE

Equal

Equal

λDAD′

p(p+1)/2

EV

Equal

Variable

λDgAD′g

Gp(p+1)/2−(G−1)p

VV

Variable

Variable

λgDgAgD′g

Gp(p+1)/2

VE

Variable

Equal

λgDAgD′

p(p+1)/2+(G−1)p

Table 2: Nomenclature, covariance structure, and number of free covariance parameters for each member of the rGPCM family; equivalent models are available within mixture but no equivalent for the last model is available within mclust.

We develop an EM algorithm to fit the rGPCM models while imposing constraints on the eigenvalues. In addition, we investigate algorithms that slowly lift the constraints on the eigenvalues either from below, above, or both. Ingrassia (2004) shows that having lower and upper bounds on the eigenvalues does not destroy the monotonicity property of the EM algorithm.
Furthermore, keeping the eigenvalues from going below a threshold prevents degeneracy of the log-likelihood (Ingrassia and
Rocci, 2011). However, algorithms that only have dynamic constraints from below prolong degeneracy. We show that having dynamic constraints from above and below reduces the risk of degeneracy and yields higher log-likelihood values at convergence.

2.2 Parameter Estimation and Model Selection

Parameter estimation for each member of the GPCM family is carried out using an EM algorithm. The EM algorithm is used to obtain maximum likelihood estimates when data are incomplete or are taken to be incomplete. In mixture model-based clustering applications, the missing data are the component membership labels, which we denote by z1,…,zn, where zig=1 if observation i is in component g and zig=0 otherwise. These missing data together with the observed data x1,…,xn are known as the complete-data, and the E-step of the EM algorithm involves computation of the expected value of the complete-data log-likelihood. For Gaussian mixture model-based clustering, the complete-data log-likelihood is

The M-step involves maximizing the expected value of Equation 1 with respect to the model parameters. The E- and M-steps are iterated until convergence. Details on the EM algorithm parameter estimates for the GPCMs are given by Celeux and
Govaert (1995) and Fraley and
Raftery (1999).

One feature of these EM algorithms is the importance of starting values: mclust utilizes a Gaussian model-based agglomerative hierarchical clustering procedure to obtain starting values (cf. Murtagh and
Raftery, 1984; Banfield and
Raftery, 1993). The mixture package allows the user freedom in selecting starting values, with k-means clustering (Hartigan and
Wong, 1979) results being the default.
The clusterings for a given model arise as the maximum a posteriori (MAP) expected values (i.e., probabilities) of the Zig. To compute these MAP values, we compute the expected values

with the parameter estimates ^\boldmathϑ taking the converged values. Then, MAP{^zjg}=1 if maxg{^zjg} occurs at component g and MAP{^zjg}=0 otherwise. Note that the expected values ^zig are computed in each E-step, which is why we emphasize that in computation of the MAP classifications ^zig depends on the parameter values at convergence.

The Bayesian information criterion (BIC; Schwarz, 1978) is used to select the number of components and the covariance structure (i.e., the model). Although it is by far the most popular model selection criterion for mixture model-based clustering, the regularity conditions for the asymptotic approximation used in the development of the BIC are not generally satisfied by mixture models (cf. Keribin, 1998, 2000). There is, however, plenty of practical evidence to support its use in mixture model selection (e.g., Dasgupta and
Raftery, 1998; Fraley and
Raftery, 2002) and we use it for the analyses herein. The BIC is given by −2l(x,^\boldmathϑ)+mlogn,
where m is the number of free parameters, l(x,^\boldmathϑ) is the maximized log-likelihood, and ^\boldmathϑ is the maximum likelihood estimate of ϑ. Dasgupta and
Raftery (1998) proposed using the BIC for mixture model selection, where the model with the lowest BIC is selected.

3.1 Constrained Covariance Updates

Our constrained EM algorithm is an alternating conditional maximization algorithm (Meng and van
Dyk, 1997), where the matrix of eigenvalues Bg for each group is maximized conditional on Dg and then vice versa. Note, that Bg or Dg can be equal or varying across groups, depending on the model (cf. Table 2). These conditional updates can be repeated m times or until a convergence criteria is achieved. Ingrassia and
Rocci (2007, 2011) give an algorithm for our VV model when the smallest eigenvalue is constrained and show that these constraints maintain the monotonicity property. Herein, we introduce and illustrate parameter estimation with constraints on both the smallest and largest eigenvalues.

Let [a,b] be the range of allowable eigenvalues and let Sg be the sample covariance matrix for group g, i.e., Sg=(1/ng)∑ni=1^zig(xi−%
\boldmathμg)(xi−\boldmathμg)′.
Consider unconstrained Bg. From Celeux and
Govaert (1995), we have

v(t+1)g=diag{D(t)gSgD(t)g},

where v(t+1)g=(v(t)g1,…,v(t)gp) and the superscripts in parentheses denote iteration number. If we let bg1,…,bgp be the diagonal elements of B(t+1)g, then the constrained EM uses the updates

b(t+1)gk=min{b,max(v(t+1)gk,a)}.

Now, suppose we set Bg=B. Then,

v(t+1)=diag{G∑gπ(t+1)gD(t)gSgD(t)g},

where v(t+1)=(v(t+1)1,…,v(t+1)p).
If we let b1,…,bp be the diagonal elements of B(t+1), then the constrained EM algorithm sets

b(t+1)gk=min{b,max(v(t+1)k,a)}.

Consider unconstrained Dg and
let Sg=PgQgP′g be the eigen-decomposition of Sg. Then we set
D(t+1)g=Pg.

Finally, consider Dg=D. This update can be carried out using Flury’s method (see Flury, 1984; Celeux and
Govaert, 1995, for details).

3.2 Dynamic Initialization

We run the EM algorithm for each member of the rGPCM family as described in Section 3.1, but for the first k iterations we use a sequence of k constraints S={(a1,b1),…,(ak,bk)}. We could use such a set S; however, we instead simplify and use a sequence v={0,…,1}, where v is some sequence from 0 to 1. We also use the mapping

(ai,bi)=(a(vi),b(vi))=β(1−vi,1−log(1−vi)),

(3)

where β>0.
These equations are set up so that v1=0 and vk=1 implies that (a1,b1)=(β,β) and (ak,bk)=(0,∞). In this paper, we have set β=1 because we scale the data in our clustering applications (Section 4).

4.1 Performance Assessment

Although our examples are all genuine clustering problems, i.e., no knowledge of labels are used, the labels are known in each case; therefore, we can asses the performance of our algorithms for the rGPCM family. We use classification tables and adjusted Rand indices (ARI; Hubert and
Arabie, 1985) to summarize classification accuracy. The ARI corrects the Rand index (Rand, 1971) for chance agreement. An ARI value of 1 indicates perfect class agreement and a value of 0 would be expected under random classification.

4.2 Simulation Study 1

The first data set consists of n=200 observations generated from a four-dimensional two-component (n1=100 and n2=100) mixture of multivariate t-distributions with scale matrix Σg=λDgAD′g=DgBD′g and 5 degrees of freedom. As we would expect, the resulting clusters are clearly heavy tailed with several outlying points (Figure 1). Using the output from k-means clustering as the initialization for z1,…,zn, we run our algorithm for G=1,…,6. The eigenvalues are constrained to be within the smallest and largest eigenvalues of the sample covariance matrix of the data.

The mixture package is used to fit the GPCM models to facilitate comparison with the rGPCM family. The chosen rGPCM model is a two-component EV model that gives perfect classification (ARI=1), whereas the selected GPCM model is a four-component VEV (Σg=λgDgAD′g) model with ARI=0.75 (Table 3). In the absence of a constraint on eigenvalues, the chosen GPCM model has additional components with relatively high variance to accommodate the heavier tails (Figure 2). From Table 3, it is clear that with appropriate merging of components, the classification performance of the best GPCM model is very close to that of the best rGPCM model. BIC values for all rGPCM models are given in Appendix A.

True ∖ Estimated

rGPCM

GPCM

1

2

1

2

3

4

1

100

92

8

2

100

1

21

78

Table 3: Classifications for the best rGPCM and GPCM models, respectively, for simulation study 1.Figure 2: Predicted classifications using the best GPCM model for simulation study 1.

4.3 Simulation Study 2

The second data set consists of n=200 observations generated from a three-dimensional two-component Gaussian mixture model with Σg=λDAD′=DBD′, n1=100, and n2=100. As shown in Figure a, the components are very well separated in these simulated data.

(a) Without noise.

(b) With 5% uniform noise.

Figure 3:
Simulated data from simulation study 2.

Again, we run our algorithms for the rGPCM models for G=1,…,6 with the eigenvalues constrained to lie within the smallest and largest eigenvalues of the sample covariance matrix. For the same data, we also run the GPCM models for G=1,…,6. Both algorithms selected a G=2 component EE model and give perfect classifications. BIC values for the rGPCM models are given in Appendix A.

We then added 5% uniform noise to the data set (n3=10 noise observations, see Figure b) and repeated the above analyses for the rGPCM and the GPCM families, respectively. The selected rGPCM model is a two-component EE model that absorbed the noisey observations into the two Gaussian components (Table 4). On the other hand, the chosen GPCM model is a eight-component EEE model (Σg=λgDgAD′g), where four of the components contain just one or two of the noisy points and the other two components contain the true Gaussian components along with one and three noisy points, respectively (Table 4). Again, BIC values for the rGPCM models are given in Appendix A.

True ∖ Estimated

rGPCM

GPCM

1

2

1

2

3

4

5

6

1

100

100

2

100

100

3

4

6

1

2

2

1

1

3

Table 4: Classifications for the best rGPCM and GPCM models, respectively, for simulation study 2.

We have illustrated the effect of only a very small proportion of outliers on the EM algorithms used for parameter estimation for the GPCM family versus the application of our algorithms for the rGPCM family. Please note that we are not proffering the rGPCM solution as being ideal in this context; rather, we are suggesting that our parameter estimation approach led to rGPCM results that are preferable to those obtained using the more traditional parameter estimation approach for the GPCM family. Effective methods for clustering noisy data include trimmed clustering (e.g., García-Escudero et al., 2008, 2010) and mixtures of contaminated distributions (Punzo and
McNicholas, 2013).

4.4 Two Well-Known Data Sets

Forina
et al. (1986) recorded 28 chemical and physical properties of three types of wine (Barolo, Grignolino, Barbera) from the Piedmont region of Italy. A subset of 13 of these variables is available in the gclus package (Hurley, 2004) for R.
The leptograpsus crabs data set can be found in the MASS package (Venables and
Ripley, 1999) in R. These data contain five physical measurements on two different colours of crab, further separated into gender. MCLUST is known to do poorly on these data; Raftery and
Dean (2006) used these data to illustrate the superiority of their variable selection technique over MCLUST.

4.5 Illustrating Convergence From Random Starting Values

For each data set, we generate 50 random starting points. We run the four types of EM algorithm until convergence for G=2,…,6 components. Specifically, we run the EM algorithm in four circumstances: no constraints, lower constraints, upper constraints, and both upper and lower (range) constraints on the eigenvalues. For each dynamic initialization, we use an equidistant sequence of length 25 from 0 to 1.
For each run, we noted which algorithms achieved the highest converged log-likelihood value for a particular starting value. This is because all four algorithms could, and sometimes did, converge to the same solution.

The results are given Tables 11 to 12 in Appendix B. By inspection of these tables, the value of imposing eigenvalue constraints is clear. Specifically, the model most often converges to the ‘best’ value of the log-likelihood is very rarely from an unconstrained EM algorithm. Furthermore, the unconstrained EM algorithm yields far more degenerate solutions that its constrained counterparts.

4.6 Constrained Eigenvalues: A Comparison With The GPCM Family

For each data set, we compare results for the rGPCM models using the constrained eigenvalue approach to parameter estimation to results for the GPCM models with the traditional EM algorithm approach to parameter estimation.
When estimating parameters for the rGPCM family, we constrain eigenvalues to be within the smallest and largest eigenvalues of the sample covariance matrix of each data set, i.e., [0.1033,4.7057] for the wine data and [0.0017,4.7888] for the crabs data. BIC values for the rGPCM models for all data sets are given in Appendix A.

For the wine data, the best rGPCM model is a G=3 component VE model with an ARI of 0.96 (Table 5). The best GPCM model is a G=3 component model with an ARI of 0.90.

1

2

3

Barolo

58

1

Grignolino

1

70

Barbera

48

Table 5: Classification table associated with the best GPCM model for the wine data.

For the crabs data, the best rGPCM model is a G=4 component EV model with an ARI of 0.80 (Table 6). For the crabs data, the best GPCM model is a G=9 component model with an ARI of 0.50.

1

2

3

4

Blue & Male

38

12

Blue & Female

50

Orange & Male

50

Orange & Female

5

45

Table 6: Classification table for the best rGPCM model for the crabs data.

In this paper, we introduced a constrained eigenvalue parameter estimation procedure for the eight of the parsimonious Gaussian clustering models of Celeux and
Govaert (1995). For convenience, we have referred to this subset of models as the rGPCM family. Please note that when we discuss the performance of the rGPCM family herein, we are referring to the performance of those models with our constrained eigenvalue parameter estimation procedure. We are not suggesting that the rGPCM models are in any sense better than the other GPCM models when the same parameter estimation methods are used.

We illustrated our approach through extensive simulation studies and two real data applications. In one application, we studied dynamic initialization, where we begin with random starting values and impose stringent constraints on the eigenvalues which are slowly lifted during the first 25 iterations of the EM algorithm. This approach is shown to maintain the monotonicity of the EM algorithm while reducing the risk of converging to a degeneracy.
In another application, we constrained the range of eigenvalues and fit the rGPCM models to two well known data sets. In most cases, we find solutions that are an improvement over the famous GPCM models; however, we require constraints to be chosen a priori. Constraining the eigenvalues in this way can be viewed as a form of regularization or as placing a uniform prior on the eigenvalues. Future work will involve studying different approaches to estimating the range of allowable eigenvalues.

The fact that the rGPCM models outperformed the GPCM models on both simulation studies and real data sets shows that the eigenvalue constraints we use can lead to improved classification performance. Furthermore, if one follows the approach of only running the rGPCM models for which our eigenvalue constraints can be used, this is tantamount to merging the volume (λ or λg) and shape (A or Ag) parameters from the famous eigen-decomposition used in the GPCM family. Therefore, the importance of having separate volume and shape parameters deserves further consideration. Furthermore, even if it can be useful in some scenarios, the value of including the component volume as a separate parameter has to be judged in context with the fact that including it prevents application of the constrained eigenvalue approach to parameter estimation.