One of my goals for 2016 is to improve my ability to understand different statistical/machine learning problems. I have an educational background in economics, so I have spent a good deal of time studying and using linear modeling in it’s various forms. However, I have spent little time with the various classification techniques. Gaussian mixture modeling has several advantages as a good place to start. It is fairly simple, introduces the concept of expectation-maximization, and belongs to a family of algorithms all with the same form.

The gaussian mixture model (GMM) is a modeling technique that uses a probability distribution to estimate the likelihood of a given point in a continuous set. For the GMM, we assume that our classes bear the markings of a normally distributed density function. When the two classes are clearly defined, the guassian distribution works well as an estimate for class-conditional probabilties.

In practice, it is not usually a great idea to implement your own learning algorithm. But, the exercise can be useful as a learning tool. I will be implementing my own Gaussian Mixture Model to show how it works and see if I am able to get results that mimic the Mclust packge. By the end of this article, I hope in the very least to provide you with good resources for learning the GMM on your own.

### Linear Disciminant Analysis

Before moving directly into the main model I want to note that if you are familiar with linear discriminant anlaysis (LDA), then you may have already seen the formula for the GMM. LDA is a supervised learning technique where the class priors are known and the class means and covariances can be estimated using training data. LDA works much like the gaussian mixture model by estimating an *a posterior* probability (Elements of Statistical Learning 107-08).

LDA takes the form: \[r_{ic} = \frac{\pi_c N(x_i: \mu_c, \Sigma_c)}{\sum_{c'}{\pi_{c'}N(x_i: \mu_{c'},\Sigma_{c'})}}\] where \(\pi_c\) is a prior for each class estimated from our training data and \(N(x_i:\mu_c,\Sigma_c)\) is the density function given a mean, \(\mu\), and covariance matrix, \(\Sigma_c\). The denominator is the summation of each class prior multiplied by the probability density which effectively acts as a normalizing constant so that \(\sum_c{r_{ic}}=1\). When we compute a class-conditional probability for each observation, we are given \(c\) number of probabilities that the observation \(x_i\) belongs to the class \(c\). We choose the class which maximizes that probability (Elements, 107).

### Expectation Maximization

There are times, however, when the class for each observation is unknown and we wish to estimate them. When this is the case, we can use the gaussian mixture model and the *Expectation-Maximization algorithm (EM)*.

The EM algorithm is a two step process. First is the *E-step* where the expectation is calculated. For the Gaussian Mixture Model, we use the same form of bayes theorm to compute expectation as we did with LDA. The equation we end up using is the same:

\[r_{ic} = \frac{\pi_c N(x_i: \mu_c, \Sigma_c)}{\sum_{c'}{\pi_{c'}N(x_i: \mu_{c'},\Sigma_{c'})}}\] In this form of bayes theorem, \(\pi_c\) is a vector of mixing components for the gaussian density where \(\sum{\pi_c = 1}\). \(N(x_i: \mu_c \Sigma_c)\) is notation for our probability density function where we compute the probability of \(x_i\) given \((\mu, \Sigma)\). The denominator is the sum of the priors multiplied by the gaussian probabilities. An \(r_i\) will be computed for each row-vector, \(x_i\), and with each mixing component, \(\pi_c\). Once a probability for each class has been computed, choose the most likely class.

### The Multivariate Guassian Distribution

First, we need to define \(N(\mu, \sigma)\) which is our gaussian density function. The normal distribution can take on a univariate or multivariate form. For the example below, I will use the multivariate form:

\[N(\mu_k, \Sigma_k) = \frac{e^{-\frac{1}{2}(x_i-\mu_k)^T \Sigma^{-1} (x_i-\mu_k) }}{\sqrt{|2\pi\Sigma|}}\]

The following R code uses the same equation to calculate the multivariate probability density.

```
# Multivariate Normal PDF Function Given a matrix x, mu (mean), and
# sigma (covariance), we can calculate the probability density for each
# row using the apply function. The function returns a column vector of
# probabilities.
mvpdf <- function(x, mu, sigma) {
if (det(sigma) == 0) {
warning("Determinant is equal to 0.")
}
apply(x, 1, function(x) exp(-(1/2) * (t(x) - mu) %*% MASS::ginv(sigma) %*%
t(t(x) - mu))/sqrt(det(2 * pi * sigma)))
}
```

With this function, we can estimate the probability that a point lies within the a distribution with paramaters \(\mu\) and \(\Sigma\). Gaussian mixture models work well when the class densities are clearly separated and well defined. So, as long as the classes do not overlap and the data is truly distributed normally, we can find a good estimate.

### Getting Initial Parameters

Gaussian mixture models assume that each latent class has a different set of means and covariances. However, since each class is unknown we must begin by intializing these parameters and itteratively updating. Initialization methods are an important step in mixture modeling which can greatly affect the consistency and accuracy of your results. Thus, it’s worth briefly mentioning a few techniques.

One approach is to randomly sample \(k\) number of means and covariances within the range of our data; however, since the EM algorithm is a “hill climbing” (i.e. maximization) algorithm, randomly choosing your starting points can alter performance. Given certain random values, my custom algorithm created for this article would often fail to produce an estimate and R would begin to return *NA* values after several iterations. The R package *mclust* addresses the issue by selecting initial means and covariances through the application of hierarchical clustering – an unsupervised clustering technique which iteratively collects points/groups together until the desired number of clusters is found. After classifying each point to a cluster, we can initialize our mixture, mean, and covariance parameters within those groups (Melnykov, Maitra, 2010).

### Maximization Step

Once the e-step has been completed, we need to maximize our results. Listed below is each equation we use during the maximization step: \[mc = \sum_i{r_{ic}}\] \[\pi_c=\frac{m_c}{m}\] \[u_c = \frac{1}{m_c} \sum_i{r_{ic}x_i}\] \[\Sigma_c = \frac{1}{m_c} \sum_i{r_{ic}(x_i - \mu_c)^T(x_i-\mu_c)}\] Now for an explanation of what is happening here. Our first task it to update the mixing components (i.e. prior probabilities). For each point \(x_i\), we will have calculated an \(r_{ic}\) which will give us one column vector per class. Our first equation for \(mc\) tells us the responsibility assigned to each class. Since \(\sum{m_c}=N\) where \(N\) is equal to the number of rows in our dataset, we can update the proportions using the second equation for \(\pi_c\) updating the mixing components. \(\mu_c\) and \(\Sigma_c\) are updates to our gaussian density parameters where we calculate the mean and covariances normally, but use weights \(r_{ic}\) and \(\frac{1}{m_c}\) to weight each point in our mean and covariance estimation.

When the maximization step is complete, we return to the beginning and repeat the process until a maximum is found. When this happens, the class means and covariances will not be greatly altered with each iteration.

### Code and Example

```
# Plot our dataset.
plot(iris[, 1:4], col = iris$Species, pch = 18, main = "Fisher's Iris Dataset")
```

If you have used R long enough, you’re probably familiar with fisher’s iris dataset. There are three species of flowers being measured with four continuous variables. In this dataset we have a good mix of variables that explain the differences between the classes (Petal Length and Petal Width) and those which show the classes are mixed (Sepal Length and Sepal Width).

Now let’s assume that we do not know which class each \(x_i\) belongs. Gaussian Mixture Modeling can help us determine each distinct species of flower.

Let’s start by intializing the parameters. The code below borrows from the mclust package by using it’s hierarchical clustering technique to help create better estimates for our means.

```
# Mclust comes with a method of hierarchical clustering. We'll
# initialize 3 different classes.
initialk <- mclust::hc(data = iris, modelName = "EII")
initialk <- mclust::hclass(initialk, 3)
# First split by class and calculate column-means for each class.
mu <- split(iris[, 1:4], initialk)
mu <- t(sapply(mu, colMeans))
# Covariance Matrix for each initial class.
cov <- list(diag(4), diag(4), diag(4))
# Mixing Components
a <- runif(3)
a <- a/sum(a)
```

Next we’ll use the equations defined above to calculate expectation.

```
# Calculate PDF with class means and covariances.
z <- cbind(mvpdf(x = iris[, 1:4], mu = mu[1, ], sigma = cov[[1]]), mvpdf(x = iris[,
1:4], mu = mu[2, ], sigma = cov[[2]]), mvpdf(x = iris[, 1:4], mu = mu[3,
], sigma = cov[[3]]))
# Expectation Step for each class.
r <- cbind((a[1] * z[, 1])/rowSums(t((t(z) * a))), (a[2] * z[, 2])/rowSums(t((t(z) *
a))), (a[3] * z[, 3])/rowSums(t((t(z) * a))))
# Choose the highest rowwise probability
eK <- factor(apply(r, 1, which.max))
```

Now let’s begin to update

```
# Total Responsibility
mc <- colSums(r)
# Update Mixing Components.
a <- mc/NROW(iris)
# Update our Means
mu <- rbind(colSums(iris[, 1:4] * r[, 1]) * 1/mc[1], colSums(iris[, 1:4] *
r[, 2]) * 1/mc[2], colSums(iris[, 1:4] * r[, 3]) * 1/mc[3])
# Update Covariance matrix.
cov[[1]] <- t(r[, 1] * t(apply(iris[, 1:4], 1, function(x) x - mu[1, ]))) %*%
(r[, 1] * t(apply(iris[, 1:4], 1, function(x) x - mu[1, ]))) * 1/mc[1]
cov[[2]] <- t(r[, 2] * t(apply(iris[, 1:4], 1, function(x) x - mu[2, ]))) %*%
(r[, 2] * t(apply(iris[, 1:4], 1, function(x) x - mu[2, ]))) * 1/mc[2]
cov[[3]] <- t(r[, 3] * t(apply(iris[, 1:4], 1, function(x) x - mu[3, ]))) %*%
(r[, 3] * t(apply(iris[, 1:4], 1, function(x) x - mu[3, ]))) * 1/mc[3]
```

### Log-Likelihood Maximization

During each iteration, each mean is updated in a way that maximizes the log-likelihood. Each iteration will increase the sum of the log-likelihood. The log-likelihood function experiences diminishing returns for each iteration of the E-M algorithm. Convergence occurs as the marginal utility of the function approaches zero (i.e. the change from another itteration is small). If we calculate the log-likelihood for each iteration, we can estimate the most cost effective approach for estimating our mixtures. The log-likelihood is defined as:

\[log N(\mu, \Sigma) = log \sum_c{\pi_c*N(\mu_c,\Sigma_c)}\] In other words, we need to estimate the probability density of each gaussian, combine them into one and take the log across all the points. The sum of the logs will be our likelihood estimate. The Mclust package takes the log-likelihood estimate and calculates the Bayesian Information Criterion (BIC) as a metric for goodness of fit. The BIC equation is defined as:

\[BIC=-2ln \hat{L} + k * ln(n)\]

For our particular example using iris, we compute: \[ -2 * loglike + 14 * log(150) \] where \(k\) is the number of free parameters in our model and \(n\) is the number in our sample size. We can compute a number of different models with \(k\) number of classes and compare the BIC. The best model is one that maximizes that metric. Below is a quick review of the code used to calculate BIC.

```
# Compute the sum of the mixture densities, take the log, and add the
# column vector.
loglik <- sum(log(apply(t(t(z) * a), 1, sum)))
# BIC is calculated using the equation
bic <- -2 * loglik + 14 * log(NROW(iris))
# After every iteration we can plot.
par(mfrow = c(1, 2))
plot(bic[, c(1, 2)], type = "l", lwd = 2, col = "red", main = "BIC", xlab = "Iterations",
ylab = "Score")
plot(bic[, c(1, 3)], type = "l", lwd = 2, col = "blue", main = "Log-Likelihood",
xlab = "Iterations", ylab = "Score")
```

### Putting it all together

The best way to see the EM algorithm/gaussian mixture model in action is to visualize each iteration. During the beginning, updates make a difference and cause several of the classes to chage. As the itterations continue, the EM algorithm experiences diminishing marginal returns till the change after another itteration is negligible.

### Comparison with the Mclust package and LDA

Simply using the package Mclust provides you with a richer and more rigorous analysis but, hopefully you have a better reference for what the algorithm is doing. Training the model with Mclust is easy and only requires two parameters: the data and the number of clusters.

```
# Load the package
library(mclust)
# Select 4 continuous variables and look for three distinct groups.
mcl.model <- Mclust(iris[, 1:4], 3)
# Plot our results.
plot(mcl.model, what = "classification", main = "Mclust Classification")
```

Let’s compare the results of Mclust to our custom algorithm first and then look how well Mclust discovered the class Species.

#### Mclust v.s. Custom Gaussian Mixture Model

Class 1 | Class 2 | Class 3 |
---|---|---|

50 | 0 | 0 |

0 | 45 | 0 |

0 | 0 | 55 |

A confusion matrix shows us that the two models performed equally well with no differences in classification.

#### Modeled Class v.s. Actual Class

setosa | versicolor | virginica |
---|---|---|

50 | 0 | 0 |

0 | 45 | 0 |

0 | 5 | 50 |

When we look at how our model identifies the three species of flower, we can see that it does fairly well in estimating the parameters correctly. As the matrix shows, only five were misclassificed in our sample of 150.

### Resources

I hope that’s enough to pique your interest in these types of models. There’s a lot more that could be explained, but what has been written here is, I hope, a good overview of the whole process. If you’re interested in more of the technical details, I would highly recommend: