Suppose we are beginning a hike through a region known for its mountain lions. At the beginning of our trek, we have access to a map with our trails, as well as the coordinates of every recent mountain lion sighting. Our goal is to choose the trail where we are least likely to find a mountain lion.
How might we approach this problem? In some cases it may be trivial if sightings are clearly localized around a certain path. However, this will not always be the case. For example, suppose one path is much longer than another, and the shorter path goes directly through a cluster of sightings. While it's best to avoid going directly through a cluster of sightings, a longer path presents more opportunities to meet these creatures as well. How can we make a decision about which path to choose in these less than obvious situations?
What if we could approximate the relative likelihood of seeing a mountain lion at any point on our map? With that information we could take the sum of these relative likelihoods at each point along a path. This would give us a total probability of seeing a mountain lion for a certain path. Then we could simply compare these total probabilities for each path to see which one is best!
But how is it possible to transform a set of sightings into a probability of seeing a mountain lion at any point in our map? This can be accomplished with Gaussian mixture models.
What is a Gaussian Mixture Model?
Gaussian distributions are also known as normal distributions, as well as Bell curves. This kind of distribution has two important parameters: mean and variance. The mean of a distribution describes the point at which the largest values of the function are centered around, and the variance describes how spread out the values of the function are. Use the below graphing tool to experiment with the parameters.
You have likely seen this kind of distribution before, however, you may not have seen a multivariate Gaussian distribution. A multivariate Gaussian distribution has many similarities to its one dimensional counterpart. It has a mean where most of the highest values of the function are centered around, as well as a covariance matrix that describes how spread the values of the function are.
Mutlidimensional Gaussian distributions also have covariance properties. This means the spread of the function may be circular, longer with respect to one axis, or diagonal. The flexibility of multivariate Gaussian distribution's mean and covariance parameters makes them useful for our problem. For example, the sightings of one particular mountain lion on a map may well follow a distribution like this.
More formally a multivariate Gaussian has the following probability density function at any point x, in a distribution with d dimensions as written below. [1]
If the sightings of one mountain lion follow a multivariate Gaussian distribution, perhaps a mixture of these distributions can describe the sighting activity of several. A mixture of Gaussians can be made from a weighted sum of two or more distributions. This is a Gaussian mixture model. Applying a weight to each individual distribution is necessary as the sum of all values in a probability distribution must add to 100%. Using different weights for individual distributions increases the flexibility of our mixture. Observe a Gaussian mixture model in the one dimensional case:
As well as a multivariate Gaussian mixture model:
A multivariate Gaussian mixture model has the probability density function below. [1]
All Gaussian mixture models are density estimators, meaning they describe how relatively likely a particular point is. We will use multivariate Gaussian mixture models to find a good density estimator of the likelihood of seeing a mountain lion at any point on our map.
Evaluating the Fit of a Gaussian Mixture Model
The fit of a Gaussian mixture model can be measured in many ways, but a popular choice is the log likelihood expression. The log likelihood is simply the log of the product of the probability of each data point in a dataset, given our distribution parameters. For Gaussain mixture models the log likelihood expression can be written as below. [1]
Holding the number of individual distribution constant a mixture with a higher log likelihood than another is usually a better representation of the data. However, there are important exceptions, such as if one individual distribution of a Gaussian mixture model centers around an outlier and has a small covariance matrix.
Increasing the number of individual normal distributions within a mixture generally increases the log likelihood until the number of distributions reaches the number of datapoints. This is because each distribution can be responsible for less of the data until it only represents one data point. In this manner if we optimize with respect to number of individual distributions, we learn nothing about the distribution underlying the data. Instead, there is typically a point where increasing the number of individual distributions presents diminishing returns for the log likelihood. Choosing a value where the diminishing returns begin typically yields a reasonable number of individual distributions to represent the data. This shows that just because a Gaussian mixture model has more individual distributions, it is not necessarily a better representation of a data set.
Check your Understanding
The following questions contain visualizations of Gaussian mixture model distributions and data in a plane. Given what you've just learned about Gaussian mixture models, answer the following questions:
Each point on the grid below represents one data point in a data set. Which density estimator fits this data set better?
Left
Right
Each point on the grid below represents one data point in a data set. Which density estimator fits this data set better?
Left
Right
Each point on the grid below represents one data point in a data set. Which density estimator fits this data set better?
Left
Right
Finding The Probability Of A Sighting Along A Path, Line Integrals Along A Gaussain Mixture Model
Now that we understand the kind of distribution our mountain lions sightings will follow, we can now find the probability of seeing a mountain lion on our observed path. We deduced earlier that if we take the sum of all the probabilities along one path, we would know the total probability of seeing a mountain lion along that path. This can be accomplished by what is called a line integral. A line integral more generally is the sum of all the points under a multidimensional function along a line. Our multidimensional function is our multivariate Gaussian mixture model. This kind of distribution is difficult to calculate an exact line integral over. However, we can construct an approximation by splitting our line and the area underneath into trapezoids. This more generally known as a Reimann sum approximation. Below is a visualization of the an approximate line integral over a Gaussian mixture, where the area of the orange cross section represents our desired probability:
Now that we know what quantity we need given our Gaussian mixture model, all we need is to find which Gaussian mixture model works best for our data set.
Why Gaussian Mixture Models Cannot Be Learned Directly
As we are only given the locations of mountain lion sightings, we must find a Gaussian mixture model that best represents these sightings in order to build our density estimator. A typical method of learning the parameters of a distribution is maximum likelihood estimation (MLE). Many machine learning problems can be solved through directly maximizing the log likelihood expression. The log likelihood is a useful expression as it is a monotonic function and therefore retains the same local and global maximums and minimums. Additionally, the log likelihood of a product usually reduces to a sum which leads to a drastically simpler expression when applying the derivative operator with respect to one of a model’s parameters. In many cases a closed form solution is possible. Answer the following question to observe why this is not the case for optimizing Gaussian mixture models.
In order to directly solve a Gaussian mixture model in a one dimensional case with two clusters, this is just one of the derivatives we must find the roots of. Which terms will appear in the resulting expression after the derivative operator is applied?
The weights of both clusters
The means and variances of each cluster
The means, variances, and weights of each cluster
In the case of directly optimizing Gaussian mixture models our log likelihood expression yields a summation dependent on every parameter within a logarithm. Our optimization then leads to not only a system of nonlinear equations, but also has no closed form solution. [1] This means we cannot build the best density estimator as some operation of our data set, nor can we know the best Gaussian mixture model that predicts the probability of seeing a mountain lion over our whole map. This motivates the use of an algorithm that can optimize our distribution parameters instead of one mathematical expression.
The Expectation Maximization Algorithm
We have shown why it is impossible to directly optimize Gaussian mixture models with MLE, however, we can use the Expectation Maximization (EM) algorithm to solve this problem. The EM algorithm in general is a form of coordinate descent, meaning that we update some of a problem's parameters, but we hold others constant. In general, the EM algorithm holds parameters for a target distribution constant, when trying to learn another latent variable, then vice versa. The EM algorithm is guaranteed to reach a local solution after repeatedly taking turns optimizing the latent variables and other parameters.
In order to take advantage of the EM algorithm, we must introduce our latent variable. Suppose we have a Gaussian mixture model with three individual Gaussian distributions as the density estimator of the likelihood of seeing mountain lion at each point on the map. Our latent variable will be the likelihood that each individual sighting was created by an individual distribution. For example one particular sighting can have a 20% likelihood of coming from the first individual normal distribution, 30% for the second, and 50% for the third. These probabilities will always add up to 100%.
More formally this latent variable is called the soft assignment. The expression below represents the soft assignment asosciated with each data point, x.
To fit the EM algorithm to the problem of optimizing Gaussian mixture models, it is important to make the realization that given the initial distribution parameters of our Guassian mixture model (the means, covariance matrices, and weights assosciated with each distribution), it is trivial to update the probability assignment for each data point. This is known as the expectation step.
Of all our distribution parameters and latent variables including the means, covariance matrices, weights and soft assignments, what is updated during the expectation step?
The means, covariance matrices, and soft assignments of each point
The weights and the soft assignments of each point
The soft assignments of each point
We must also make the realization that it is trivial to calculate an update of our Guassian mixture model's distribution parameters given our probability assignments. This is called the maximization step. Our distribution parameters are updated according to the formulas below:
Of all our distribution parameters and latent variables including the means, covariance matrices, weights and soft assignments, what is updated during the maximization step?
The means, covariance matrices, and weights
The means, covariance matrices, and the soft assignments of each point
The soft assignments of each point
As aforementioned, the process of switching between the expectation and maximization step can continue until a local solution is achieved. A local solution does not guarantee a great solution and it is therefore important to run the EM algorithm with many different initial distribution parameters. Different initialization strategies for Gaussian mixture models will be discussed later in this article. [1]
Experiment with the below visualization of the EM algorithm to ensure you understand. Be sure to step forward, backward, and reinitialize the visualization to see the effect of different initializations and data sets.
Step Number: 0
Last Step Taken: None
Solving for the Best Path
We now understand everything required to find the path where we are least likely to find mountain lions. To do so we can use the following process.
Use the EM algorithm to learn a Gaussian mixture model, which acts as a density estimator of our data set (positions of mountain lion sightings).
Run the EM algorithm with multiple initializations to ensure we have a density estimator that is representative of our dataset.
Approximate the line integral along each path using Reimann sum approximation. This is our likelihood of seeing a mountain lion on this path.
Compare the likelihood of seeing a mountain lion on each path.
The following is a visualization of the line integrals over a probability density estimator of a mountain lion sighting. Which path minimizes the probability of seeing a mountain lion?
The blue path
The orange path
The pink path
Further Study
Gaussian Mixture Models as a Clustering Algorithm
In our problem, we used Gaussian mixture models as a density estimator. However, our soft assignment latent variable in the EM algorithm also reflects the partial assignments of each data point to an individual distribution. That is the probability for each point that it came from a particular individual distribution. Soft assignments can be turned into hard assignments by choosing the index of the highest probability for each data point. This can be useful as not all clustering algorithms take covariance into account when forming clusters. For example, K-means is one such algorithm that does not take covariance into account when clustering data.
Which clustering was likely generated by K-means?
Left
Right
Singularities
Without safegaurding, it is possible a Gaussian mixture model can contain a singularity. This is where one of the normal distributions begins to center around one datapoint and the covariance matrix entries approach infinitely small values. Even if this may increase the likelihood expression of of a data set, it is likely not representative of the data to choose one distribution to center around one outlier. To mitigate this, you can ensure the diagonal entries of each covariance matrix have some minimum value they will not decrease beyond in the maximization step.
Initialization Strategies
Different initialization choices can have great effect on the result of a Gaussian mixture as the EM algorithm is only guaranteed to find a local solution. The covariance and weight for each normal distribution can be initialized as the identity matrix and equal weighting, or a psuedorandom value. For mean initializations, it is possible to choose random points within the range of the data set or random points in the data set itself. It is also possible to choose an existing point, weight the likelihood of choosing each successive point by which points have already been chosen (further points being more likely), and continue this cycle until all means are initialized. This method of means generation will likely generate better results than others, as the means are unlikely to be initialized near each other.
Acknowledgements
Works Cited
[1] C. M. Bishop, Pattern recognition and machine learning. Springer, 2016.
Authorship
A site by Stuart Allen, in partial fulfillment of the requirements for the degree of Honors Baccalaureate of Science in Computer Science (Honors Scholar).