17.7. Expectation Maximization#

Expectation-Maximization (EM) [26] method is a maximum likelihood based estimation paradigm. It requires an explicit probabilistic model of the mixed data-set. The algorithm estimates model parameters and the segmentation of data in Maximum-Likelihood (ML) sense.

We assume that ys are samples drawn from multiple “component” distributions and each component distribution is centered around a mean. Let there be K such component distributions. We introduce a latent (hidden) discrete random variable z{1,,K} associated with the random variable y such that zs=k if ys is drawn from k-th component distribution. The random vector (y,z)RM×{1,,K} completely describes the event that a point y is drawn from a component indexed by the value of z.

We assume that z is subject to a multinomial (marginal) distribution. i.e.:

p(z=k)=πk0,π1++πK=1.

Each component distribution can then be modeled as a conditional (continuous) distribution f(y|z). If each of the components is a multivariate normal distribution, then we have f(y|z=k)N(μk,Σk) where μk is the mean and Σk is the covariance matrix of the k-th component distribution. The parameter set for this model is then θ={πk,μk,ΣK}k=1K which is unknown in general and needs to be estimated from the dataset Y.

With (y,z) being the complete random vector, the marginal PDF of y given θ is given by

f(y|θ)=z=1Kf(y|z,θ)p(z|θ)=z=1Kπkf(y|z=k,θ).

The log-likelihood function for the dataset

Y={ys}s=1N

is given by

l(Y;θ)=s=1Slnf(ys|θ).

An ML estimate of the parameters, namely θ^ML is obtained by maximizing l(Y;θ) over the parameter space. The statistic l(Y;θ) is called incomplete log-likelihood function since it is marginalized over z. It is very difficult to compute and maximize directly. The EM method provides an alternate means of maximizing l(Y;θ) by utilizing the latent r.v. z.

We start with noting that

f(y|θ)p(z|y,θ)=f(y,z|θ),
k=1Kp(z=k|y,θ)=1.

Thus, l(Y;θ) can be rewritten as

l(Y;θ)=s=1Sk=1Kp(zs=k|ys,θ)lnf(ys,zs=k|θ)p(zs=k|ys,θ)=s,kp(zs=k|ys,θ)lnf(ys,zs=k|θ)s,kp(zs=k|ys,θ)lnp(zs=k|ys,θ).

The first term is expected complete log-likelihood function and the second term is the conditional entropy of zs given ys and θ.

Let us introduce auxiliary variables wsk(θ)=p(zs=k|ys,θ). wsk represents the expected membership of ys in the k-th cluster. Put wsk in a matrix W(θ) and write:

l(Y;θ,W)=s=1Sk=1Kwsklnf(ys,zs=k|θ).
h(z|y;W)=s=1Sk=1Kwsklnwsk.

Then, we have

l(Y;θ,W)=l(Y;θ,W)+h(z|y;W)

where, we have written l as a function of both θ and W.

An iterative maximization approach can be introduced as follows:

  1. Maximize l(Y;θ,W) w.r.t. W keeping θ as constant.

  2. Maximize l(Y;θ,W) w.r.t. θ keeping W as constant.

  3. Repeat the previous two steps till convergence.

This is essentially the EM algorithm. Step 1 is known as E-step and step 2 is known as the M-step. In the E-step, we are estimating the expected membership of each sample being drawn from each component distribution. In the M-step, we are maximizing the expected complete log-likelihood function as the conditional entropy term doesn’t depend on θ.

Using Lagrange multiplier, we can show that the optimal w^sk in the E-step is given by

w^sk=πkf(ys|zs=k,θ)l=1Kπlf(ys|zs=l,θ).

A closed form solution for the M-step depends on the particular choice of the component distributions. We provide a closed form solution for the special case when each of the components is an isotropic normal distribution (N(μk,σk2I)).

μk^=s=1Swskyss=1Swsk,σ^k2=s=1Swskysμk22Ms=1Swsk,πk^=k=1KwskK.

In K-means, each ys gets hard assigned to a specific cluster. In EM, we have a soft assignment given by wsk.

EM-method is a good method for a hybrid dataset consisting of mixture of component distributions. Yet, its applicability is limited. We need to have a good idea of the number of components beforehand. Further, for a Gaussian Mixture Model (GMM), it fails to work if the variance in some of the directions is arbitrarily small [82]. For example, a subspace like distribution is one where the data has large variance within a subspace but almost zero variance orthogonal to the subspace. The EM method tends to fail with subspace like distributions.