# Expectation Maximization

# 17.7. Expectation Maximization#

Expectation-Maximization (EM) [26] method is a maximum likelihood based estimation paradigm. It requires an explicit probabilistic model of the mixed data-set. The algorithm estimates model parameters and the segmentation of data in Maximum-Likelihood (ML) sense.

We assume that \(\by_s\) are samples drawn from multiple “component” distributions and each component distribution is centered around a mean. Let there be \(K\) such component distributions. We introduce a latent (hidden) discrete random variable \(z \in \{1, \dots, K\}\) associated with the random variable \(\by\) such that \(z_s = k\) if \(\by_s\) is drawn from \(k\)-th component distribution. The random vector \((\by, z) \in \RR^M \times \{1, \dots, K\}\) completely describes the event that a point \(\by\) is drawn from a component indexed by the value of \(z\).

We assume that \(z\) is subject to a multinomial (marginal) distribution. i.e.:

Each component distribution can then be modeled as a conditional (continuous) distribution \(f(\by | z)\). If each of the components is a multivariate normal distribution, then we have \(f(\by | z = k) \sim \NNN(\mu_k, \Sigma_k)\) where \(\mu_k\) is the mean and \(\Sigma_k\) is the covariance matrix of the \(k\)-th component distribution. The parameter set for this model is then \(\theta = \{\pi_k, \mu_k, \Sigma_K \}_{k=1}^K\) which is unknown in general and needs to be estimated from the dataset \(\bY\).

With \((\by, z)\) being the complete random vector, the marginal PDF of \(\by\) given \(\theta\) is given by

The log-likelihood function for the dataset

is given by

An ML estimate of the parameters, namely \(\hat{\theta}_{\ML}\)
is obtained by maximizing \(l (\bY; \theta)\) over the
parameter space.
The statistic \(l (Y; \theta)\) is called
*incomplete log-likelihood function*
since it is marginalized over \(z\).
It is very difficult to compute and maximize directly.
The EM method provides an alternate
means of maximizing \(l (\bY; \theta)\) by
utilizing the latent r.v. \(z\).

We start with noting that

Thus, \(l (\bY; \theta)\) can be rewritten as

The first term is
*expected complete log-likelihood function*
and the second term is the
*conditional entropy* of \(z_s\) given \(\by_s\)
and \(\theta\).

Let us introduce auxiliary variables \(w_{s k} (\theta) = p(z_s = k | \by_s , \theta)\). \(w_{s k}\) represents the expected membership of \(\by_s\) in the \(k\)-th cluster. Put \(w_{sk}\) in a matrix \(\bW (\theta)\) and write:

Then, we have

where, we have written \(l\) as a function of both \(\theta\) and \(W\).

An iterative maximization approach can be introduced as follows:

Maximize \(l(\bY; \theta, \bW)\) w.r.t. \(\bW\) keeping \(\theta\) as constant.

Maximize \(l(\bY; \theta, \bW)\) w.r.t. \(\theta\) keeping \(\bW\) as constant.

Repeat the previous two steps till convergence.

This is essentially the EM algorithm. Step 1 is known
as *E-step* and step 2 is known as the *M-step*.
In the E-step, we are estimating the expected membership
of each sample being drawn from each component distribution.
In the M-step, we are maximizing the
expected complete log-likelihood
function as the conditional entropy term
doesn’t depend on \(\theta\).

Using Lagrange multiplier, we can show that the optimal \(\hat{w}_{s k}\) in the E-step is given by

A closed form solution for the \(M\)-step depends on the particular choice of the component distributions. We provide a closed form solution for the special case when each of the components is an isotropic normal distribution (\(\NNN(\mu_k, \sigma_k^2 I)\)).

In \(K\)-means, each \(\by_s\) gets hard assigned to a specific cluster. In EM, we have a soft assignment given by \(w_{s k}\).

EM-method is a good method for a hybrid dataset consisting of mixture of component distributions. Yet, its applicability is limited. We need to have a good idea of the number of components beforehand. Further, for a Gaussian Mixture Model (GMM), it fails to work if the variance in some of the directions is arbitrarily small [82]. For example, a subspace like distribution is one where the data has large variance within a subspace but almost zero variance orthogonal to the subspace. The EM method tends to fail with subspace like distributions.