Subspace Clustering
Contents
23. Subspace Clustering#
High dimensional data-sets are now pervasive in various signal processing applications. For example, high resolution surveillance cameras are now commonplace generating millions of images continually. A major factor in the success of current generation signal processing algorithms is the fact that, even though these data-sets are high dimensional, their intrinsic dimension is often much smaller than the dimension of the ambient space.

One resorts to inferring (or learning) a quantitative model
In absence of training data, the problem of modeling falls into the category of unsupervised learning. There are two common viewpoints of data modeling. A statistical viewpoint assumes that data points are random samples from a probabilistic distribution. Statistical models try to learn the distribution from the dataset. In contrast, a geometrical viewpoint assumes that data points belong to a geometrical object (a smooth manifold or a topological space). A geometrical model attempts to learn the shape of the object to which the data points belong. Examples of statistical modeling include maximum likelihood, maximum a posteriori estimates, Bayesian models etc. An example of geometrical models is Principal Component Analysis (PCA) which assumes that data is drawn from a low dimensional subspace of the high dimensional ambient space. PCA is simple to implement and has found tremendous success in different fields e.g., pattern recognition, data compression, image processing, computer vision, etc. 1.
The assumption that all the data points in a data set could be
drawn from a single model however happens
to be a stretched one. In practice, it often occurs that
if we group or segment the data set
Consider the problem of vanishing point detection in computer
vision. Under perspective projection, a group of parallel lines
pass through a common point in the image plane which is known as
the vanishing point for the group. For a typical scene consisting
of multiple sets of parallel lines, the problem of detecting
all vanishing points in the image plane
from the set of edge segments (identified in the image) can be
transformed into clustering points (in edge segments) into
multiple 2D subspaces in
In the Motion segmentation problem, an image
sequence consisting of multiple moving objects is
segmented so that each segment consists of motion
from only one object. This is a fundamental problem
in applications such as motion capture, vision based navigation,
target tracking and surveillance. We first track the
trajectories of feature points (from all objects) over the image
sequence. It has been shown (see sec:motion_segmentation)
that trajectories of feature points for rigid motion
for a single object form a low dimensional subspace.
Thus motion segmentation problem can be solved by
segmenting the feature point trajectories
for different objects separately and estimating
the motion of each object from corresponding trajectories.
In a face clustering problem, we have a collection of unlabeled images of different faces taken under varying illumination conditions. Our goal is to cluster, images of the same face in one group each. For a Lambertian object, it has been shown that the set of images taken under different lighting conditions forms a cone in the image space. This cone can be well approximated by a low-dimensional subspace [5, 49]. The images of the face of each person form one low dimensional subspace and the face clustering problem reduces to clustering the collection of images to multiple subspaces.

Fig. 23.1 A sample of faces from the Extended Yale Faces dataset B [40]. It contains 16128 images of 28 human subjects under 9 poses and 64 illumination conditions.#
As the examples above suggest, a typical hybrid model
for a mixed data set consists of multiple primitive models
where each primitive is a (low dimensional) subspace.
The data set is modeled as being sampled from a collection
or arrangement
An example of a statistical hybrid model is a Gaussian Mixture Model (GMM) where one assumes that the sample points are drawn independently from a mixture of Gaussian distributions. A way of estimating such a mixture model is the expectation maximization (EM) method.
The fundamental difficulty in the estimation of hybrid models is the “chicken-and-egg” relationship between data segmentation and model estimation. If the data segmentation was known, one could easily fit a primitive model to each subset. Alternatively, if the constituent primitive models were known, one could easily segment the data by choosing the best model for each data point. An iterative approach starts with an initial (hopefully good) guess of primitive models or data segments. It then alternates between estimating the models for each segment and segmenting the data based on current primitive models till the solution converges. On the contrary, a global algorithm can perform the segmentation and primitive modeling simultaneously. In the sequel, we will look at a variety of algorithms for solving the subspace clustering problem.
23.1. Notation#
First some general notation for vectors and matrices.
For a vector
, its support is denoted by and is defined as . denotes a vector obtained by taking the absolute values of entries in . denotes a vector whose every entry is . denotes the norm of . denotes the -“norm” of .Let
be any real matrix ( ). is the element at the -th row and -th column of . with denotes the -th column vector of . with denotes the -th row vector of . is the -th entry in . is the -th entry in . denotes a submatrix of consisting of columns indexed by . denotes a submatrix of consisting of rows indexed by . denotes the matrix consisting of absolute values of entries in . denotes the index set of non-zero rows of .Clearly,
. denotes the number of non-zero rows of .Clearly,
.We note that while
is not a norm, its behavior is similar to the -“norm” for vectors defined as .We use
and to denote the PDF and CDF of a continuous random variable.We use
to denote the PMF of a discrete random variable.We use
to denote the probability of an event.
23.1.1. Problem Formulation#
The data set can be modeled as a set of data points
lying in a union of low dimensional linear or
affine subspaces in a Euclidean space
Let the data set be
drawn from the union of subspaces under consideration. is the total number of data points being analyzed simultaneously.We put the data points together in a data matrix as
The data matrix
is known to us.Note that in statistic books, data samples are placed in each row of the data matrix. We are putting data samples in each column of the data matrix.
We will slightly abuse the notation and let
denote the set of data points also.We will use the terms data points and vectors interchangeably.
Let the vectors be drawn from a set of
(linear or affine) subspaces.The number of subspaces may not be known in advance.
The subspaces are indexed by a variable
with .The
-th subspace is denoted by .Let the (linear or affine) dimension of
-th subspace be with .Here
is an upper bound on the dimension of individual subspaces.We may or may not know
.We assume that none of the subspaces is contained in another.
A pair of subspaces may not intersect (e.g. parallel lines or planes), may have a trivial intersection (lines passing through origin), or a non-trivial intersection (two planes intersecting at a line).
The collection of subspaces may also be independent or disjoint (see Linear Algebra).
The vectors in
can be grouped (or segmented or clustered) as submatrices such that all vectors in lie in subspace .Thus, we can write
where
is an unknown permutation matrix placing each vector to the right subspace.This segmentation is straight-forward if the (affine) subspaces do not intersect or the subspaces intersect trivially at one point (e.g. any pair of linear subspaces passes through origin).
Let there be
vectors in with .We may not have any prior information about the number of points in individual subspaces.
We do typically require that there are enough vectors drawn from each subspace so that they can span the corresponding subspace.
This requirement may vary for individual subspace clustering algorithms.
For example, for linear subspaces, sparse representation based algorithms require that whenever a vector is removed from
, the remaining set of vectors spans .This guarantees that every vector in
can be represented in terms of other vectors in .The minimum required
for which this is possible is when the data points from each subspace are in general position (i.e. ).Let
be an orthonormal basis for subspace .Then, the subspaces can be described as
For linear subspaces,
.We will abuse
to also denote the set of vectors from the -th subspace.The basic objective of subspace clustering algorithms is to obtain a clustering or segmentation of vectors in
into .This involves finding out the number of subspaces/clusters
, and placing each vector in its cluster correctly.Alternatively, if we can identify
and the numbers correctly, we have solved the clustering problem.Since the clusters fall into different subspaces, as part of subspace clustering, we may also identify the dimensions
of individual subspaces, the bases and the offset vectors in case of affine subspaces.These quantities emerge due to modeling the clustering problem as a subspace clustering problem.
However, they are not essential outputs of the subspace clustering algorithms.
Some subspace clustering algorithms may not calculate them, yet they are useful in the analysis of the algorithm.
See Data Clustering for a quick review of data clustering terminology.
23.1.2. Noisy Case#
We also consider clustering of data points which are contaminated with noise.
The data points do not perfectly lie in a subspace but can be approximated as a sum of a component which lies perfectly in a subspace and a noise component.
Let
be the
-th vector that is obtained by corrupting an error free vector (which perfectly lies in a low dimensional subspace) with a noise vector .The clustering problem remains the same. Our goal would be to characterize the behavior of the clustering algorithm in the presence of noise at different levels.
23.2. Linear Algebra#
This section reviews some useful concepts from linear algebra relevant for the chapter.
A collection of linear subspaces
A collection of linear subspaces is called disjoint if they are pairwise independent [33]. In other words, every pair of subspaces intersect only at the origin.
23.2.1. Affine Subspaces#
For a detailed introduction to affine concepts, see [54].
For a vector
, the function defined by is a translation of by .The image of any set
under is the -translate of .A translation of space is a one to one isometry of
onto .A translate of a
-dimensional linear subspace of is a -dimensional flat or simply -flat in .Flats of dimension 1, 2, and
are also called lines, planes, and hyperplanes, respectively.Flats are also known as affine subspaces.
Every
-flat in is congruent to the Euclidean space .Flats are closed sets.
Affine combinations
An affine combination of the vectors
is a linear combination in which the sum of coefficients is 1.Thus,
is an affine combination of if and .The set of affine combinations of a set of vectors
is their affine span.A finite set of vectors
is called affine independent if the only zero-sum linear combination representing the null vector is the null combination. i.e. and implies .Otherwise, the set is affinely dependent.
A finite set of two or more vectors is affine independent if and only if none of them is an affine combination of the others.
Vectors vs. Points
We often use capital letters to denote points and bold small letters to denote vectors.
The origin is referred to by the letter
.An n-tuple
is used to refer to a point in as well as to a vector from origin to in .In basic linear algebra, the terms vector and point are used interchangeably.
While discussing geometrical concepts (affine or convex sets etc.), it is useful to distinguish between vectors and points.
When the terms “dependent” and “independent” are used without qualification to points, they refer to affine dependence/independence.
When used for vectors, they mean linear dependence/independence.
The span of
independent points is a -flat and is the unique -flat that contains all points.Every
-flat contains (affine) independent points.Each set of
independent points in the -flat forms an affine basis for the flat.Each point of a
-flat is represented by one and only one affine combination of a given affine basis for the flat.The coefficients of the affine combination of a point are the affine coordinates of the point in the given affine basis of the
-flat.A
-flat is contained in a linear subspace of dimension .This can be easily obtained by choosing an affine basis for the flat and constructing its linear span.
Affine functions
A function
defined on a vector space is an affine function or affine transformation or affine mapping if it maps every affine combination of vectors in onto the same affine combination of their images.If
is real valued, then is an affine functional.A property which is invariant under an affine mapping is called affine invariant.
The image of a flat under an affine function is a flat.
Every affine function differs from a linear function by a translation.
Affine functionals
A functional is an affine functional if and only if there exists a unique vector
and a unique real number such that .Affine functionals are continuous.
If
, then the linear functional and the affine functional map bounded sets onto bounded sets, neighborhoods onto neighborhoods, balls onto balls and open sets onto open sets.
23.2.2. Hyperplanes and Halfspaces#
Corresponding to a hyperplane
in (an -flat), there exists a non-null vector and a real number such that is the graph of .The vector
is orthogonal to for all .All non-null vectors
to have this property are normal to the hyperplane.The directions of
and are called opposite normal directions of .Conversely, the graph of
, , is a hyperplane for which is a normal vector. If and , , are both representations of a hyperplane , then there exists a real non-zero number such that and .We can find a unit norm normal vector for
.Each point
in space has a unique foot (nearest point) in a Hyperplane .Distance of the point
with vector from a hyperplane is given byThe coordinate
of the foot is given byHyperplanes
and are parallel if they don’t intersect.This occurs if and only if they have a common normal direction.
They are different constant sets of the same linear functional.
If
and are parallel hyperplanes, then the distance between the two hyperplanes is given by
Half-spaces
If
, , is a hyperplane , then the graphs of and are the opposite sides or opposite open half spaces of .The graphs of
and are the opposite closed half spaces of . is the face of the four half-spaces.Corresponding to a hyperplane
, there exists a unique pair of sets and that are the opposite sides of .Open half spaces are open sets and closed half spaces are closed sets.
If
and belong to the opposite sides of a hyperplane , then there exists a unique point of that is between and .
23.2.3. General Position#
A general position for a set of points or other geometric objects is a notion of genericity.
It means the general case situation as opposed to more special and coincidental cases.
For example, generically, two lines in a plane intersect in a single point.
The special cases are when the two lines are either parallel or coincident.
Three points in a plane in general are not collinear.
If they are, then it is a degenerate case.
A set of
or more points in is in said to be in general position if every subset of points is linearly independent.In general, a set of
or more points in a -flat is said to be in general linear position if no hyperplane contains more than points.
23.3. Matrix Factorizations#
23.3.1. Singular Value Decomposition#
A non-negative real value
is a singular value for a matrix if and only if there exist unit length vectors and such that and .The vectors
and are called left singular and right singular vectors for respectively.For every
with , there exist two orthogonal matrices and and a sequence of real numbers such that where (Extra columns or rows are filled with zeros).The decomposition of
given by is called the singular value decomposition of .The first
columns of and are the left and right singular vectors of corresponding to the singular values .The rank of
is equal to the number of non-zero singular values which equals the rank of .The eigen values of positive semi-definite matrices
and are given by (remaining eigen values being ).Specifically,
and .We can rewrite
. is rank-1 approximation of in Frobenius norm sense.The spectral radius and
-norm of is given by its largest singular value .The Moore-Penrose pseudo-inverse of
is easily obtained by taking the transpose of and inverting the non-zero singular values.Further,
.The non-zero singular values of
are just reciprocals of the non-zero singular values of .Geometrically, singular values of
are the precisely the lengths of the semi-axes of the hyper-ellipsoid defined by(i.e. image of the unit sphere under
).Thus, if
is a data matrix, then the SVD of is strongly connected with the principal component analysis of .
23.4. Principal Angles#
If
and are two linear subspaces of , then the smallest principal angle between them denoted by is defined as [12]In other words, we try to find unit norm vectors in the two spaces which are maximally aligned with each other.
The angle between them is the smallest principal angle.
Note that
( as defined above is always positive).If we have
and as matrices whose column spans are the subspaces and respectively, then in order to find the principal angles, we construct orthogonal bases and .We then compute the inner product matrix
.The SVD of
gives the principal angles.In particular, the smallest principal angle is given by
, the largest singular value.
- 1
PCA can also be viewed as a statistical model. When the data points are independent samples drawn from a Gaussian distribution, the geometric formulation of PCA coincides with its statistical formulation.
- 2
We would use the terms arrangement and union interchangeably. For more discussion see Algebraic Geometry.