Motion Segmentation
Contents
23.5. Motion Segmentation#
The theory of structure from motion and motion segmentation has evolved over a set of papers [14, 24, 39, 53, 66, 73, 74]. In this section, we review the essential ideas from this series of work.
A typical image sequence (from a single camera shot) may contain multiple objects moving independently of each other. In the simplest model, we can assume that images in a sequence are views of a single moving object observed by a stationary camera or a stationary object observed by a moving camera. Only rigid motions are considered. In either case, the object is moving with respect to the camera.
The structure from motion problem focuses on recovering the (3D) shape and motion information of the moving object. In the general case, there are multiple objects moving independently. Thus, we also need to perform a motion segmentation such that motions of different objects can be separated and (either after or simultaneously) shape and motion of each object can be inferred.
This problem is typically solved in two stages. In the first stage, a frame to frame correspondence problem is solved which identifies a set of feature points whose coordinates can be tracked over the sequence as the point moves from one position to other in the sequence. We obtain a set of trajectories for these points over the frames in the video. If there is a single moving object or the scene is static and the observer is moving then all the feature points will belong to the same object. Otherwise, we need to cluster these feature points to different objects moving in different directions. In the second stage, these trajectories are analyzed to group the feature points into separate objects and recover the shape and motion for individual objects. In this section we assume that the feature trajectories have been obtained by an appropriate method. Our focus is to identify the moving objects and obtain the shape and motion information for each object from the trajectories.
23.5.1. Modeling Structure from Motion for Single Object#
We start with the simple model of a static camera and a moving object. All feature point trajectories belong to the moving object. Our objective is to demonstrate that the subspace spanned by feature trajectories of a single moving object is a low dimensional subspace.
Let the image sequence consist of
Putting together the feature trajectory vectors of
This is the data matrix under consideration from which the shape and motion of the object need to be inferred.
We need two coordinate systems. We use the camera
coordinate system as the world coordinate system
with the
represents the shape of the object (w.r.t. its centroid).
Let us choose an orthonormal basis in the object coordinate
system. Let
Assuming orthographic projection and letting
where
If we write
We rewrite this as
where
23.5.2. Solving the Structure From Motion Problem#
We digress a bit to understand how to perform the
factorization of
Since
Matrices
There is no unique factorization of
But for any
is also a possible solution since
where
Rotational constraints
Recall that
where
Translational constraints
Recall that the image of a centroid of a set of points
under an isometry (rigid motion) is the centroid
of the images of the points under the same isometry.
The homogeneous coordinates of the centroid in the
object coordinate system are
Putting back, we obtain
A least squares solution for
23.5.3. Modeling Motion for Multiple Objects#
The generalization of modeling of motion of one object
to multiple objects is straight-forward. Let there be
Clearly, each submatrix
Let us dig a bit deeper to see how the motion shape
factorization identity changes for the multi-object
formulation. Each data submatrix
If we further denote :
then we obtain a factorization similar to the single object case given by
Thus, when the segmentation of
23.5.4. Limitations#
Our discussion so far has established that feature trajectories for each moving object span a 4-dimensional space. There are a number of reasons why this is only approximately valid: perspective distortion of camera, tracking errors, and pixel quantization. Thus, a subspace clustering algorithm should allow for the presence of noise or corruption of data in real life applications.
- 1
Our realization of an object is a set of feature points undergoing same rotation and translation over a sequence of images. The notion of locality, color, connectivity etc. plays no role in this definition. It is possible that two visually distinct objects are undergoing same rotation and translation within a given image sequence. For the purposes of inferring an object from its motion, these two visually distinct object are treated as one.