Sparse and Redundant Representations
Contents
18.5. Sparse and Redundant Representations#
18.5.1. Dictionaries#
(Dictionary)
A dictionary for \(\CC^N\) is a finite collection \(\bDDD\) of unit-norm vectors which span the whole space.
The elements of a dictionary are called atoms and they are denoted by \(\phi_{\omega}\) where \(\omega\) is drawn from an index set \(\Omega\).
The dictionary is written as
where
and every signal \(\bx \in \CC^N\) can be expressed as
We use the letter \(D\) to denote the number of elements in the dictionary; i.e.,
This definition is adapted from [77].
The indices may have an interpretation, such as the time-frequency or time-scale localization of an atom, or they may simply be labels without any underlying meaning.
Note that the dictionary need not provide a unique representation for any vector \(\bx \in \CC^N\), but it provides at least one representation for each \(\bx \in \CC^N\).
When \(D=N\) we have a set of unit norm vectors which span the whole of \(\CC^N\). Thus we have a basis (not-necessarily an orthonormal basis). A dictionary cannot have \(D < N\). The more interesting case is when \(D > N\).
18.5.2. Redundant Dictionaries and Sparse Signals#
With \(D > N\), clearly there are more atoms than necessary to provide a representation of every signal \(\bx \in \CC^N\). Thus such a dictionary is able provide multiple representations to same vector \(\bx\). We call such dictionaries redundant dictionaries or over-complete dictionaries.
In contrast a basis with \(D=N\) is called a complete dictionary. A special class of signals is those signals which have a sparse representation in a given dictionary \(\bDDD\).
\((\bDDD,K)\)-sparse signals)
(A signal \(\bx \in \CC^N\) is called \((\bDDD,K)\)-sparse if it can be expressed as a linear combination of at-most \(K\) atoms from the dictionary \(\bDDD\).
Normally, for sparse signals, we have \(K \ll D\). It is usually expected that \(K \ll N\) also holds.
Let \(\Lambda \subset \Omega\) be a subset of indices with \(|\Lambda|=K\).
Let \(\bx\) be any signal in \(\CC^N\) such that \(\bx\) can be expressed as
Note that this is not the only possible representation of \(\bx\) in \(\bDDD\). This is just one of the possible representations of \(\bx\). The special feature of this representation is that it is \(K\)-sparse i.e. only at most \(K\) atoms from the dictionary are being used.
Now there are \(\binom{D}{K}\) ways in which we can choose a set of \(K\) atoms from the dictionary \(\bDDD\).
Thus the set of \((\bDDD,K)\)-sparse signals is given by
This set \(\Sigma_{(\bDDD,K)}\) is dependent on the chosen dictionary \(\bDDD\). In the sequel, we will simply refer to it as \(\Sigma_K\).
\(K\)-sparse signals for standard basis)
(For the special case where \(\bDDD\) is nothing but the standard basis of \(\CC^N\), then
i.e., the set of signals which have \(K\) or less non-zero elements.
\(K\)-sparse signals for orthonormal basis)
(In contrast if we choose an orthonormal basis \(\Psi\) such that every \(\bx \in \CC^N\) can be expressed as
then with the dictionary \(\bDDD = \Psi\), the set of \(K\)-sparse signals is given by
We also note that for a specific choice of \(\Lambda \subseteq \Omega\) with \(|\Lambda| = K\), the set of vectors
form a subspace of \(\CC^N\).
So we have \(\binom{D}{K}\) \(K\)-sparse subspaces contained in the dictionary \(\bDDD\). And the \(K\)-sparse signals lie in the union of these subspaces.
18.5.3. Sparse Approximation Problem#
In sparse approximation problem, we attempt to approximate a given signal \(\bx \in \CC^N\) as a linear combination of \(K\) atoms from the dictionary \(\bDDD\) where \(K \ll N\) and typically \(N \ll D\); i.e., the number of atoms in a dictionary \(\bDDD\) is typically much larger than the ambient signal space dimension \(N\).
Naturally, we wish to obtain a best possible sparse representation of \(\bx\) over the atoms \(\phi_{\omega} \in \bDDD\) which minimizes the approximation error.
Let \(\Lambda\) denote the index set of atoms which are used to create a \(K\)-sparse representation of \(\bx\) where \(\Lambda \subset \Omega\) with \(|\Lambda| = K\).
Let \(\bx_{\Lambda}\) denote an approximation of \(\bx\) over the set of atoms indexed by \(\Lambda\).
Then we can write \(\bx_{\Lambda}\) as
We put all complex valued coefficients \(b_{\lambda}\) in the sum into a list \(\bb_{\Lambda}\). The approximation error is given by
Clearly we would like to minimize the approximation error over all possible choices of \(K\) atoms and corresponding set of coefficients \(\bb_{\Lambda}\).
Thus the sparse approximation problem can be cast as a minimization problem given by
If we choose a particular \(\Lambda\), then the inner minimization problem becomes a straight-forward least squares problem. But there are \(\binom{D}{K}\) possible choices of \(\Lambda\) and solving the inner least squares problem for each of them becomes prohibitively expensive.
We reemphasize here that in this formulation we are using a fixed dictionary \(\bDDD\) while the vector \(\bx \in \CC^N\) is arbitrary.
This problem is known as \((\bDDD, K)\)-SPARSE approximation problem.
A related problem is known as \((\bDDD, K)\)-EXACT-SPARSE problem where it is known a-priori that \(\bx\) is a linear combination of at-most \(K\) atoms from the given dictionary \(\bDDD\) i.e. \(\bx\) is a \(K\)-sparse signal as defined above for the dictionary \(\bDDD\).
This formulation simplifies the minimization problem (18.18) since it is known a priori that for \(K\)-sparse signals, a \(0\) approximation error can be achieved. The only problem is to find a set of subspaces from the \(\binom{D}{K}\) possible \(K\)-sparse subspaces which are able to provide a \(K\)-sparse representation of \(\bx\) and among them choose one. It is imperative to note that even the \(K\)-sparse representation need not be unique.
Clearly the EXACT-SPARSE problem is simpler than the SPARSE approximation problem. Thus if EXACT-SPARSE problem is NP-Hard then so is the harder SPARSE-approximation problem. It is expected that solving the EXACT-SPARSE problem will provide insights into solving the SPARSE problem.
In Theorem 18.8 we identified conditions under which a sparse representation for a given vector \(\bx\) in a two-ortho-basis is unique. It would be useful to get similar conditions for general dictionaries. such conditions would help us guarantee the uniqueness of EXACT-SPARSE problem.
18.5.4. Synthesis and Analysis#
The atoms of a dictionary \(\bDDD\) can be organized into a \(N \times D\) matrix as follows:
where \(\Omega = \{\omega_1, \omega_2, \dots, \omega_N\}\) is the index set for the atoms of \(\bDDD\). We recall that \(\phi_{\omega} \in \CC^N\), hence they have a column vector representation in the standard basis for \(\CC^N\). The order of columns doesn’t matter as long as it remains fixed once chosen.
Thus in matrix terminology a representation of \(\bx \in \CC^N\) in the dictionary can be written as
where \(\bb \in \CC^D\) is a vector of coefficients to produce a superposition \(\bx\) from the atoms of dictionary \(\bDDD\). Clearly with \(D > N\), \(\bb\) is not unique. Rather for every vector \(\bz \in \NullSpace(\Phi)\), we have:
(Synthesis matrix)
The matrix \(\Phi\) is called a synthesis matrix since \(\bx\) is synthesized from the columns of \(\Phi\) with the coefficient vector \(\bb\).
We can also view the synthesis matrix \(\Phi\) as a linear operator from \(\CC^D\) to \(\CC^N\).
There is another way to look at \(\bx\) through \(\Phi\).
(Analysis matrix)
The conjugate transpose \(\Phi^H\) of the synthesis matrix \(\Phi\) is called the analysis matrix. It maps a given vector \(\bx \in \CC^N\) to a list of inner products with the dictionary:
where \(\bc \in \CC^D\).
Note that in general \(\bx \neq \Phi (\Phi^H \bx)\) unless \(\bDDD\) is an orthonormal basis.
\((\bDDD, K)\) EXACT SPARSE problem)
(With the help of synthesis matrix \(\Phi\), the \((\bDDD, K)\) EXACT SPARSE problem can now be written as
If \(\bx \notin \Sigma_K\), then the EXACT SPARSE problem is infeasible. Otherwise, we are looking to find the sparsest possible solution.
\((\bDDD, K)\) SPARSE approximation problem)
(With the help of synthesis matrix \(\Phi\), the \((\bDDD, K)\) SPARSE approximation problem can now be written as
This problem can be visualized as a projection of \(\bx\) on to the set \(\Sigma_K\). Hence, it always has a solution.
18.5.5. P-Norms#
There are some simple and useful results on relationships between different \(p\)-norms listed in this section. We also discuss some interesting properties of \(l_1\)-norm specifically.
(Complex sign vector)
Let \(\bv \in \CC^N\). Let the entries in \(\bv\) be represented as
where \(r_k = | v_k |\) with the convention that \(\theta_k = 0\) whenever \(r_k = 0\).
The sign vector for \(\bv\) denoted by \(\sgn(\bv)\) is defined as
where
\(\ell_1\) norm as product of vector with its sign)
(For any \(\bv \in \CC^N\):
Proof. This follows from:
Note that whenever \(v_k = 0\), corresponding \(0\) entry in \(\sgn(\bv)\) has no effect on the sum.
\(\ell_1\) and \(\ell_2\) norms)
(Equivalence ofSuppose \(\bv \in \CC^N\). Then
Proof. For the lower bound, we go as follows
This gives us
We can write \(\ell_1\) norm as
By Cauchy-Schwartz inequality we have
Since \(\sgn(\bv)\) can have at most \(N\) non-zero values, each with magnitude 1,
Thus, we get
\(\ell_2\) and \(\ell_{\infty}\) norms)
(Equivalence ofLet \(\bv \in \CC^N\). Then
Proof. This follows from:
Thus
\(p\)-norms)
(Relationship betweenLet \(\bv \in \CC^N\). Let \(1 \leq p, q \leq \infty\). Then
Proof. TBD
Let \(\bone \in \CC^N\) be the vector of all ones; i.e., \(\bone = (1, \dots, 1)\). Let \(\bv \in \CC^N\) be some arbitrary vector. Let \(| \bv |\) denote the vector of absolute values of entries in \(\bv\); i.e., \(|v|_i = |v_i| \Forall 1 \leq i \leq N\). Then
Proof. This follows from:
Finally since \(\bone\) consists only of real entries, hence its transpose and Hermitian transpose are same.
Let \(\OneMat \in \CC^{N \times N}\) be a square matrix of all ones. Let \(\bv \in \CC^N\) be some arbitrary vector. Then
Proof. We know that
Thus,
We used the fact that \(\| \bv \|_1 = \bone^T | \bv |\).
\(k\)-th largest value)
(An upper bound on the\(k\)-th largest (magnitude) entry in a vector \(\bx \in \CC^N\) denoted by \(x_{(k)}\) obeys
Proof. Let \(n_1, n_2, \dots, n_N\) be a permutation of \(\{ 1, 2, \dots, N \}\) such that
Thus, the \(k\)-th largest entry in \(\bx\) is \(x_{n_k}\). It is clear that
Obviously
Similarly
Thus
18.5.6. Sparse Signals#
In this subsection we explore some useful properties of \(\Sigma_K\), the set of \(K\)-sparse signals in standard basis for \(\CC^N\).
We recall that
We established before that this set is a union of \(\binom{N}{K}\) subspaces of \(\CC^N\) each of which is is constructed by an index set \(\Lambda \subset \{1, \dots, N \}\) with \(| \Lambda | = K\) choosing \(K\) specific dimensions of \(\CC^N\).
We first present some results which connect the \(l_1\), \(l_2\) and \(l_{\infty}\) norms of vectors in \(\Sigma_K\).
(Relation between norms of sparse vectors)
Suppose \(\bu \in \Sigma_K\). Then
Proof. Due to Theorem 18.10, we can write \(\ell_1\) norm as
By Cauchy-Schwartz inequality we have
Since \(\bu \in \Sigma_K\), \(\sgn(\bu)\) can have at most \(K\) non-zero values each with magnitude 1. Thus, we have
Thus we get the lower bound
Now \(| u_i | \leq \max(| u_i |) = \| \bu \|_{\infty}\). So we have
since there are only \(K\) non-zero terms in the expansion of \(\| \bu \|_2^2\). This establishes the upper bound:
18.5.7. Compressible Signals#
In this subsection, we first look at some general results and definitions related to \(K\)-term approximations of arbitrary signals \(\bx \in \CC^N\). We then define the notion of a compressible signal and study properties related to it.
18.5.7.1. K-term Approximation of General Signals#
(Restriction of a signal)
Let \(\bx \in \CC^N\). Let \(T \subset \{ 1, 2, \dots, N\}\) be any index set. Further let
such that
Let \(\bx_T \in \CC^{|T|}\) be defined as
Then \(\bx_T\) is a restriction of the signal \(\bx\) on the index set \(T\).
Alternatively let \(\bx_T \in \CC^N\) be defined as
In other words, \(\bx_T \in \CC^N\) keeps the entries in \(\bx\) indexed by \(T\) while sets all other entries to 0. Then we say that \(\bx_T\) is obtained by masking \(\bx\) with \(T\).
As an abuse of notation, we will use any of the two definitions whenever we are referring to \(\bx_T\). The definition being used should be obvious from the context.
(Restrictions on index sets)
Let
Then
Since \(|T| = 4\), sometimes we will also write
\(K\)-term signal approximation)
(Let \(\bx \in \CC^N\) be an arbitrary signal. Consider any index set \(T \subset \{1, \dots, N \}\) with \(|T| = K\). Then \(\bx_T\) is a \(K\)-term approximation of \(\bx\).
Clearly for any \(\bx \in \CC^N\) there are \(\binom{N}{K}\) possible \(K\)-term approximations of \(\bx\).
\(K\)-term approximation)
(Let
Let \(T= \{ 1, 6 \}\). Then
is a \(2\)-term approximation of \(\bx\).
If we choose \(T= \{7,8,9,10\}\), the corresponding \(4\)-term approximation of \(\bx\) is
\(K\)-largest entries approximation)
(Let \(\bx \in \CC^N\) be an arbitrary signal. Let \(\lambda_1, \dots, \lambda_N\) be indices of entries in \(\bx\) such that
In case of ties, the order is resolved lexicographically; i.e., if \(|x_i| = |x_j|\) and \(i < j\) then \(i\) will appear first in the sequence \(\{ \lambda_k \}\).
Consider the index set \(\Lambda_K = \{ \lambda_1, \lambda_2, \dots, \lambda_K\}\). The restriction of \(\bx\) on \(\Lambda_K\) given by \(x_{\Lambda_K}\) contains the \(K\) largest entries \(\bx\) while setting all other entries to 0. This is known as the \(K\) largest entries approximation of \(\bx\).
This signal is denoted henceforth as \(\bx|_K\); i.e.
where \(\Lambda_K\) is the index set corresponding to \(K\) largest entries in \(\bx\) (magnitude wise).
(Largest entries approximation)
Let
Then
All further \(K\) largest entries approximations are same as \(\bx\).
A pertinent question at this point is: which \(K\)-term approximation of \(\bx\) is the best \(K\)-term approximation? Certainly in order to compare two approximations we need some criterion. Let us choose \(\ell_p\) norm as the criterion. The next result gives an interesting result for best \(K\)-term approximations in \(\ell_p\) norm sense.
\(K\)-term approximation for \(\ell_p\) norms)
(BestLet \(\bx \in \CC^N\). Let the best \(K\) term approximation of \(\bx\) be obtained by the following optimization program:
where \(p \in [1, \infty]\).
Let an optimal solution for this optimization problem be denoted by \(\bx_{T^*}\). Then
i.e., the \(K\)-largest entries approximation of \(\bx\) is an optimal solution to (18.25).
Proof. For \(p=\infty\), the result is obvious. In the following, we focus on \(p \in [1, \infty)\).
We note that maximizing \(\| \bx_T \|_p\) is equivalent to maximizing \( \| \bx_T \|_p^p\).
Let \(\lambda_1, \dots, \lambda_N\) be indices of entries in \(x\) such that
Further let \(\{ \omega_1, \dots, \omega_N\}\) be any permutation of \(\{1, \dots, N \}\). Clearly
Thus if \(T^*\) corresponds to an optimal solution of (18.25) then
Thus \(\bx|_K\) is an optimal solution to (18.25).
This result helps us establish that whenever we are looking for a best \(K\)-term approximation of \(\bx\) under any \(\ell_p\) norm, all we have to do is to pickup the \(K\)-largest entries in \(\bx\).
(Restriction of a matrix)
Let \(\Phi \in \CC^{M \times N}\). Let \(T \subset \{ 1, 2, \dots, N\}\) be any index set. Further let
such that
Let \(\Phi_T \in \CC^{M \times |T|}\) be defined as
Then \(\Phi_T\) is a restriction of the matrix \(\Phi\) on the index set \(T\).
Alternatively let \(\Phi_T \in \CC^{M \times N}\) be defined as
In other words, \(\Phi_T \in \CC^{M \times N}\) keeps the columns in \(\Phi\) indexed by \(T\) while sets all other columns to \(\bzero\). Then we say that \(\Phi_T\) is obtained by masking \(\Phi\) with \(T\).
As an abuse of notation, we will use any of the two definitions whenever we are referring to \(\Phi_T\). The definition being used should be obvious from the context.
Let \(\supp(\bx) = \Lambda\). Then
Proof. This follows from:
The result remains valid whether we use the restriction or the mask version of \(\bx_{\Lambda}\) notation as long as same version is used for both \(\Phi\) and \(\bx\).
Let \(S\) and \(T\) be two disjoint index sets such that for some \(\bx \in \CC^N\)
using the mask version of \(x_{\Lambda}\) notation. Then the following holds
Proof. Straightforward application of Theorem 18.19:
Let \(T\) be any index set. Let \(\Phi \in \CC^{M \times N}\) and \(\by \in \CC^M\). Then
Proof. Note that
Now let
Then
The result remains valid whether we use the restriction or the mask version of \(\Phi_T\) notation.
18.5.7.2. Compressible Signals#
We will now define the notion of a compressible signal in terms of the decay rate of magnitude of its entries when sorted in descending order.
(Compressible signal)
Let \(\bx \in \CC^N\) be an arbitrary signal. Let \(\lambda_1, \dots, \lambda_N\) be indices of entries in \(\bx\) such that
In case of ties, the order is resolved lexicographically, i.e. if \(|x_i| = |x_j|\) and \(i < j\) then \(i\) will appear first in the sequence \(\{ \lambda_k \}\). Define
The signal \(\bx\) is called \(p\)-compressible with magnitude \(R\) if there exists \(p \in (0, 1]\) such that
\(1\)-compressible signals)
(Let \(\bx\) be be \(p\)-compressible with \(p=1\). Then
Proof. Recalling \(\widehat{x}\) from (18.28) it is straightforward to see that
since the \(\ell_1\) norm doesn’t depend on the ordering of entries in \(\bx\).
Now since \(\bx\) is \(1\)-compressible, hence from (18.29) we have
This gives us
The sum on the R.H.S. is the \(N\)-th Harmonic number (sum of reciprocals of first \(N\) natural numbers). A simple upper bound on Harmonic numbers is
This completes the proof.
We now demonstrate how a compressible signal is well approximated by a sparse signal.
(Sparse approximation of compressible signals)
Let \(\bx\) be a \(p\)-compressible signal and let \(\bx|_K\) be its best \(K\)-term approximation. Then the \(\ell_1\) norm of approximation error satisfies
with
Moreover the \(\ell_2\) norm of approximation error satisfies
with
Proof. Expanding the \(\ell_1\) approximation error
We now approximate the R.H.S. sum with an integral.
Now
We can similarly show the result for \(\ell_2\) norm.
\(\frac{1}{2}\)-compressible signals)
(Sparse approximation forLet \(p = \frac{1}{2}\). Then
Hence
and
Both \(\ell_1\) and \(\ell_2\) approximation error bounds decrease as \(K\) increases at the same rate.