20.3. Orthogonal Matching Pursuit#

Orthogonal Matching Pursuit is a greedy algorithm for computing sparse approximations of a signal in a suitable dictionary. The goal of this algorithm is to select a small number of atoms from the dictionary which can provide a representation for the signal with low error. The algorithm is iterative in structure. In each step we greedily select one new atom from the dictionary which can maximally reduce the representation error. Once we have enough atoms selected such that the representation error is below an acceptable bound, we stop. The algorithm is presented in Algorithm 20.1.

Algorithm 20.1 (Orthogonal matching pursuit for sparse approximation)

Inputs:

  • Signal dictionary DCN×D with spark(D)>2KD

  • Threshold ϵ0

  • Signal xCN

Outputs:

  • K-sparse approximate representation aΣKCD satisfying xDa2ϵ0

  • S support for sparse solution identified by the algorithm

Initialization:

  1. k0 # Iteration counter

  2. a00 # Solution vector aCD

  3. r0xDa0=x # Residual rCN

  4. S0 # Solution support S=supp(a)

Algorithm:

  1. If rk2ϵ0 break.

  2. kk+1 # Next iteration

  3. zDHrk1 # Sweep

  4. Update support

    1. Find j0 that maximizes |zj|jSk1

    2. SkSk1{j0}

  5. Update provisional solution

    akminimizeaDax22 subject to supp(a)=Sk.
  6. rkxDak=xDSkaSkk # Update residual

  7. Go to step 1.

20.3.1. The Algorithm#

  • We start with the initial estimate of solution a=0.

  • We also maintain the support of a; i.e., the set of indices for which a is non-zero.

  • We start with an empty support.

  • In each (k-th) iteration we attempt to reduce the difference between the actual signal x and the approximate signal based on current solution ak1 given by rk1=xDak1.

  • We do this by choosing a new index in a given by j0 for the column dj0 which most closely matches our current residual.

  • We include this to our support for a and estimate new solution vector ak.

  • We then compute the new residual rk.

  • We stop when the residual magnitude is below a threshold ϵ0 defined by us.

Each iteration of algorithm consists of following stages:

  1. [Sweep] We try to find the best matching atom from the dictionary with the current residual.

    1. The best matching atom is selected using the least square error principle.

    2. For each column dj in our dictionary, we measure the projection of residual from previous iteration rk1 on the column

    3. We compute the magnitude of error between the projection and residual.

    4. The square of minimum error for dj is given by:

      ϵ2(j)=rk122|djHrk1|2.
    5. We can also note that minimizing over ϵ(j) is equivalent to maximizing over the inner product of dj with rk1.

  2. [Update support] Ignoring the columns which have already been included in the support, we pick up the column which most closely resembles the residual of previous stage; i.e., the magnitude of error is minimum. We include the index of this column j0 in the support set Sk.

  3. [Update provisional solution]

    1. In this step we find the solution of minimizing Dax2 over the support Sk as our next candidate solution vector.

    2. By keeping ai=0 for iSk we are leaving out corresponding columns di from our calculations.

    3. Thus we pickup up only the columns specified by Sk from D.

    4. Let us call this subdictionary as DSk.

    5. The size of this subdictionary is N×|Sk|=N×k.

    6. Let us call corresponding sub vector as aSk.

      Suppose D=4, then D=[d1d2d3d4]. Let Sk={1,4}. Then DSk=[d1d4] and aSk=(a1,a4).

    7. Our minimization problem then reduces to minimizing DSkaSkx2.

    8. We use standard least squares estimate for getting the coefficients for aSk over these indices.

    9. We put back aSk to obtain our new solution estimate ak.

      In the running example after obtaining the values a1 and a4, we will have ak=(a1,0,0,a4).

    10. The solution to this minimization problem is given by

      DSkH(DSkaSkx)=0aSk=(DSkHDSk)1DSkHx.
    11. Interestingly we note that rk=xDak=xDSkaSk, thus

      DSkHrk=0

      which means that columns in DSk which are part of support Sk are necessarily orthogonal to the residual rk.

    12. This implies that these columns will not be considered in the coming iterations for extending the support.

    13. This orthogonality is the reason behind the name of the algorithm as OMP.

  4. [Update residual] We finally update the residual vector to rk based on new solution vector estimate.

Example 20.1 ((D,K)-EXACT-SPARSE reconstruction with OMP)

Let us consider a dictionary of size 10×20. Thus N=10 and D=20. In order to fit into the display, we will present the matrix in two 10 column parts.

Da=110[1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111]Db=110[1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111]

with D=[DaDb].

  1. You may verify that each column is unit norm.

  2. It is known that rank(D)=10 and spark(D)=6.

  3. Thus if a signal x has a 2 sparse representation in D then the representation is necessarily unique.

  4. We now consider a signal x given by

    x=(4.743424.743421.581144.743421.581141.581144.743421.581144.743424.74342).

    For saving space, we have written it as an n-tuple over two rows. You should treat it as a column vector of size 10×1.

  5. It is known that the vector has a two sparse representation in D.

  6. Let us go through the steps of OMP and see how it works.

  7. In step 0, r0=x, a0=0, and S0=.

  8. We now compute absolute value of inner product of r0 with each of the columns. They are given by

    (444731111212170240213).
  9. We quickly note that the maximum occurs at index 7 with value 11.

  10. We modify our support to S1={7}.

  11. We now solve the least squares problem

    minimizex[d7]a722.
  12. The solution gives us a7=11.00.

  13. Thus we get

    a1=(000000110000000000000).

    Again note that to save space we have presented a over two rows. You should consider it as a 20×1 column vector.

  14. This leaves us the residual as

    r1=xDa1=(1.264911.264911.897371.264911.897371.897371.264911.897371.264911.26491).
  15. We can cross check that the residual is indeed orthogonal to the columns already selected, for

    r1,d7=0.
  16. Next we compute inner product of r1 with all the columns in D and take absolute values.

  17. They are given by

    (0.40.40.40.40.81.201.221.22.43.24.8020.4021.20.8)
  18. We quickly note that the maximum occurs at index 13 with value 4.8.

  19. We modify our support to S1={7,13}.

  20. We now solve the least squares problem

    minimizex[d7d13][a7a13]22.
  21. This gives us a7=10 and a13=5.

  22. Thus we get

    a2=(000000100000050000000)
  23. Finally the residual we get at step 2 is

    r2=xDa2=1014(000.11102200.1110220.11102200.11102200)
  24. The magnitude of residual is very small.

  25. We conclude that our OMP algorithm has converged and we have been able to recover the exact 2 sparse representation of x in D.

20.3.2. Exact Recovery Conditions#

In this subsection, following [77], we will closely look at some conditions under which OMP is guaranteed to recover the solution for (D,K)-EXACT-SPARSE problem.

  1. It is given that x=Da where a contains at most K non-zero entries.

  2. Both the support and entries of a are known and can be used to verify the correctness of OMP. Note that a itself won’t be given to OMP.

  3. Let Λopt=supp(a); i.e., the set of indices at which optimal representation a has non-zero entries.

  4. Then we can write

    x=iΛaidi.
  5. From the dictionary D we can extract a N×K matrix Dopt whose columns are indexed by Λopt.

    Dopt[dλ1dλK]

    where λiΛopt.

  6. Thus we can also write

    x=Doptaopt

    where aoptCK is a vector of K complex entries.

  7. The columns of optimum Dopt are linearly independent. Hence Dopt has full column rank.

  8. We define another matrix Hopt whose columns are the remaining DK columns of D.

  9. Thus Hopt consists of atoms or columns which do not participate in the optimum representation of x.

  10. OMP starts with an empty support.

  11. At every step, it picks up one column from D and adds to the support of approximation.

  12. If we can ensure that it never selects any column from Hopt we will be guaranteed that correct K sparse representation is recovered.

  13. We will use mathematical induction and assume that OMP has succeeded in its first k steps and has chosen k columns from Dopt so far.

  14. At this point it is left with the residual rk.

  15. In (k+1)-th iteration, we compute inner product of rk with all columns in D and choose the column which has highest inner product.

  16. We note that the maximum value of inner product of rk with any of the columns in Hopt is given by

    HoptHrk.
  17. Correspondingly, the maximum value of inner product of rk with any of the columns in Dopt is given by

    DoptHrk.
  18. since we have shown that rk is orthogonal to the columns already chosen, hence they will not contribute to this term.

  19. In order to make sure that none of the columns in Hopt is selected, we need

    HoptHrk<DoptHrk.

Definition 20.3 (Greedy selection ratio)

We define a ratio

(20.23)#ρ(rk)HoptHrkDoptHrk.

This ratio is known as greedy selection ratio.

  1. We can see that as long as ρ(rk)<1, OMP will make a right decision at (k+1)-th stage.

  2. If ρ(rk)=1 then there is no guarantee that OMP will make the right decision.

  3. We will assume pessimistically that OMP makes wrong decision in such situations.

We note that this definition of ρ(rk) looks very similar to matrix p-norms defined in Definition 4.146. It is suggested to review the properties of p-norms for matrices at this point.

We now present a condition which guarantees that ρ(rk)<1 is always satisfied.

Theorem 20.17 (A sufficient condition for exact recovery using OMP)

A sufficient condition for Orthogonal Matching Pursuit to resolve x completely in K steps is that

(20.24)#maxhDopth1<1,

where h ranges over columns in Hopt.

Moreover, Orthogonal Matching Pursuit is a correct algorithm for the (D,K)-EXACT-SPARSE problem whenever the condition holds for every superposition of K atoms from D.

Proof. .

  1. In (20.24), Dopt is the pseudo-inverse of Dopt.

    Dopt=(DoptHDopt)1DoptH.
  2. What we need to show is if (20.24) holds true then ρ(rk) will always be less than 1.

  3. We note that the projection operator for the column span of Dopt is given by

    Dopt(DoptHDopt)1DoptH=(Dopt)HDoptH.
  4. We also note that by assumption since xC(Dopt) and the approximation at the k-th step, xk=DakC(Dopt).

  5. Hence rk=xxk also belongs to C(Dopt).

  6. Thus

    rk=(Dopt)HDoptHrk

    i.e. applying the projection operator for Dopt on rk doesn’t change it.

  7. Using this we can rewrite ρ(rk) as

    ρ(rk)=HoptHrkDoptHrk=HoptH(Dopt)HDoptHrkDoptHrk.
  8. We see the term DoptHrk appearing both in numerator and denominator.

  9. Now consider the matrix HoptH(Dopt)H and recall the definition of matrix -norm from Definition 4.146

    A=maxx0AxxAxxx0.
  10. Thus

    HoptH(Dopt)HHoptH(Dopt)HDoptHrkDoptHrk

    which gives us

    ρ(rk)HoptH(Dopt)H=(DoptHopt)H.
  11. We recall that A is max row sum norm while A1 is max column sum norm.

  12. Hence

    A=AT1=AH1

    which means

    (DoptHopt)H=DoptHopt1.
  13. Thus we have:

    ρ(rk)DoptHopt1.
  14. Now the columns of DoptHopt are nothing but Dopth where h ranges over columns of Hopt.

  15. Thus in terms of max column sum norm

    ρ(rk)maxhDopth1.
  16. Thus assuming that OMP has made k correct decision and rk lies in C(Dopt), ρ(rk)<1 whenever

    maxhDopth1<1.
  17. The initial residual r0=x which always lies in column space of Dopt.

  18. By above logic, OMP will always select an optimal column in each step.

  19. Since the residual is always orthogonal to the columns already selected, hence it will never select the same column twice.

  20. Thus in K steps it will retrieve all K atoms which comprise x.

20.3.3. Babel Function Estimates#

There is a small problem with Theorem 20.17. Since we don’t know the support of a a-priori hence its not possible to verify that

maxhDopth1<1

holds. Verifying this for all K column sub-matrices is computationally prohibitive.

It turns out that Babel function (recall Definition 18.22) can be used to relax the sufficient condition in a manner so that it becomes computationally tractable. We show how Babel function guarantees that exact recovery condition for OMP holds.

Theorem 20.18 (Babel function based guarantee for exact recovery using OMP)

Suppose that μ1 is the Babel function for a dictionary D. The exact recovery condition holds whenever

(20.25)#μ1(K1)+μ1(K)<1.

Thus, Orthogonal Matching Pursuit is a correct algorithm for the (D,K)-EXACT-SPARSE problem whenever (20.25) holds.

In other words, for sufficiently small K for which (20.25) holds, OMP will recover any arbitrary superposition of K atoms from D.

Proof. .

  1. We can write

    maxhDopth1=maxh(DoptHDopt)1DoptHh1
  2. We recall from Lemma 4.90 that operator-norm is subordinate; i.e.,

    Ax1A1x1.
  3. Thus with A=(DoptHDopt)1 we have

    (DoptHDopt)1DoptHh1(DoptHDopt)11DoptHh1.
  4. With this we have

    maxhDopth1(DoptHDopt)11maxhDoptHh1.
  5. Now let us look at DoptHh1 closely.

  6. There are K columns in Dopt.

  7. For each column we compute its inner product with h and then absolute sum of the inner product.

  8. Also recall the definition of Babel function:

    μ1(K)=max|Λ|=KmaxhΛ|h,dλ|.
  9. Clearly

    maxhDoptHh1=maxhΛopt|h,dλi|μ1(K).
  10. We also need to provide a bound on (DoptHDopt)11 which requires more work.

  11. First note that since all columns in D are unit norm, hence the diagonal of DoptHDopt contains unit entries.

  12. Thus we can write

    DoptHDopt=IK+A

    where A contains the off diagonal terms in DoptHDopt.

  13. Looking carefully , each column of A lists the inner products between one atom of Dopt and the remaining K1 atoms.

  14. By definition of Babel function

    A1=maxk=1Kj,jk|dλkdλj|μ1(K1).
  15. Whenever A1<1 then the Von Neumann series (A)k converges to the inverse (IK+A)1.

  16. Thus we have

    (DoptHDopt)11=(IK+A)11=k=0(A)k1k=0A1k=11A111μ1(K1).
  17. Putting things together we get

    maxhDopth1μ1(K)1μ1(K1).
  18. Thus whenever μ1(K1)+μ1(K)<1 holds,
    we have

    maxhDopth1<1.

20.3.4. Sparse Approximation Conditions#

We now remove the assumption that x is K-sparse in D. This is indeed true for all real life signals as they are not truly sparse.

In this subsection we will look at conditions under which OMP can successfully solve the (D,K)-SPARSE approximation problem as described in Definition 18.9.

  1. Let x be an arbitrary signal.

  2. Suppose that aopt is an optimal K-term approximation representation of x; i.e., aopt is a solution to (18.20) and the optimal K-term approximation of x is given by

    xopt=Daopt.

    We note that aopt may not be unique.

  3. Let Λopt be the support of aopt which identifies the atoms in D that participate in the K-term approximation of x.

  4. Let Dopt be the subdictionary consisting of columns of D indexed by Λopt.

  5. We assume that columns in Dopt are linearly independent. This is easily established since if any atom in this set were linearly dependent on other atoms, we could always find a sparser solution which would contradict the fact that aopt is optimal.

  6. Again let Hopt be the matrix of (DK) columns which are not indexed by Λopt.

  7. We note that if Λopt is identified then finding aopt is a straightforward least squares problem.

We now present a condition under which Orthogonal Matching Pursuit is able to recover the optimal atoms.

Theorem 20.19

Assume that μ1(K)<12, and suppose that at k-th iteration, the support Sk for ak consists only of atoms from an optimal k-term approximation of the signal x. At step (k+1), Orthogonal Matching Pursuit will recover another atom indexed by Λopt whenever

(20.26)#xDak2>1+K(1μ1(K))(12μ1(K))2xDaopt2.

A few remarks are in order.

  1. xDak2 is the approximation error norm at k-th iteration.

  2. xDaopt2 is the optimum approximation error after K iterations.

  3. The theorem says that OMP makes absolute progress whenever the current approximation error is larger than the optimum error by a factor.

  4. As a result of this theorem, we note that every optimal K-term approximation of x contains the same kernel of atoms.

  5. The optimum error is always independent of choice of atoms in K term approximation (since it is optimum).

  6. Initial error is also independent of choice of atoms (since initial support is empty).

  7. OMP always selects the same set of atoms by design.

Proof. .

  1. Let us assume that after k steps, OMP has recovered an approximation xk given by

    xk=DSkak

    where Sk=supp(ak) chooses k columns from D all of which belong to Dopt.

  2. Let the residual at k-th stage be

    rk=xxk=xDSkak.
  3. Recalling from previous section, a sufficient condition for recovering another optimal atom is

    ρ(rk)=HoptHrkDoptHrk<1.
  4. One difference from previous section is that rkC(Dopt).

  5. We can write

    rk=xxk=(xxopt)+(xoptxk).
  6. Note that (xxopt) is nothing but the residual left after K iterations.

  7. We also note that since residual in OMP is always orthogonal to already selected columns, hence

    DoptH(xxopt)=0.
  8. We will now use these expressions to simplify ρ(rk).

    ρ(rk)=HoptHrkDoptHrk=HoptH(xxopt)+HoptH(xoptxk)DoptH(xxopt)+DoptH(xoptxk)=HoptH(xxopt)+HoptH(xoptxk)DoptH(xoptxk)HoptH(xxopt)DoptH(xoptxk)+HoptH(xoptxk)DoptH(xoptxk)
  9. We now define two new terms

    ρerr(rk)HoptH(xxopt)DoptH(xoptxk)

    and

    ρopt(rk)HoptH(xoptxk)DoptH(xoptxk).
  10. With these we have

    (20.27)#ρ(rk)ρopt(rk)+ρerr(rk)
  11. Now xopt has an exact K-term representation in D given by aopt.

  12. Hence ρopt(rk) is nothing but ρ(rk) for corresponding EXACT-SPARSE problem.

  13. From the proof of Theorem 20.18 we recall

    ρopt(rk)μ1(K)1μ1(K1)μ1(K)1μ1(K)

    since

    μ1(K1)μ1(K)1μ1(K1)1μ1(K).
  14. The remaining problem is ρerr(rk).

  15. Let us look at its numerator and denominator one by one.

  16. HoptH(xxopt) is the maximum (absolute) inner product between any column in Hopt with xxopt.

  17. We can write

    HoptH(xxopt)maxh|hH(xxopt)|maxhh2xxopt2=xxopt2

    since all columns in D are unit norm. In between we used Cauchy-Schwartz inequality.

  18. Now look at denominator DoptH(xoptxk) where (xoptxk)CN and DoptCN×K.

  19. Thus

    DoptH(xoptxk)CK.
  20. Now for every vCK we have

    v2Kv.
  21. Hence

    DoptH(xoptxk)K1/2DoptH(xoptxk)2.
  22. Since Dopt has full column rank, hence its singular values are non-zero.

  23. Thus

    DoptH(xoptxk)2σmin(Dopt)xoptxk2.
  24. From Corollary 18.2 we have

    σmin(Dopt)1μ1(K1)1μ1(K).
  25. Combining these observations we get

    ρerr(rk)Kxxopt21μ1(K)xoptxk2.
  26. Now from (20.27) ρ(rk)<1 whenever ρopt(rk)+ρerr(rk)<1.

  27. Thus a sufficient condition for ρ(rk)<1 can be written as

    μ1(K)1μ1(K)+Kxxopt21μ1(K)xoptxk2<1.
  28. We need to simplify this expression a bit. Multiplying by (1μ1(K)) on both sides we get

    μ1(K)+K1μ1(K)xxopt2xoptxk2<1μ1(K)K(1μ1(K))xxopt2xoptxk2<12μ1(K)xoptxk2>K(1μ1(K))12μ1(K)xxopt2.

    We assumed μ1(K)<12 thus 12μ1(K)>0 which validates the steps above.

  29. We recall that (xxopt)C(Dopt) and (xoptxk)C(Dopt) thus (xxopt) and (xoptxk) are orthogonal to each other.

  30. Thus by applying Pythagorean theorem we have

    xxk22=xxopt22+xoptxk22.
  31. Thus we have

    xxk22>K(1μ1(K))(12μ1(K))2xxopt22+xxopt22.
  32. This gives us a sufficient condition

    (20.28)#xxk2>1+K(1μ1(K))(12μ1(K))2xxopt2.
  33. In other words, whenever (20.28) holds true, we have ρ(rk)<1 which leads to OMP making a correct choice and choosing an atom from the optimal set.

  34. Putting xk=Dak and xopt=Daopt we get back (20.26) which is the desired result.

Theorem 20.19 establishes that as long as (20.26) holds for each of the steps from 1 to K, OMP will recover a K term optimum approximation xopt. If xCN is completely arbitrary, then it may not be possible that (20.26) holds for all the K iterations. In this situation, a question arises as to what is the worst K-term approximation error that OMP will incur if (20.26) doesn’t hold true all the way.

This is answered in following corollary of Theorem 20.19.

Corollary 20.2 (An estimate for the worst case K-term approximation error by OMP)

Assume that μ1(K)<12 and let xCN be a completely arbitrary signal. Orthogonal Matching Pursuit produces a K-term approximation xK which satisfies

(20.29)#xxK21+C(D,K)xxopt2

where xopt is the optimum K-term approximation of x in dictionary D (i.e. xopt=Daopt where aopt is an optimal solution of (18.20)). C(D,K) is a constant depending upon the dictionary D and the desired sparsity level K. An estimate of C(D,K) is given by

C(D,K)K(1μ1(K))(12μ1(K))2.

Proof. .

  1. Suppose that OMP runs fine for first p steps where p<K.

  2. Thus (20.26) keeps holding for first p steps.

  3. We now assume that (20.26) breaks at step p+1 and OMP is no longer guaranteed to make an optimal choice of column from Dopt.

  4. Thus at step p+1 we have

    xxp21+K(1μ1(K))(12μ1(K))2xxopt2.
  5. Any further iterations of OMP will only reduce the error further (although not in an optimal way).

  6. This gives us

    xxK2xxp21+K(1μ1(K))(12μ1(K))2xxopt2.
  7. Choosing

    C(D,K)=K(1μ1(K))(12μ1(K))2

    we can rewrite this as

    xxK21+C(D,K)xxopt2.

This is a very useful result. It establishes that even if OMP is not able to recover the optimum K-term representation of x, it always constructs an approximation whose error lies within a constant factor of optimum approximation error where the constant factor is given by 1+C(D,K).

  1. If the optimum approximation error xxopt2 is small then xxK2 will also be not too large.

  2. If xxopt2 is moderate, then the OMP may inflate the approximation error to a higher value. But in this case, probably sparse approximation is not the right tool for signal representation over the dictionary.