22.1. Introduction#

When designing a dictionary for a particular application, we have several options [70]. On the one hand we can go through the long list of analytically constructed or tunable dictionaries and select one of them as suitable for the application in concern. On the other hand, we can take up a number of real example signals from our application and try to construct a dictionary which is optimized for these.

Dictionary learning (DL) [30, 75] is a process which attempts to solve the problem of constructing a dictionary directly from the set of example signals. The atoms of a learnt dictionary come from the underlying empirical data of training set of example signals for the specific application. While analytically constructed dictionaries are typically meant for only specific applications, the learning method allows one to construct dictionaries for any family of signals which are amenable to the sparse and redundant representations model. This certainly comes at a cost. Learnt dictionaries have to be held completely in memory explicitly as they happen to be lack any structure. Thus they don’t provide an efficient implementation of analysis (DHx) and synthesis (Da) operators. Thus using them in applications leads to more computational costs.


We start by formalizing the notation for Dictionary Learning (DL).

  1. We consider a set of S example signals put together in a signal matrix XCN×S.

  2. Consider a dictionary DCN×D.

  3. Let aiCD be the sparsest possible representation of xi in D with

    xi=Dai+ei.
  4. We put all ai together in a matrix ACD×S.

  5. Then we have

    X=DA+E

    where ECN×S represents approximation error.

  6. We are looking for best dictionary from the set of possible dictionaries such that we are able to get sparse representations of xi with low approximation error.

  7. We can quantify the notion of good approximation by putting an upper bound on the norm of approximation error as ei2ϵ.

  8. Combining these ideas, we introduce the notion of a sparse signal model denoted as MD,K,ϵ which consists of a dictionary D providing K-sparse representations for a class of signals (from which the example signals are drawn) with an upper bound on approximation error given by ϵ.

  9. The DL problem tries to learn best model M based on the example signals X.


  1. We can formulate the DL problem as an optimization problem as follows:

    (22.1)#minimize D,{ai}i=1Si=1Sai0subject to xiDai2ϵ,i=1,,S.
  2. In this version, we are trying to minimize total sparsity of ai while using the upper bound on approximation error as optimization constraint.

  3. We are not enforcing sparsity constraint that ai0K1iS.

  4. Alternatively we can also write:

    (22.2)#minimize D,{ai}i=1Si=1SxiDai22subject to ai0K,i=1,,S.
  5. In this version we are trying to minimize approximation error while keeping K-sparsity as optimization constraint.

  6. We are not enforcing the constraint that ei2ϵ.

  7. Unfortunately, there doesn’t exist any computationally tractable algorithm to solve these optimization problems.

  8. An alternative is to consider a heuristic iterative approach presented in Algorithm 22.1.

Algorithm 22.1 (Dictionary learning: iterative approach)

Initialization:

  1. Choose an initial dictionary D0.

  2. k0 # iteration counter. \Repeat{halting criteria is true}{

Algorithm:

  1. If halting criteria is met: break.

  2. kk+1.

  3. Obtain representations ai for xi with the given Dk1 and sparsity level K using some sparse approximation algorithm.

  4. Obtain new dictionary Dk from the signals xi and their representations ai.

Some possible halting criteria are:

  • Stop after a fixed number of iterations.

  • Stop when ei2ϵ for every example.

  • Stop when approximation errors stop improving.

For initialization of the algorithm, we have few options.

  • We can start with a randomly generated dictionary.

  • We can select D examples from X and put them together as our starting dictionary.

  • We can start with some analytical dictionary suitable for the application domain.

22.1.1. Dictionary Learning Methods#

Two popular algorithms implementing this approach are MOD (Method of Optimal Directions) and K-SVD.

22.1.1.1. Method Of Optimal Directions#

In MOD [34], dictionary update is formulated as

Dk=argminDXDAkF2

A straight forward least squares solution is obtained as

Dk=X(Ak).

22.1.1.2. K-SVD#

K-SVD [1] is slightly different. Rather than recomputing whole dictionary at once by solving the LS problem, it updates atoms of D one by one.

  1. Let j be the index of atom dj being updated.

  2. Consider the error

    Ej=Xk,kjdkbkT
  3. bk refers to the k-th row of A; i.e. entries for atom dk in every example.

  4. Ej means the approximation error when the atom dj (and corresponding entries in A) has been dropped from consideration.

  5. We can see that

    XDA=Xk=1DdkbkT=Xk,kjdkbkTdjbjT=EjdjbjT.
  6. Hence

    XDAF2=EjdjbjTF2.
  7. On the L.H.S. we have total approximation error.

  8. On the R.H.S. we have the same expressed in terms of Ej and dj where Ej doesn’t depend on dj.

  9. An optimal dj is one which can minimize R.H.S.

  10. This can easily be obtained by rank-1 approximation of Ej.

  11. The rank-1 approximation gives us both dj as well as the entries in the j-th row of A given by bj. There is a small catch though.

  12. In general the rank-1 approximation can lead to a dense bj meaning the atom dj gets used in many signal representations. We wish to avoid that. We don’t want dj to appear in many signal representations.

  13. For this, we identify representations ai in which atom dj appears, and let them be indexed by Γ.

  14. We then restrict Ej to signals indexed by Γ to avoid a dense bj.

  15. Rank-1 approximation can be easily obtained using singular value decomposition.

  16. We perform SVD of Ej,Γ=UΣVH.

  17. We then pick u1,σ1,v1.

  18. u1 is the new update for dj.

  19. Further σ1v1 is the update for bj,Γ.

  20. Rest of bj is left with zero entries.

  21. We repeat the process for each of the atoms in D to obtain next update of D.