Karl Skretting, University of Stavanger.
Contents on this page:  Relevant papers and links to other pages: 



Dictionary Learning is a topic in the Signal Processing area,
the dictionary is usually used for Sparse Representation or Approximation of signals.
A dictionary is a collection of atoms, here the atoms are real column vectors of length N
.
A finite dictionary of K
atoms can be represented
as a matrix D
of size NxK
.
In a Sparse Representation a vector x
is represented or approximated as a linear
combination of some few of the dictionary atoms.
The approximation xa
can be written as
xa = D w
where w
is a vector containg the coefficients and
most of the entries in w
are zero.
Dictionary Learning is the problem of finding a dictionary such that the approximations of
many vectors, the training set, are as good as possible given a sparseness criterion on the coefficients,
i.e. allowing only a small number of nonzero coefficients for each approximation.
This page describes some experiments done on Dictionary Learning. The complete theory of dictionary learning is not told here, only a brief overview (of some parts) is given in section 3. and some links to relevant papers are included on the upper right part of this page. Section 4 presents the results of the experiments used in the RLSDLA paper, and section 6 also includes the files needed to redo the experiments.
In Matlab version 2012a Matching Pursuit algorithms are included in the wavelet toolbox, see Wavelet Toolbox User Guide.
Let the dictionary D
be represented as a real matrix of size
NxK
and K>N
.
Given a test vector x
of size Nx1
,
it can be approximated by a linear combination of dictionary atoms,
the column vectors of D
are often called atoms in a sparse approximation context.
Denoting the atoms as d_{1}
, d_{2}
, ... , d_{K}
,
one example of a sparse approximation is
xa = w(1)d_{1} + w(4)d_{4} + w(7)d_{7} + w(9)d_{9}, 
x = xa + r 
(2.1) 
where four of the atoms have been used here. r
is the representation error.
The elements w(i)
(where i
is 1, 4, 7 and 9 in this example)
are the coefficients in the representation.
Collecting the coefficients into a vector w
the approximation xa
can be written as:
xa = D w,
and the representation error can be written as
: r = x  xa = x  D w.
If most of the entries in w
are zero this is a sparse representation,
the number of nonzero coefficients is s
and the sparsness factor is s/N
.
The problem of finding w
is the sparse approximation problem, a common way to write this problem is
(2.2) 
As γ increases the solution is getting more dense.
The problem with p=0
is NPhard.
Good, but not necessarily optimal, solutions can be found by matching pursuit algorithms,
for example the order recursive matching pursuit (ORMP) algorithm.
The problem with p=1
is easier, the LARS algorithm is effective for solving this problem.
Both ORMP and LARS find w
in a greedy way, starting with an all zero vector in w
,
which is the solution when γ is close to zero, then the algorithms
add one and one vector based on some rules given by the algorithm.
For the LARS algorithm this corresponds to all the solutions to Eq. 2.2 above
with p=1
and γ gradually increases.
For both ORMP and LARS there must be a stopping criterium,
the algorithm returns when the stopping criterion is reached.
This can be that the 0norm (number of nonzeros) the 1norm (sum of absulute values) or
that the error (sum of squared errors) have reached a predefined limit.
We should note that both LARS and ORMP implementations often are more effective when a fixed dictionary
D
can be used to find the solutions for several signal vectors at once.
The function sparseapprox.m is a general sparse approximation function which contains several methods, and also acts as interface to external functions. The available methods can be devided into four groups:
mpv2
java package (by K. Skretting) is available then sparseapprox.m can be used to access
some of the matching pursuit variants implemented there:
javaMP, javaOMP, javaORMP and javaPS, the
last one is partial search as described in the
Partial Search paper.
This paper also describes the details of and difference between OMP and ORMP.
A test of the sparseapprox.m function is done by ex300.m.
The dictionary and the data in the matfiles included in the table at the bottom of this
page is used and sparse approximations are found using several methods avaliable from sparseapprox.m.
The results are presented in a table, parts of that table is shown here.
Time is given in seconds (for the 4000 sparse approximations done), SNR is signal to noise ratio,
w0 is average 0norm for the coefficient, i.e. the first term in Eq. 2.2 with p=0
,
w1 is average 1norm for the coefficient, i.e. the first term in Eq. 2.2 with p=1
,
and r2 is average 2norm for the errors, i.e. the second term in Eq. 2.2.
Here pinv, linprog and FOCUSS are followed by thresholding
and the coefficients are adjusted by an orthogonal projection onto the column space of the seleceted columns.
We can note that there is a pattern in the matching pursuit algorithms, as the error is
decreased, from MP to OMP to ORMP to PS(10) to PS(250), the 1norm of the coefficients is increased.
Method  Time  SNR  w0  w1  r2 

pinv  0.59  4.88  4  14.59  6.29 
linprog  58.99  9.94  4  16.43  3.75 
FOCUSS  29.68  16.95  3.9  14.62  1.79 
mexLasso  0.07  11.90  3.96  10.28  3.13 
javaMP  0.49  16.89  4  16.19  1.81 
OMP  8.12  17.86  4  16.86  1.61 
javaOMP  0.56  17.86  4  16.86  1.61 
ORMP  8.53  18.27  4  18.12  1.54 
javaORMP  0.30  18.27  4  18.12  1.54 
mexOMP  0.05  18.27  4  18.12  1.54 
Partial Search (10)  0.55  18.79  4  20.21  1.45 
Partial Search (250)  3.95  19.00  4  23.17  1.42 
A common setup for the dictionary learning problem starts with access to a training set,
a collection of training vectors, each of lenght N
.
This training set may be finite, then the training vectors are usually collected
as columns in a matrix X
of size NxL
, or it may be infinite.
For the finite case the aim of dictionary learning is to find both a dictionary, D
of size NxK
, and a corresponding coefficient matrix W
of size KxL
such that the
representation error, R=XDW
, is minimized and W
fulfil some sparseness criterion.
The dictionary learning problem can be formulated as an optimization problem with respect to W
and D
, with γ and p
are as in Eq. 2.2 we may put it as
(3.1) 
The Matlab function dlfun.m,
attached in the end of this page, can be used for MOD, KSVD, ODL or RLSDLA.
It is entirely coded in Matlab, except that the sparse approximation, done by sparseapprox.m
,
may call java or mex functions if this is specified.
The help text of dlfun.m
gives some examples for how to use it.
Note that this function is slow as sparse approximation is done for only one vector each time,
an improvement (in speed) would be to do all the sparse approximations in one function call for
MOD and KSVD, or to use a minibatch approach like LSDLA for ODL and RLSDLA.
However, processing one training vector at a time may give better control of the algorithm
and gives more precise information in error messages and exception handling.
Method of Optimized Directions (MOD), or iterative least squares dictionary learning algorithms (ILSDLA)
as the family of MODalgoritms (many variants exist) may be denoted, can be used for a finite learning set,
X
of size NxL
,
and with sparseness defined by either 0norm (number of nonzero coefficient is limited)
or 1norm (sum of absolute values of coefficient is limited), i.e. p=0
or p=1
.
The optimization problem of Eq. 3.1 is divided into two (or three) parts iteratively solved
D
fixed find W
,
this gives L
independent problems as in Eq. 2.2, and
W
fixed find D
using the least squares solution:
D = (XW^{T})(WW^{T})^{1} = BA^{1}
.
It is convenient to define the matrices B=XW^{T}
and A=WW^{T}
.
D
, i.e. scale each column vector (atom) of D
to unit norm.
Part 3 may be skipped if D
is almost normalized and the l_{0}
sparseness is used.
Part 1 is computationally most demanding, and with this setup there will be a lot of computational effort
between each dictionary update, slowing down the learning process. This is especially true with large training sets.
Some minor adjustments can be done to improve the learning speed for large (and also medium sized) training sets, this variant can be denoted as the largeMOD variant.
The training set is (randomly) divided into M
(equal sized) subsets, the subsets are denoted X_{m}
for m=1
to M
.
Now, for each iteration we only use one, some few, (or perhaps almost all) of the subsets of training vectors in part 1 above.
Which subsets to use may be randomly chosen, or the subsets which has been unused for the longest time can be selected.
Anyway, finding new coefficients for only a part of the total number of training vectors reduces the time used.
In part 2 the matrices A
and B
are calculated by adding products of training vector
submatrices and coefficient submatrices, i.e.
A = Σ_{m} W_{m}W_{m}^{T} , 
B = Σ_{m} X_{m}W_{m}^{T} 
and  D = B A^{1} 
(3.2) 
The sum can be taken over the same subsets that were used in part 1 of the same iteration. An alternative, that is especially useful in the end when the dictionary do not change very much from one iteration to the next, is to take the sum in part 2, Eq. 3.2, over all subsets in the training set even if only some of the coefficients were updated in part 1 for this iteration.
An alternative, here denoted LSDLA, which may be easier to implement, includes a forgetting factor (0 ≤ λ_{i} ≤ 1
) in the algorithm.
This opens the possibility of letting the training set be infinitely large.
The subset of training vectors to use in iteration i
is here indexed by subscript i
to make the connection between subset and iteration clearer.
X_{i}
can be a new subset from the training set or it can reuse training vectors processed before. D_{i1}
is used solving the sparse approximation problem,
finding the corresponding coefficients W_{i}
.
The following dictionary update step can be formulated as:
A_{i} = λ_{i} A_{i1}
+ W_{i}W_{i}^{T} , 
B_{i} = λ_{i} B_{i1}
+ X_{i}W_{i}^{T} 
and  D_{i} = B_{i} A_{i}^{1} 
(3.3) 
Eq. 3.3 is indeed very flexible. If the used set of training vectors always is the complete finite
training set, X_{i} = X
, and λ_{i} = 0
it reduces
to the original MOD equation. On the other end, having each training set as just one training vector,
X_{i} = x_{i}
, and λ_{i} = 1
,
makes the equation mathematical equivalent to the RLSDLA (section 3.4) without a forgetting factor and almost equivalent to the original ODL (section 3.3) formulation.
The challenge when using only one training vector in each iteration is the often computationally demaning
calculation of the dictionary in each step. Both ODL and RLSDLA have alternative formulations to the
least squares solution, D_{i} = B_{i} A_{i}^{1}
.
Smaller subsets and the forgetting factor close to one makes the equation quite similar to the minibatch approach of ODL, in fact it may be considered as a minibatch extension of RLSDLA.
Matlab implementation:
The code below is from ex210.m, see more of the dictionary learning context in that file.
This is the very basic straighforward MOD implementation to the left.
To the right is how my java implementation of MOD can be used.
ILSDLA/MOD (matlab)  ILSDLA/MOD (java) 
for it = 1:noIt W = sparseapprox(X, D, 'mexOMP', 'tnz',s); D = (X*W')/(W*W'); D = dictnormalize(D); end 
jD0 = mpv2.SimpleMatrix(D); jDicLea = mpv2.DictionaryLearning(jD0, 1); jDicLea.setORMP(int32(s), 1e6, 1e6); jDicLea.ilsdla( X(:), noIt ); jD = jDicLea.getDictionary(); D = reshape(jD.getAll(), N, K); 
KSVD, for details see the KSVD paper by Aharon et al., is also an iterative method like MOD. The two steps in each main iteration are:
D
fixed find W
,
this gives L
independent problems as in Eq. 2.2, and
W
fixed and find D
and W
using SVD decompositions.
Part 3 is not needed as the SVD make sure that the dictionary atoms are normalized.
We should note that using KSVD with the 1norm in sparse approximation, i.e. LARS,
makes no sense since the KSVD dictionary update step conserves the 0norm but
modifies the 1norm of the coefficients, and thus the actual coefficients will be neither l_{0}
nor l_{1}
sparse.
Matlab implementation:
The code below is from ex210.m, see more of the dictionary learning context in that file.
KSVD is mainly as described in the KSVD paper by Aharon. The Approximate KSVD is detailed described in
the technical report,
"Efficient Implementation of the KSVD Algorithm using Batch Orthogonal Matching Pursuit." by Ron Rubinstein et al,
CS Technion, April 2008. The essential Matlab code is quite simple and compact and shown below:
Standard KSVD  Approximate KSVD 
for it = 1:noIt W = sparseapprox(X, D, 'mexOMP', 'tnz',s); R = X  D*W; for k=1:K I = find(W(k,:)); Ri = R(:,I) + D(:,k)*W(k,I); [U,S,V] = svds(Ri,1,'L'); % U is normalized D(:,k) = U; W(k,I) = S*V'; R(:,I) = Ri  D(:,k)*W(k,I); end end 
for it = 1:noIt W = sparseapprox(X, D, 'mexOMP', 'tnz',s); R = X  D*W; for k=1:K I = find(W(k,:)); Ri = R(:,I) + D(:,k)*W(k,I); dk = Ri * W(k,I)'; dk = dk/sqrt(dk'*dk); % normalize D(:,k) = dk; W(k,I) = dk'*Ri; R(:,I) = Ri  D(:,k)*W(k,I); end end 
Online dictionary learning (ODL) can be explained from Eq. 3.3.
A single training vector x_{i}
or a minibatch
X_{i}
is processed in each iteration.
The corresponding coefficients, w_{i}
or W_{i}
,
are found and the A_{i}
and B_{i}
matrices
are updated according to Eq. 3.3 (originally with λ_{i} = 1
),
which in the single training vector case become
A_{i} = A_{i1} + w_{i}w_{i}^{T}
and B_{i} = B_{i1} + x_{i}w_{i}^{T}
.
In ODL the computationally demanding calculation of D_{i}
is replaced by doing one iteration
of the blockcoordinate descent with warm restart algorithm. A little bit simplified, each column (atom) of the
dictionary is updated as
d_{j} ← d_{j} + (b_{j}
 D a_{j})/a_{j}(j) 
for  j = 1, 2, ..., K 
(3.4) 
where d_{j}
, b_{j}
, a_{j}
are columns of the
D
, B_{i}
, and A_{i}
matrices
and D
here is the column by column transition of
the dictionary from D_{i1}
to D_{i}
.
Note that if Eq. 3.4 is repeated the dictionary will converge to the least squares
solution, D_{i} = B_{i} A_{i}^{1}
,
often after just some few iterations.
For details see the ODL paper by Mairal et al.
The recursive least squares dictionary learning algorithm RLSDLA can, like ODL, be found by processing just one new training vector x_{i}
in each iteration of Eq. 3.3.
The current dictionary D_{i1}
is used to find the corresponding
coefficients w_{i}
. The main improvement in RLSDLA, compared to LSDLA,
is that instead of calculating the least squares solution as in Eq. 3.3 in each step
the matrix inversion lemma (Woodbury matrix identity) can be used to update
C_{i} = A_{i}^{1}
and D_{i}
directly without
using neither the A_{i}
nor the B_{i}
matrices.
This gives the following simple updating rules:
C_{i} = A_{i}^{1} = (C_{i1} / λ_{i})  α u u^{T} 
and  D_{i} = D_{i} + α r_{i} u^{T} 
(3.5) 
where u = (C_{i1} / λ_{i}) w_{i}
and
α = 1 / (1 + w_{i}^{T} u)
, and
r_{i} = x_{i}  D_{i1} w_{i}
is the representation error.
Note that in these steps neither matrix inversion nor matrix by matrix multiplication is needed.
The real advantage of RLSDLA compared to MOD and KSVD comes with the flexibility introduced by the forgetting factor λ. The SearchThenConverge scheme is particular favourable, and the idea is to forget more quickly in the beginning then forget less as learning proceeds and we get more confidence in the quality of the dictionary. This can be done by starting with λ slightly smaller than one and slowly increasing λ towards 1 as learning progress. The update scheme and λ should be chosen so that the initial dictionary is soon forgotten and convergence is obtained in a reasonable amount of iterations. The effect of different values for λ and the SearchThenConverge scheme is illustrated in section 4 below.
Matlab implementation:
The code below is from ex210.m, see more of the dictionary learning context in that file.
To the left we just call another function, i.e. dictlearn_mb.m,
which do the work. The only issue here is to set its parameters in an appropriate way. You should note that
the variable n below use 500=25+50+125+300 to set the number of training vectors to be used in learning
to the same as the number of vectors in the training set L times the number of iterations to do noIt.
The function dictlearn_mb.m returns the results in a struct, the dictionary is field "D".
To the right is how the java implementation of RLSDLA can be used, here the work is done
by jDicLea.rlsdla(...). The java implementation is more flexible when it comes to decide how
the forgetting factor should increase towards 1, below a quadratic scheme is used. The experiments in the
RLSDLA paper showed that this flexibility is not needed, the choice used in the minibatch function will do.
RLSDLA minibatch  RLSDLA (java) 
n = L*noIt/500; mb = [n,25; n,50; n,125; n,300]; MBopt = struct('K',K, 'samet','mexomp', ... 'saopt',struct('tnz',s, 'verbose',0), ... 'minibatch', mb, 'lam0', 0.99, 'lam1', 0.95, ... 'PropertiesToCheck', {{}}, 'checkrate', L, ... 'verbose',0 ); Ds = dictlearn_mb('X',X, MBopt); D = Ds.D; 
jD0 = mpv2.SimpleMatrix(D); jDicLea = mpv2.DictionaryLearning(jD0, 1); jDicLea.setORMP(int32(s), 1e6, 1e6); jDicLea.setLambda( 'Q', 0.99, 1.0, 0.95*(noIt*L) ); jDicLea.rlsdla( X(:), noIt ); jD = jDicLea.getDictionary(); D = reshape(jD.getAll(), N, K); 
The experiments are made as Matlab mfiles that run in the workspace, not as functions.
The training set used for all experiments on the AR signal is stored as a matrix X
in dataXforAR1.mat, dimension of X
is 16x4000
.
This set is also used as the test set. The goal of dictionary learning here is to represent this particular
data set in a good way, not to make a dictionary for a general AR(1) signal.
The target (and test) sparseness factor is 1/4, which gives s=4
nonzero coefficients in each sparse representation.
The first experiment, ex111.m, compare three methods to each other. Only 200 iterations are done, to get better results thousands of iterations should be done as in ex112.m but those results are not shown here. RLSDLA is used with λ set to one.
Above is results for ex111.m. We see the results are not very good for the RLSDLA method. KSVD and ILSDLA (MOD) perform very similar to each other.
In the next experiment ex121.m RLSDLA is used with different fixed values of λ.
Above is results for ex121.m. We see that the RLSDLA method here performs much better. For appropriate values of λ RLSDLA is better than KSVD and ILSDLA (MOD).
In the third experiment, ex131.m and ex132.m we will see that RLSDLA can do even better when an adaptive scheme for λ is used.
Above is results for ex132.m,
results for ex131.m are not shown here.
We see that the RLSDLA method here performs even better.
The convergence of the algorithm can be targeted to a given number of iterations.
The variable a
in the experiment gives when λ is increased to its final value 1,
it is given as a fraction of the total number of iterations.
In this experiment we want to compare the different dictionary learning algorithms
regarding recovering of a known dictionary. A true dictionary of size 20x50
is made in each trial. A set of training data is made by using the true dictionary,
each training vector is a sparse combination of some, here 5, vectors from the
true dictionary and random noise. The number of training vectors can be chosen
and the level of noise can be given. Then a dictionary is trained from the training set
using one of the three methods, ILSDLA (MOD), KSVD, and RLSDLA. The trained dictionary
is then compared to the true dictionary, this is done by pairing the dictionaries atoms
in the best possible way and using the angle between a true atom and a trained atom as
a distance measure. The number of identified atoms is then found by comparing
these angles to a limit, β_{lim}.
Above is results for ex211.m.
We see that the RLSDLA method here performs better than ILSDLA (MOD) and KSVD which both perform quite similar.
The most remarkable difference is that RLSDLA is able to identify all atoms for a
quite small angle β_{lim}. ILSDLA (MOD) and KSVD always have some
atoms that are far away from the true atom. Having β_{lim}=8.11
degrees
corresponds to the case where the inner product of the two compared atoms is equal to 0.99,
each atom has 2norm equal to 1.
This section describes how to learn the dictionaries used in the image compression examples in our ICASSP 2011 paper. How the learned dictionaries were used is described in the image compression example in section 7 on the IC tools page where also the 8 used dictionaries are available, those dictionaries were designed by the mfiles presented here. An important point in our ICASSP 2011 paper is that sparse approximation using a dictionary can be done in the coefficient domain, and that this improves compression performance in much the same way as using the 9/7 wavelet transform performs better than using the block dased discrete cosine transform (DCT). Thus, the learning must be done in the coefficient domain and the training vectors be in the coefficient domain.
A Matlab function, getXfrom8images.m,
was made to generate the training vectors from 8 training images (shown below).
The function is actually quite flexible and can be used to generate training vectors from
all images that can be read by the Matlab imread
function.
Training vectors can be made in the pixel domain or in several transform domains, most important the 9/7 wavelet domain.
The function needs mycol2im.m from ICTools,
which must be in the current directory or available on the Matlab search path.
The getXfrom8images.m
file should be updated so that its default image catalog matches a catalog
on your computer where the 8 bmpimages above, i.e. elaine.bmp, lake.bmp, man.bmp, crowd.bmp,
couple.bmp, woman1.bmp, woman2.bmp, baboon.bmp
, are all stored.
These images, as well as lena.bmp
, are available in a zipfile,
USCimages_bmp.zip
.
Having access to the set of training vectors we could start learning dictionaries using the
dlfun.m
briefly presented in the beginning of section 3 here. But the function is slow.
Here we have made four Matlab functions more appropriated, in the sence that they are more
directed towards the image compression problem at hand and faster than dlfun.m
.
The four files are denoted as ex31?.m
, where ?
can be 1 for ILSDLA (MOD) learning,
2 for RLSDLA learning, 3 for KSVD learning and 5 for a special separable MOD learning.
The three first methods are briefly explained in section 3 above, and the separable case is
described in section 3.6 in my PhD thesis.
The four functions are all used in a very similar way, like:
% res = ex31?(transform, K, targetPSNR, noIT, testLena, verbose); % % where the arguments are: % transform may be: 'none', 'dct', 'lot', 'elt', 'db79', 'm79' % see myim2col.m (in ICTools) for more % K number of vectors in dictionary % for the separable case use [K1, K2] instead of K % targetPSNR target Peak Signal to Noise Ratio, should be >= 30 % noIT number of iterations through the training set % testLena 0 or 1, default 0. (imageapprox.m is used) % verbose 0 or 1, default 0. %
The returned variable res
will be a struct
with the leared dictionary as one field.
It will also be stored in a matfile in the current catalog.
The 8 dictionaries learned for the image compression example in section 7 of the ICTools page
were learned by these four implementations.
The mfile ex31prop.m
can be used to display some properties of these dictionaries,
and other dictionaries learned by the four mfiles ex31?.m
.
For example, the results of r=ex31prop('ex31*.mat', 'useLaTeX',true);
is:
ex31prop display properties for 8 dictionary files. \hline Dictionary filename & iter & tPSNR & trans. & A & B & betamin & betaavg \\ \hline ex311Feb031626.mat & 1000 & 38.0 & none & 0.09 & 73.73 & 7.84 & 43.43 \\ ex311Jan241424.mat & 1000 & 38.0 & m79 & 0.14 & 57.20 & 12.03 & 46.37 \\ ex312Jan191930.mat & 800 & 38.0 & m79 & 0.58 & 30.54 & 16.18 & 54.05 \\ ex312Jan282133.mat & 500 & 38.0 & none & 0.47 & 30.96 & 16.36 & 53.86 \\ ex313Feb011544.mat & 1000 & 38.0 & none & 0.23 & 63.58 & 4.71 & 45.91 \\ ex313Jan251720.mat & 1000 & 38.0 & m79 & 0.30 & 54.68 & 9.03 & 48.01 \\ ex315Feb021040.mat & 1000 & 38.0 & none & 0.92 & 27.57 & 19.62 & 35.67 \\ ex315Jan211232.mat & 500 & 38.0 & m79 & 0.70 & 20.38 & 30.36 & 41.16 \\ \hline
During learning the four mfiles ex31?.m
saves some information to show the progress of learning.
Since the target PSNR is fixed, some of the first iterations are used to find an appropriate limit for the error
which is used as a stopping criterion in sparse approximation. After som few iterations the achieved
PSNR is quite stable during learning and as the dictionary further improves the average number of nonzero
coefficients for each training vector slowly decreases. This can be plotted, for example:
r1 = load('ex311Feb031626.mat'); r1.tabPSNR(1:10)' % display the first PSNR values (on the next lines) ans = 35.8615 38.2784 37.8627 38.0553 37.9764 38.0124 37.9938 37.9985 38.0014 38.0004 plot(r1.tabIT(5:end), r1.tabNNZ(5:end)/r1.L); title('Average number of nonzero coefficients, including DC, for training vectors.'); xlabel(['Iteration number, dictionary: ',r1.ResultFile]);
The result off the commands above is plotted below.
This section starts with a brief presentation of some dictionary properties. Many of these properties are used in the experiments done in the SPIE paper, they are used to assess learned dictionaries and to compare dictionaries to each other. The larger part of this section describes two of the three experiments done in the SPIE paper. The main purpose of these experiments were to compare different learning algorithms to each other, including the effect of using either the l_{0}norm (ORMP) or l_{1}norm (LARS) in vector selction, for sparse approximation of patches from natural images. The Matlab files and how to use them are briefly explained. Some results are presented, including plots of some of the learned dictionaries.
A thorough presentation of many dictionary properties is given in
Elad's book: "Sparse and Redundant Representations ...".
Also the SPIE 2011 paper gives an overview of some dictionary properties, mainly those that are relatively easy to compute.
It also suggests some new properties, i.e. μ_{mse}, μ_{avg},
β_{mse}, and β_{avg}, and proposes a measure for the
distance between two dictionaries: β(D,D')
.
The table below lists some properties for a normalized dictionary, i.e. each atom (column vector) has unit 2norm.
The dictionary is represented as a matrix D
of size NxK
.
For more details on each property see the SPIE paper.
Property  Definition (comment) 

σ_{1}, σ_{2}, ... σ_{N}  The singular values of D , where we assume
σ_{1} ≥ σ_{2} ≥ ... σ_{N} 
λ_{1}, λ_{2}, ... λ_{N}  The eigenvalues of the frame operator S=DD^{T} ,
we have λ_{i} = σ_{i}^{2}
and Σ_{i} λ_{i} = K 
A, B  Lower and upper frame bound, A = λ_{N} and B = λ_{1} 
μ, β_{min}  Coherence and angle between the two dictionary atoms closest to each other, coherence is cosine of the angle. For a normalized dictionary the coherence is the inner product of the two closest atoms. 
μ_{avg}, β_{avg}  Average of cosine (angle) for all atoms to its closest neighbor. 
μ_{mse}, β_{mse}  Average, in mean square error sense, of cosine (angle) for all pairs of atoms in the dictionary, including an atom to itself. These properties can be computed from the eigenvalues. 
μ_{gap}, β_{gap}  Cosine (angle) for the most difficult vector x to represent by one atom
and its closest neighbor in the dictionary. This property is generally hard to compute. 
δ(D) 
The decay factor used in Elad's book,
we have δ(D) = μ_{gap}^{2} 
β(D,D') 
Distance between two dictionaries as the average angle between each atom (in both dictionaries) to its closest atom in the other dictionary. 
SRC 
The Sparse Representation Capability is the average number of atoms used in a sparse approximation of a set of test vectors where the representation error is below a given limit for each test vector. 
We will here present the two first of the three experiments presented in the SPIE paper. A general description of the experiments and the main result are given in the paper. This webpage mainly presents the Matlab files needed to redo the experiments, but also some results not given in the paper are shown. In all experiments training vectors are generated from 8 by 8 patches from natural images in the Berkeley image segmentation set. The files were copied into a catalog on the local computer, and a Matlab function, getXfromBSDS300.m, was made to access the data in an easy way. Note that you have to change some few lines in this mfile (near line 175) to match the catalog where you have copied the data on your local computer. Matlab example:
par = struct('imCat','train', 'no',1000, 'col',false, 'subtractMean',true, 'norm1lim',10); X = getXfromBSDS300(par, 'v',1);
The following table shows some example dictionaries.
Dictionary, size is 64x256  Atoms  matfile  A  B  μ_{mse}  μ_{avg}  μ  SRC 

1. Random Gaussian  dict_rand.mat 
1.12  9.10  0.1405  0.3709  0.5803  14.86  
2. Patches from training images  dict_data.mat 
0.07  49.87  0.2860  0.6768  0.9629  11.20  
3. RLSDLA learned dictionary (from ex.1)  ex414_Jun250624.mat 
0.13  17.89  0.1596  0.5205  0.9401  8.37  
4. Orthogonal bases  dict_orth.mat 
4  4  0.1250  0.1250  0.1250  10.87  
5. Separable dictionary with sine elements  dict_sine.mat 
1.00  6.66  0.1350  0.4971  0.4999  9.89  
6. Dictionary with sine elements and corners  dict_ahoc.mat 
1.47  6.15  0.1318  0.5026  0.6252  9.14 
The dictionaries listed in the table are available as matfiles, links are given in third column. Some dictionary atoms are shown in the second column of the table, all atoms should be shown when you click on these. The six rightmost columns are properties found for each dictionary. The fourth dictionary is the concatenation of 4 maximally separated orthogonal bases, it is made by the get64x256.m function. The dictionary in the table includes the identity basis, and the SRC measure is 10.87. If we rotate alle atoms such that the 2DDCT basis is in the dictionary (instead of the identity basis) then we get a dictionary with the same properties but which will achiecve better SRC measure, SRC = 10.38. Using only the dct basis will give SRC = 13.55. The fifth dictionary is made by the get64x256sine.m function which needs the getNxKsine.m function. The last dictionary is an adhoc dictionary with 168 separable sine atoms, and the rest of the atoms are 6x6 2Ddct basis functions multiplied by a windowfunction and located in the corner of each patch.
For the dictionaries in experiment 1 and 2 it was convenient to make a special function to
find and return the dictionary properties, mainly because the special SRC
measure is also wanted.
To find SRC
a specific set of test data should be available and stored in a matfile in current directory,
the file used here is testDLdataA.mat (7.5 MB) and contains 25600 test vectors
generated from the Berkeley image segmentation test set. The function should also plot the development of
some of the properties during learning, when available these are plotted in figure 1 to 3,
and display the dictionary atoms in yet another figure, here figure 4.
The result is a 'quickanddirty' mfile, testDLplot.m.
This function needs a struct containg the dictionary and possible also learning history as fields,
alternatively the name of a matfile containing this struct with the name res, as the first input argument.
For most dictionaries the more clean and tidy functions dictprop.m
and dictshow.m should be used.
Anyway, you may try the following code:
v = testDLplot('ex414_Jun250624.mat', 1); % or clear all; load dict_rand.mat % or dict_data.mat, dict_orth.mat res.D = D; % since D is stored, but a struct is needed in testDLplot v = testDLplot(res, 1); % or clear all; res.D = get64x256('dct'); % a 'better' alternative to the fourth dictionary v = testDLplot(res, 1);
In experiment 1 the RLSDLA algorithm was used and different parameters were used, see SPIE paper. To make similar dictionaries you may try the code in ex414.m, note that this is not a function but simply a collection of Matlab commands. ex414.m uses the javafunction DictionaryLearning.rlsdla(..) which unfortunately only uses the sparse approximation methods implemented in the same package. Thus to use RLSDLA with LARS, i.e. mexLasso in SPAMS, the function denoted testDL_RLS.m is needed. Both these mfiles store the dictionary as a struct in a matfile. Examples of use are:
clear all; ex414; % all defaults % or clear all; dataSet='A'; batches=100; dataLength=20000; lambda0=0.996; targetPSNR=38; ex414; v = testDLplot(res, 1); % or param = struct('vecSel','mexLasso', ... 'mainIt',100, ... 'dataLength',20000, ... 'dataSet','A', ... 'dataBatch',[20*ones(1,20),50*ones(1,50),100], ... 'targetPSNR', 38, ... 'lambda0', 0.998 ); res = testDL_RLS( param ); v = testDLplot( res, 1 );
Varying targetPSNR and lambda0 it should be possible to make dictionaries similar to the ones described in the SPIE paper, and present the results in a similar way. We observed that the resulting dictionary properties mainly depended on the selected parameters. Learning several dictionaries with the same parameters, but different initial dictionary and different training vectors, all had almost the same properties. This is a good property of the learning algorithm, as it seems to capture the statistical properties of the training set rather than depending on the specific data used. We also noted an essential difference between dictionaries learnd by LARS (l_{1}sparsity) and dictionaries learnd by ORMP (l_{0}sparsity), see results presented in the SPIE paper.
In experiment 2 we again refer to the SPIE paper for a more complete description of the experiment and a presenteation of the results. This section presents the Matlab files and some additional results. First we review the main results as presented in Table 3 in the SPIE paper.
The 8 different (variants of) dictionary learning algorithms used in this experiments are implemented by 4 mfiles. The MOD dictionaries, case 1 and 3, were generated by testDL_MOD.m. The KSVD dictionaries, case 2, were generated by testDL_KSVD.m. These two functions assume that there is available a fixed huge set (1 million vectors) of training vectors in 50 files (20000 training vectors in each) in a catalog on the local computer. But if these files are not available the functions will make a new set of training data in each iteration. The advantage with a fixed, but huge, training set is that learning has a fixed training set, which is an assumtion made when the algorithms are derived. But even with this huge training set, in each iteration only some few (1 to 10 randomly chosen) subsets of the 50 available subsets are used, and this way the 'fixed training set in each iteration' assumtion is violated. The dictionaries for the first three cases can be made by the Matlab commands below, the results can be displayed by testDLplot.m function.
p1 = struct('vecSel','mexOMP', 'targetPSNR',38, 'mainIt',1000); p2 = struct('vecSel','mexOMP', 'targetPSNR',38, 'mainIt',500); p3 = struct('vecSel','mexLasso', 'targetPSNR',35, 'mainIt',400); % more parameters can be used to do logging in each iteration during learning pExtra = struct('useTestData',true, 'verbose', 1, ... 'avg_w0',1, 'avg_w1',1, 'avg_r2',1, 'avg_rr',1 ); res1 = testDL_MOD(p1); % or res1 = testDL_MOD(p1,pExtra); res2 = testDL_KSVD(p2); % or res2 = testDL_KSVD(p2,pExtra); res3 = testDL_MOD(p3); % or res3 = testDL_MOD(p3,pExtra);
The dictionaries in case 4 were made by the ex414.m mfile used in experiment 1, but with different parameters. The last 4 cases were all made by the ex413.m mfile, this is a function and may be called with several parameters to select which variant of ODL or LSDLA to run. The two different methods were implemented by the same function since the implementation should ensure that the two methods, ODL and LSDLA, can be run under exactly the same conditions. It should be possible for both methods to use excactly the same scheme for the forgetting factor and for the minibatch approach, the only difference is the few code lines where the actual dictionary update is done. ODL uses the equation 3.4 in section 3.3 above, while LSDLA uses the equation 3.3 in section 3.1 above. The LSDLA scheme is actually how we would implement a minibatch variant of RLSDLA. We see that the true RLSDLA implemetation in case 4 and the minibatch implementation in case 5 give dictionaries with the same properties. The dictionaries for the last five cases can be made by the Matlab commands below, the results can be displayed by testDLplot.m function.
clear all; batches=250; dataLength=20000; lambda0=0.996; targetPSNR=40; ex414; res4 = res; % total number of vectors processed is: sum(mb(:,1).*mb(:,2)) mb = [2000,50; 5000,100; 10000,200; 5000,250; 2300,500]; p5 = struct('vecSel','mexOMP', 'lambda0',0.996, 'targetPSNR',40, ... 'ODLvariant',5, 'minibatch',mb); res5 = ex413(p5); p6 = p5; p6.ODLvariant = 2; res6 = ex413(p6); p7 = struct('vecSel','mexLasso', 'lambda0',0.998, 'targetPSNR',35, ... 'ODLvariant',5, 'minibatch',mb); res7 = ex413(p7); p8 = p7; p8.ODLvariant = 2; res8 = ex413(p8);
The distance between two dictionaries, β(D,D')
in Table 3 in SPIE paper,
shows that dictionaries learned by the same method are in fact quite different from another,
even though their properties are almost the same.
The variance (or standard deviation) is relativly small for all values in Table 3, also for
the β(D,D')
measure. That means that if two dictionaries are learnt by the
same method, and the same parameter set, the expected distance between them will be expected to
be as given in Table 3, standard deviation will be less than 1 degree.
We can also measure the distance between dictionaries from different cases and present the
results in a table, the column in Table 3 in the SPIE paper is the diagonal in this table.
Case  1  2  3  4  5  6  7  8 

1  33.40  35.86  39.90  35.62  35.67  35.38  40.29  36.68 
2  .  37.07  42.49  38.45  38.49  38.29  42.93  38.87 
3  .  .  35.90  40.24  40.29  40.04  36.80  42.58 
4  .  .  .  35.57  35.60  35.36  40.39  39.39 
5  .  .  .  .  35.75  35.44  40.47  39.56 
6  .  .  .  .  .  34.81  40.20  39.11 
7  .  .  .  .  .  .  36.85  43.17 
8  .  .  .  .  .  .  .  28.57 
We note that the thre cases 4, 5 and 6 are almost equal to each other, especially 4 and 5.
Also case 3 and case 7 are quite equal to each other, these numbers ar shown bold in the table.
This table also hows that dictionaries for case 8 are quite special, they are different from the other cases
and also more clustered, i.e. more equal to each other in the β(D,D')
measure.
For experiment 3 we just refer to the SPIE paper.
The experiments are done in Matlab supported by Java.
The Java package mpv2
is a collection of all the Java classes used here.
mpv2
stands for Matrix Package version 2 (or Matching Pursuit version 2).
Note that mpv2
is compiled for Java 1.5 and do not work with older Java Virtual Machines (JVM),
upgrading JVM in Matlab is possible, see
Mathworks support.
mpv2
has classes for Matching Pursuit based vector selection algorithms and for Dictionary Learning Algorithms.
The ILSDLA (MOD) and RLSDLA methods are implemented in this package.
As the classes for MP and DLA use matrices extensively, effective Java
implementations of some matrix operations were also needed. This was done by
including the classes from the JAMAmatrix package
and wrapping an abstract matrix class
AllMatrices
around the implementations of different kind of matrices, including the JAMAmatrix class.
The mpv2
package works well,
and I think that parts of the package are well designed and easy to use.
But the mpv2
package must still be regarded to be incomplete and under development.
To use the mpv2
Java package from Matlab you need to copy the classfiles
from mpv2class.zip
into a catalog available from a Java Path in Matlab.
For example you may copy the unzipped classfiles into the catalog F:/matlab/javaclasses/mpv2/
,
and make the catalog F:/matlab/javaclasses/
available in Matlab as a
DYNAMIC JAVA PATH
(see Matlab javaclasspath
).
Documentation is in mpv2
.
Java files, i.e. source code, for most classes in the mpv2
package are in
mpv2java.zip
.
I consider the left out classes, DictionaryLearning
and MatchingPursuit
,
to be of less general interest.
If you want to run these experiments you should make a catalog in Matlab where you put all Matlabfiles
from the table below. To test that the files are properly installed you start Matlab and go to
the catalog where the files are installed. In the Matlab command window type
help lambdafun
to display some simple help text, then go on and type
lambdafun('col');
to run the function and make a plot.
The file java_access
check that the access to the mpv2
Java package is ok.
You should probably edit this mfile to match the catalog where mpv2
is installed.
When Java access is ok you may run bestAR1
. This file uses the ORMP algorithm (javaORMP)
to do a sparse representation of the AR(1) training set using a predesigned dictionary.
matfiles  Text 

bestDforAR1.mat  The matfile which store the best dictionary for the AR(1) data. 
dataXforAR1.mat  The matfile which store the training, and test, data set. Size is 16x4000. 
dict_ahoc.mat, dict_data.mat dict_orth.mat, dict_rand.mat dict_sine.mat 
5 example dictionaries used in section 5.2 above. 
ex414_Jun250624.mat  An example dictionary made by ex414.m, see section 5.2 above. 
testDLdataA.mat  The matfile with data used for the SRCmeasure in section 5.2 above. Size is 64x25600. 
mfiles  Text 
bestAR1.m  Load and test the best (found so far and using ORMP with s=4) dictionary for the special set of AR(1) data, i.e. bestDforAR1.mat . 
datamake.m  Generate a random data set using a given dictionary. 
dictdiff.m  Return a measure for the distance between two dictionaries. 
dictlearn_mb.m  Learn a dictionary by a minibatch RLSDLA variant. 
dictmake.m  Make a random dictionary, iid Gaussian or uniform entries. 
dictnormalize.m  Normalize and arrange the vectors of a dictionary (frame). 
dictprop.m  Return properties for a normalized dictionary (frame). 
dictshow.m  Show atoms, basis images, for a dictionary or a transform. Needs mycol2im.m from ICTools. 
dlfun.m  A slow Matlab only Dictionary Learning function, can be used for MOD, KSVD, ODL or RLS. 
ex111.m, ex112.m  mfiles for first experiment presented in section 4.1 here. 
ex121.m  mfile for second experiment presented in section 4.1 here. 
ex131.m, ex132.m  mfiles for third experiment presented in section 4.1 here. 
ex210.m, ex211.m  mfiles for generating data and displaying results for experiment presented in section 4.2 here. ex210.m is extended with more learning alternatives April 2013. 
ex300.m  mfile that tests sparseapprox.m, using the matfiles above. 
ex311.m, ex312.m ex313.m, ex315.m 
Dictionary learning for images, section 5.1 above. The metods are ILSDLA (MOD), RLSDLA, KSVD and separable MOD. 
ex31prop.m  Display some properties for dictionaries stored in ex31*.mat files. 
ex413.m, ex414.m  Dictionary learning for (BSDS) images, section 5.2 above. 
getNxKsine.m  Get a NxK dictionary (frame) with sine elements. 
get8x21haar.m  Get an 8x21 regular Haar dictionary (frame), used in ex315.m . 
get8x21sine.m  Get an 8x21 dictionary (frame) with sine elements, used in ex315.m . 
get16x144.m  Get a 16x144 tight dictionary (frame). 
get64x256.m, get64xK.m  Get a 64x256, or a 64xK, tight dictionary (frame). 
get64x256sine.m  Get a 64x256, i.e. (8x8)x(16x16), separable dictionary with sine elements. 
getXfrom8images.m  Get (training) vectors generated from 8 example images. 
getXfromBSDS300.m  Get (training) vectors generated from images copied from the Berkeley image segmentation set. 
lambdafun.m  Can be used to set the forgetting factor λ in RLSDLA. 
testDL_MOD.m, testDL_RLS.m testDL_KSVD.m, testplot.m 
Dictionary learning for (BSDS) images, section 5.2 above, and a special varint for plotting (showing) dictionary properties. 
more mfiles  Text 
assignmentoptimal.m  Compute optimal assignment by Munkres algorithm (by Markus Buehren). 
java_access.m  Check, and try to establish, access to mpv2 javapackage. Some statements in this file should be changed to reflect the catalog structure on your computer, and perhaps also use a mfile like below to return the relevant path. 
my_matlab_path.m  Return the relevant path for (some of) my matlab catalogs. 
sparseapprox.m  Returns coefficients in a sparse approximation of each column in a matrix X . 
zipfiles  Text 
mpv2class.zip 
The classfiles for the mpv2 Java package. 
mpv2java.zip 
The javafiles for most of the classes in the mpv2 Java package. 
USCimages_bmp.zip 
Some images stored as bmp files, once copied from the
University of Southern CaliforniaSIPI image database. The directory where these are stored should be written into the getXfrom8images.m file. 
Last update: July 2, 2014. Karl Skretting  Visit my home page. 