7.9. Subgaussian Distributions#

In this section we review subgaussian distributions and matrices drawn from subgaussian distributions.

Examples of subgaussian distributions include

  • Gaussian distribution

  • Rademacher distribution taking values ±1M

  • Any zero mean distribution with a bounded support

Definition 7.33

A random variable X is called subgaussian if there exists a constant c>0 such that

(7.2)#MX(t)=E[exp(Xt)]exp(c2t22)

holds for all tR. We use the notation XSub(c2) to denote that X satisfies the constraint (7.2). We also say that X is c-subgaussian.

E[exp(Xt)] is moment generating function of X.

exp(c2t22) is moment generating function of a Gaussian random variable with variance c2.

The definition means that for a subgaussian variable X, its M.G.F. is bounded by the M.G.F. of a Gaussian random variable N(0,c2).

Example 7.16 (Gaussian r.v. as subgaussian r.v.)

Consider zero-mean Gaussian random variable XN(0,σ2) with variance σ2. Then

E[exp(Xt)]=exp(σ2t22).

Putting c=σ we see that (7.2) is satisfied. Hence XSub(σ2) is a subgaussian r.v. or X is σ-subgaussian.

Example 7.17 (Rademacher distribution)

Consider X with

PX(x)=12δ(x1)+12δ(x+1)

i.e. X takes a value 1 with probability 0.5 and value 1 with probability 0.5.

Then

E[exp(Xt)]=12exp(t)+12exp(t)=coshtexp(t22).

Thus XSub(1) or X is 1-subgaussian.

Example 7.18 (Uniform distribution)

Consider X as uniformly distributed over the interval [a,a] for some a>0. i.e.

fX(x)={12aaxa0otherwise

Then

E[exp(Xt)]=12aaaexp(xt)dx=12at[eateat]=n=0(at)2n(2n+1)!

But (2n+1)!n!2n. Hence we have

n=0(at)2n(2n+1)!n=0(at)2n(n!2n)=n=0(a2t2/2)n(n!)=exp(a2t22).

Thus

E[exp(Xt]exp(a2t22).

Hence XSub(a2) or X is a-subgaussian.

Example 7.19 (Random variable with bounded support)

Consider X as a zero mean, bounded random variable i.e.

P(|X|B)=1

for some BR+ and

E(X)=0.

Then, the following upper bound holds:

E[exp(Xt)]=BBexp(xt)fX(x)dxexp(B2t22).

This result can be proven with some advanced calculus. XSub(B2) or X is B-subgaussian.

There are some useful properties of subgaussian random variables.

Lemma 7.7 (Mean and variance of subgaussian random variables)

If XSub(c2) then

E(X)=0

and

E(X2)c2.

Thus subgaussian random variables are always zero-mean. Their variance is always bounded by the variance of the bounding Gaussian distribution.

Proof. We proceed as follows:

  1. Note that

    n=0tnn!E(Xn)=E(n=0(Xt)nn!)=E(exp(Xt)).
  2. But since XSub(c2) hence

    n=0tnn!E(Xn)exp(c2t22)=n=0c2nt2n2nn!.
  3. Restating

    E(X)t+E(X2)t22!c2t22+o(t2) as t0.
  4. Dividing throughout by t>0 and letting t0 we get E(X)0.

  5. Dividing throughout by t<0 and letting t0 we get E(X)0.

  6. Thus E(X)=0. So Var(X)=E(X2).

  7. Now we are left with

    E(X2)t22!c2t22+o(t2) as t0.
  8. Dividing throughout by t2 and letting t0 we get Var(X)c2.

Subgaussian variables have a linear structure.

Theorem 7.24 (Linearity of subgaussian variables)

If XSub(c2) i.e. X is c-subgaussian, then for any αR, the r.v. αX is |α|c-subgaussian.

If X1,X2 are r.v. such that Xi is ci-subgaussian, then X1+X2 is c1+c2-subgaussian.

Proof. Scalar multiplication:

  1. Let X be c-subgaussian.

  2. Then

    E[exp(Xt)]exp(c2t22).
  3. Now for α0, we have

    E[exp(αXt)]exp(α2c2t22)exp((|α|c)2t22).
  4. Hence αX is |α|c-subgaussian.

Addition:

  1. Consider X1 as c1-subgaussian and X2 as c2-subgaussian.

  2. Thus

    E(exp(Xit))exp(ci2t22).
  3. Let p,q>1 be two numbers s.t. 1p+1q=1.

  4. Using H”older’s inequality, we have

    E(exp((X1+X2)t))[E(exp(X1t))p]1p[E(exp(X2t))q]1q=[E(exp(pX1t))]1p[E(exp(qX2t))]1q[exp((pc1)2t22)]1p[exp((qc2)2t22)]1q=exp(t22(pc12+qc22))=exp(t22(pc12+pp1c22)).
  5. Since this is valid for any p>1, we can minimize the r.h.s. over p>1.

  6. If suffices to minimize the term

    r=pc12+pp1c22.
  7. We have

    rp=c121(p1)2c22.
  8. Equating it to 0 gives us

    p1=c2c1p=c1+c2c1pp1=c1+c2c2.
  9. Taking second derivative, we can verify that this is indeed a minimum value.

  10. Thus

    rmin=(c1+c2)2.
  11. Hence we have the result

    E(exp((X1+X2)t))exp((c1+c2)2t22).
  12. Thus X1+X2 is (c1+c2)-subgaussian.

If X1 and X2 are independent, then X1+X2 is c12+c22-subgaussian.

If X is c-subgaussian then naturally, X is d-subgaussian for any dc. A question arises as to what is the minimum value of c such that X is c-subgaussian.

Definition 7.34 (Subgaussian moment)

For a centered random variable X, the subgaussian moment of X, denoted by σ(X), is defined as

σ(X)=inf{c0|E(exp(Xt))exp(c2t22),tR.}

X is subgaussian if and only if σ(X) is finite.

We can also show that σ() is a norm on the space of subgaussian random variables. And this normed space is complete.

For centered Gaussian r.v. XN(0,σ2), the subgaussian moment coincides with the standard deviation. σ(X)=σ.

Sometimes it is useful to consider more restrictive class of subgaussian random variables.

Definition 7.35 (Strictly subgaussian distribution)

A random variable X is called strictly subgaussian if XSub(σ2) where σ2=E(X2), i.e. the inequality

E(exp(Xt))exp(σ2t22)

holds true for all tR.

We will denote strictly subgaussian variables by XSSub(σ2).

Example 7.20 (Gaussian distribution)

If XN(0,σ2) then XSSub(σ2).

7.9.1. Characterization#

We quickly review Markov’s inequality which will help us establish the results in this subsection.

Let X be a non-negative random variable. And let t>0. Then

P(Xt)E(X)t.

Theorem 7.25

For a centered random variable X, the following statements are equivalent:

  1. moment generating function condition:

    E[exp(Xt)]exp(c2t22)tR.
  2. Subgaussian tail estimate: There exists a>0 such that

    P(|X|λ)2exp(aλ2)λ>0.
  3. ψ2-condition: There exists some b>0 such that

    E[exp(bX2)]2.

Proof. (1)(2)

  1. Using Markov’s inequality, for any t>0 we have

    P(Xλ)=P(tXtλ)=P(etXetλ)E(etX)etλexp(tλ+c2t22)tR.
  2. Since this is valid for all tR, hence it should be valid for the minimum value of r.h.s.

  3. The minimum value is obtained for t=λc2.

  4. Thus we get

    P(Xλ)exp(λ22c2).
  5. Since X is c-subgaussian, hence X is also c-subgaussian.

  6. Hence

    P(Xλ)=P(Xλ)exp(λ22c2).
  7. Thus

    P(|X|λ)=P(Xλ)+P(Xλ)2exp(λ22c2).
  8. Thus we can choose a=12c2 to complete the proof.

(2)(3)

TODO PROVE THIS

E(exp(bX2))1+02btexp(bt2)P(|X|>t)dt

(3)(1)

TODO PROVE THIS

7.9.2. More Properties#

We also have the following result on the exponential moment of a subgaussian random variable.

Lemma 7.8

Suppose XSub(c2). Then

E[exp(λX22c2)]11λ

for any λ[0,1).

Proof. We are given that

E(exp(Xt))exp(c2t22)exp(tx)fX(x)dxexp(c2t22)tR

Multiplying on both sides with exp(c2t22λ):

exp(txc2t22λ)fX(x)dxexp(c2t22λ1λ)=exp(t22c2(1λ)λ)

Integrating on both sides w.r.t. t we get:

exp(txc2t22λ)fX(x)dxdtexp(t22c2(1λ)λ)dt

which reduces to:

1c2πλexp(λx22c2)fX(x)dx1c2πλ1λE(exp(λX22c2))11λ

which completes the proof.

7.9.3. Subgaussian Random Vectors#

The linearity property of subgaussian r.v.s can be extended to random vectors also. This is stated more formally in following result.

Theorem 7.26

Suppose that X=[X1,X2,,XN], where each Xi is i.i.d. with XiSub(c2). Then for any αRN, X,αSub(c2α22). Similarly if each XiSSub(σ2), then for any αRN, X,αSSub(σ2α22).

Norm of a subgaussian random vector

  1. Let X be a random vector where each Xi is i.i.d. with XiSub(c2).

  2. Consider the l2 norm X2. It is a random variable in its own right.

  3. It would be useful to understand the average behavior of the norm.

  4. Suppose N=1. Then X2=|X1|.

  5. Also X22=X12. Thus E(X22)=σ2.

  6. It looks like E(X22) should be connected with σ2.

  7. Norm can increase or decrease compared to the average value.

  8. A ratio based measure between actual value and average value would be useful.

  9. What is the probability that the norm increases beyond a given factor?

  10. What is the probability that the norm reduces beyond a given factor?

These bounds are stated formally in the following theorem.

Theorem 7.27

Suppose that X=[X1,X2,,XN], where each Xi is i.i.d. with XiSub(c2). Then

(7.3)#E(X22)=Nσ2.

Moreover, for any α(0,1) and for any β[c2σ2,βmax], there exists a constant κ4 depending only on βmax and the ratio σ2c2 such that

(7.4)#P(X22αNσ2)exp(N(1α)2κ)

and

(7.5)#P(X22βNσ2)exp(N(β1)2κ)
  • First equation gives the average value of the square of the norm.

  • Second inequality states the upper bound on the probability that norm could reduce beyond a factor given by α<1.

  • Third inequality states the upper bound on the probability that norm could increase beyond a factor given by β>1.

  • Note that if Xi are strictly subgaussian, then c=σ. Hence β(1,βmax).

Proof. Since Xi are independent hence

E[X22]=E[i=1NXi2]=i=1NE[Xi2]=Nσ2.

This proves the first part.

Now let us look at (7.5).

By applying Markov’s inequality for any λ>0 we have:

P(X22βNσ2)=P(exp(λX22)exp(λβNσ2))E(exp(λX22))exp(λβNσ2)=i=1NE(exp(λXi2))exp(λβNσ2)

Since Xi is c-subgaussian, hence from \cref {lem:subgaussian_exp_square_moment} we have

E(exp(λXi2))=E(exp(2c2λXi22c2))112c2λ.

Thus:

i=1NE(exp(λXi2))(112c2λ)N2.

Putting it back we get:

P(X22βNσ2)(exp(2λβσ2)12c2λ)N2.

Since above is valid for all λ>0, we can minimize the R.H.S. over λ by setting the derivative w.r.t. λ to 0.

Thus we get optimum λ as:

λ=βσ2c22c2σ2(1+β).

Plugging this back we get:

P(X22βNσ2)(βσ2c2exp(1βσ2c2))N2.

Similarly proceeding for (7.4) we get

P(X22αNσ2)(ασ2c2exp(1ασ2c2))N2.

We need to simplify these equations. We will do some jugglery now. Consider the function

f(γ)=2(γ1)2(γ1)lnγγ>0.

By differentiating twice, we can show that this is a strictly increasing function. Let us have γ(0,γmax]. Define

κ=max(4,2(γmax1)2(γmax1)lnγmax)

Clearly

κ2(γ1)2(γ1)lnγγ(0,γmax].

Which gives us:

ln(γ)(γ1)2(γ1)2κ.

Hence by exponentiating on both sides we get:

γexp[(γ1)2(γ1)2κ].

By slight manipulation we get:

γexp(1γ)exp[2(1γ)2κ].

We now choose

γ=ασ2c2

Substituting we get:

P(X22αNσ2)(γexp(1γ))N2exp[N(1γ)2κ].

Finally

cσσ2c21γα1γ1α

Thus we get

P(X22αNσ2)exp[N(1α)2κ].

Similarly by choosing γ=βσ2c2 proves the other bound.

We can now map γmax to some βmax by:

γmax=βmaxσ2c2.

This result tells us that given a vector with entries drawn from a subgaussian distribution, we can expect the norm of the vector to concentrate around its expected value Nσ2.