# 9.18. Smoothness#

Primary references for this section are [7].

## 9.18.1. L-Smooth Functions#

Definition 9.80 (L-smooth functions)

For some $$L \geq 0$$, a function $$f : \VV \to \RERL$$ is called $$L$$-smooth over a set $$D \subseteq \VV$$ if it is differentiable over $$D$$ and satisfies

$\| \nabla f(\bx) - \nabla f(\by)\|_* \leq L \| \bx - \by \| \Forall \bx, \by \in D.$

The constant $$L$$ is called the smoothness parameter.

• Since $$f$$ is differentiable over $$D$$, hence $$D \subseteq \interior \dom f$$.

• If $$f$$ is $$L$$-smooth over the entire $$\VV$$, we simply say that $$f$$ is $$L$$-smooth.

• $$L$$-smooth functions are also known as functions with Lipschitz gradient with constant $$L$$.

• The class of functions which are $$L$$-smooth over a set $$D$$ is denoted by $$C_L^{1,1}(D)$$.

• When $$D = \VV$$, then the class is simply denoted as $$C_L^{1,1}$$.

• The class of functions which are $$L$$-smooth for some $$L \geq 0$$ (but $$L$$ may not be known), is denoted by $$C^{1,1}$$.

• By definition, if a function is $$L_1$$-smooth, then it is $$L_2$$-smooth for every $$L_2 \geq L_1$$.

• Thus, it is often useful to identify the smallest possible value of $$L$$ for which a function is $$L$$-smooth.

Example 9.82 (Zero smoothness of affine functions)

Let $$\bb \in \VV^*$$ and $$c \in \RR$$. Let $$f : \VV \to \RR$$ be given by:

$f(\bx) = \langle \bx, \bb \rangle + c.$

Then, $$f$$ is $$0$$-smooth.

To see this, we note that $$\nabla f(\bx) = \bb$$. Consequently,

$\| \nabla f(\bx) - \nabla f(\by) \|_* = \| \bb - \bb \|_* = \| \bzero \|_* = 0 \leq 0 \| \bx - \by \|.$

Theorem 9.263 (Smoothness of quadratic functions)

Let $$\bA \in \SS^n$$, $$\bb \in \RR^n$$ and $$c \in \RR$$. Assume that $$\RR^n$$ is endowed with $$\ell_p$$-norm for some $$1 \leq p \leq \infty$$. Let $$f : \RR^n \to \RR$$ be given by:

$f(\bx) = \frac{1}{2} \bx^T \bA \bx + \bb^T \bx + c.$

Then, $$f$$ is $$L$$-smooth with $$L = \| \bA \|_{p, q}$$ where $$q \in [1,\infty]$$ is the conjugate exponent satisfying

$\frac{1}{p} + \frac{1}{q} = 1$

and $$\| \bA \|_{p, q}$$ is the induced norm given by

$\| \bA \|_{p, q} = \sup \{ \| \bA \bx \|_q \ST \| \bx \|_p \leq 1 \}.$

Proof. We note that the dual norm of $$\ell_p$$ norm is the $$\ell_q$$ norm. Now,

$\begin{split} & \| \nabla f(\bx) - \nabla f(\by)\|_* \\ &= \| \nabla f(\bx) - \nabla f(\by)\|_q \\ &= \| \bA \bx + \bb - \bA \by - \bb \|_q \\ &= \| \bA \bx - \bA \by \|_q \\ &\leq \| \bA \|_{p, q} \| \bx - \by \|_p. \end{split}$

Thus, $$f$$ is $$\| \bA \|_{p, q}$$ smooth.

We next show that $$\| \bA \|_{p, q}$$ is the smallest smoothness parameter for $$f$$.

1. Assume that $$f$$ is $$L$$ smooth for some $$L$$.

2. By definition of $$\| \bA \|_{p, q}$$, there exists a vector $$\tilde{\bx}$$ such that $$\| \tilde{\bx} \|_p = 1$$ and

$\| \bA \tilde{\bx} \|_q = \| \bA \|_{p, q} \| \tilde{\bx} \|_p = \| \bA \|_{p, q}.$
3. Then,

$\begin{split} \| \bA \|_{p, q} &= \| \bA \tilde{\bx} \|_q\\ &= \| \bA \tilde{\bx} + \bb - \bA \bzero - \bb \|_q\\ &= \| \nabla f(\tilde{\bx}) - \nabla f(\bzero) \|_q \\ &\leq L \| \tilde{\bx} - \bzero \|_p = L. \end{split}$
4. Thus, $$\| \bA \|_{p, q} \leq L$$.

Thus, $$\| \bA \|_{p, q}$$ is indeed the smallest smoothness parameter for $$L$$.

### 9.18.1.1. Descent Lemma#

Theorem 9.264 (Descent lemma)

Let $$f : \VV \to \RERL$$ be $$L$$-smooth for some $$L \geq 0$$ over some convex set $$D$$. Then for any $$\bx, \by \in D$$,

$f(\by) \leq f(\bx) + \langle \by - \bx, \nabla f(\bx) \rangle + \frac{L}{2} \| \bx - \by \|^2.$

Proof. we proceed as follows:

1. By the fundamental theorem of calculus

$f(\by) - f(\bx) = \int_0^1 \langle \by - \bx, \nabla f(\bx + t(\by - \bx)) \rangle dt.$
2. By adding and subtracting $$\langle \by - \bx, \nabla f(\bx) \rangle$$, we get:

$f(\by) - f(\bx) = \langle \by - \bx, \nabla f(\bx) \rangle + \int_0^1 \langle \by - \bx, \nabla f(\bx + t(\by - \bx)) - \nabla f(\bx) \rangle dt.$
3. This gives us

$\begin{split} & | f(\by) - f(\bx) - \langle \by - \bx, \nabla f(\bx) \rangle | \\ &= \left | \int_0^1 \langle \by - \bx, \nabla f(\bx + t(\by - \bx)) - \nabla f(\bx) \rangle dt \right | \\ &\leq \int_0^1 | \langle \by - \bx, \nabla f(\bx + t(\by - \bx)) - \nabla f(\bx) \rangle | dt \\ &\leq \int_0^1 \| \by - \bx \| \| \nabla f(\bx + t(\by - \bx)) - \nabla f(\bx) \|_* dt & \text{ (a) } \\ &\leq \int_0^1 \| \by - \bx \| tL \| \by - \bx \| dt & \text { (b) } \\ &= \int_0^1 tL \| \by - \bx \|^2 dt \\ &= L \| \by - \bx \|^2 \int_0^1 t dt \\ &= \frac{L}{2} \| \by - \bx \|^2. \end{split}$
• (a) is an application of Generalized Cauchy Schwartz inequality (Theorem 4.108}).

• (b) is the application of $$L$$-smoothness of $$f$$ (Definition 9.80).

4. Thus,

$\begin{split} & | f(\by) - f(\bx) - \langle \by - \bx, \nabla f(\bx) \rangle | \leq \frac{L}{2} \| \by - \bx \|^2 \\ & \implies f(\by) - f(\bx) - \langle \by - \bx, \nabla f(\bx) \rangle \leq \frac{L}{2} \| \by - \bx \|^2 \\ & \implies f(\by) \leq f(\bx) + \langle \by - \bx, \nabla f(\bx) \rangle + \frac{L}{2} \| \by - \bx \|^2. \end{split}$

### 9.18.1.2. Characterization of $$L$$-Smooth Functions#

Theorem 9.265 (Characterization of $$L$$-smooth functions)

Let $$f : \VV \to \RR$$ be convex and differentiable over $$\VV$$. Let $$L > 0$$. The following claims are equivalent:

1. $$f$$ is $$L$$-smooth.

2. $$f(\by) \leq f(\bx) + \langle \by - \bx, \nabla f(\bx) \rangle + \frac{L}{2} \| \bx - \by \|^2 \Forall \bx, \by \in \VV$$.

3. $$f(\by) \geq f(\bx) + \langle \by - \bx, \nabla f(\bx) \rangle + \frac{1}{2L} \| \nabla f (\bx) - \nabla f(\by) \|_*^2 \Forall \bx, \by \in \VV$$.

4. $$\langle \bx - \by, \nabla f (\bx) - \nabla f(\by) \rangle \geq \frac{1}{L} \| \nabla f (\bx) - \nabla f(\by) \|_*^2 \Forall \bx, \by \in \VV$$.

5. $$f(t \bx + (1-t) \by) \geq t f(\bx) + (1-t) f(\by) - \frac{L}{2} t (1 - t) \| \bx - \by \|^2 \Forall \bx, \by \in \VV, t \in [0, 1]$$.

Proof. (1) $$\implies$$ (2). This is a direct implication of the descent lemma (Theorem 9.264).

(2) $$\implies$$ (3)

1. We are given that (2) is satisfied.

2. If $$\nabla f(\bx) = \nabla f(\by)$$, then the inequality is trivial due to the convexity of $$f$$. Hence, we consider the case where $$\nabla f(\bx) \neq \nabla f(\by)$$.

3. Fix a $$\bx \in \VV$$.

4. Consider a function $$g_{\bx} : \VV \to \RR$$ given by

$g_{\bx}(\by) = f(\by) - f(\bx) - \langle \by - \bx, \nabla f(\bx) \rangle.$
5. Then,

$\nabla g_{\bx}(\by) = \nabla f(\by) - \nabla f(\bx).$
6. By hypothesis in property (2), for any $$\bz \in \VV$$

$f(\bz) \leq f(\by) + \langle \bz - \by, \nabla f(\by) \rangle + \frac{L}{2} \| \bz - \by \|^2.$
7. Now,

$\begin{split} & g_{\bx}(\bz) \\ &= f(\bz) - f(\bx) - \langle \bz - \bx, \nabla f(\bx) \rangle \\ &\leq f(\by) + \langle \bz - \by, \nabla f(\by) \rangle + \frac{L}{2} \| \bz - \by \|^2 - f(\bx) - \langle \bz - \bx, \nabla f(\bx) \rangle\\ &= f(\by) - f(\bx) - \langle \bz - \bx, \nabla f(\bx) \rangle + \langle \bz - \by, \nabla f(\bx) \rangle - \langle \bz - \by, \nabla f(\bx) \rangle\\ &+ \langle \bz - \by, \nabla f(\by) \rangle + \frac{L}{2} \| \bz - \by \|^2\\ &= f(\by) - f(\bx) - \langle \by - \bx, \nabla f(\bx) \rangle + \langle \bz - \by, \nabla f(\by) - \nabla f(\bx) \rangle + \frac{L}{2} \| \bz - \by \|^2\\ &= g_{\bx}(\by) + \langle \bz - \by, \nabla g_{\bx}(\by) \rangle + \frac{L}{2} \| \bz - \by \|^2. \end{split}$
8. Thus, $$g_{\bx}$$ also satisfies the inequality in property (2).

9. We note in particular that $$\nabla g_{\bx} (\bx) = \nabla f(\bx) - \nabla f(\bx) = \bzero$$.

10. Since $$g_{\bx}$$ is convex, hence $$\bx$$ is the global minimizer of $$g_{\bx}$$.

11. In other words,

$g_{\bx}(\bx) \leq g_{\bx}(\bz) \Forall \bz \in \VV.$
12. We can also see that $$g_{\bx}(\bx) = f(\bx) - f(\bx) - \langle \bx - \bx, \nabla f(\bx) \rangle = 0$$.

13. Let $$\by \in \VV$$

14. Let $$\bv \in \VV$$ be the unit norm vector satisfying $$\| \nabla g_{\bx}(\by) \|_* = \langle \bv , \nabla g_{\bx}(\by) \rangle$$.

15. Choose

$\bz = \by - \frac{\| \nabla g_{\bx}(\by) \|_*}{L} \bv.$
16. Then,

$0 = g_{\bx}(\bx) \leq g_{\bx}(\bz) = g_{\bx}\left (\by - \frac{\| \nabla g_{\bx}(\by) \|_*}{L} \bv \right ).$
17. Using property (2) on $$g_{\bx}(\bz)$$, we get

$\begin{split} 0 &\leq g_{\bx}(\bz) \\ &\leq g_{\bx}(\by) + \langle \bz - \by, \nabla g_{\bx}(\by) \rangle + \frac{L}{2} \| \bz - \by \|^2 \\ &= g_{\bx}(\by) - \frac{\| \nabla g_{\bx}(\by) \|_*}{L} \langle \bv, \nabla g_{\bx}(\by) \rangle + \frac{L}{2} \left \| \frac{\| \nabla g_{\bx}(\by) \|_*}{L} \bv \right \|^2 \\ &= g_{\bx}(\by) - \frac{\| \nabla g_{\bx}(\by) \|_*}{L} \| \nabla g_{\bx}(\by) \|_* + \frac{1}{2 L} \| \nabla g_{\bx}(\by) \|_*^2 \\ &= g_{\bx}(\by) - \frac{1}{2 L} \| \nabla g_{\bx}(\by) \|_*^2 \\ &= f(\by) - f(\bx) - \langle \by - \bx, \nabla f(\bx) \rangle - \frac{1}{2 L} \| \nabla f(\by) - \nabla f(\bx) \|_*^2. \end{split}$
18. Simplifying this, we get

$f(\by) \geq f(\bx) + \langle \by - \bx, \nabla f(\bx) \rangle + \frac{1}{2 L} \| \nabla f(\by) - \nabla f(\bx) \|_*^2$

as desired.

(3) $$\implies$$ (4)

1. For $$\bx, \by$$, the property (3) gives us:

$f(\by) \geq f(\bx) + \langle \by - \bx, \nabla f(\bx) \rangle + \frac{1}{2 L} \| \nabla f(\by) - \nabla f(\bx) \|_*^2.$
2. For $$\by, \bx$$, the property (3) gives us:

$f(\bx) \geq f(\by) + \langle \bx - \by, \nabla f(\by) \rangle + \frac{1}{2 L} \| \nabla f(\bx) - \nabla f(\by) \|_*^2.$
3. Adding the two inequalities and canceling the term $$f(\bx) + f(\by)$$ gives us

$0 \geq \langle \bx - \by, \nabla f(\by) - f(\bx) \rangle + \frac{1}{L} \| \nabla f(\bx) - \nabla f(\by) \|_*^2.$
4. Rearranging, we get

$\langle \bx - \by, \nabla f(\bx) - f(\by) \rangle \geq \frac{1}{L} \| \nabla f(\bx) - \nabla f(\by) \|_*^2$

as desired.

(4) $$\implies$$ (1)

1. When $$\nabla f(\bx) = \nabla f(\by)$$, then the Lipschitz condition in (1) is trivial. Hence, we consider the case where $$\nabla f(\bx) \neq \nabla f(\by)$$.

2. By generalized Cauchy Schwartz inequality (Theorem 4.108)

$\langle \bx - \by, \nabla f(\bx) - f(\by) \rangle \leq \| \bx - \by \| \| f(\bx) - f(\by) \|_*.$
3. Thus, combining with hypothesis (4), we obtain

$\frac{1}{L} \| \nabla f(\bx) - \nabla f(\by) \|_*^2 \leq \| \bx - \by \| \| f(\bx) - f(\by) \|_*.$
4. Since $$\nabla f(\bx) \neq \nabla f(\by)$$, hence $$\| f(\bx) - f(\by) \|_* > 0$$.

5. Canceling it from both sides, we get

$\| \nabla f(\bx) - \nabla f(\by) \|_* \leq L \| \bx - \by \|$

as desired.

We have shown so far that (1), (2), (3) and (4) are equivalent statements. We are left with showing that (5) is equivalent to the other statements.

(2) $$\implies$$ (5)

1. Pick $$\bx, \by \in \VV$$ and $$t \in [0,1]$$.

2. Let $$\bz = t \bx + (1-t) \by$$.

3. By hypothesis (2),

$\begin{split} & f(\bx) \leq f(\bz) + \langle \bx - \bz, \nabla f(\bz) \rangle + \frac{L}{2} \| \bx - \bz \|^2;\\ & f(\by) \leq f(\bz) + \langle \by - \bz, \nabla f(\bz) \rangle + \frac{L}{2} \| \by - \bz \|^2. \end{split}$
4. Note that $$\bx - \bz = (1-t) (\bx - \by)$$ and $$\by - \bz = t (\by - \bx)$$.

5. Thus, the previous two inequalities are same as

$\begin{split} & f(\bx) \leq f(\bz) + (1-t)\langle \bx - \by, \nabla f(\bz) \rangle + \frac{L (1-t)^2}{2} \| \bx - \by \|^2;\\ & f(\by) \leq f(\bz) + t\langle \by - \bx, \nabla f(\bz) \rangle + \frac{L t^2}{2} \| \bx - \by \|^2. \end{split}$
6. Multiplying the first inequality by $$t$$, the second by $$(1-t)$$ and adding, we get

$t f(\bx) + (1-t) f(\by) \leq f(\bz) + \frac{L t(1-t)}{2} \| \bx - \by \|^2.$
7. Rearranging, we get

$f(t \bx + (1-t) \by) = f(\bz) \geq t f(\bx) + (1-t) f(\by) - \frac{L }{2} t(1-t) \| \bx - \by \|^2.$

(5) $$\implies$$ (2)

1. Pick $$\bx, \by \in \VV$$ and $$t \in (0,1)$$.

2. By hypothesis in inequality (5)

$f(t \bx + (1-t) \by) \geq t f(\bx) + (1-t) f(\by) - \frac{L }{2} t(1-t) \| \bx - \by \|^2.$
3. Rearranging the terms, we obtain

$\begin{split} & (1-t) f(\by) \leq f(t \bx + (1-t) \by) - t f(\bx) + \frac{L }{2} t(1-t) \| \bx - \by \|^2 \\ &\iff (1-t) f(\by) \leq f(t \bx + (1-t) \by) - f(\bx) + (1 - t) f(\bx) + \frac{L }{2} t(1-t) \| \bx - \by \|^2 \\ &\iff f(\by) \leq f(\bx) + \frac{f(t \bx + (1-t) \by) - f(\bx)}{1-t} + \frac{L }{2} t \| \bx - \by \|^2. \end{split}$

Division by $$(1-t)$$ is fine since $$(1-t) \in (0, 1)$$.

4. Recalling the definition of directional derivative (Definition 9.70):

$\begin{split} &\lim_{t \to 1^-} \frac{f(t \bx + (1-t) \by) - f(\bx)}{1-t} \\ &= \lim_{s \to 0^+} \frac{f( (1-s) \bx + s \by) - f(\bx)}{s} \\ &= \lim_{s \to 0^+} \frac{f( \bx + s (\by - \bx) ) - f(\bx)}{s}\\ &= f'(\bx; \by - \bx). \end{split}$
5. Since the previous inequality is valid for every $$t \in (0,1)$$, taking the limit to $$t \to 1^-$$ on the R.H.S., we obtain

$f(\by) \leq f(\bx) + f'(\bx; \by - \bx) + \frac{L }{2} \| \bx - \by \|^2.$
6. Recall from Theorem 9.203 that $$f'(\bx; \by - \bx) = \langle \by - \bx, \nabla f(\bx) \rangle$$.

7. Thus, we get:

$f(\by) \leq f(\bx) + \langle \by - \bx, \nabla f(\bx) \rangle + \frac{L }{2} \| \bx - \by \|^2.$

as desired.

### 9.18.1.3. Second Order Characterization#

We now restrict our attention to the vector space $$\RR^n$$ equipped with an $$\ell_p$$ norm with $$p \geq 1$$.

Theorem 9.266 ($$L$$-smoothness and the boundedness of the Hessian)

Let $$f : \RR^n \to \RR$$ be a twice continuously differentiable function over $$\RR^n$$. Then, for any $$L \geq 0$$, the following claims are equivalent:

1. $$f$$ is $$L$$-smooth w.r.t. the $$\ell_p$$-norm ($$p \in [1, \infty]$$).

2. $$\| \nabla^2 f(\bx)\|_{p, q} \leq L$$ for any $$\bx \in \RR^n$$ where $$q \geq 1$$ satisfies $$\frac{1}{p} + \frac{1}{q} = 1$$.

Proof. (2) $$\implies$$ (1)

1. We are given that $$\| \nabla^2 f(\bx)\|_{p, q} \leq L$$ for any $$\bx \in \RR^n$$.

2. By the fundamental theorem of calculus

$\begin{split} \nabla f(\by) - \nabla f(\bx) &= \int_0^1 \nabla^2 f(\bx + t(\by - \bx)) (\by - \bx) dt \\ &= \left (\int_0^1 \nabla^2 f(\bx + t(\by - \bx)) dt \right ) (\by - \bx). \end{split}$
3. Taking the (dual)-norm on both sides

$\begin{split} \| \nabla f(\by) - \nabla f(\bx) \|_q &= \left \| \left ( \int_0^1 \nabla^2 f(\bx + t(\by - \bx)) dt \right ) (\by - \bx) \right \|_q \\ &\leq \left \| \int_0^1 \nabla^2 f(\bx + t(\by - \bx)) dt \right \|_{p, q} \| \by - \bx \|_p \\ &\leq \left ( \int_0^1 \| \nabla^2 f(\bx + t(\by - \bx)) \|_{p, q} dt \right ) \| \by - \bx \|_p \\ &\leq \left ( \int_0^1 L dt \right ) \| \by - \bx \|_p \\ &= L \| \by - \bx \|_p. \end{split}$
4. Thus, $$\| \nabla f(\by) - \nabla f(\bx) \|_q \leq L \| \by - \bx \|_p$$ as desired.

(1) $$\implies$$ (2)

1. We are given that $$f$$ is $$L$$ smooth with $$\ell_p$$ norm.

2. By fundamental theorem of calculus, for any $$\bd \in \RR^n$$ and $$s > 0$$,

$\nabla f(\bx + s \bd) - \nabla f(\bx) = \int_0^s \nabla^2 f(\bx + t \bd) \bd dt.$
3. Taking $$q$$ norm on both sides

$\left \| \left ( \int_0^s \nabla^2 f(\bx + t \bd) dt \right ) \bd \right \|_q = \| \nabla f(\bx + s \bd) - \nabla f(\bx) \|_q \leq L \|\bx + s \bd - \bx \|_p = s L \| \bd \|_p.$
4. Dividing by $$s$$ on both sides and taking the limit $$s \to 0^+$$, we get

$\| \nabla^2 f(\bx) \bd \|_q \leq L \| \bd \|_p.$
5. Since this is valid for every $$\bd \in \RR^n$$, hence

$\| \nabla^2 f(\bx) \|_{p,q} \leq L.$

Corollary 9.27 ($$L$$-smoothness and largest eigenvalue of Hessian)

Let $$f : \RR^n \to \RR$$ be a twice continuously differentiable convex function over $$\RR^n$$. Then $$f$$ is $$L$$-smooth w.r.t. $$\ell_2$$-norm if and only if

$\lambda_{\max}( \nabla^2 f(\bx)) \leq L \Forall \bx \in \RR^n.$

Proof. Since $$f$$ is convex, hence it follows that $$\nabla^2 f(\bx) \succeq \ZERO$$ for every $$\bx$$. Thus,

$\|\nabla^2 f(\bx) \|_{2,2} = \sqrt{\lambda_{\max}( \nabla^2 f(\bx)^2 )} = \lambda_{\max}( \nabla^2 f(\bx)).$

From Theorem 9.266, $$f$$ is $$L$$-smooth is equivalent to the condition that

$\lambda_{\max}( \nabla^2 f(\bx)) = \|\nabla^2 f(\bx) \|_{2,2} \leq L.$

## 9.18.2. Strong Convexity#

Definition 9.81 (Strong convexity)

A function $$f : \VV \to \RERL$$ is called $$\sigma$$-strongly convex for $$\sigma > 0$$ if $$\dom f$$ is convex and the following holds for any $$\bx, \by \in \dom f$$ and $$t \in [0,1]$$:

(9.10)#$f(t \bx + (1 - t)\by) \leq t f(\bx) + (1-t)f(\by) - \frac{\sigma}{2} t (1 - t) \| \bx - \by \|^2.$

### 9.18.2.1. Strong Convexity $$\implies$$ Convexity#

Strongly convex functions are convex. In fact, we have a stronger result available.

Theorem 9.267 (Strong convexity and convexity)

Assume that the ambient space $$\VV$$ is Euclidean.

A function $$f : \VV \to \RERL$$ is $$\sigma$$-strongly convex if and only if the function $$f(\cdot) - \frac{\sigma}{2} \| \cdot \|^2$$ is convex.

Proof. Let us define a function $$g : \VV \to \RERL$$ as

$g(\bx) = f(\bx) = \frac{\sigma}{2} \| \bx \|^2.$

We need to show that $$f$$ is $$\sigma$$-strongly convex if and only if $$g$$ is convex.

1. We first note that $$\dom g = \dom f$$.

2. Thus, $$\dom g$$ is convex if and only if $$\dom f$$ is convex.

3. Now, $$g$$ is convex if and only if $$\dom g = \dom f$$ is convex and for any $$\bx, \by \in \dom f$$ and $$t \in (0, 1)$$

$g(t \bx + (1-t) \by) \leq t g(\bx) + (1-t) g(\by).$
4. Now,

$\begin{split} & g(t \bx + (1-t) \by) \leq t g(\bx) + (1-t) g(\by) \\ \iff & f(t \bx + (1-t) \by) - \frac{\sigma}{2} \| t \bx + (1-t) \by \|^2 \\ & \leq t f(\bx) + (1-t) f(\by) - \frac{\sigma}{2} [t \| \bx \|^2 + (1-t) \| \by \|^2] \\ \iff & f(t \bx + (1-t) \by) \leq t f(\bx) + (1-t) f(\by) \\ & + \frac{\sigma}{2} [ \| t \bx + (1-t) \by \|^2 - t \| \bx \|^2 - (1-t) \| \by \|^2]. \end{split}$
5. Since the norm is Euclidean, hence

$\begin{split} & \| t \bx + (1-t) \by \|^2 - t \| \bx \|^2 - (1-t) \| \by \|^2 \\ &= \langle t \bx + (1-t) \by, t \bx + (1-t) \by \rangle - t \| \bx \|^2 - (1-t) \| \by \|^2 \\ &= t^2 \| \bx \|^2 + (1-t)^2 \| \by \|^2 + 2 t (1-t)\langle \bx, \by \rangle - t \| \bx \|^2 - (1-t) \| \by \|^2 \\ &= - t(1-t) \| \bx \|^2 - t(1-t) \| \by \|^2 + 2 t (1-t)\langle \bx, \by \rangle \\ &= - t(1-t) \left ( \| \bx \|^2 + \| \by \|^2 - 2 \langle \bx, \by \rangle \right ) \\ &= - t(1-t) \| \bx - \by \|^2. \end{split}$
6. Thus, the convexity inequality for $$g$$ is equivalent to

$f(t \bx + (1-t) \by) \leq t f(\bx) + (1-t) f(\by) - \frac{\sigma}{2}t(1-t) \| \bx - \by \|^2$

which is nothing but the $$\sigma$$-strong convexity condition of $$f$$.

Theorem 9.268 (Strong convexity of quadratic functions)

Let $$\bA \in \SS^n$$, $$\bb \in \RR^n$$ and $$c \in \RR$$. Let $$f : \RR^n \to \RR$$ be given by:

$f(\bx) = \frac{1}{2} \bx^T \bA \bx + \bb^T \bx + c.$

Then $$f$$ is $$\sigma$$-strongly convex if and only if $$\bA$$ is positive definite and $$\sigma \leq \lambda_{\min}(\bA)$$.

Proof. Due to Theorem 9.267, $$f$$ is strongly convex with $$\sigma > 0$$ if and only if $$g(\bx) = f(\bx) - \frac{\sigma}{2} \| \bx \|^2$$ is convex.

1. We note that

$\begin{split} g(\bx) &= \frac{1}{2} \bx^T \bA \bx + \bb^T \bx + c - \frac{\sigma}{2} \| \bx \|^2\\ &= \frac{1}{2} \bx^T (\bA - \sigma \bI) \bx + \bb^T \bx + c. \end{split}$
2. As shown in Example 9.40, $$g$$ is convex if and only if $$\bA - \sigma \bI \succeq \ZERO$$.

3. This is equivalent to $$\sigma \leq \lambda_{\min}(\bA)$$.

### 9.18.2.3. Coerciveness#

Theorem 9.269 (Strong convexity and coerciveness)

Assume that the ambient space $$\VV$$ is Euclidean. Assume that $$f: \VV \to \RERL$$ is a Fréchet-differentiable function. If $$f$$ is $$\sigma$$-strongly convex, then it is coercive.

Proof. We proceed as follows.

1. Define

$g(\bx) = f(\bx) - \frac{\sigma}{2} \| \bx \|^2.$
2. By Theorem 9.267, $$g$$ is convex.

3. Since $$f$$ is differentiable, hence $$g$$ is also differentiable.

4. Specifically, $$\nabla g(\bx) = \nabla f(\bx) - \sigma \bx$$.

5. Fix some $$\bx \in \interior \dom f$$.

6. Then $$\partial g(\bx) = \{ \nabla g(\bx) \}$$.

7. By subgradient inequality, for any $$\by \in \VV$$,

$g(\by) \geq g(\bx) + \langle \by - \bx, \nabla g(\bx) \rangle.$
8. Expanding $$g$$ and $$\nabla g$$:

$f(\by) - \frac{\sigma}{2} \| \by \|^2 \geq f(\bx) - \frac{\sigma}{2} \| \bx \|^2 + \langle \by - \bx, \nabla f(\bx) - \sigma \bx \rangle.$
9. Let $$\bv = f(\bx) - \sigma \bx$$.

10. Rearranging terms

$f(\by) \geq \frac{\sigma}{2} \| \by \|^2 + \langle \by, \bv \rangle + K_{\bx}$

where $$K_{\bx} = f(\bx) - \frac{\sigma}{2} \| \bx \|^2 - \langle \bx, \bv \rangle$$.

11. We note that the term $$K_{\bx}$$ depends solely on $$\bx$$ which is fixed. Hence $$K_{\bx}$$ is a fixed quantity.

12. By Cauchy-Schwarz inequality

$\langle \by, \bv \rangle \geq - \| \bv \| \| \by \|.$
13. Hence

$f(\by) \geq \frac{\sigma}{2} \| \by \|^2 - \| \bv \| \| \by \| + K_{\bx}.$
14. It is easy to see that, the R.H.S. goes to $$\infty$$ as $$\| \by \| \to \infty$$.

15. Hence $$f$$ is coercive.

### 9.18.2.4. Sum Rule#

Theorem 9.270 (Sum of strongly convex and convex functions)

Let $$f$$ be $$\sigma$$-strongly convex and $$g$$ be convex. Then $$f+g$$ is $$\sigma$$-strongly convex.

Proof. Since both $$f$$ and $$g$$ are convex, hence their domains are convex. Hence, $$\dom (f + g) = \dom f \cap \dom g$$ is also convex.

We further need to show that $$f+g$$ satisfies (9.10).

1. Let $$\bx, \by \in \VV$$ and $$t \in (0,1)$$.

2. Since $$f$$ is $$\sigma$$-strongly convex, hence

$f(t \bx + (1 - t)\by) \leq t f(\bx) + (1-t)f(\by) - \frac{\sigma}{2} t (1 - t) \| \bx - \by \|^2.$
3. Since $$g$$ is convex, hence

$g(t \bx + (1 - t)\by) \leq t g(\bx) + (1-t)g(\by).$
4. Then,

$\begin{split} & (f + g)(t \bx + (1 - t)\by) \\ &= f(t \bx + (1 - t)\by) + g((t \bx + (1 - t)\by)) \\ &\leq f(t \bx + (1 - t)\by) \leq t f(\bx) + (1-t)f(\by) - \frac{\sigma}{2} t (1 - t) \| \bx - \by \|^2 + t g(\bx) + (1-t)g(\by)\\ &= t (f + g) (\bx) + (1-t) (f + g)(\by) - \frac{\sigma}{2} t (1 - t) \| \bx - \by \|^2. \end{split}$
5. Thus, $$f+g$$ is also $$\sigma$$-strongly convex.

Example 9.83 (Strong convexity of $$\frac{1}{2}| \cdot |^2 + I_C$$)

Let $$\VV$$ be a Euclidean space.

1. The function $$\frac{1}{2}\| \bx \|^2$$ is 1-strongly convex due to Theorem 9.268.

2. Let $$C$$ be a convex set.

3. Then, the indicator function $$I_C$$ is convex.

4. Due to Theorem 9.270, the function

$g(\bx) = \frac{1}{2}\| \bx \|^2 + I_C(\bx)$

is also 1-strongly convex.

### 9.18.2.5. First Order Characterization#

Recall that $$\dom (\partial f)$$ denotes the set of points at which $$f$$ is subdifferentiable.

Theorem 9.271 (First order characterization of strong convexity)

Let $$f: \VV \to \RERL$$ be a proper closed and convex function. For a given $$\sigma > 0$$, the following statements are equivalent.

1. $$f$$ is $$\sigma$$-strongly convex.

2. For every $$\bx \in \dom (\partial f)$$, $$\by \in \dom f$$ and $$\bg \in \partial f(\bx)$$, the following holds true

$f(\by) \geq f(\bx) + \langle \by - \bx, \bg \rangle + \frac{\sigma}{2} \| \by - \bx \|^2.$
3. For any $$\bx, \by \in \dom (\partial f)$$ and $$\bg_{\bx} \in \partial f(\bx)$$, $$\bg_{\by} \in \partial f(\bx)$$, the following holds true:

$\langle \bx - \by, \bg_{\bx} - \bg_{\by} \rangle \geq \sigma \| \bx - \by \|^2.$

Proof. We shall prove the equivalence of these statements in the following order. $$(2) \implies (1)$$, $$(1) \implies (3)$$, $$(3) \implies (2)$$.

(2) $$\implies$$ (1)

1. We assume that (2) is true.

2. Let $$\bx, \by \in \dom f$$ and $$t \in (0,1)$$.

3. We need to show that (9.10) holds for $$f$$.

4. Since $$\dom f$$ is convex, its relative interior is not empty (see Theorem 9.142).

5. Let $$\bz \in \relint \dom f$$.

6. Choose some $$\alpha \in (0, 1]$$.

7. Let $$\tilde{\bx} = (1-\alpha) \bx + \alpha \bz$$.

8. By the line segment property (Theorem 9.143), $$\tilde{\bx} \in \relint \dom f$$.

9. Let $$\bx_t = t \tilde{\bx} + (1-t)\by$$.

10. Again, by the line segment property, $$\bx_t \in \relint \dom f$$.

11. Since $$f$$ is a proper convex function, hence the subdifferential of $$f$$ at relative interior points is nonempty (Theorem 9.217).

12. Thus, $$\partial f(\bx_t) \neq \EmptySet$$ and $$\bx_t \in \dom (\partial f)$$.

13. Take some $$\bg \in \partial f(\bx_t)$$.

14. By hypothesis (2)

$f(\tilde{\bx}) \geq f(\bx_t) + \langle \tilde{\bx} - \bx_t, \bg \rangle + \frac{\sigma}{2} \| \tilde{\bx} - \bx_t \|^2.$
15. Substituting $$\bx_t = t \tilde{\bx} + (1-t)\by$$, we have $$\tilde{\bx} - \bx_t = (1-t) (\tilde{\bx} - \by)$$. Thus,

$f(\tilde{\bx}) \geq f(\bx_t) + (1-t)\langle \tilde{\bx} - \by, \bg \rangle + \frac{\sigma (1-t)^2}{2} \| \tilde{\bx} - \by \|^2.$
16. Similarly, by hypothesis (2)

$f(\by) \geq f(\bx_t) + \langle \by - \bx_t, \bg \rangle + \frac{\sigma}{2} \| \by - \bx_t \|^2.$
17. $$\by - \bx_t = \by - t \tilde{\bx} - (1-t)\by = t (\by - \tilde{\bx})$$.

18. This gives us,

$f(\by) \geq f(\bx_t) + t \langle \by - \tilde{\bx}, \bg \rangle + \frac{\sigma t^2}{2} \| \by - \tilde{\bx} \|^2.$
19. Multiplying the first inequality by $$t$$ and the second one by $$(1-t)$$ and adding them together, we get

$t f(\tilde{\bx}) + (1-t)f(\by) \geq f(\bx_t) + \frac{\sigma t(1-t)}{2} \| \tilde{\bx} - \by \|^2.$
20. Thus,

$f(t \tilde{\bx} + (1-t)\by) = f(\bx_t) \leq t f(\tilde{\bx}) + (1-t)f(\by) - \frac{\sigma t(1-t)}{2} \|\tilde{\bx} - \by \|^2.$
21. Expanding $$\tilde{\bx}$$,

$\begin{split} t \tilde{\bx} + (1-t)\by &= t ((1-\alpha) \bx + \alpha \bz) + (1-t)\by\\ &= t(1-\alpha) \bx + (1-t) \by + t \alpha \bz. \end{split}$
22. Define $$g_1(\alpha) = f(t \tilde{\bx} + (1-t)\by) = f(t(1-\alpha) \bx + (1-t) \by + t \alpha \bz)$$.

23. Define $$g_2(\alpha) = f(\tilde{\bx}) = f((1-\alpha) \bx + \alpha \bz)$$.

24. Substituting these into the previous inequality, we obtain

$g_1(\alpha) \leq t g_2(\alpha)+ (1-t)f(\by) - \frac{\sigma t(1-t)}{2} \|(1-\alpha) \bx + \alpha \bz - \by \|^2.$
25. The functions $$g_1$$ and $$g_2$$ are one dimensional, proper, closed and convex functions.

26. By Theorem 9.173, both $$g_1$$ and $$g_2$$ are continuous on their domain.

27. Therefore, taking the limit $$\alpha \to 0^+$$, it follows that

$g_1(0) \leq t g_2(0) + (1-t)f(\by) - \frac{\sigma t(1-t)}{2} \|\bx - \by \|^2.$
28. Now $$g_1(0) = f(t\bx + (1-t) \by)$$ and $$g_2(0) = f(\bx)$$.

29. Thus,

$f(t\bx + (1-t) \by) \leq t f(\bx) + (1-t)f(\by) - \frac{\sigma t(1-t)}{2} \|\bx - \by \|^2.$
30. This establishes that $$f$$ is indeed $$\sigma$$-strongly convex.

(1) $$\implies$$ (3)

1. We are given that $$f$$ is $$\sigma$$-strongly convex.

2. Let $$\bx, \by \in \dom (\partial f)$$.

3. Pick any $$\bg_{\bx} \in \partial f(\bx)$$ and $$\bg_{\by} \in \partial f(\by)$$.

4. Let $$t \in [0, 1)$$ and denote $$\bx_t = t \bx + (1-t) \by$$.

5. By the hypothesis

$f(\bx_t) \leq t f(\bx) + (1-t) f(\by) - \frac{\sigma t (1-t) }{2} \| \bx - \by \|^2.$
6. This is same as

$f(\bx_t) - f(\bx) \leq (1-t) [f(\by) - f(\bx)] - \frac{\sigma t (1-t) }{2} \| \bx - \by \|^2.$
7. We can see that $$(1-t) \in (0, 1]$$.

8. Dividing both sides of inequality by $$(1-t)$$, we obtain

$\frac{f(\bx_t) - f(\bx)}{1-t} \leq f(\by) - f(\bx) - \frac{\sigma t }{2} \| \bx - \by \|^2.$
9. Since $$\bg_{\bx} \in \partial f(\bx)$$, hence by subgradient inequality

$f(\bx_t) \geq f(\bx) + \langle \bx_t - \bx, \bg_{\bx} \rangle.$
10. We can rewrite this as

$\frac{f(\bx_t) - f(\bx)}{1-t} \geq \frac{\langle \bx_t - \bx, \bg_{\bx} \rangle}{1-t}.$
11. Note that $$\bx_t - \bx = (1-t)(\by - \bx)$$.

12. Thus,

$\frac{f(\bx_t) - f(\bx)}{1-t} \geq \langle \by - \bx, \bg_{\bx} \rangle.$
13. Thus,

$\langle \by - \bx, \bg_{\bx} \rangle \leq f(\by) - f(\bx) - \frac{\sigma t }{2} \| \bx - \by \|^2.$
14. This inequality holds for every $$t \in [0, 1)$$.

15. Taking the limit $$t \to 1^-$$, we obtain

$\langle \by - \bx, \bg_{\bx} \rangle \leq f(\by) - f(\bx) - \frac{\sigma}{2} \| \bx - \by \|^2.$
16. An identical reasoning by switching the roles of $$\bx$$ and $$\by$$, gives us

$\langle \bx - \by, \bg_{\by} \rangle \leq f(\bx) - f(\by) - \frac{\sigma}{2} \| \by - \bx \|^2.$
17. Adding these two inequalities gives us

$\langle \bx - \by, \bg_{\by} - \bg_{\bx} \rangle \leq - \sigma \| \bx - \by \|^2.$
18. Multiplying both sides by $$-1$$ (and switching the inequality accordingly), we get

$\langle \bx - \by, \bg_{\bx} - \bg_{\by} \rangle \geq \sigma \| \bx - \by \|^2$

as desired.

(3) $$\implies$$ (2)

1. We are given that (3) is satisfied.

2. Let $$\bx \in \dom (\partial f)$$, $$\by \in \dom f$$ and $$\bg \in \partial f(\bx)$$.

3. Pick any $$\bz \in \relint \dom f$$.

4. Pick some $$\alpha \in (0, 1)$$.

5. Define $$\tilde{\by} = (1 - \alpha) \by + \alpha \bz$$.

6. By line segment property $$\tilde{\by} \in \relint \dom f$$.

7. Define $$\bx_t = (1-t) \bx + t \tilde{\by}$$.

8. Consider the 1D function

$\varphi(t) = f(\bx_t), \Forall t \in [0, 1].$
9. Pick any $$t \in (0, 1)$$.

10. Then, by line segment principle $$\bx_t \in \relint \dom f$$.

11. Due to (Theorem 9.217), $$\partial f(\bx_t) \neq \EmptySet$$ and $$\bx_t \in \dom (\partial f)$$.

12. Take some $$\bg_t \in \partial f(\bx_t)$$.

$f(\bz) \geq f(\bx_t) + \langle \bz - \bx_t, \bg_t \rangle \Forall \bz \in \VV.$
14. In particular, for $$\bx_s = (1-s) \bx + s \tilde{\by}$$, we have

$\begin{split} & f(\bx_s) \geq f(\bx_t) + \langle (1-s) \bx + s \tilde{\by} - (1-t) \bx - t \tilde{\by}, \bg_t \rangle \\ & \implies \varphi(s) \geq \varphi(t) + \langle (s-t) (\tilde{\by} - \bx), \bg_t \rangle \\ &\implies \varphi(s) \geq \varphi(t) + (s-t) \langle \tilde{\by} - \bx, \bg_t \rangle. \end{split}$
15. Since this is valid for every $$s$$, hence $$\langle \tilde{\by} - \bx, \bg_t \rangle \in \partial \varphi(t)$$.

16. Applying the mean value theorem (Theorem 9.234)

$f(\tilde{\by}) - f(\bx) = \varphi(1) - \varphi(0) = \int_0^1 \langle \tilde{\by} - \bx, \bg_t \rangle dt.$
17. Since $$\bg \in \partial f(\bx)$$ and $$\bg_t \in \partial f(\bx_t)$$, hence applying the hypothesis (3), we get

$\langle \bx_t - \bx, \bg_t - \bg \rangle \geq \sigma \| \bx_t - \bx \|^2.$
18. But $$\bx_t - \bx = t (\tilde{\by} - \bx)$$.

19. Hence

$t \langle \tilde{\by} - \bx, \bg_t - \bg \rangle \geq \sigma t^2 \| \tilde{\by} - \bx \|^2.$
20. This simplifies to

$\langle \tilde{\by} - \bx, \bg_t \rangle \geq \langle \tilde{\by} - \bx, \bg \rangle + \sigma t \| \tilde{\by} - \bx \|^2.$

Canceling $$t$$ on both sides doesn’t change the sign of inequality since $$t > 0$$.

21. Applying the inequality to the integral above

$f(\tilde{\by}) - f(\bx) \geq \int_0^1 \left [ \langle \tilde{\by} - \bx, \bg \rangle + \sigma t \| \tilde{\by} - \bx \|^2 \right ] d t.$
22. Integrating, we get

$f(\tilde{\by}) - f(\bx) \geq \langle \tilde{\by} - \bx, \bg \rangle + \frac{\sigma}{2}\| \tilde{\by} - \bx \|^2.$
23. Expanding for $$\tilde{\by}$$ for any $$\alpha \in (0,1)$$, we have

$f((1 - \alpha) \by + \alpha \bz) \geq f(\bx) + \langle (1 - \alpha) \by + \alpha \bz - \bx, \bg \rangle + \frac{\sigma}{2}\| (1 - \alpha) \by + \alpha \bz - \bx \|^2.$
24. The 1D function $$g(\alpha) = f((1 - \alpha) \by + \alpha \bz)$$ is continuous again due to Theorem 9.173.

25. Taking the limit $$\alpha \to 0^+$$ on both sides, we obtain

$f(\by) \geq f(\bx) + \langle \by - \bx, \bg \rangle + \frac{\sigma}{2}\| \by - \bx \|^2$

which is the desired result.

### 9.18.2.6. Minimization#

Theorem 9.272 (Existence and uniqueness of a a minimizer of closed strongly convex function)

Let $$f: \VV \to \RERL$$ be a proper, closed and $$\sigma$$-strongly convex function with $$\sigma > 0$$. Then,

1. $$f$$ has a unique minimizer $$\ba \in \dom f$$ such that $$f(\bx) > f(\ba)$$ for every $$\bx \in \dom f$$ and $$\bx \neq \ba$$.

2. The increase in the value of $$f$$ w.r.t. its minimum satisfies

$f(\bx) - f(\ba) \geq \frac{\sigma}{2} \| \bx - \ba \|^2$

where $$\ba \in \dom f$$ is the unique minimizer of $$f$$.

Proof. (1) Existence of the minimizer

1. Since $$f$$ is proper and convex, hence $$\dom f$$ is nonempty and convex.

2. Since $$\dom f$$ is nonempty and convex, hence its relative interior is nonempty (Theorem 9.142).

3. Pick $$\by \in \relint \dom f$$.

4. By Theorem 9.214, $$\partial f(\by)$$ is nonempty.

5. Pick some $$\bg \in \partial f(\by)$$.

6. Then, by property 2 of Theorem 9.271,

$f(\bx) \geq f(\by) + \langle \bx - \by, \bg \rangle + \frac{\sigma}{2} \| \bx - \by \|^2$

holds true for every $$\bx \in \VV$$.

7. Let $$\| \cdot \|_2 \triangleq \sqrt{\langle \cdot, \cdot \rangle}$$ denote the Euclidean norm associated with the inner product of the space $$\VV$$. This might be different from the endowed norm $$\| \cdot \|$$.

8. Since all norms in a finite dimensional space are equivalent, hence, there exists a constant $$C > 0$$ such that

$\| \bz \| \geq \sqrt{C} \| \bz \|_2$

for every $$\bz \in \VV$$.

9. Therefore,

$f(\bx) \geq f(\by) + \langle \bx - \by, \bg \rangle + \frac{\sigma C}{2} \| \bx - \by \|_2^2 \Forall \bx \in \VV.$
10. This in turn is same as

$f(\bx) \geq f(\by) - \frac{1}{2 C \sigma} \| \bg \|_2^2 + \frac{C \sigma}{2} \left \| \bx - \left (\by - \frac{1}{C \sigma} \bg \right ) \right \|_2^2 \Forall \bx \in \VV.$
11. Let $$S_t$$ denote the sublevel set $$\{ \bx \ST f(\bx) \leq t \}$$.

12. Consider the sublevel set $$S_{f(\by)}$$.

13. Let $$\bx \in S_{f(\by)}$$.

14. Then, $$f(\bx) = f(\by) - r$$ for some $$r \geq 0$$.

15. But then

$f(\by) - r \geq f(\by) - \frac{1}{2 C \sigma} \| \bg \|_2^2 + \frac{C \sigma}{2} \left \| \bx - \left (\by - \frac{1}{C \sigma} \bg \right ) \right \|_2^2.$
16. This simplifies to

$r \leq \frac{1}{2 C \sigma} \| \bg \|_2^2 - \frac{C \sigma}{2} \left \| \bx - \left (\by - \frac{1}{C \sigma} \bg \right ) \right \|_2^2.$
17. Since $$r$$ must be nonnegative, hence the R.H.S. must be nonnegative also.

18. Thus, we require that

$\frac{1}{2 C \sigma} \| \bg \|_2^2 \geq \frac{C \sigma}{2} \left \| \bx - \left (\by - \frac{1}{C \sigma} \bg \right ) \right \|_2^2.$
19. This simplifies to

$\left \| \bx - \left (\by - \frac{1}{C \sigma} \bg \right ) \right \|_2 \leq \frac{1}{C \sigma} \| \bg \|_2.$
20. In other words, $$\bx$$ must belong to an $$\ell_2$$ closed ball given by

$B_{\| \cdot \|_2}\left [ \by - \frac{1}{C \sigma} \bg, \frac{1}{C \sigma} \| \bg \|_2 \right ].$
21. Since this is valid for every $$\bx \in S_{f(\by)}$$, hence

$S_{f(\by)} \subseteq B_{\| \cdot \|_2}\left [ \by - \frac{1}{C \sigma} \bg, \frac{1}{C \sigma} \| \bg \|_2 \right ].$
22. Since $$f$$ is closed, hence all its sublevel sets are closed.

23. since $$S_{f(\by)}$$ is contained in a ball, hence $$S_{f(\by)}$$ is bounded.

24. Thus, $$S_{f(\by)}$$ is closed and bounded.

25. Since $$\VV$$ is finite dimensional, hence $$S_{f(\by)}$$ is compact.

26. $$S_{f(\by)}$$ is also nonempty since $$\by \in S_{f(\by)}$$.

27. Thus, the problem of minimizing $$f$$ over $$\dom f$$ reduces to the problem of minimizing $$f$$ over the nonempty compact set $$S_{f(\by)}$$.

28. Since $$f$$ is closed, it is also lower semicontinuous.

29. By Theorem 3.121, $$f$$ attains a minimum on $$S_{f(\by)}$$ at some point $$\ba \in S_{f(\by)}$$.

30. Thus, we have established the existence of a minimizer of $$f$$ at some $$\ba \in S_{f(\by)} \subseteq \dom f$$.

(1) Uniqueness of the minimizer

1. To show the uniqueness, for contradiction, assume that $$\bu$$ and $$\bv$$ are two different minimizers of $$f$$ with $$f(\bu) = f(\bv) = p^*$$, the optimal value.

2. Let $$\bw = \frac{1}{2} \bu + \frac{1}{2} \bv$$.

3. We must have $$f(\bw) \geq p^*$$.

4. By strong convexity of $$f$$,

$f(\bw) \leq \frac{1}{2} f(\bu) + \frac{1}{2} f(\bv) - \frac{\sigma}{2}\frac{1}{2}\frac{1}{2} \| \bu - \bv \|^2 = p^* - \frac{\sigma}{8}\| \bu - \bv \|^2.$
5. If $$\bu \neq \bv$$, then $$f(\bw) < p^*$$; a contradiction.

6. Hence, the minimizer must be unique.

(2) Increase in value of $$f$$

1. Let $$\ba$$ be the unique minimizer of $$f$$.

2. By Fermat’s optimality condition $$\bzero \in \partial f(\ba)$$.

3. Since $$f$$ is $$\sigma$$-strongly convex, hence by property (2) in the Theorem 9.271,

$f(\bx) - f(\ba) \geq \langle \bx - \ba, \bzero \rangle + \frac{\sigma}{2} \| \bx - \ba \|^2 = \frac{\sigma}{2} \| \bx - \ba \|^2$

holds true for any $$\bx \in \dom f$$.

## 9.18.3. Smoothness and Strong Convexity#

### 9.18.3.1. The Conjugate Correspondence Theorem#

The idea of smoothness and strong convexity is connected. Roughly speaking, a function is strongly convex if and only if its conjugate is smooth.

Theorem 9.273 (Conjugate correspondence theorem)

Let $$\sigma > 0$$. Then

1. If $$f : \VV \to \RR$$ is a $$\frac{1}{\sigma}$$-smooth convex function, then $$f^*$$ is $$\sigma$$-strongly convex w.r.t. the dual norm $$\| \cdot \|_*$$.

2. If $$f: \VV \to \RERL$$ is a proper, closed $$\sigma$$-strongly convex function, then $$f^* : \VV^* \to \RR$$ is $$\frac{1}{\sigma}$$-smooth.

Proof. (1) Smooth convex to strongly convex conjugate

1. We are given that $$f: \VV \to \RR$$ is a $$\frac{1}{\sigma}$$-smooth convex function.

2. Due to Theorem 9.239, $$f^*$$ is closed and convex.

3. Since $$f$$ is proper and convex, hence due to Theorem 9.240, $$f^*$$ is proper.

4. Thus $$f^*$$ is a proper, closed and convex function.

5. Pick any $$\by_1, \by_2 \in \dom (\partial f^*)$$.

6. Let $$\bv_1 \in \partial f^*(\by_1)$$ and $$\bv_2 \in \partial f^*(\by_2)$$.

7. Since $$f$$ is proper and convex, hence by conjugate subgradient theorem (Theorem 9.246)

$\by_1 \in \partial f(\bv_1) \text{ and } \by_2 \in \partial f(\bv_2).$
8. Since $$f$$ is smooth, hence it is differentiable. Hence due to Theorem 9.220,

$\by_1 = \nabla f(\bv_1) \text{ and } \by_2 = \nabla f(\bv_2).$
9. Following characterization of smoothness (Theorem 9.265), by its property 4,

$\langle \bv_1 - \bv_2, \by_1 - \by_2 \rangle \geq \sigma \| \by_1 - \by_2 \|^2_*.$
10. Since the last inequality holds for any $$\by_1, \by_2 \in \dom (\partial f^*)$$ and any $$\bv_1 \in \partial f^*(\by_1), \bv_2 \in \partial f^*(\bv_2)$$, hence following the first order characterization of strong convexity in Theorem 9.271, $$f^*$$ is a $$\sigma$$-strongly convex function.

(2) Strongly convex to smooth conjugate

1. We are given that $$f$$ is proper, closed and $$\sigma$$-strongly convex.

2. Pick any $$\by \in \VV^*$$.

3. The conjugate is given by

$f^*(\by) = \sup_{\bx \in \VV} \{ \langle \bx, \by \rangle - f(\by) \}.$
4. Define $$g(\bx) = f(\bx) - \langle \bx, \by \rangle$$.

5. We can see that

$f^*(\by) = - \inf_{\bx \in \VV} g(\bx).$
6. Due to the sum rule (Theorem 9.270), $$g$$ is $$\sigma$$-strongly convex.

7. Due to Theorem 9.272, $$g$$ has a unique minimizer.

8. Hence $$f^*(\by)$$ is finite.

9. Since this is valid for any $$\by \in \VV^*$$, hence $$\dom f^* = \VV^*$$.

10. This justifies the signature for $$f^*$$ as $$f^* : \VV^* \to \RR$$ being real valued.

11. Let’s continue with any $$\by$$.

12. Since $$\dom f^* = \VV^*$$, hence $$\by \in \interior \dom f^*$$.

13. Now, by the second formulation of conjugate subgradient theorem (Corollary 9.26),

$\partial f^*(\by) = \argmax_{\bx \in \VV} \{ \langle \bx, \by \rangle - f(\bx) \}.$
14. We can see that

$\partial f^*(\by) = - \argmin_{\bx \in \VV} g(\bx).$
15. Since $$g$$ has a unique minimizer, hence $$\partial f^*(\by)$$ is a singleton.

16. Due to Theorem 9.220, $$f^*$$ is differentiable at $$\by$$.

17. Since $$\by$$ is arbitrary, hence $$f^*$$ is differentiable over entire $$\VV^*$$.

18. We now pickup two points $$\by_1, \by_2 \in \VV^*$$ and denote $$\bv_1 = \nabla f^*(\by_1), \bv_2 = \nabla f^*(\by_2)$$.

19. By conjugate subgradient theorem (Theorem 9.246), this is equivalent to $$\by_1 \in \partial f(\bv_1)$$ and $$\by_2 \in \partial f(\bv_2)$$.

20. Following the first order characterization of strong convexity in Theorem 9.271,

$\langle \bv_1 - \bv_2, \by_1 - \by_2 \rangle \geq \sigma \| \bv_1 - \bv_2 \|^2.$
21. In other words

$\langle \nabla f^*(\by_1) - \nabla f^*(\by_2), \by_1 - \by_2 \rangle \geq \sigma \| \nabla f^*(\by_1) - \nabla f^*(\by_2) \|^2.$
22. By generalized Cauchy Schwartz inequality (Theorem 4.108)

$\langle \nabla f^*(\by_1) - \nabla f^*(\by_2), \by_1 - \by_2 \rangle \leq \| \nabla f^*(\by_1) - \nabla f^*(\by_2) \| \| \by_1 - \by_2 \|_*.$
23. Thus the previous inequality simplifies to

$\| \nabla f^*(\by_1) - \nabla f^*(\by_2) \| \leq \frac{1}{\sigma}\| \by_1 - \by_2 \|_*.$
24. This establishes that $$f^*$$ is $$\frac{1}{\sigma}$$-smooth.

## 9.18.4. Examples#

Example 9.84 (Smoothness of $$\sqrt{1 + | \cdot |_2^2}$$)

Let $$f : \RR^n \to \RR$$ be given by

$f(\bx) = \sqrt{1 + \| \bx \|_2^2}.$

$$f$$ is 1-smooth w.r.t. the $$\ell_2$$-norm.

1. Note that for any $$\bx \in \RR^n$$, the gradient is given by

$\nabla f(\bx) = \frac{\bx}{ \sqrt{1 + \| \bx \|_2^2}}.$
2. The Hessian is given by

$\nabla^f (\bx)= \frac{\bI}{ \sqrt{1 + \| \bx \|_2^2}} - \frac{\bx \bx^T}{ (1 + \| \bx \|_2^2)^{\frac{3}{2}}} \preceq \frac{\bI}{ \sqrt{1 + \| \bx \|_2^2}} \preceq \bI.$
3. Therefore, $$\lambda_{\max}( \nabla^2 f(\bx)) \leq 1$$ for every $$\bx \in \RR^n$$.

4. Hence, by Corollary 9.27, $$f$$ is 1-smooth w.r.t. the $$\ell_2$$-norm.

### 9.18.4.1. Log-Sum-Exp#

Example 9.85 (Smoothness of log-sum-exp)

Consider the log-sum-exp function $$f : \RR^n \to \RR$$ given by

$f(\bx) = \ln \left ( \sum_{i=1}^n e^{x_i}\right ).$

$$f$$ is 1-smooth w.r.t. $$\ell_2$$ and $$\ell_{\infty}$$ norms.

Smoothness w.r.t. $$\ell_2$$ norm

1. The partial derivatives of $$f$$ are

$\frac{\partial f}{\partial x_i} (\bx) = \frac{e^{x_i}}{\sum_{k=1}^n e^{x_k} }.$
2. The second order partial derivatives are

$\begin{split} \frac{\partial^2 f}{\partial x_i \partial x_j} (\bx) = \begin{cases} - \frac{e^{x_i} e^{x_j}}{\left (\sum_{k=1}^n e^{x_k} \right )^2}, & i \neq j; \\ - \frac{e^{x_i} e^{x_i}}{\left (\sum_{k=1}^n e^{x_k} \right )^2} + \frac{e^{x_i}}{\sum_{k=1}^n e^{x_k}}, & i = j. \end{cases} \end{split}$
3. The Hessian can be written as

$\nabla^2 f(\bx) = \diag (\bw) - \bw \bw^T$

where $$w_i = \frac{e^{x_i}}{\sum_{k=1}^n e^{x_k}}$$.

4. We can now see that

$\nabla^2 f(\bx) = \diag (\bw) - \bw \bw^T \preceq \diag (\bw) \preceq \bI.$
5. Hence $$\lambda_{\max}( \nabla^2 f(\bx)) \leq 1$$ for every $$\bx \in \RR^n$$.

6. Hence, by Corollary 9.27, $$f$$ is 1-smooth w.r.t. the $$\ell_2$$-norm.

Smoothness w.r.t. $$\ell_{\infty}$$ norm

1. We first show that for any $$\bv \in \VV$$

$\bv^T \nabla^2 f(\bx) \bv \leq \| \bv \|_{\infty}^2.$
2. To see this, we expand the L.H.S. as

$\begin{split} \bv^T \nabla^2 f(\bx) \bv &= \bv^T (\diag (\bw) - \bw \bw^T) \bv \\ &= \bv^T \diag (\bw) \bv - (\bw^T \bv)^2 \\ &\leq \bv^T \diag (\bw) \bv\\ &= \sum_{i=1}^n w_i v_i^2 \\ &\leq \| \bv \|_{\infty}^2 \sum_{i=1}^n w_i \\ &= \| \bv \|_{\infty}^2. \end{split}$
3. Since $$f$$ is twice differentiable over $$\RR^n$$, hence by linear approximation theorem (Theorem 5.5), for any $$\bx, \by \in \RR^n$$, there exists $$\bz \in [\bx, \by]$$ such that

$f(\by) - f(\bx) = \nabla f(\bx)^T (\by - \bx) + \frac{1}{2} (\by - \bx)^T \nabla^2 f(\bz) (\by - \bx).$
4. Let $$\bv = \by - \bx$$.

5. Then from above,

$(\by - \bx)^T \nabla^2 f(\bz) (\by - \bx) \leq \| \bv \|_{\infty}^2.$
6. Putting this back in the approximation, we have

$f(\by) \leq f (\bx) + \nabla f(\bx)^T (\by - \bx) + \frac{1}{2} \| \by - \bx \|_{\infty}^2.$
7. Following characterization of smoothness (Theorem 9.265), $$f$$ is indeed 1-smooth w.r.t. the $$\ell_{\infty}$$-norm.

### 9.18.4.2. Negative Entropy#

Example 9.86 (Strong convexity of negative entropy over the unit simplex)

Let $$f : \RR^n \to \RERL$$ be given by:

$\begin{split} f(\bx) \triangleq \begin{cases} \sum_{i=1}^n x_i \ln x_i & \bx \in \Delta_n\\ \infty & \text{ otherwise } \end{cases}. \end{split}$

$$f$$ is 1-strongly convex for both $$\ell_1$$ and $$\ell_2$$ norms.

1. By Theorem 9.257, its conjugate is given by

$f^*(\by) = \ln \left ( \sum_{j=1}^n e^{y_j} \right )$

which is the log sum exp function.

2. By Example 9.85, the log-sum-exp function is 1-smooth w.r.t. both $$\ell_2$$ and $$\ell_{\infty}$$ norms.

3. Hence by conjugate correspondence theorem Theorem 9.273, $$f$$ is 1-strongly convex for both $$\ell_1$$ and $$\ell_2$$ norms.