11.1. Basic Subgradient Method#

The theory of subgradients for convex functions is discussed in Subgradients. This section discusses basic subgradient method for optimization. Primary references for this section are [15].

Recall that a vector \(\bg\) is a subgradient of \(f\) at a point \(\bx\) if

\[ f(\by) \geq f(\bx) + \langle \by - \bx, \bg \rangle \Forall \by \in \RR^n. \]

11.1.1. Unconstrained Minimization#

We start with a simple case of minimizing a convex function \(f: \RR^n \to \RR\).

The optimal value shall be denoted by \(f^*\).
We shall assume that the optimal value is finite (i.e., \(f^* > -\infty\)).
The set of optimal minimizers shall be denoted by \(X^*\).
Since \(f\) attains its optimal value, hence \(X^*\) is nonempty.
A global minimizer will be denoted by \(\bx^*\).

The basic subgradient method uses the following iteration.

(11.1)#\[\bx^{k+1} = \bx^k - t_k \bg^k.\]

In this iteration

\(\bx^k\) is the current (\(k\)-th) iterate.
\(\bg^k\) is any subgradient of \(f\) at \(\bx^k\). We have \(\bg^k \in \partial f(\bx^k)\).
\(t_k > 0\) is the step size for the \(k\)-th iteration in the negative subgradient direction \(- \bg^k\).
\(\bx^{k+1}\) is the new (\(k+1\)-th) iterate.
\(\| t_k \bg^k \|_2\) denotes the step-length.

Thus, in each iteration of the subgradient method, we take a step in the direction of a negative subgradient.

11.1.1.1. Descent Directions#

Observation 11.1 (Negative subgradient need not be a descent direction)

Recall from Definition 10.22 that a descent direction of \(f\) at \(\bx\) is a direction along which the directional derivative is negative; i.e., \(f'(\bx; \bd) < 0\).
If \(\bd\) is a descent direction at \(\bx\), then there exists a step size \(t > 0\) such that

\[ f(\bx + t \bd) < f(\bx). \]
If \(f\) is a differentiable function and \(\bd = - \nabla f(\bx)\), then

\[ f'(\bx; - \nabla f(\bx)) = - \| \nabla f(\bx) \|^2 \leq 0. \]
Hence, the negative gradient is always a descent direction if the gradient is nonzero.
However, for a nondifferentiable function \(f\), a negative subgradient may not be a descent direction.
Recall from max formula (Theorem 9.219) that

\[ f'(\bx;\bd) = \sup \{ \langle \bd, \bg \rangle \ST \bg \in \partial f(\bx) \}. \]
Let \(\tilde{\bg}\) be the chosen subgradient at an iteration of the subgradient method.
Then

\[ f'(\bx; -\tilde{\bg}) = \sup_{\bg \in \partial f(\bx)} \{ \langle - \tilde{\bg}, \bg \rangle\} = - \inf_{\bg \in \partial f(\bx)} \{ \langle \tilde{\bg}, \bg \rangle\}. \]
It is possible that the quantity \(\langle \tilde{\bg}, \bg \rangle\) is negative for some other subgradient \(\bg\) at \(\bx\).
Hence the directional derivative along \(-\tilde{\bg}\) may be positive.
This argument shows that a negative subgradient is not necessarily a descent direction.

11.1.1.2. Tracking of the Best Estimate#

Since the negative subgradient \(-\bg^k\) may not be a descent direction, hence it is quite likely that \(f(\bx^{k+1}) > f(\bx^k)\).
Even if the selected \(-\bg^k\) is a descent direction at \(\bx^k\), it is possible that the step size can be such that \(f(\bx^{k+1}) > f(\bx^k)\).
This happens since typically there is no local line search method invoked to select an appropriate step size in a subgradient method.
Since the negative subgradient itself may not be a descent direction, hence running a local line search will be futile.
The alternative is to track the best estimate of the optimal value so far using the following rule

\[ f_{\best}^k = \min \{f_{\best}^{k-1}, f(\bx^k) \}. \]
If the best estimate has decreased, we also update the estimate of the local minimizer as \(\bx_{\best}\) to \(\bx^{k+1}\).
It is easy to see that

\[ f_{\best}^k = \min \{f(\bx^1), \dots, f(\bx^k) \}. \]
We can see that the sequence \(\{ f_{\best}^k \}\) is a nonincreasing sequence.
Hence it has a limit.
The success of the subgradient method depends on whether the limit equals the optimal value.

11.1.1.3. Step Size Selection#

In a gradient descent method, one typically uses a line search method to select the step size. Hence, it depends on current iterate and the function values in its neighborhood.

In a subgradient method, the step size selection is different. Typically, the step sizes (or step lengths) are determined beforehand and they don’t depend on the data computed during the execution of the algorithm. Following are some of the common methods for step size selection.

Constant step size. \(t_k = t\) for some positive constant \(t > 0\) which is independent of \(k\).
Constant step length.

\[ t_k = \frac{c}{\| \bg^k \|_2} \]

where \(c > 0\) is a predefined step length. Note that with this choice of step size, we have:

\[ \| \bx^{k + 1} - \bx^k \|_2 = \| t_k \bg^k \|_2 = c. \]
Square summable but not summable. We choose step sizes that satisfy the following constraints:

\[ t_k > 0 \Forall k, \quad \sum_{k=1}^{\infty} t_k^2 < \infty, \sum_{k=1}^{\infty} t_k = \infty. \]

As an example, fix some \(a > 0\) and \(b \geq 0\) and let

\[ t_k = \frac{a}{b + k}. \]
Nonsummable diminishing. We choose step sizes that satisfy the following constraints:

\[ t_k > 0 \Forall k, \quad \lim_{k \to \infty} t_k = 0, \quad \sum_{k=1}^{\infty} t_k = \infty. \]

As and example, fix some \(a > 0\) and let

\[ t_k = \frac{a}{\sqrt{k}}. \]
Nonsummable diminishing step lengths: We choose step sizes as \(t_k = \frac{c_k}{\| \bg^k \|_2}\) where

\[ c_k > 0, \quad \lim_{k \to \infty} c_k = 0,\quad \sum_{k=1}^{\infty} c_k = \infty. \]

11.1.1.4. The Subgradient Method#

We now present the overall template for the subgradient method.

Algorithm 11.1 (The subgradient method)

Inputs

\(f\) : function to be minimized
\(\bx^1\): initial guess for the minimizer
\(t_1\): initial step size

Outputs

\(f_{\best}^1\): Best estimate of minimum value of \(f\)
\(\bx_{\best}^1\) the best estimate of the minimizer

Initialization

\(f_{\best} = f(\bx^1)\).
\(\bx_{\best} = \bx^1\).

General iteration: for \(k=1,2,\dots\), execute the following steps

Select a subgradient \(\bg^k\) from \(\partial f(\bx^k)\).
Select a step size: \(t_k\).
Update minimizer estimate: \(\bx^{k+1} = \bx^k - t_k \bg^k\).
\(k = k + 1\).
Compute \(f(\bx^k)\).
if \(f_{\best}^{k-1} > f(\bx^k)\):
1. Update best estimate of optimal value: \(f_{\best}^k = f(\bx^k)\).
2. Update best estimate of minimizer: \(\bx_{\best}^k = \bx^k\).
Otherwise, retain current values
1. \(f_{\best}^k = f_{\best}^{k-1}\).
2. \(\bx_{\best}^k = \bx_{\best}^{k-1}\).
If stopping criterion is met, then STOP.

For a concrete implementation, we shall need the following:

A way to compute \(f(\bx)\) at every \(\bx\).
A way to pick a subgradient \(\bg \in \partial f(\bx)\) at every \(\bx\).
A step size selection strategy.
A stopping criterion.

Note that there is no need to specify the complete subdifferential at every \(\bx\). The stopping criterion will be developed in the following as part of the convergence analysis of the algorithm.

11.1.2. Convergence Analysis#

Our goal is to show that \(f_{\best}^k \downarrow f^*\). Towards this, we will establish a suboptimality bound on the estimate of the optimal value after \(k\) iterations.

We shall make the following assumptions in the analysis.

\(f\) is a real valued convex function.
The optimal value \(f^*\) is finite and the minimizer \(\bx^*\) exists.
The norm of the subgradients is bounded. i.e., there exists \(G > 0\) such that

\[ \| \bg \|_2 \leq G \Forall \bg \in \partial f(\bx) \Forall \bx. \]

Recall from Theorem 9.232 that if a convex function is Lipschitz continuous, then its subgradients are bounded.
A number \(R\) that satisfies

\[ R \geq \| \bx^* - \bx^1 \|_2 \]

for some \(\bx^* \in X^*\) is known beforehand. \(R\) can be interpreted as an upper bound on the distance of the initial point to the set of optimal minimizers.

\[ R \geq \text{dist}(\bx^1, X^*). \]

Theorem 11.1 (Suboptimality bound for subgradient method)

Let \(\| \bx^1 - \bx^* \|_2 \leq R\).

After \(k\) iterations of the subgradient method, we have

(11.2)#\[f_{\best}^k - f^* \leq \frac{R^2 + \sum_{i=1}^k t_i^2 \| \bg_i \|_2^2 }{2 \sum_{i=1}^k t_i }.\]

If the subgradients of \(f\) are bounded by \(\| \bg \| \leq G\) for all \(\bg \in \partial f(\bx)\) for every \(\bx\), then this simplifies to:

(11.3)#\[f_{\best}^k - f^* \leq \frac{R^2 + G^2 \sum_{i=1}^k t_i^2}{2 \sum_{i=1}^k t_i }.\]

Proof. Let \(\bx^*\) be a minimizer of \(f\). From the subgradient inequality, we have

\[ f^* = f(\bx^*) \geq f(\bx^k) + \langle \bx^* - \bx^k, \bg^k \rangle. \]

Hence

\[ \langle \bx^* - \bx^k, \bg^k \rangle \leq f^* - f(\bx^k). \]

We first develop an upper bound on the distance between the \(k+1\)-th iterate and a minimizer.

\[\begin{split} \| \bx^{k+1} - \bx^* \|_2^2 & = \| \bx^k - t_k \bg^k - \bx^* \|_2^2 \\ & = \| (\bx^k - \bx^*) - t_k \bg^k \|_2^2 \\ & = \| \bx^k - \bx^* \|_2^2 - 2 t_k \langle \bx^k - \bx^*, \bg^k \rangle + t_k^2 \| \bg^k \|_2^2 \\ & \leq \| \bx^k - \bx^* \|_2^2 - 2 t_k (f(\bx^k) - f^*) + t_k^2 \| \bg^k \|_2^2. \end{split}\]

Applying this inequality recursively, we get

\[ \| \bx^{k+1} - \bx^* \|_2^2 \leq \| \bx^1 - \bx^* \|_2^2 - 2 \sum_{i=1}^k t_i (f(\bx^i) - f^*) + \sum_{i=1}^k t_i^2 \| \bg^i \|_2^2. \]

Using the fact that \(\| \bx^{k+1} - \bx^* \|_2^2 \geq 0\) and \(\| \bx^1 - \bx^* \|_2 \leq R\), we have

\[ 2 \sum_{i=1}^k t_i (f(\bx^i) - f^*) \leq R^2 + \sum_{i=1}^k t_i^2 \| \bg^i \|_2^2. \]

On the L.H.S., we can see that

\[ \sum_{i=1}^k t_i (f(\bx^i) - f^*) \geq \left ( \sum_{i=1}^k t_i \right ) \min_{i=1,\dots,k} (f(\bx^i) - f^*) = \left ( \sum_{i=1}^k t_i \right ) (f_{\best}^k - f^* ). \]

Combining this with the previous inequality, we get

\[ 2 \left ( \sum_{i=1}^k t_i \right ) (f_{\best}^k - f^* ) \leq R^2 + \sum_{i=1}^k t_i^2 \| \bg^i \|_2^2. \]

This can be rewritten as

\[ f_{\best}^k - f^* \leq \frac{R^2 + \sum_{i=1}^k t_i^2 \| \bg_i \|_2^2 }{2 \sum_{i=1}^k t_i } \]

as desired.

Applying the upper bound on the subgradient \(\| \bg \|_2 \leq G\), we have

\[ f_{\best}^k - f^* \leq \frac{R^2 + G^2 \sum_{i=1}^k t_i^2}{2\sum_{i=1}^k t_i } \]

as desired.

Based on this result, we can provide upper bounds on the estimation error for different step size selection strategies.

11.1.2.1. Constant Step Size#

Corollary 11.1 (Convergence of subgradient method with constant step size)

If \(t_k = t\) for all \(k\), then we have

\[ f_{\best}^k - f^* \leq \frac{R^2 + G^2 t^2 k}{2 t k }. \]

The subgradient method converges to within \(G^2 t / 2\) of the optimal value \(f^*\). In other words,

\[ \lim_{k \to \infty} f_{\best}^k - f^* \leq \frac{G^2 t}{2}. \]

Also, \(f_{\best}^k - f^* \leq G^2 t\) within at most \(R^2 / (G^2 t^2)\) steps.

Proof. By putting \(t_k = t\) in (11.3), we obtain the desired upper bound. Taking the limit \(k \to \infty\), we see that

\[ \lim_{k \to \infty} (f_{\best}^k - f^*) \leq \frac{G^2 t}{2}. \]

Now,

\[\begin{split} & \frac{R^2 + G^2 t^2 k}{2 t k } \leq G^2 t \\ \iff & R^2 + G^2 t^2 k \leq 2 G^2 t^2 k \\ \iff & R^2 \leq G^2 t^2 k \\ \iff & k \geq \frac{R^2}{G^2 t^2}. \end{split}\]

Since \(f_{\best}^k\) is a nonincreasing sequence, hence for all \(k \geq R^2 / (G^2 t^2)\), we must have

\[ f_{\best}^k - f^* \leq G^2 t. \]

11.1.2.2. Constant Step Length#

Corollary 11.2 (Convergence of subgradient method with constant step length)

Let \(c\) be the constant step length for the subgradient method. Let \(t_k = \frac{c}{\| \bg^k \|_2}\) for every \(k\). Then we have

\[ f_{\best}^k - f^* \leq \frac{R^2 + c^2 k}{2 c k / G }. \]

The subgradient method converges to within \(G c / 2\) of the optimal value \(f^*\). In other words,

\[ \lim_{k \to \infty} f_{\best}^k - f^* \leq \frac{G c}{2}. \]

Proof. By putting \(t_i = \frac{c}{\| \bg^i \|_2}\) in (11.2), we see that

\[ f_{\best}^k - f^* \leq \frac{R^2 + \sum_{i=1}^k c^2 }{2 \sum_{i=1}^k t_i }. \]

Also, note that

\[\begin{split} & t_i = \frac{c}{\| \bg^i \|_2}\\ \implies & t_i \geq \frac{c}{ G} \\ \implies & \sum_{i=1}^k t_i \geq \frac{c k}{G} \\ \implies & \frac{1}{\sum_{i=1}^k t_i} \leq \frac{1}{c k / G}. \end{split}\]

Putting this back, we have

\[ f_{\best}^k - f^* \leq \frac{R^2 + c^2 k }{2 c k / G }. \]

Taking the limit \(k \to \infty\), we see that

\[ \lim_{k \to \infty} (f_{\best}^k - f^*) \leq \frac{G c}{2}. \]

11.1.2.3. Square Summable Step Sizes#

Corollary 11.3 (Convergence of subgradient method with square summable step sizes)

Let the step sizes satisfy the rule

\[ t_k > 0 \Forall k, \quad \sum_{k=1}^{\infty} t_k^2 < \infty, \sum_{k=1}^{\infty} t_k = \infty. \]

Further, let \(T = \sum_{k=1}^{\infty} t_k^2\).

Then we have

\[ f_{\best}^k - f^* \leq \frac{R^2 + G^2 T}{2 \sum_{i=1}^k t_k}. \]

The subgradient method converges to \(f^*\).

Proof. By putting the step size constraints into (11.3), we obtain

\[ f_{\best}^k - f^* \leq \frac{R^2 + G^2 T}{2 \sum_{i=1}^k t_i }. \]

Taking the limit \(k \to \infty\), the denominator grows to \(\infty\) while the numerator converges to \(R^2 + G^2 T\). Hence

\[ \lim_{k \to \infty} f_{\best}^k - f^* = 0. \]

11.1.2.4. Diminishing Step Sizes#

Corollary 11.4 (Convergence of subgradient method with diminishing step sizes)

Let the step sizes satisfy the rule

\[ t_k > 0 \Forall k, \quad \lim_{k \to \infty} t_k = 0, \quad \sum_{k=1}^{\infty} t_k = \infty. \]

Then the subgradient method converges to \(f^*\).

Proof. We shall show that the R.H.S. of (11.3) converges to 0 as \(k \to \infty\).

Let \(\epsilon > 0\).
There exists integer \(n_1\) such that \(t_i \leq \epsilon / G^2\) for all \(i > n_1\) since \(t_k \to 0\).
There also exists an integer \(n_2\) such that

\[ \sum_{i=1}^{n_2} t_i \geq \frac{1}{\epsilon} \left ( R^2 + G^2 \sum_{i=1}^{n_1} t_i^2 \right) \]

since \(t_i\) is nonsummable.
Let \(n = \max\{n_1, n_2 \}\).
Then, for every \(k > n\), we have

\[ \frac{R^2 + G^2 \sum_{i=1}^k t_i^2}{2 \sum_{i=1}^k t_i} = \frac{R^2 + G^2 \sum_{i=1}^{n_1} t_i^2}{2 \sum_{i=1}^k t_i} + \frac{G^2 \sum_{i=n_1 + 1}^k t_i^2}{2 \sum_{i=1}^{n_1} t_i + 2 \sum_{i=n_1 + 1}^k t_i}. \]
We can see that

\[ \sum_{i=1}^k t_i \geq \sum_{i=1}^{n_2} t_i \geq \frac{1}{\epsilon} \left ( R^2 + G^2 \sum_{i=1}^{n_1} t_i^2 \right). \]
Hence

\[ \frac{R^2 + G^2 \sum_{i=1}^{n_1} t_i^2}{2 \sum_{i=1}^k t_i} \leq \frac{\epsilon}{2}. \]
For every \(i > n_1\), we have \(t_i < \epsilon / G^2\).
Hence

\[ G^2 \sum_{i=n_1 + 1}^k t_i^2 \leq \epsilon \sum_{i=n_1 + 1}^k t_i. \]
We can now see that

\[\begin{split} & \frac{G^2 \sum_{i=n_1 + 1}^k t_i^2}{2 \sum_{i=1}^{n_1} t_i + 2 \sum_{i=n_1 + 1}^k t_i}\\ & < \frac{G^2 \sum_{i=n_1 + 1}^k t_i^2}{2 \sum_{i=n_1 + 1}^k t_i}\\ & \leq \frac{\epsilon \sum_{i=n_1 + 1}^k t_i}{2 \sum_{i=n_1 + 1}^k t_i}\\ &= \frac{\epsilon}{2}. \end{split}\]
Combining, we have

\[ \frac{R^2 + G^2 \sum_{i=1}^k t_i^2}{2 \sum_{i=1}^k t_i} < \frac{\epsilon}{2} + \frac{\epsilon}{2} = \epsilon \]

for every \(k > n\).
Thus, for every \(\epsilon > 0\), there exists \(n\) such that for all \(k > n\), we have

\[ \frac{R^2 + G^2 \sum_{i=1}^k t_i^2}{2 \sum_{i=1}^k t_i} < \epsilon. \]
Hence

\[ \lim_{k \to \infty}\frac{R^2 + G^2 \sum_{i=1}^k t_i^2}{2 \sum_{i=1}^k t_i} = 0. \]
Hence the subgradient method converges.

11.1.2.5. Diminishing Step Lengths#

Corollary 11.5 (Convergence of subgradient method with diminishing step lengths)

Let the step sizes \(t_k = \frac{c_k}{\| \bg^k \|_2}\) be chosen such that

\[ c_k > 0, \quad \lim_{k \to \infty} c_k = 0,\quad \sum_{k=1}^{\infty} c_k = \infty \]

where \(c_k\) is the step length for the \(k\)-th iteration.

Then the subgradient method converges to \(f^*\).

Proof. From (11.2) we have

\[ f_{\best}^k - f^* \leq \frac{R^2 + \sum_{i=1}^k t_i^2 \| \bg_i \|_2^2 }{2 \sum_{i=1}^k t_i } = \frac{R^2 + \sum_{i=1}^k c_i^2}{2 \sum_{i=1}^k t_i}. \]

We have

\[ \sum_{i=1}^k t_i = \sum_{i=1}^k \frac{c_i}{\| \bg^i \|_2} \geq \sum_{i=1}^k \frac{c_i}{G}. \]
Hence we have

\[ f_{\best}^k - f^* \leq \frac{R^2 + \sum_{i=1}^k c_i^2}{(2/G) \sum_{i=1}^k c_i}. \]
Following an argument similar to the proof of Corollary 11.4, the R.H.S. converges to 0 as \(k \to \infty\).
Hence

\[ \lim_{k \to \infty} f_{\best}^k - f^* = 0. \]

11.1.2.6. On the Suboptimality Bound#

If we run the subgradient method for \(k\) iterations, we get a suboptimality bound given by (11.3).

\[ f_{\best}^k - f^* \leq \frac{R^2 + G^2 \sum_{i=1}^k t_i^2}{2 \sum_{i=1}^k t_i }. \]

A natural question is how tight can this bound be by an appropriate selection of step sizes \(t_1, \dots, t_k\).

Note that the R.H.S. is a convex and symmetric function of \(t_1, \dots, t_k\).
Hence, at the optimal value, we must have \(t_1 = \dots = t_k\).
Let \(t = t_1 = \dots = t_k\).
Then the suboptimality bound reduces to

\[ \frac{R^2 + G^2 k t^2}{2 k t }. \]
This is minimized at \(t = \frac{R}{G \sqrt{k}}\).
In other words, the finite sequence of step-sizes \(t_1, \dots, t_k\) that minimizes the suboptimality bound in (11.3) after exactly \(k\) iterations is given by

\[ t_i = \frac{R}{G \sqrt{k}}, \Forall i=1,\dots,k. \]
It must be noted that this suboptimality bound is dependent on the number of iterations.
This choice of constant step size gives us the bound

\[ f_{\best}^k - f^* \leq \frac{RG}{\sqrt{k}}. \]

Theorem 11.2 (A bound on the suboptimality bound)

Let \(t_1, \dots, t_k\) be any choice of step sizes for the subgradient method for \(k\) iterations. Then we must have

\[ \frac{R^2 + G^2 \sum_{i=1}^k t_i^2}{2 \sum_{i=1}^k t_i } \geq \frac{RG}{\sqrt{k}} \]

where the L.H.S. is the step-size selection dependent suboptimality bound on \(f_{\best}^k - f^*\).

Accordingly, the number of steps required to achieve a guaranteed accuracy of \(f_{\best}^k - f^* \leq \epsilon\) for some \(\epsilon > 0\) is at least \((RG / \epsilon)^2\) as per the suboptimality bound (11.3) irrespective of the step size selection.

Proof. We have already shown that the suboptimality bound is minimized by the constant step size selection where \(t_i = \frac{R}{G \sqrt{k}}\) for every \(i=1,\dots,k\) and is given by \(\frac{RG}{\sqrt{k}}\).

Let \(\epsilon > 0\) be the required guaranteed accuracy.
Then \(\epsilon \geq \frac{RG}{\sqrt{k}}\) since \(\frac{RG}{\sqrt{k}}\) is the minimum guaranteed suboptimality bound after \(k\) iterations as per (11.3).
Hence we have \(\sqrt{k} \geq \frac{RG}{\epsilon}\).
Hence we have \(k \geq \left ( \frac{RG}{\epsilon} \right )^2\).

We note that the suboptimality bound of \(\epsilon\) can be guaranteed in \(k\) iterations only if we use the constant step sizes of \(t = \frac{R}{G \sqrt{k}}\).

Observation 11.2 (Interpretation of \(R G\))

Initial uncertainty

Assume that \(f\) is Lipschitz continuous with the constant \(G\).
By Lipschitz continuity, we have

\[ f^1 - f^* \leq G \| \bx^1 - \bx^* \|_2 \leq R G. \]
Hence \(RG\) is an estimate of the initial uncertainty in \(f^*\).

Reduction in uncertainty

If our goal is to achieve \(f_{\best}^k - f^* \leq \epsilon\), then \(\epsilon\) is an estimate of the final uncertainty in \(f^*\).
Hence \(RG / \epsilon\) is the ratio of initial uncertainty in \(f^*\) to the final uncertainty in \(f^*\).
By Theorem 11.2, we require a minimum of \((R G / \epsilon)^2\) iterations to achieve this much reduction in uncertainty.

This analysis shows that the subgradient method is very slow. To achieve a 1000x reduction in the uncertainty of \(f^*\), we need a million iterations at least.

Topics in Signal Processing

Basic Subgradient Method

Contents