10.5. Constrained Optimization I#

In this section, we focus on objective functions of type f:RnR which are convex and differentiable. Our goal is to minimize f over a convex set Cdomf.

We also concern ourselves with functions of type f:V(,] where V is a Euclidean space and f is Fréchet differentiable over some open set Cdomf. For those results which concern with Fréchet differentiable functions, we request the readers to revise the concepts of Gateaux and Fréchet differentiability in Differentiation in Banach Spaces.

Main references for this section are [6, 17].

10.5.1. Stationary Points#

We first look at functions which are differentiable over a convex set. Here, we don’t require that the function itself be convex. Thus, we may not characterize the global optimality of a point. However, we can still characterize the local optimality of a point.

In Definition 8.6, we defined stationary points for a real valued function as points where the gradient vanishes; i.e. f(x)=0.

In this section, we wish to restrict the domain of feasible points to a convex set C and consider the problem

minimize f(x)subject to xC

where f is differentiable over the convex set C.

We now introduce the notion of stationary points for the given optimization problem of minimizing f over C.

Definition 10.21 (Stationary point for an optimization problem)

Let f:RnR be a real valued function which is differentiable over a convex set C.

Then, aC is called a stationary point of the problem of minimizing f over C if

f(a)T(xa)0xC.

If C=Rn, then the condition will reduce to f(a)=0.

Theorem 10.45 (Local minimum points are stationary points)

Let f:RnR be a real valued function which is differentiable over a convex set C. Consider the optimization problem

minimize f(x)subject to xC.

If aC is a local minimum, then a must be a stationary point.

Proof. Let a be a local minimum and for contradiction assume that it is not a stationary point.

  1. Then, there exists xC such that f(a)T(xa)<0.

  2. Let t[0,1] be a parameter.

  3. Let zt=tx+(1t)a.

  4. Since C is convex, hence ztC.

  5. Differentiating f(zt) w.r.t. t at t=0, we obtain

    ddtf(zt)|t=0=f(a)T(xa)<0.
  6. Thus, for small enough t, f(zt)<f(a).

  7. This contradicts the hypothesis that a is a local minimum.

  8. Thus, all local minimum points must be stationary points.

Example 10.6 (Unconstrained minimization)

Let f:RnR be a real valued function which is differentiable. Consider the unconstrained optimization problem

minimize f(x).

In this case, the feasible set is C=Rn.

  1. If a is a stationary point for this problem, then

    f(a)T(xa)0xRn.
  2. In particular, if we choose x=af(a), we get

    f(a)T(xa)=f(a)T(f(a))=f(a)20.
  3. This is true only if f(a)=0.

  4. Thus, for unconstrained minimization, the gradient vanishes at stationary points.

Theorem 10.46 (Stationary point as an orthogonal projection)

Let f:RnR be a real valued function which is differentiable over a convex set C. Consider the optimization problem

minimize f(x)subject to xC.

Let s>0. Then aC is a stationary point of the optimization problem if and only if

a=PC(asf(a)).

Proof. Recall from Theorem 10.7 that zC is the projection of x if and only if

yz,xz0yC.
  1. Replace z=a and x=asf(a). We get

    ya,asf(a)a0yCsya,f(a)0yCf(a)T(ya)0yC.
  2. But this is the same condition as the definition for a stationary point.

10.5.2. First Order Optimality Criteria#

We now pay our attention to the case where f is convex as well as differentiable. In this case, a point is a global optimal point if and only if it is a stationary point.

Theorem 10.47 (Optimality criterion for differentiable objective function)

Let f:RnR be a differentiable convex function. Let Cdomf be a convex set. Consider the minimization problem

minimize f(x)subject to xC.

Then, xC is an optimal point if and only if

(10.8)#f(x)T(yx)0yC.

In other words, x is optimal if and only if it is a stationary point.

Proof. By Theorem 9.98

f(y)f(x)+f(x)T(yx)

for every x,ydomf.

Assume that some xC satisfies the optimality criterion in (10.8).

  1. Let yC.

  2. Then, by differentiability

    f(y)f(x)+f(x)T(yx).
  3. By hypothesis f(x)T(yx)0.

  4. Thus, f(y)f(x).

  5. Since this is true for every yC, hence x is an optimal point for the minimization problem.

Now for the converse, assume that xC is an optimal point.

  1. For contradiction, assume that (10.8) doesn’t hold.

  2. Then, there exists yC such that

    f(x)T(yx)<0.
  3. Let t[0,1] be a parameter.

  4. Let z(t)=ty+(1t)x.

  5. Since C is convex, hence z(t)C.

  6. Differentiating f(z(t)) w.r.t. t at t=0, we obtain

    ddtf(z(t))|t=0=f(x)T(yx)<0.
  7. Thus, for small enough t, f(z(t))<f(x).

  8. Thus, x cannot be an optimal point for the minimization problem.

  9. This contradicts our hypothesis that x is an optimal point.

  10. Hence, (10.8) must hold for every yC.

Theorem 10.48 (Optimality criterion for unconstrained problem with differentiable objective function)

Let f:RnR be a differentiable convex function. Consider the unconstrained minimization problem

minimize f(x)

Then, xRn is an optimal point if and only if f(x)=0.

Proof. In this case, the set of feasible points is C=domf.

  1. Since f is differentiable, hence C is an open set.

  2. By Theorem 10.47, x is an optimal point if and only if

    f(x)T(yx)0yC.
  3. If f(x)=0, then this inequality is satisfied. Hence, x must be an optimal point.

  4. Now, assume that x is an optimal point.

  5. Since xC and C is open, hence there exists a closed ball B[x,r]C.

  6. Let y=xtf(x).

  7. For sufficiently small t>0, yB[x,r]C.

  8. Then,

    f(x)T(yx)=f(x)T(tf(x))=tf(x)220

    must hold true for t>0.

  9. This means that f(x)20 must be true.

  10. Thus f(x)=0 must be true.

10.5.2.1. Nondifferentiability at the Boundary#

There are some specific results available for the unconstrained minimization of a convex function f is not differentiable everywhere in its domain. If domf is not open, then f may be differentiable at intdomf but is not differentiable at the boundary points (domf)(intdomf). The key questions are

  1. Under what conditions, the minimizer of f is a point of differentiability.

  2. Under what conditions, the minimizer of f may be at a point of nondifferentiability.

Theorem 10.49 (Zero gradient implies minimizer)

Let f:V(,] with S=domf be a proper convex function which is differentiable over some open set US. If f(x)=0 at some xU, then x is one of the minimizers for f.

Proof. Recall from Theorem 9.220 that f is differentiable at x if and only if

f(x)={f(x)}.
  1. Assume that f is differentiable at xU and f(x)=0.

  2. Then f(x) is the one and only subgradient of f at x.

  3. Due to subgradient inequality

    f(y)f(x)+yx,f(x)yV.
  4. Since f(x)=0, hence

    f(y)f(x)yV.
  5. Thus f attains a minimum at x and x is one of its minimizers.

Next we consider the special case of the minimization of a convex function f which is differentiable in the interior of its domain but the gradient never vanishes. In this case, if a minimizer for f exists, it must be at the boundary of its domain.

Theorem 10.50 (Minimizers at points of nondifferentiability)

Let f:V(,] with S=domf be a proper convex function which is differentiable over some open set US and not differentiable over SU. Assume that f attains its minimum value at some aS; i.e., a minimizer of f exists. If f(x)0 at every xU, then the minimizer must be at some point of nondifferentiability. In other words, aSU.

Proof. We are given that there exists a is a minimizer of f and for every xU, f(x)0.

  1. Pick any xU.

  2. We need to show that x cannot be a minimizer.

  3. For contradiction, assume that x is a minimizer.

  4. By subgradient inequality

    f(y)f(x)+yx,f(x)yV.
  5. Since x is a minimizer, hence we must have

    yx,f(x)0yV.
  6. Let y=xf(x).

  7. Then

    yx,f(x)=f(x),f(x)=f(x)2<0.
  8. This contradicts our assumption that x is a minimizer.

  9. Hence if a is a minimizer of f then aSU.

We next show how the condition in (10.8) simplifies for specific optimization problem structures.

10.5.2.2. Equality Constraints#

Theorem 10.51 (Differentiable objective minimization with equality constraints)

Let f:RnR be a differentiable convex function with domf=Rn. Consider the minimization problem:

minimize f(x)subject to Ax=b

where ARp×n and bRp represent p linear equality constraints. Assume that the problem is feasible.

Then, x is an optimal point for the minimization problem if and only if there exists a vector vRp such that

f(x)+ATv=0.

Proof. The feasible set is given by C={x|Ax=b}. We recall from Theorem 10.47, that x is an optimal point if and only if f(x)T(yx)0yC.

  1. Thus, x is feasible if and only if f(x)T(yx)0 for every y satisfying Ay=b.

  2. Since both x and y are feasible, hence A(yx)=0.

  3. Thus, z=yxN(A).

  4. In fact, yC if and only if y=x+z for some zN(A).

  5. Thus, the optimality criteria reduces to f(x)Tz0 for every zN(A).

  6. Note that f(x)Tz is a linear function of z as f(x) is a fixed vector.

  7. If a linear function is nonnegative on a subspace, then it must be identically zero on the subspace.

  8. Thus, f(x)Tz=0 for every zN(A).

  9. In other words, x is optimal if and only if f(x)N(A).

  10. Recall that N(A)=C(AT); i.e., the null space of A is orthogonal complement of the column space (range) of AT.

  11. Thus, x is optimal if and only if f(x)C(AT).

  12. In other words, x is optimal if and only if there exists a vector vRp such that

    f(x)+ATv=0.

This result is a Lagrange multiplier optimality condition to be discussed in more detail in later sections.

10.5.2.3. Nonnegative Orthant Constraints#

Example 10.7 (Differentiable objective minimization over nonnegative orthant)

Let f:RnR be a differentiable convex function with domf=Rn. Consider the minimization problem:

minimize f(x)subject to x0.
  1. The feasible set is the nonnegative orthant R+n.

  2. x is optimal if and only if x0 and f(x)T(yx)0 for every y0.

  3. The term f(x)Ty is unbounded below on yR+n unless f(x)R+n.

  4. Thus, f(x) must be nonnegative.

  5. Then, the minimum value for f(x)Ty is 0.

  6. Consequently, the optimality condition reduces to f(x)Tx0 or f(x)Tx0.

  7. But x0 and f(x)0.

  8. Thus, we must have f(x)Tx=0.`

  9. We note that

    f(x)Tx=i=1n(f(x))ixi.
  10. Thus, it is a sum of products of nonnegative numbers.

  11. So each term in the sum must be 0.

  12. Thus, (f(x))ixi=0 must hold true for every i=1,,n.

  13. Thus, the optimality condition can be rephrased as

    x0 and f(x)0 and (f(x))ixi=0i=1,,n.

The condition (f(x))ixi=0 for every i is known as complementarity. It means that for every i either xi or (f(x))i or both must be 0. In other words, both xi and (f(x))i cannot be nonzero at the same time.

Thus, the sparsity patterns of x and f(x) are complementary. In other words,

supp(x)supp(f(x))=

where supp(x) denotes the index set of nonzero entries of x.

10.5.2.4. Unit Sum Set Constraint#

Example 10.8 (Minimization over unit sum set)

Let f:RnR be a differentiable convex function with domf=Rn. Consider the minimization problem:

minimize f(x)subject to 1Tx=1.
  1. The feasible set is given by

    C={xRn|1Tx=1}={xRn|i=1nxi=1}.
  2. This set is also known as the unit sum set.

  3. A point aC is optimal if and only if

    f(a)T(xa)0xC.

    Let us call this condition (I).

  4. This is equivalent to the condition:

    fx1(a)=fx2(a)==fxn(a).

    Let us call this condition (II).

  5. We shall show that (I) and (II) are equivalent.

We first show that (II) implies (I).

  1. Assume that some aC satisfies (II) with

    α=fx1(a)=fx2(a)==fxn(a).
  2. Then, for any xC,

    f(a)T(xa)=i=1nfxi(a)(xiai)=αi=1n(xiai)=α(i=1nxii=1nai)=α(11)=0.
  3. Thus, we see that f(a)T(xa)0 indeed and (I) is satisfied.

Now, assume that (I) is satisfied for some aC.

  1. For contradiction, assume that a doesn’t satisfy (II).

  2. Then, their exist i,j[1,,n] such that

    fxi(a)>fxj(a).
  3. Pick a vector xC with following definition

    xk={akk{i,j}ak1k=iak+1k=j.
  4. Note that

    i=1nxk=i=1nak=1.

    Thus, xC holds.

  5. Now,

    f(a)T(xa)=k=1nfxk(a)(xkak)f(a)T(xa)=fxi(a)(xiai)+fxj(a)(xjaj)=fxi(a)+fxj(a)<0.
  6. This violates the hypothesis that (I) holds true.

  7. Thus, (II) must be true.

  8. Thus, (I) implies (II).

10.5.2.5. Unit Ball Constraint#

Example 10.9 (Minimization over unit ball)

Let f:RnR be a convex function which is differentiable over B[0,1]. Consider the minimization problem:

minimize f(x)subject to x1.
  1. The feasible set is given by

    C=B[0,1]={x|x1}.
  2. A point aB[0,1] is an optimal point if and only if

    f(a)T(xa)0xB[0,1].
  3. This is equivalent to saying that

    infx1(f(a)Txf(a)Ta)0.
  4. Recall from Theorem 8.11 that for any vRn, the optimal value of the problem

    inf{vTx|x1}

    is v.

  5. Thus,

    infx1f(a)Tx=f(a).
  6. Thus, the inequality simplifies to

    f(a)Taf(a).
  7. At the same time, by Cauchy Schwartz inequality,

    f(a)Taf(a)af(a)

    since aB[0,1].

  8. Thus, the inequality must be an equality, giving us, a is an optimal point

    f(a)Ta=f(a).

We now have following possibilities for this condition.

  1. If f(a)=0, then the condition holds and a is indeed an optimal point.

  2. Otherwise, if f(a)0, then a=1 must be true.

    1. For contradiction, if we assume that a<1.

    2. Then, by Cauchy Schwartz inequality

      f(a)Taf(a)a<f(a),

      a contradiction.

  3. Thus, if f(a)0, then a is an optimal point if and only if a=1 and

    f(a)Ta=f(a)=f(a)a.
  4. But this is possible only when there exists t0 such that

    f(a)=ta.
  5. Thus, if f(a)0, then a is an optimal point if and only if a=1 and there exists t0 such that f(a)=ta.

10.5.3. Descent Directions Method#

We first consider the problem of unconstrained minimization of a continuously differentiable function f.

Typical iterative algorithms which aim to find the solution x for the minimization problem start with an initial guess x0 and perform a step of the form

xk+1=xk+tkdk

where xk is the current guess (starting from x0), dk is a direction in which we move to make the next guess and tk is a step size in that direction. xk+1 is the next guess obtained from current guess. We say that an algorithm has made progress if f(xk+1)<f(xk).

This brings us to the notion of a descent direction.

Definition 10.22 (Descent direction)

Let f:VR be a continuously differentiable function over V. A nonzero vector d is called a descent direction of f at x if the directional derivative f(x;d) is negative.

In other words,

f(x;d)=d,f(x)<0.

If the directional derivative is negative, then it is clear that for a small enough step in this direction, the value of f will decrease.

Lemma 10.4 (Descent property of descent direction)

Let f:VR be a continuously differentiable function over V. Let xV. Assume that d is a descent direction for f. Then, there exists ϵ>0 such that

f(x+td)<f(x)

for any t(0,ϵ].

Proof. This follows from the negativity of the directional derivative.

  1. Recall from Definition 9.70 that

    f(x;d)=limt0+f(x+td)f(x)t.
  2. Since d is a descent direction, hence f(x;d)<0.

  3. Thus,

    limt0+f(x+td)f(x)t<0.
  4. Thus, there exists ϵ>0 such that

    f(x+td)f(x)t<0

    for every t(0,ϵ].

10.5.3.1. Descent Directions Method#

Algorithm 10.1 (The descent directions method)

Initialization

  1. Pick x0V arbitrarily as initial solution.

General iteration: for k=0,1,2,, execute the following steps

  1. Pick a descent direction dk.

  2. Pick a step size tk satisfying f(x+tkdk)<f(xk).

  3. Update: xk+1xk+tkdk.

  4. Check for convergence: If converged, then STOP.

Return xk+1 as the output.

This method is not really an actual algorithm. It can be considered as a template for an actual algorithm. Several aspects of the method need to be carefully chosen to come up with a viable algorithm.

  1. How to select the initial point x0?

  2. How to choose the descent direction?

  3. How to select the step size?

  4. How to decide when to stop the iterations (stopping criterion)?

The key result that we establish here is to show that if the step size tk is sufficient small in the descent direction dk, then there is sufficient decrease in the value of f going from xk to xk+1.

Theorem 10.52 (Sufficient decrease condition for descent direction method)

Let f be a continuously differentiable function over Rn. Let xRn. Assume that a nonzero d is a descent direction for f at x. Let α(0,1). Then, there exists ϵ>0 such that the inequality

f(x)f(x+td)αtd,f(x)

holds for every t[0,ϵ].

Proof. We proceed as follows.

  1. Since f is continuously differentiable, hence due to Theorem 5.1,

    f(x+td)=f(x)+tf(x)Td+o(td).
  2. Rearranging the terms and introducing an α(0,1), we obtain

    f(x)f(x+td)=αtf(x)Td(1α)f(x)Tdo(td).
  3. Since d is a descent direction of f, hence f(x;d)=f(x)Td<0.

  4. Thus,

    limt0+(1α)f(x)Td+o(td)t=(1α)f(x)Td<0.
  5. Hence, there exists ϵ>0 such that for every t[0,ϵ],

    (1α)f(x)Td+o(td<0.
  6. Thus, from the previous equation:

    f(x)f(x+td)αtf(x)Td

    for every t[0,ϵ].

10.5.3.2. Step Size Selection#

Following are common methods for step size selection. Each method has has advantages as well as drawbacks.

  1. Constant step size

  2. Exact line search

  3. Backtracking

Constant step size uses tk=t¯ for every iteration.

  • A large step size might cause algorithm to take non-decreasing steps.

  • The algorithm may never converge with a large step size.

  • A small constant step size may lead to very slow convergence.

Exact line search solves the minimization problem

 minimize f along the ray xk+tdk.

The optimization variable is the step size parameter tR+. The solution is the value tk0.

  • Minimizing f along the ray may not be straight-forward.

  • Any closed form or algorithmic solution for the exact line search may not be available.

Backtracking is a compromise between these two approaches.

  1. Input parameters s>0, α(0,1), β(0,1).

  2. Initialize tk=s.

  3. While

    f(xk)f(xk+tkdk)<αtkf(xk)Td
    1. Set tkβtk.

    2. Continue.

  4. Return step size tk.

Essentially, we are reducing the step size by a factor β iteratively till f shows sufficient decrease as stipulated by Theorem 10.52.

  • There is no exact line search.

  • Does find a good enough step size satisfying the sufficient decrease condition.

  • Involves multiple evaluations of f.

  • s should be chosen carefully.

10.5.4. Gradient Method#

The gradient method is a descent direction method in which the descent direction is always chosen to be the negative of the gradient at the current point (solution).

dk=f(xk).

Lemma 10.5 (Negative of gradient is a descent direction)

Let f be a continuously differentiable function over an open set S. At any point xS, the negative of the gradient d=f(x) is a descent direction whenever f(x)0.

Proof. We just need to compute the directional derivative.

f(x;d)=d,f(x)=f(x),f(x)=f(x)2<0.

The last strict inequality is valid since f(x)0.

10.5.5. Gradient Projection Method#

In this subsection, we present a method to solve the optimization problem

minimize f(x)subject to xC

where C is a convex set and f is differentiable over C.

Recall from Theorem 10.46, that a is a stationary point for the problem of minimizing f over a convex set C if and only if

a=PC(asf(a)).

This stationarity condition is the basis for the gradient projection method presented below.

Algorithm 10.2 (The gradient projection method)

Inputs

  1. ϵ>0 - tolerance parameter

Initialization

  1. Pick x0C arbitrarily.

General iteration: for k=0,1,2,, execute the following steps

  1. Pick a step size tk by a line search procedure.

  2. Update: xk+1PC(xktkf(xk)).

  3. Check for convergence: If xk+1xkϵ, then STOP.

Return xk+1 as the output.

In the case of unconstrained minimization:

  1. C=Rn

  2. PC(xktkf(xk))=xktkf(xk).

  3. xk+1=xktkf(xk).

  4. We see that gradient projection reduces to gradient descent.

Another way to look at the algorithm is:

  1. yk+1=xktkf(xk) computes the next candidate solution assuming no constraints.

  2. xk+1=PC(yk+1) step projects the next candidate solution back to the feasible set C.

  3. Thus, we have a gradient step followed by a projection step.

10.5.5.1. Convergence#

If we can establish conditions under which each iteration of the gradient projection algorithm leads to sufficient decrease in the value of the objective function, then we can guarantee that the algorithm will converge in a finite number of steps.

Lemma 10.6 (Sufficient decrease lemma for constrained problems)

Consider the optimization problem:

minimize f(x)subject to xC.

Assume that C is a nonempty closed convex set, f is continuously differentiable over C, and f is Lipschitz continuous with a constant L over C, Then, for any xC, and t(0,2L), the following inequality holds.

f(x)f(y)t(1Lt2)1t(xy)2.

where y=PC(xtf(x)).

Proof.