Differentiation
Contents
5.1. Differentiation#
We consider functions from \(\RR^n\) to \(\RR^m\).
5.1.1. Differentiability and Jacobian#
(Differentiability at a point)
Let \(f : \RR^n \to \RR^m\). Let \(\bx \in \interior \dom f\). The function \(f\) is differentiable at \(\bx\) if there exists a matrix \(Df(\bx) \in \RR^{m \times n}\) that satisfies
Such a matrix \(Df(\bx)\) is called the derivative (or Jacobian) of \(f\) at \(\bx\).
There can be at most one \(Df(\bx)\) satisfying the limit in (5.1).
If we write \(\bz = \bx + \bh\) then an alternative form for (5.1) is given by:
The matrix \(Df(\bx)\) can be obtained from the partial derivatives:
The Jacobian \(Df(\bx)\) is an \(m \times n\) real matrix.
Partial derivatives of each component of \(f\) (i.e., \(f_i\)) line up on the \(i\)-th row.
Partial derivatives for one coordinate \(x_j\) line up on the \(j\)-th column.
If \(f\) is single valued, then the Jacobian \(Df(\bx)\) is a row vector.
(Jacobian of identity function)
Let \(f: \RR^n \to \RR^n\) be defined as:
Then, \(f_i(\bx) = x_i\). Hence,
Thus
the \(n\times n\) identity matrix.
(Jacobian of linear transformation)
Let \(f: \RR^n \to \RR^m\) be defined as:
where \(\bA = (a_{i j})\) is an \(m \times n\) real matrix.
Then, \(f_i(\bx) = \sum_{j=1}^n a_{i j} x_j\). Hence,
Thus
(Jacobian of affine transformation)
Let \(f: \RR^n \to \RR^m\) be defined as:
where \(\bA = (a_{i j}) \in \RR^{m \times n}\) and \(\bb \in \RR^m\).
Then, \(f_i(\bx) = \sum_{j=1}^n a_{i j} x_j + b_i\). Hence,
Thus
The vector \(\bb\) is a constant offset. It has no impact on the derivative.
(Differentiable function)
A function \(f\) is called differentiable if its domain \(\dom f\) is open and it is differentiable at every point of \(\dom f\).
(First order approximation)
The affine function given by:
is called the first order approximation of \(f\) at \(\bx=\ba \in \interior \dom f\).
5.1.2. Real Valued Functions#
Rest of this section focuses mostly on real valued functions of type \(f : \RR^n \to \RR\).
First order derivative of a real valued function is called a gradient.
Second order derivative of a real valued function is called a Hessian.
We consider first order and second order approximations of a real valued function.
5.1.3. Gradient#
(Gradient)
When \(f : \RR^n \to \RR\) is a real valued function, then the derivative \(Df(\bx)\) is a \(1 \times n\) matrix. The gradient of a real valued function is defined as:
at \(\bx \in \interior \dom f\) if \(f\) is differentiable at \(\bx\).
For real valued functions, the derivative is a row vector but the gradient is a column vector.
The components of the gradient are given by the partial derivatives as:
(Gradient of linear functional)
Let \(f : \RR^n \to \RR\) be a linear functional given by:
We can expand it as:
Computing partial derivative with respect to \(x_i\), we get:
Putting the partial derivatives together, we get:
(Gradient of affine functional)
Let \(f : \RR^n \to \RR\) be a affine functional given by:
where \(\ba \in \RR^n\) and \(b \in \RR\).
We can expand it as:
Computing partial derivative with respect to \(x_i\), we get:
Putting the partial derivatives together, we get:
The intercept \(b\) is a constant term which doesn’t affect the gradient.
(Gradient of quadratic form)
Let \(f : \RR^n \to \RR\) be a quadratic form given by:
where \(\bA \in \RR^{n \times n}\).
We can expand it as:
Note that the diagonal elements \(a_{ii}\) give us terms of the form \(a_{i i} x_i^2\). Let us split the expression into diagonal and non-diagonal terms:
There are \(n\) terms in the first sum (the diagonal entries of \(\bA\)) and \(n^2 - n\) terms in the second sum (the non-diagonal entries of \(\bA\)).
Taking partial derivative w.r.t. \(x_k\), we obtain:
The first term comes from \(a_{k k}\) term that is quadratic in \(x_k\).
The first sum comes from linear terms where \(k=j\) and \(i=1,\dots,n\) except \(i\neq k\).
The second sum comes from linear terms where \(k=i\) and \(j=1,\dots,n\) except \(j\neq k\).
There are \(2n -2\) terms in the sums and \(2\) \(a_{k k} x_k\) terms.
We can move one \(a_{k k }x_k\) into each sum to simplify the partial derivative as:
Note that the \(k\)-th component of the vector \(\bu = \bA \bx\) is \(\sum_{j=1}^n a_{k j} x_j\).
Similarly, the \(k\)-th component of the vector \(\bv = \bA^T \bx\) is \(\sum_{i=1}^n a_{i k} x_i\).
Thus,
Putting together the partial derivatives, we obtain:
If \(\bA\) is symmetric then,
\(\ell_2\) norm)
(Gradient of squaredLet \(f : \RR^n \to \RR\) be a quadratic form given by:
We can write this as
where \(\bI\) is the identity matrix.
Following, Example 5.6,
(Gradient of quadratic functional)
Let \(\bP \in \SS^n\) be a symmetric matrix. Let \(\bq \in \RR^n\) and \(r \in \RR\). Consider the quadratic functional \(f: \RR^n \to \RR\) given as:
We can compute the gradient as follows:
We took advantage of the fact that gradient operation commutes with scalar multiplication and distributes on vector addition.
Since \(r\) is a constant, it has no contribution to the derivative.
We reused results from previous examples.
We utilized the fact that \(\bP = \bP^T\) since \(\bP\) is symmetric.
In summary:
The derivative of \(f\) is then obtained by taking the transpose of the gradient:
(Gradient mapping)
If a real valued function \(f: \RR^n \to \RR\) is differentiable, the gradient mapping of \(f\) is the function \(\nabla f : \RR^n \to \RR^n\) with \(\dom \nabla f = \dom f\), with the value \(\nabla f(\bx)\) at every \(\bx \in \dom f\).
5.1.4. Continuous Differentiability#
(Continuously differentiable real valued function)
Let \(f: \RR^n \to \RR\) be a real valued function with \(S = \dom f\). Let \(U \subseteq S\) be an open set. If all the partial derivatives of \(f\) exist and are continuous at every \(\bx \in U\), then \(f\) is called continuously differentiable over \(U\).
If \(f\) is continuously differentiable over an open set \(U \subseteq S\), then it is continuously differentiable over every subset \(C \subseteq U\).
If \(S\) is open itself and \(f\) is continuously differentiable over \(S\), then \(f\) is called continuously differentiable.
5.1.5. First Order Approximation#
(First order approximation of real valued functions)
The affine function given by:
is the first order approximation of a real valued function \(f\) at \(\bx=\ba \in \interior \dom f\).
(First order approximation accuracy)
Let \(f : \RR^n \to \RR\) be defined on an open set \(S = \dom f\). Assume that \(f\) is continuously differentiable on \(S\). Then,
Another way to write this result is:
where \(\ba \in S\) and \(o(\cdot) : \RR_+ \to \RR\) is a one dimensional function satisfying \(\frac{o(t)}{t} \to 0\) as \(t \to 0^+\).
5.1.6. Chain Rule#
(Chain rule)
Suppose \(f : \RR^n \to \RR^m\) is differentiable at \(\bx \in \interior \dom f\) and \(g : \RR^m \to \RR^p\) is differentiable at \(f(\bx) \in \interior \dom g\). Define the composition \(h: \RR^n \to \RR^p\) as:
Then, \(h\) is differentiable at \(\bx\) with the derivative given by:
Notice how the derivative lines up as a simple matrix multiplication.
(Chain rule for real valued functions)
Suppose \(f : \RR^n \to \RR\) is differentiable at \(\bx \in \interior \dom f\) and \(g : \RR \to \RR\) is differentiable at \(f(\bx) \in \interior \dom g\). Define the composition \(h: \RR^n \to \RR\) as:
Then, \(h\) is differentiable at \(\bx\) with the gradient given by:
(Gradient of log-sum-exp)
Let \(h : \RR^n \to \RR\) be given by:
with \(\dom h = \RR^n\).
Let \(g(y) = \ln y\) and
Then, we can see that \(h(\bx) = g (f (\bx))\). Now \(g'(y) = \frac{1}{y}\) and
Thus,
Now, if we define
then, we see that:
Using this notation:
\(\ell_2\) norm at nonzero vectors)
(Gradient ofLet \(h : \RR^n \to \RR\) be given by:
with \(\dom h = \RR^n\).
Let \(g : \RR \to \RR\) with \(\dom g = \RR_+\) be given by \(g(y) = \sqrt{y}\).
Let \(f : \RR^n \to \RR\) with \(\dom f = \RR^n\) be given by
Then, we can see that \(h(\bx) = g (f (\bx))\) or \(h = g \circ f\).
\(g\) is differentiable on the open set \(\RR_{++}\). For every \(y \in \RR_{++}\),
and (from Example 5.7)
Thus, for every \(\bx \neq \bzero\), following Corollary 5.1,
The gradient of \(\ell_2\) norm at \(\bzero\) doesn’t exist. However, subgradients can be computed. See Example 9.71 and Example 9.72.
(Chain rule for composition with affine function)
Suppose \(f : \RR^n \to \RR^m\) is differentiable. Let \(\bA \in \RR^{n \times p}\) and \(\bb \in \RR^n\). Define \(g : \RR^p \to \RR^m\) as:
with \(\dom g = \{ \bx \ST \bA \bx + \bb \in \dom f \}\).
The derivative of \(g\) at \(\bx \in \interior \dom g\) is given by:
If \(f\) is real valued (i.e. \(m=1\)), then the gradient of a composition of a function with an affine function is given by:
(Chain rule for restriction on a line)
Let \(f : \RR^n \to \RR\) be a real valued differentiable function. Consider the restriction of \(f\) on a line in its domain
where \(\bx \in \dom f\) and \(\bv \in \RR^n\) with the domain
If we define \(h : \RR \to \RR^n\) as:
we can see that:
By chain rule:
In particular, if \(\bv = \by - \bx\), with \(\by \in \dom f\),
5.1.7. Hessian#
In this section, we review the second derivative of a real valued function \(f: \RR^n \to \RR\).
(Hessian)
The second derivative or Hessian matrix of \(f\) at \(\bx \in \interior \dom f\), denoted by \(\nabla^2 f\), is given by:
provided \(f\) is twice differentiable at \(\bx\).
(Hessian of linear functional)
Let \(f : \RR^n \to \RR\) be a linear functional given by:
We can expand it as:
Computing partial derivative with respect to \(x_i\), we get:
If we further compute the partial derivative w.r.t. \(x_j\), we get:
Thus, the Hessian is an \(n \times n\) 0 matrix:
Hessian is the derivative of the gradient mapping.
(Hessian of quadratic form)
Let \(f : \RR^n \to \RR\) be a quadratic form given by:
where \(\bA \in \RR^{n \times n}\).
Recall from Example 5.6 that:
Also recall from Example 5.2 that
for all \(\bC \in \RR^{m \times n}\).
Thus, using Theorem 5.3
If \(\bA\) is symmetric then
(Hessian of log-sum-exp)
Let \(f : \RR^n \to \RR\) be given by:
with \(\dom f = \RR^n\).
Define
then, we see that:
Using this notation:
We have:
\(\frac{\partial z_j}{\partial x_i} = 0\) for \(i \neq j\). Now,
Proceeding to compute the second derivatives:
Now, note that \((\bz \bz^T)_{i j} = z_i z_j\). And, \((\Diag (\bz))_{i j} = \delta_{ i j} z_i \).
Thus,
Alternatively,
(Derivatives for least squares cost function)
Let \(\bA \in \RR^{m \times n}\). Let \(\bb \in \RR^n\). Consider the least squares cost function:
Expanding it, we get:
Note that \(\bA^T \bA\) is symmetric. Using previous results, we obtain the gradient:
And the Hessian is:
(Derivatives for quadratic over linear function)
Let \(f : \RR \times \RR \to \RR\) be given by:
with \(\dom f = \{ (x, y) \ST y > 0\}\).
The gradient is obtained by computing the partial derivatives w.r.t. \(x\) and \(y\):
The Hessian is obtained by computing second order partial derivatives:
5.1.8. Twice Continuous Differentiability#
(Twice continuously differentiable real valued function)
Let \(f: \RR^n \to \RR\) be a real valued function with \(S = \dom f\). Let \(U \subseteq S\) be an open set. If all the second order partial derivatives of \(f\) exist and are continuous at every \(\bx \in U\), then \(f\) is called twice continuously differentiable over \(U\).
If \(f\) is twice continuously differentiable over an open set \(U \subseteq S\), then it is twice continuously differentiable over every subset \(C \subseteq U\).
If \(S\) is open itself and \(f\) is twice continuously differentiable over \(S\), then \(f\) is called twice continuously differentiable.
(Symmetry of Hessian)
If \(f : \RR^n \to \RR\) with \(S = \dom f\) is twice continuously differentiable over a set \(U \subseteq S\), then its Hessian matrix \(\nabla^2 f(\bx)\) is symmetric at every \(\bx \in U\)
5.1.9. Second Order Approximation#
(Linear approximation theorem)
Let \(f : \RR^n \to \RR\) with \(S = \dom f\) be twice continuously differentiable over an open set \(U \subseteq S\). Let \(\bx \in U\). Let \(r > 0\) be such that \(B(\bx, r) \subseteq U\). Then, for any \(\by \in B(\bx, r)\), there exist \(\bz \in [\bx, \by]\) such that
(Quadratic approximation theorem)
Let \(f : \RR^n \to \RR\) with \(S = \dom f\) be twice continuously differentiable over an open set \(U \subseteq S\). Let \(\bx \in U\). Let \(r > 0\) be such that \(B(\bx, r) \subseteq U\). Then, for any \(\by \in B(\bx, r)\),
(Second order approximation)
The second order approximation of \(f\) at or near \(\bx=\ba\) is the quadratic function defined by:
5.1.10. Smoothness#
5.1.10.1. Real Functions#
(Class of continuous functions)
The class of continuous real functions, denoted by \(C\), is the set of functions of type \(f: \RR \to \RR\) which are continuous over their domain \(\dom f\).
\(C^k\))
(Differentiability classLet \(f: \RR \to \RR\) be a real function with \(S = \dom f\).
Then, we say that \(f\) belongs to the differentiability class \(C^k\) on \(S\) if and only if
In other words, the \(k\)-th derivative of \(f\) exists and is continuous.
\(C^0\) consists of class of continuous real functions.
\(C^1\) consists of class of continuously differentiable functions.
\(C^{\infty}\) consists of class of smooth functions which are infinitely differentiable.
5.1.10.2. Real Valued Functions on Euclidean Space#
\(C^k\))
(Differentiability classA function \(f: \RR^n \to \RR\) with \(S = \dom f\) where \(S\) is an open subset of \(\RR^n\) is said to be of class \(C^k\) on \(S\), for a positive integer \(k\), if all the partial derivatives of \(f\)
exist and are continuous for every \(m_1,m_2,\dots,m_n \geq 0\) and \(m = m_1 + m_2 + \dots m_n \leq k\).
If \(f\) is continuous, it is said to belong to \(C\) or \(C^0\).
If \(f\) is continuously differentiable, it is said to belong to \(C^1\).
If \(f\) is twice continuously differentiable, it is said to belong to \(C^2\).
If \(f\) is infinitely differentiable, it is said to belong to \(C^{\infty}\).