8The Hessian

In Chapter 6 we exploited the second derivative of a one variable real function to analyze convexity along with local minima and maxima.
In this chapter we introduce an analogue of the second derivative for real functions of several variables. This will be an matrix. The important notion of a matrix being positive (semi-) definite introduced in Section 3.7 will now make its appearance.

8.1 Introduction

In Section 6.4 the Taylor expansion for a one variable differentiable function centered a with step size was introduced as
Recall that the second derivative contains a wealth of information about the function. Especially if , then we might glean from if is a local maximum or minimum or none of these (see Theorem 6.47 and review Exercise 6.50 ).
We also noticed that gradient descent did not work so well only descending along the gradient. We need to take the second derivative into account to get a more detailed picture of the function.

8.2 Several variables

Our main character is a differentiable function in several variables. We already know that
where and are vectors in (as opposed to the good old numbers in (8.1) ). Take a look back at Definition 7.5 for the general definition of differentiability.
We wish to have an analogue of the Taylor expansion in (8.1) for such a function of several variables. To this end we introduce the function given by
Notice that
where is the function given by . In particular we get
by using the chain rule (see Theorem 7.24 ).
Explain how the chain rule is applied to get (8.3) .
The derivative is also composed of several functions and again we may compute by using the chain rule:
where is defined by
and by
The Hessian matrix of at the point is defined by
A very important observation is that above is a symmetric matrix if satisfies the condition in the last part of Theorem 7.13 .
Suppose that is given by
Then the gradient
and the Hessian
of are computed in the Sage window below.
See the further documentation for Calculus functions in Sage.
Verify (just this once) by hand the computations done by Sage in Example 8.3 .
Also, experiment with a few other functions in the Sage window and compute their Hessians.
By applying Proposition 7.11 it is not too hard to see that the Hessian matrix fits nicely into the framework above, since
The full application of the chain rule then gives
Give a detailed explanation as to why (8.5) holds.

8.3 Newton's method for finding critical points

We may use Newton's method for computing critical points for a function of several variables. Recall that a critical point is a point with . By (7.8) and (8.5) the computation in Newton's method becomes
In practice the (inverse) Hessian appearing in (8.7) is often a heavy computational burden. This leads to the socalled quasi-Newton methods, where the inverse Hessian in (8.7) is replaced by other matrices.
We will return to the logistic regression in Example 7.34 about the Challenger disaster. Here we sought to maximize the function
In order to employ Newton's method we compute the gradient and the Hessian of (8.8)
where
is the sigmoid function.
Notice the potential problem in using Newton's method here: the formula for the second order derivatives in (8.9) show that if the are just mildly big, say , then the Hessian is extremely close to the zero matrix and therefore Sage considers it non-invertible and (8.7) fails.
In the code below we have nudged the initial vector so that it works, but you can easily set other values and see its failure. Optimization is not just mathematics, it also calls for some good (engineering) implementation skills (see for example details on the quasi Newton algorithms).
In the instance below we do, however, get a gradient that is practically .
Code for Newton's method

8.3.1 Transforming data for better numerical performance

The numerical problems with Newton's method in Example 8.6 can be prevented by transforming the input data. It makes sense to transform data from large numbers to smaller numbers around . There is a rather standard way of doing this.
Suppose in logistic regression we have a set of data
associated with outcomes . Then the function
from Example 7.32 becomes much more manageable if we first transform the data according to
and instead optimize the function
Here
is the mean value and
the variance of the data in (8.10) .
Now if and is an optimum for , then
is an optimum for , since
Why is the claim/trick alluded to in (8.11) true?
Below is a snippet of Sage code implementing the trick in (8.11) . The function test takes as input x0 (an initial vector like [0,0]) and noofits (the number of iterations of Newton's method). You execute this in the Sage window by adding for example

test([0,0], 10)

and then pressing Compute.
Experiment and compare with the official output from Example 7.34 . Also, compute the gradient of the output below for the original non-transformed problem.
Transformed code

8.4 The Hessian and critical points

Now we are in a position to state at least the first terms in the Taylor expansion for a differentiable function . The angle of the proof is to reduce to the one-dimensional case through the function defined in (8.2) . Here one may prove that
where with continuous at , much like in the definition of differentiability except that we also include the second derivative.
Now (8.12) translates into
by using (8.3) and (8.6) .
A point is called a saddle point for if there exists two vectors , such that
as illustrated in the graphics below.
Now go back and recall the definition of positive definite matrices in Section 3.7 . We call a symmetric matrix negative definite if is positive definite. One more concept (related to the definition of saddle point above):
A symmetric matrix is called indefinite if there exists with
So an indefinite matrix is mixed up in the sense that it is neither positive definite, nor negative definite.
We have the following addition to Proposition 3.41 (with a similar proof).
Let be a symmetric matrix and an invertible matrix. Then is indefinite (positive definite, negative definite) if and only if
is indefinite (positive definite, negative definite).
From (8.13) one can prove the following nice criterion, which may be viewed as a several variable generalization of Theorem 6.47 .
Let be a critical point for . Then
  1. is a local minimum if is positive definite.
  2. is a local maximum if is negative definite.
  3. is a saddle point if is indefinite.
Consider, with our new technology in Theorem 8.12 , Exercise 7.22 once again. Here we analyzed the point for the function
and showed (by a trick) that is neither a local maximum nor a local minimum for . The Hessian matrix for at is
Now Theorem 8.12 (ⅲ) shows that is a saddle point, since
and
Try plotting the graph for different values of a a=4 shows the saddle point clearly. in the Sage window in Example 8.13 . What do you observe for the point with respect to the function? Does a have to be a number? Could it be a symbolic expression in the variables x and y like a = -10*cos(x)*sin(y)?
Check the computation of the Hessian matrix in Example 8.13 by showing that the Hessian matrix for at the point is
What about and in Example 8.13 ? How do they relate to the hint given in Exercise 7.22 ?
Give an example of a function having a local minimum at , where is not positive definite.
The following exercise is a sci2u exercise from the Calculus book.
  1. The point is a critical point for
    What does Theorem 8.12 say about this point?
  2. The point is a critical point for
    What does Theorem 8.12 say about this point?
  3. The point is a critical point for
    What does Theorem 8.12 say about this point?
Consider the function
Compute its critical points and decide on their types according to Theorem 8.12 . Try to convince yourself that
for every .
Look at the minimization problem
subject to
where is a big number.
Give an example of a function that has a local maximum, but where there exists with for any given (large) number .

8.5 Differential convex functions of several variables

Below is the generalization of Theorem 6.58 to several variables. You have already seen this in Exercise 6.59 , right?
Let be a differentiable function, where is an open convex subset. Then is convex if and only if
for every .
Proof
Suppose that (8.14) holds and let with , where . To prove that is convex we must verify the inequality
Let . Then
by (8.14) . If you multiply the first inequality by , the second by and then add the two, you get (8.15) .
Suppose on the other hand that is a convex function. Let . Since is an open subset, it follows that for , where is sufficiently small. Now define the function by
Being the composition of two differentiable functions, is differentiable. Suppose that and . Then
showing that is a convex function. By Theorem 6.58 ,
which translates into
by using the chain rule in computing .
Prove that a bounded convex differentiable function is constant.
The following is the generalization of Corollary 6.52 .
Let be a differentiable function with continuous second order partial derivatives, where is a convex open subset. Then is convex if and only if the Hessian is positive semidefinite for every . If is positive definite for every , then is strictly convex.
Proof
We have done all the work for a convenient reduction to the one variable case. Suppose that is convex. Then the same reasoning as in the proof of Theorem 8.21 shows that
is a convex function for every and every from an open interval to for suitable . Therefore by Theorem 6.51 . This proves that the matrix is positive semidefinite for every . Suppose on the other hand that is positive semidefinite for every . Then Theorem 6.51 shows that is a convex function from to for small and , since
for . Therefore is a convex function, since
The same argument (using the last part of Theorem 6.51 on strict convexity), shows that is strictly convex if is positive definite. It follows that is strictly convex if is positive definite for every .
Prove that
is a strictly convex function from to . Also, prove that
is a convex subset of .
Is strictly convex on some non-empty open convex subset of the plane?
Show that given by
is a convex function. Is strictly convex?
Let be given by
where .
  1. Show that is a strictly convex function if and only if and .
    This is a hint for the only if part. If is the Hessian for , then
    where - this is seen by a matrix multiplication computation. We know that is positive semidefinite. If was not positive definite, there would exist with . Now use to complete the proof that is positive definite by looking at .
  2. Suppose now that and . Show that has a unique global minimum and give a formula for this minimum in terms of and .

8.6 How to decide the definiteness of a matrix

In this section we will outline a straightforward method for deciding if a matrix is positive definite, positive semidefinite, negative definite or indefinite.
Before proceeding it is a must that you do the following exercise.
Show that a diagonal matrix
is positive definite if and only if , positive semidefinite if and only if and indefinite if and only if there exists with and .
The crucial ingredient is the following result.
Let be a real symmetric matrix. Then there exists an invertible matrix , such that is a diagonal matrix.
The proof contains an algorithm for building by different steps. We will supply examples afterwards illustrating these. An operational procedure implementing the steps is outlined in section 8.7 .
Proof
Suppose that . If has a non-zero entry in the upper left hand corner i.e., , then
where is a real symmetric matrix and is the invertible matrix
By induction on we may find an invertible matrix matrix such that
Putting
it follows that
We now treat the case of a zero entry in the upper left hand corner i.e., . Suppose first that for some . Let denote the identity matrix with the first and -th rows interchanged. The operation amounts to interchanging the first and -th columns in . Similarly is interchanging that first and -th rows in . The matrix is invertible and is a symmetric matrix with and we have reduced to the case of a non-zero entry in the upper left hand corner.
If for every we may assume that for some . Let denote the identity matrix where the entry in the first column and -th row is . The operation amounts to adding the -th column to the first column in . Similarly is adding the -th row to the first row in . All in all we get , where we have used that for . Again we have reduced to the case of a non-zero entry in the upper left hand corner.
Consider the real symmetric matrix.
Here . Therefore the fundamental step in the proof of Theorem 8.29 applies and
and again
Summing up we get
You are invited to check that
Let
Here , but the diagonal element . So we are in the second step of the proof of Theorem 8.29 . Using the matrix
we get
As argued in the proof, this corresponds to interchanging the first and third columns and then interchanging the first and third rows. In total you move the non-zero to the upper left corner in the matrix.
Consider the symmetric matrix
We have zero entries in the diagonal. As in the third step in the proof of Theorem 8.29 we must find an invertible matrix , such that the upper left corner in is non-zero. In the proof it is used that every diagonal element is zero: if we locate a non-zero element in the -th column in the first row, we can add the -th column to the first column and then the -th row to the first row obtaining a non-zero element in the upper left corner. For above we choose and the matrix becomes
so that
Let be any matrix. Show that
is positive semidefinite.
Find inequalities defining the set
Same question with positive semidefinite. Sketch and compare the two subsets of the plane .
Let denote the function given by
where . Let denote the Hessian of in a point .
  1. Compute .
  2. Show that for and .
  3. Compute a non-zero vector , such that in the case, where . Is invertible in this case?
  4. Show that is strictly convex if .
  5. Is strictly convex if ?
    Hint
    Consider the line segment between and a suitable vector , where .
Why is the subset given by the inequalities
a convex subset of ?

8.7 A schematic procedure for transforming symmetric matrices

Suppose that is a symmetric matrix. We wish to find an invertible matrix and a diagonal matrix so that
Every step in the algorithm in the proof of Theorem 8.29 involve an operation on the columns of followed by a similar operation on the rows. These steps can be carried out systematically by transforming the extended matrix
The recipe is: every column operation (on ) is carried out on the full matrix, whereas every row operation is only carried out on the lower matrix in (8.16) .
Here is how this plays out for the examples above.
Here is the schematic procedure applied to Example 8.30 :
Here is the schematic procedure applied to Example 8.31 :
Here is the schematic procedure applied to Example 8.32 :