8The Hessian

In Chapter 6 we exploited the second derivative f(x)f''(x) of a one variable real function f:(a,b)Rf:(a, b) \rightarrow \mathbb{R} to analyze convexity along with local minima and maxima.
In this chapter we introduce an analogue of the second derivative for real functions f:RnRf:\mathbb{R}^n\rightarrow \mathbb{R} of several variables. This will be an n×nn\times n matrix. The important notion of a matrix being positive (semi-) definite introduced in Section 3.7 will now make its appearance.

8.1 Introduction

In Section 6.4 the Taylor expansion for a one variable differentiable function f:RRf:\mathbb{R}\rightarrow \mathbb{R} centered a x0x_0 with step size h=xx0h = x - x_0 was introduced as
f(x0+h)=f(x0)+f(x0)h+12!f(x0)h2+(8.1) f(x_0 + h) = f(x_0) + f'(x_0) h + \frac{1}{2!} f''(x_0) h^2 + \cdots \tag{8.1}
Recall that the second derivative f(x0)f''(x_0) contains a wealth of information about the function. Especially if f(x0)=0f'(x_0) = 0, then we might glean from f(x0)f''(x_0) if x0x_0 is a local maximum or minimum or none of these (see Theorem 6.43
Loading...
and review Exercise 6.46
Loading...
).
We also noticed that gradient descent did not work so well only descending along the gradient. We need to take the second derivative into account to get a more detailed picture of the function.

8.2 Several variables

Our main character is a differentiable function F:RnRF:\mathbb{R}^n\rightarrow \mathbb{R} in several variables. We already know that
F(x0+h)=F(x0)+F(x0)h+ϵ(h)h, F(x_0 + h) = F(x_0) + \nabla F(x_0) h + \epsilon(h) \left\vert h \right\vert,
where x0x_0 and hh are vectors in Rn\mathbb{R}^n (as opposed to the good old numbers in (8.1)
f(x0+h)=f(x0)+f(x0)h+12!f(x0)h2+(8.1) f(x_0 + h) = f(x_0) + f'(x_0) h + \frac{1}{2!} f''(x_0) h^2 + \cdots \tag{8.1}
). Take a look back at Definition 7.5
Loading...
for the general definition of differentiability.
We wish to have an analogue of the Taylor expansion in (8.1)
f(x0+h)=f(x0)+f(x0)h+12!f(x0)h2+(8.1) f(x_0 + h) = f(x_0) + f'(x_0) h + \frac{1}{2!} f''(x_0) h^2 + \cdots \tag{8.1}
for such a function of several variables. To this end we introduce the function g:RRg:\mathbb{R}\rightarrow \mathbb{R} given by
g(t)=F(x0+th).(8.2) g(t) = F(x_0 + t h). \tag{8.2}
Notice that
g(t)=(FA)(t), g(t) = (F\circ A)(t),
where A:RRnA: \mathbb{R}\rightarrow \mathbb{R}^n is the function given by A(t)=x0+thA(t) = x_0 + t h. In particular we get
g(t)=F(x0+th)h=F(x0+th)h(8.3) g'(t) = F'(x_0 + t h) h = \nabla F(x_0 + t h) h \tag{8.3}
by using the chain rule (see Theorem 7.24
Loading...
).
Explain how the chain rule is applied to get (8.3)
g(t)=F(x0+th)h=F(x0+th)h(8.3) g'(t) = F'(x_0 + t h) h = \nabla F(x_0 + t h) h \tag{8.3}
.
The derivative g(t)g'(t) is also composed of several functions and again we may compute g(t)g''(t) by using the chain rule:
g(t)=(CBA)(t)=(CB)(A(t))A(t)=C(B(A(t)))B(A(t))A(t),(8.4)\begin{aligned} g''(t) &= (C\circ B \circ A)'(t)\\ &= (C\circ B)'(A(t)) A'(t) \\ &= C'(B(A(t))) B'(A(t)) A'(t), \end{aligned}\tag{8.4}
where B:RnRnB: \mathbb{R}^n\rightarrow \mathbb{R}^n is defined by
B(v)=F(v)T B(v) = \nabla F(v)^T
and C:RnRC:\mathbb{R}^n \rightarrow \mathbb{R} by
C(v)=vTh. C(v) = v^T h.
The Hessian matrix of FF at the point xRnx\in \mathbb{R}^n is defined by
2F(x):=(2Fx1x1(x)2Fx1xn(x)2Fxnx1(x)2Fxnxn(x)). \nabla^2 F(x) := \begin{pmatrix} \dfrac{ \partial^2 F}{ \partial x_1 \partial x_1}(x) & \cdots & \dfrac{ \partial^2 F}{ \partial x_1 \partial x_n}(x) \\ \vdots & \ddots & \vdots \\ \dfrac{ \partial^2 F}{ \partial x_n \partial x_1}(x) & \cdots & \dfrac{\partial^2 F}{ \partial x_n\partial x_n}(x) \end{pmatrix} .
A very important observation is that 2F(x)\nabla^2 F(x) above is a symmetric matrix. Again, you should review what this means by clicking back to Section 3.7.
Why is the Hessian matrix symmetric?
Suppose that f:R2Rf: \mathbb{R}^2\rightarrow \mathbb{R} is given by
f(x,y)=sin(xy)+x2y2+y. f(x, y) = \sin(x y) + x^2 y^2 + y.
Then the gradient
f=(fx,fy) \nabla f = \left(\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y} \right)
and the Hessian
2f=(2fx22fxy2fyx2fy2) \nabla^2 f = \begin{pmatrix} \dfrac{\partial^2 f}{\partial x^2} & \dfrac{\partial^2 f}{\partial x \partial y} \\ \\ \dfrac{\partial^2 f}{\partial y \partial x} & \dfrac{\partial^2 f}{\partial y^2} \end{pmatrix}
of ff are computed in the Sage window below.
See the further documentation for Calculus functions in Sage.
Verify (just this once) by hand the computations done by Sage in Example 8.4
Suppose that f:R2Rf: \mathbb{R}^2\rightarrow \mathbb{R} is given by
f(x,y)=sin(xy)+x2y2+y. f(x, y) = \sin(x y) + x^2 y^2 + y.
Then the gradient
f=(fx,fy) \nabla f = \left(\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y} \right)
and the Hessian
2f=(2fx22fxy2fyx2fy2) \nabla^2 f = \begin{pmatrix} \dfrac{\partial^2 f}{\partial x^2} & \dfrac{\partial^2 f}{\partial x \partial y} \\ \\ \dfrac{\partial^2 f}{\partial y \partial x} & \dfrac{\partial^2 f}{\partial y^2} \end{pmatrix}
of ff are computed in the Sage window below.
See the further documentation for Calculus functions in Sage.
.
Also, experiment with a few other functions in the Sage window and compute their Hessians.
By applying Proposition 7.11
Loading...
it is not too hard to see that the Hessian matrix fits nicely into the framework above, since
B(v)=2F(v).(8.5) B'(v) = \nabla^2 F(v). \tag{8.5}
The full application of the chain rule then gives
g(t)=hT2F(x0+th)h.(8.6) g''(t) = h^T \nabla^2 F(x_0 + t h) h. \tag{8.6}
Give a detailed explanation as to why (8.5)
B(v)=2F(v).(8.5) B'(v) = \nabla^2 F(v). \tag{8.5}
holds.

8.3 Newton's method for finding critical points

We may use Newton's method for computing critical points for a function F:RnRF:\mathbb{R}^n\rightarrow \mathbb{R} of several variables. Recall that a critical point is a point x0Rnx_0\in \mathbb{R}^n with F(x0)=0\nabla F(x_0) = 0. By (7.8)
Loading...
and (8.5)
B(v)=2F(v).(8.5) B'(v) = \nabla^2 F(v). \tag{8.5}
the computation in Newton's method becomes
x1=x0(2F(x0))1F(x0).(8.7) x_1 = x_0 - \left(\nabla^2 F(x_0)\right)^{-1} \nabla F(x_0). \tag{8.7}
In practice the (inverse) Hessian appearing in (8.7)
x1=x0(2F(x0))1F(x0).(8.7) x_1 = x_0 - \left(\nabla^2 F(x_0)\right)^{-1} \nabla F(x_0). \tag{8.7}
is often a heavy computational burden. This leads to the socalled quasi-Newton methods, where the inverse Hessian in (8.7)
x1=x0(2F(x0))1F(x0).(8.7) x_1 = x_0 - \left(\nabla^2 F(x_0)\right)^{-1} \nabla F(x_0). \tag{8.7}
is replaced by other matrices.
We will return to the logistic regression in Example 7.34
Loading...
about the Challenger disaster. Here we sought to maximize the function
(α,β)=i=1mEi(α+βxi)log(1+eα+βxi).(8.8) \ell(\alpha, \beta) = \sum_{i=1}^m E_i (\alpha + \beta x_i) - \log(1 + e^{\alpha + \beta x_i}). \tag{8.8}
In order to employ Newton's method we compute the gradient and the Hessian of (8.8)
(α,β)=i=1mEi(α+βxi)log(1+eα+βxi).(8.8) \ell(\alpha, \beta) = \sum_{i=1}^m E_i (\alpha + \beta x_i) - \log(1 + e^{\alpha + \beta x_i}). \tag{8.8}
α=i=1mEiσ(α+βxi)β=i=1mEixixiσ(α+βxi)2α2=i=1mσ(α+βxi)2βα=i=1mσ(α+βxi)xi2β2=i=1mσ(α+βxi)xi2,(8.9)\begin{aligned} \frac{\partial \ell}{\partial \alpha} &= \sum_{i=1}^m E_i - \sigma(\alpha + \beta x_i)\\ \frac{\partial \ell}{\partial \beta} &= \sum_{i=1}^m E_i x_i - x_ i\sigma(\alpha + \beta x_i)\\ \frac{\partial^2 \ell}{\partial \alpha^2} &= \sum_{i=1}^m - \sigma'(\alpha + \beta x_i)\\ \frac{\partial^2 \ell}{\partial \beta \partial \alpha} &= \sum_{i=1}^m -\sigma'(\alpha + \beta x_i) x_i\\ \frac{\partial^2 \ell}{\partial \beta^2} &= \sum_{i=1}^m - \sigma'(\alpha + \beta x_i) x_i^2, \end{aligned}\tag{8.9}
where
σ(t)=11+et \sigma(t) = \frac{1}{1 + e^{-t}}
is the sigmoid function.
Notice the potential problem in using Newton's method here: the formula for the second order derivatives in (8.9)
α=i=1mEiσ(α+βxi)β=i=1mEixixiσ(α+βxi)2α2=i=1mσ(α+βxi)2βα=i=1mσ(α+βxi)xi2β2=i=1mσ(α+βxi)xi2,(8.9)\begin{aligned} \frac{\partial \ell}{\partial \alpha} &= \sum_{i=1}^m E_i - \sigma(\alpha + \beta x_i)\\ \frac{\partial \ell}{\partial \beta} &= \sum_{i=1}^m E_i x_i - x_ i\sigma(\alpha + \beta x_i)\\ \frac{\partial^2 \ell}{\partial \alpha^2} &= \sum_{i=1}^m - \sigma'(\alpha + \beta x_i)\\ \frac{\partial^2 \ell}{\partial \beta \partial \alpha} &= \sum_{i=1}^m -\sigma'(\alpha + \beta x_i) x_i\\ \frac{\partial^2 \ell}{\partial \beta^2} &= \sum_{i=1}^m - \sigma'(\alpha + \beta x_i) x_i^2, \end{aligned}\tag{8.9}
show that if the α+βxi\alpha +\beta x_i are just mildly big, say 50\geq 50, then the Hessian is extremely close to the zero matrix and therefore Sage considers it non-invertible and (8.7)
x1=x0(2F(x0))1F(x0).(8.7) x_1 = x_0 - \left(\nabla^2 F(x_0)\right)^{-1} \nabla F(x_0). \tag{8.7}
fails.
In the code below we have nudged the initial vector so that it works, but you can easily set other values and see its failure. Optimization is not just mathematics, it also calls for some good (engineering) implementation skills (see for example details on the quasi Newton algorithms).
In the instance below we do, however, get a gradient that is practically (0,0)(0, 0).
Code for Newton's method

8.3.1 Transforming data for better numerical performance

The numerical problems with Newton's method in Example 8.7
We will return to the logistic regression in Example 7.34
Loading...
about the Challenger disaster. Here we sought to maximize the function
(α,β)=i=1mEi(α+βxi)log(1+eα+βxi).(8.8) \ell(\alpha, \beta) = \sum_{i=1}^m E_i (\alpha + \beta x_i) - \log(1 + e^{\alpha + \beta x_i}). \tag{8.8}
In order to employ Newton's method we compute the gradient and the Hessian of (8.8)
(α,β)=i=1mEi(α+βxi)log(1+eα+βxi).(8.8) \ell(\alpha, \beta) = \sum_{i=1}^m E_i (\alpha + \beta x_i) - \log(1 + e^{\alpha + \beta x_i}). \tag{8.8}
α=i=1mEiσ(α+βxi)β=i=1mEixixiσ(α+βxi)2α2=i=1mσ(α+βxi)2βα=i=1mσ(α+βxi)xi2β2=i=1mσ(α+βxi)xi2,(8.9)\begin{aligned} \frac{\partial \ell}{\partial \alpha} &= \sum_{i=1}^m E_i - \sigma(\alpha + \beta x_i)\\ \frac{\partial \ell}{\partial \beta} &= \sum_{i=1}^m E_i x_i - x_ i\sigma(\alpha + \beta x_i)\\ \frac{\partial^2 \ell}{\partial \alpha^2} &= \sum_{i=1}^m - \sigma'(\alpha + \beta x_i)\\ \frac{\partial^2 \ell}{\partial \beta \partial \alpha} &= \sum_{i=1}^m -\sigma'(\alpha + \beta x_i) x_i\\ \frac{\partial^2 \ell}{\partial \beta^2} &= \sum_{i=1}^m - \sigma'(\alpha + \beta x_i) x_i^2, \end{aligned}\tag{8.9}
where
σ(t)=11+et \sigma(t) = \frac{1}{1 + e^{-t}}
is the sigmoid function.
Notice the potential problem in using Newton's method here: the formula for the second order derivatives in (8.9)
α=i=1mEiσ(α+βxi)β=i=1mEixixiσ(α+βxi)2α2=i=1mσ(α+βxi)2βα=i=1mσ(α+βxi)xi2β2=i=1mσ(α+βxi)xi2,(8.9)\begin{aligned} \frac{\partial \ell}{\partial \alpha} &= \sum_{i=1}^m E_i - \sigma(\alpha + \beta x_i)\\ \frac{\partial \ell}{\partial \beta} &= \sum_{i=1}^m E_i x_i - x_ i\sigma(\alpha + \beta x_i)\\ \frac{\partial^2 \ell}{\partial \alpha^2} &= \sum_{i=1}^m - \sigma'(\alpha + \beta x_i)\\ \frac{\partial^2 \ell}{\partial \beta \partial \alpha} &= \sum_{i=1}^m -\sigma'(\alpha + \beta x_i) x_i\\ \frac{\partial^2 \ell}{\partial \beta^2} &= \sum_{i=1}^m - \sigma'(\alpha + \beta x_i) x_i^2, \end{aligned}\tag{8.9}
show that if the α+βxi\alpha +\beta x_i are just mildly big, say 50\geq 50, then the Hessian is extremely close to the zero matrix and therefore Sage considers it non-invertible and (8.7)
x1=x0(2F(x0))1F(x0).(8.7) x_1 = x_0 - \left(\nabla^2 F(x_0)\right)^{-1} \nabla F(x_0). \tag{8.7}
fails.
In the code below we have nudged the initial vector so that it works, but you can easily set other values and see its failure. Optimization is not just mathematics, it also calls for some good (engineering) implementation skills (see for example details on the quasi Newton algorithms).
In the instance below we do, however, get a gradient that is practically (0,0)(0, 0).
Code for Newton's method
can be prevented by transforming the input data. It makes sense to transform data from large numbers to smaller numbers around 00. There is a rather standard way of doing this.
Suppose in logistic regression we have a set of data
x1,x2,,xn(8.10) x_1, x_2, \dots, x_n \tag{8.10}
associated with outcomes E1,,EnE_1, \dots, E_n. Then the function
(α,β)=i=1mEi(α+βxi)log(1+eα+βxi). \ell(\alpha, \beta) = \sum_{i=1}^m E_i (\alpha + \beta x_i) - \log(1 + e^{\alpha + \beta x_i}).
from Example 7.32
Loading...
becomes much more manageable if we first transform the data according to
xi=xixσ x'_i = \frac{x_i - \overline{x}}{\sigma}
and instead optimize the function
(α,β)=i=1mEi(α+βxi)log(1+eα+βxi). \ell'(\alpha, \beta) = \sum_{i=1}^m E_i (\alpha + \beta x'_i) - \log(1 + e^{\alpha + \beta x'_i}).
Here
x=x1+x2++xnn \overline{x} = \frac{x_1 + x_2 + \cdots + x_n}{n}
is the mean value and
σ2=(x1x)2+(x2x)2++(xnx)2n \sigma^2 = \frac{(x_1 - \overline{x})^2 + (x_2 - \overline{x})^2 + \cdots + (x_n - \overline{x})^2}{n}
the variance of the data in (8.10)
x1,x2,,xn(8.10) x_1, x_2, \dots, x_n \tag{8.10}
.
Now if α\alpha' and β\beta' is an optimum for \ell', then
α=αxσββ=βσ(8.11)\begin{aligned} \alpha &= \alpha' - \frac{\overline{x}}{\sigma} \beta'\\ \beta &= \frac{\beta'}{\sigma} \end{aligned}\tag{8.11}
is an optimum for \ell, since
(α,β)=(αβxσ,βσ). \ell'(\alpha, \beta) = \ell\left(\alpha - \beta \frac{\overline{x}}{\sigma}, \frac{\beta}{\sigma}\right).
Why is the claim/trick alluded to in (8.11)
α=αxσββ=βσ(8.11)\begin{aligned} \alpha &= \alpha' - \frac{\overline{x}}{\sigma} \beta'\\ \beta &= \frac{\beta'}{\sigma} \end{aligned}\tag{8.11}
true?
Below is a snippet of Sage code implementing the trick in (8.11)
α=αxσββ=βσ(8.11)\begin{aligned} \alpha &= \alpha' - \frac{\overline{x}}{\sigma} \beta'\\ \beta &= \frac{\beta'}{\sigma} \end{aligned}\tag{8.11}
. The function test takes as input x0 (an initial vector like [0,0]) and noofits (the number of iterations of Newton's method). You execute this in the Sage window by adding for example

test([0,0], 10)

and then pressing Compute.
Experiment and compare with the official output from Example 7.34
Loading...
. Also, compute the gradient of the output below for the original non-transformed problem.
Transformed code

8.4 The Taylor series in several variables

Now we are in a position to state at least the first terms in the Taylor expansion for a differentiable function F:RnRF:\mathbb{R}^n\rightarrow \mathbb{R}. The angle of the proof is to reduce to the one-dimensional case through the function g(t)g(t) defined in (8.2)
g(t)=F(x0+th).(8.2) g(t) = F(x_0 + t h). \tag{8.2}
. Here one may prove that
g(t)=g(0)+g(0)t+12g(0)t2+ϵ(t)t2,(8.12) g(t) = g(0) + g'(0) t + \frac{1}{2} g''(0) t^2 + \epsilon(t)t^2, \tag{8.12}
where ϵ(0)=0\epsilon(0) = 0 with ϵ\epsilon continuous at 00, much like in the definition of differentiability except that we also include the second derivative.
Now (8.12)
g(t)=g(0)+g(0)t+12g(0)t2+ϵ(t)t2,(8.12) g(t) = g(0) + g'(0) t + \frac{1}{2} g''(0) t^2 + \epsilon(t)t^2, \tag{8.12}
translates into
F(x0+th)=F(x0)+(F(x0)h)t+12(hT2F(x0)h)t2+ϵ(t)t2(8.13) F(x_0 + t h) = F(x_0) + \left(\nabla F(x_0) h \right) t + \frac{1}{2} \left( h^T \nabla^2 F(x_0) h \right) t^2 + \epsilon(t) t^2 \tag{8.13}
by using (8.3)
g(t)=F(x0+th)h=F(x0+th)h(8.3) g'(t) = F'(x_0 + t h) h = \nabla F(x_0 + t h) h \tag{8.3}
and (8.6)
g(t)=hT2F(x0+th)h.(8.6) g''(t) = h^T \nabla^2 F(x_0 + t h) h. \tag{8.6}
.
From (8.13)
F(x0+th)=F(x0)+(F(x0)h)t+12(hT2F(x0)h)t2+ϵ(t)t2(8.13) F(x_0 + t h) = F(x_0) + \left(\nabla F(x_0) h \right) t + \frac{1}{2} \left( h^T \nabla^2 F(x_0) h \right) t^2 + \epsilon(t) t^2 \tag{8.13}
one reads the following nice criterion, which may be viewed as a several variable generalization of Theorem 6.43
Loading...
.
Let x0x_0 be a critical point for F:RnRF:\mathbb{R}^n\rightarrow \mathbb{R}. Then
  1. x0x_0 is a local minimum if 2F(x0)\nabla^2 F(x_0) is positive definite.
  2. x0x_0 is a local maximum if 2F(x0)-\nabla^2 F(x_0) is positive definite (here we call 2F(x0)\nabla^2 F(x_0) negative definite).
  3. x0x_0 is a saddle point if 2F(x0)\nabla^2 F(x_0) is indefinite.
We need to clarify two things concerning the above theorem.
  1. An n×nn\times n indefinite matrix is a symmetric matrix AA with the property that there exists u,vRnu, v\in \mathbb{R}^n with
    uTAu>0andvTAv<0\begin{aligned} u^T A u &>0\quad\text{and}\\ v^T A v &<0 \end{aligned}
    i.e., neither AA nor A-A is positive semidefinite.
  2. A saddle point x0x_0 for FF is defined by the existence of two vectors u,vRnu, v\in \mathbb{R}^n, such that
    t=0is a local minimum for the functionf(t)=F(x0+tu)t=0is a local maximum for the functiong(t)=F(x0+tv)\begin{aligned} &t=0\quad \text{is a local minimum for the function}\quad f(t) = F(x_0 + t u)\\ &t=0\quad \text{is a local maximum for the function}\quad g(t) = F(x_0 + t v) \end{aligned}
    as illustrated in the graphics below.
Consider, with our new technology in Theorem 8.9
Let x0x_0 be a critical point for F:RnRF:\mathbb{R}^n\rightarrow \mathbb{R}. Then
  1. x0x_0 is a local minimum if 2F(x0)\nabla^2 F(x_0) is positive definite.
  2. x0x_0 is a local maximum if 2F(x0)-\nabla^2 F(x_0) is positive definite (here we call 2F(x0)\nabla^2 F(x_0) negative definite).
  3. x0x_0 is a saddle point if 2F(x0)\nabla^2 F(x_0) is indefinite.
, Exercise 7.22
Loading...
once again. Here we analyzed the point v0=(0,0)v_0 = (0, 0) for the function
f(x,y)=x3+xy+y3 f(x, y) = x^3 + x y + y^3
and showed (by a trick) that v0v_0 is neither a local maximum nor a local minimum for ff. The Hessian matrix for f(x,y)f(x, y) at v0v_0 is
H=(0110). H = \begin{pmatrix} 0 & 1 \\ 1 & 0 \end{pmatrix}.
Now Theorem 8.9
Let x0x_0 be a critical point for F:RnRF:\mathbb{R}^n\rightarrow \mathbb{R}. Then
  1. x0x_0 is a local minimum if 2F(x0)\nabla^2 F(x_0) is positive definite.
  2. x0x_0 is a local maximum if 2F(x0)-\nabla^2 F(x_0) is positive definite (here we call 2F(x0)\nabla^2 F(x_0) negative definite).
  3. x0x_0 is a saddle point if 2F(x0)\nabla^2 F(x_0) is indefinite.
(ⅲ.) shows that v0v_0 is a saddle point, since
(xy)H(xy)=2xy \begin{pmatrix} x & y \end{pmatrix} H \begin{pmatrix} x \\ y \end{pmatrix} = 2 x y
and
uTHu>0for u=(11)vTHv<0for v=(11).\begin{aligned} u^T H u &> 0\qquad\text{for } u = \begin{pmatrix} 1 \\ 1 \end{pmatrix}\\ v^T H v &< 0\qquad\text{for } v = \begin{pmatrix} 1 \\ -1 \end{pmatrix}. \end{aligned}
Try plotting the graph for different values of aa=4 shows the saddle point clearly. in the Sage window in Example 8.11
Consider, with our new technology in Theorem 8.9
Let x0x_0 be a critical point for F:RnRF:\mathbb{R}^n\rightarrow \mathbb{R}. Then
  1. x0x_0 is a local minimum if 2F(x0)\nabla^2 F(x_0) is positive definite.
  2. x0x_0 is a local maximum if 2F(x0)-\nabla^2 F(x_0) is positive definite (here we call 2F(x0)\nabla^2 F(x_0) negative definite).
  3. x0x_0 is a saddle point if 2F(x0)\nabla^2 F(x_0) is indefinite.
, Exercise 7.22
Loading...
once again. Here we analyzed the point v0=(0,0)v_0 = (0, 0) for the function
f(x,y)=x3+xy+y3 f(x, y) = x^3 + x y + y^3
and showed (by a trick) that v0v_0 is neither a local maximum nor a local minimum for ff. The Hessian matrix for f(x,y)f(x, y) at v0v_0 is
H=(0110). H = \begin{pmatrix} 0 & 1 \\ 1 & 0 \end{pmatrix}.
Now Theorem 8.9
Let x0x_0 be a critical point for F:RnRF:\mathbb{R}^n\rightarrow \mathbb{R}. Then
  1. x0x_0 is a local minimum if 2F(x0)\nabla^2 F(x_0) is positive definite.
  2. x0x_0 is a local maximum if 2F(x0)-\nabla^2 F(x_0) is positive definite (here we call 2F(x0)\nabla^2 F(x_0) negative definite).
  3. x0x_0 is a saddle point if 2F(x0)\nabla^2 F(x_0) is indefinite.
(ⅲ.) shows that v0v_0 is a saddle point, since
(xy)H(xy)=2xy \begin{pmatrix} x & y \end{pmatrix} H \begin{pmatrix} x \\ y \end{pmatrix} = 2 x y
and
uTHu>0for u=(11)vTHv<0for v=(11).\begin{aligned} u^T H u &> 0\qquad\text{for } u = \begin{pmatrix} 1 \\ 1 \end{pmatrix}\\ v^T H v &< 0\qquad\text{for } v = \begin{pmatrix} 1 \\ -1 \end{pmatrix}. \end{aligned}
. What do you observe for the point v0v_0 with respect to the function? Does a have to be a number? Could it be a symbolic expression in the variables x and y like a = -10*cos(x)*sin(y)?
Check the computation of the Hessian matrix HH in Example 8.11
Consider, with our new technology in Theorem 8.9
Let x0x_0 be a critical point for F:RnRF:\mathbb{R}^n\rightarrow \mathbb{R}. Then
  1. x0x_0 is a local minimum if 2F(x0)\nabla^2 F(x_0) is positive definite.
  2. x0x_0 is a local maximum if 2F(x0)-\nabla^2 F(x_0) is positive definite (here we call 2F(x0)\nabla^2 F(x_0) negative definite).
  3. x0x_0 is a saddle point if 2F(x0)\nabla^2 F(x_0) is indefinite.
, Exercise 7.22
Loading...
once again. Here we analyzed the point v0=(0,0)v_0 = (0, 0) for the function
f(x,y)=x3+xy+y3 f(x, y) = x^3 + x y + y^3
and showed (by a trick) that v0v_0 is neither a local maximum nor a local minimum for ff. The Hessian matrix for f(x,y)f(x, y) at v0v_0 is
H=(0110). H = \begin{pmatrix} 0 & 1 \\ 1 & 0 \end{pmatrix}.
Now Theorem 8.9
Let x0x_0 be a critical point for F:RnRF:\mathbb{R}^n\rightarrow \mathbb{R}. Then
  1. x0x_0 is a local minimum if 2F(x0)\nabla^2 F(x_0) is positive definite.
  2. x0x_0 is a local maximum if 2F(x0)-\nabla^2 F(x_0) is positive definite (here we call 2F(x0)\nabla^2 F(x_0) negative definite).
  3. x0x_0 is a saddle point if 2F(x0)\nabla^2 F(x_0) is indefinite.
(ⅲ.) shows that v0v_0 is a saddle point, since
(xy)H(xy)=2xy \begin{pmatrix} x & y \end{pmatrix} H \begin{pmatrix} x \\ y \end{pmatrix} = 2 x y
and
uTHu>0for u=(11)vTHv<0for v=(11).\begin{aligned} u^T H u &> 0\qquad\text{for } u = \begin{pmatrix} 1 \\ 1 \end{pmatrix}\\ v^T H v &< 0\qquad\text{for } v = \begin{pmatrix} 1 \\ -1 \end{pmatrix}. \end{aligned}
by showing that the Hessian matrix for ff at the point (x,y)(x, y) is
(6x116y). \begin{pmatrix} 6 x & 1\\ 1 & 6 y \end{pmatrix}.
What about uu and vv in Example 8.11
Consider, with our new technology in Theorem 8.9
Let x0x_0 be a critical point for F:RnRF:\mathbb{R}^n\rightarrow \mathbb{R}. Then
  1. x0x_0 is a local minimum if 2F(x0)\nabla^2 F(x_0) is positive definite.
  2. x0x_0 is a local maximum if 2F(x0)-\nabla^2 F(x_0) is positive definite (here we call 2F(x0)\nabla^2 F(x_0) negative definite).
  3. x0x_0 is a saddle point if 2F(x0)\nabla^2 F(x_0) is indefinite.
, Exercise 7.22
Loading...
once again. Here we analyzed the point v0=(0,0)v_0 = (0, 0) for the function
f(x,y)=x3+xy+y3 f(x, y) = x^3 + x y + y^3
and showed (by a trick) that v0v_0 is neither a local maximum nor a local minimum for ff. The Hessian matrix for f(x,y)f(x, y) at v0v_0 is
H=(0110). H = \begin{pmatrix} 0 & 1 \\ 1 & 0 \end{pmatrix}.
Now Theorem 8.9
Let x0x_0 be a critical point for F:RnRF:\mathbb{R}^n\rightarrow \mathbb{R}. Then
  1. x0x_0 is a local minimum if 2F(x0)\nabla^2 F(x_0) is positive definite.
  2. x0x_0 is a local maximum if 2F(x0)-\nabla^2 F(x_0) is positive definite (here we call 2F(x0)\nabla^2 F(x_0) negative definite).
  3. x0x_0 is a saddle point if 2F(x0)\nabla^2 F(x_0) is indefinite.
(ⅲ.) shows that v0v_0 is a saddle point, since
(xy)H(xy)=2xy \begin{pmatrix} x & y \end{pmatrix} H \begin{pmatrix} x \\ y \end{pmatrix} = 2 x y
and
uTHu>0for u=(11)vTHv<0for v=(11).\begin{aligned} u^T H u &> 0\qquad\text{for } u = \begin{pmatrix} 1 \\ 1 \end{pmatrix}\\ v^T H v &< 0\qquad\text{for } v = \begin{pmatrix} 1 \\ -1 \end{pmatrix}. \end{aligned}
? How do they relate to the hint given in Exercise 7.22
Loading...
?
Give an example of a function F:R2RF:\mathbb{R}^2\rightarrow \mathbb{R} having a local minimum at x0x_0, where 2F(x0)\nabla^2 F(x_0) is not positive definite.
The following exercise is a sci2u exercise from the Calculus book.
  1. The point (0,33)\left(0, \frac{\sqrt{3}}{3}\right) is a critical point for
    f(x,y)=x3+y3y. f(x, y) = x^3 + y^3 - y.
    What does Theorem 8.9
    Let x0x_0 be a critical point for F:RnRF:\mathbb{R}^n\rightarrow \mathbb{R}. Then
    1. x0x_0 is a local minimum if 2F(x0)\nabla^2 F(x_0) is positive definite.
    2. x0x_0 is a local maximum if 2F(x0)-\nabla^2 F(x_0) is positive definite (here we call 2F(x0)\nabla^2 F(x_0) negative definite).
    3. x0x_0 is a saddle point if 2F(x0)\nabla^2 F(x_0) is indefinite.
    say about this point?
  2. The point (13,13)\left(\frac{1}{3}, \frac{1}{3}\right) is a critical point for
    f(x,y)=x3x2+xy3+2y2y. f(x, y) = -x^3 -x^2 + x - y^3 + 2 y^2 - y.
    What does Theorem 8.9
    Let x0x_0 be a critical point for F:RnRF:\mathbb{R}^n\rightarrow \mathbb{R}. Then
    1. x0x_0 is a local minimum if 2F(x0)\nabla^2 F(x_0) is positive definite.
    2. x0x_0 is a local maximum if 2F(x0)-\nabla^2 F(x_0) is positive definite (here we call 2F(x0)\nabla^2 F(x_0) negative definite).
    3. x0x_0 is a saddle point if 2F(x0)\nabla^2 F(x_0) is indefinite.
    say about this point?
  3. The point (0,1)(0, 1) is a critical point for
    f(x,y)=x3x2+y3y2y. f(x, y) = x^3 - x^2 + y^3 - y^2 - y.
    What does Theorem 8.9
    Let x0x_0 be a critical point for F:RnRF:\mathbb{R}^n\rightarrow \mathbb{R}. Then
    1. x0x_0 is a local minimum if 2F(x0)\nabla^2 F(x_0) is positive definite.
    2. x0x_0 is a local maximum if 2F(x0)-\nabla^2 F(x_0) is positive definite (here we call 2F(x0)\nabla^2 F(x_0) negative definite).
    3. x0x_0 is a saddle point if 2F(x0)\nabla^2 F(x_0) is indefinite.
    say about this point?
Consider the function
f(x,y)=x4y2+x2y43x2y2. f(x, y) = x^4 y^2 + x^2 y^4 - 3 x^2 y^2.
Compute its critical points and decide on their types according to Theorem 8.9
Let x0x_0 be a critical point for F:RnRF:\mathbb{R}^n\rightarrow \mathbb{R}. Then
  1. x0x_0 is a local minimum if 2F(x0)\nabla^2 F(x_0) is positive definite.
  2. x0x_0 is a local maximum if 2F(x0)-\nabla^2 F(x_0) is positive definite (here we call 2F(x0)\nabla^2 F(x_0) negative definite).
  3. x0x_0 is a saddle point if 2F(x0)\nabla^2 F(x_0) is indefinite.
. Try to convince yourself that
f(x,y)1 f(x, y) \geq -1
for every x,yRx, y\in \mathbb{R}.
Look at the minimization problem
minf(x,y) \min f(x, y)
subject to
(x,y)C={(x,y)MxM,  MyM}, (x, y)\in C = \{(x, y) \mid -M \leq x \leq M, \,\, -M \leq y \leq M\},
where MM is a big number.
Give an example of a function f:RRf:\mathbb{R}\rightarrow \mathbb{R} that has a local maximum, but where there exists xRx\in \mathbb{R} with f(x)>Mf(x) > M for any given (large) number MM.

8.5 Convex functions of several variables

Below is the generalization of Theorem 6.53
Loading...
to several variables. You have already seen this in Exercise 6.54
Loading...
, right?
Let f:URf: U\rightarrow \mathbb{R} be a differentiable function, where URnU\subseteq \mathbb{R}^n is an open convex subset. Then ff is convex if and only if
f(x)f(x0)+f(x0)(xx0)(8.14) f(x) \geq f(x_0) + \nabla f(x_0) (x-x_0) \tag{8.14}
for every x,x0Ux, x_0\in U.
Proof
Suppose that (8.14)
f(x)f(x0)+f(x0)(xx0)(8.14) f(x) \geq f(x_0) + \nabla f(x_0) (x-x_0) \tag{8.14}
holds and let xt=(1t)x0+txx_t = (1-t)x_0 + t x with 0t10 \leq t \leq 1, where x0,xUx_0, x\in U. To prove that ff is convex we must verify the inequality
f(xt)(1t)f(x0)+tf(x).(8.15) f(x_t) \leq (1-t) f(x_0) + t f(x). \tag{8.15}
Let ξ=f(xt)\xi = \nabla f(x_t). Then
f(x)f(xt)+ξ(1t)(xx0)f(x0)f(xt)ξt(xx0)\begin{aligned} f(x) &\geq f(x_t) + \xi (1-t) (x-x_0)\\ f(x_0) &\geq f(x_t) - \xi t (x-x_0) \end{aligned}
by (8.14)
f(x)f(x0)+f(x0)(xx0)(8.14) f(x) \geq f(x_0) + \nabla f(x_0) (x-x_0) \tag{8.14}
. If you multiply the first inequality by tt, the second by 1t1-t and then add the two, you get (8.15)
f(xt)(1t)f(x0)+tf(x).(8.15) f(x_t) \leq (1-t) f(x_0) + t f(x). \tag{8.15}
.
Suppose on the other hand that ff is a convex function. Let x0,xUx_0, x\in U. Since UU is an open subset, it follows that (1t)x0+txU(1-t)x_0 + t x\in U for tI=(δ,1+δ)t\in I=(-\delta, 1 + \delta), where δ>0\delta>0 is sufficiently small. Now define the function g:IRg:I\rightarrow \mathbb{R} by
g(t)=f((1t)x0+tx)=f(x0+t(xx0)). g(t) = f((1-t) x_0 + t x) = f(x_0 + t (x-x_0)).
Being the composition of two differentiable functions, gg is differentiable. Suppose that 0α10\leq \alpha \leq 1 and t1,t2It_1, t_2\in I. Then
g((1α)t1+αt2)=f(x0+((1α)t1+αt2)(xx0))=f((1α)(x0+t1(xx0))+α(x0+t2(xx0)))(1α)f(x0+t1(xx0))+αf(x0+t2(xx0))=(1α)g(t1)+αg(t2)\begin{aligned} g((1- \alpha) t_1 + \alpha t_2) &= f(x_0 + ((1-\alpha) t_1 + \alpha t_2)(x-x_0)) \\ &=f((1-\alpha)(x_0 + t_1(x-x_0)) + \alpha (x_0+t_2(x-x_0))) \\ &\leq(1-\alpha) f(x_0 + t_1 (x-x_0)) + \alpha f(x_0 + t_2(x-x_0)) \\ &=(1-\alpha) g(t_1) + \alpha g(t_2) \end{aligned}
showing that gg is a convex function. By Theorem 6.53
Loading...
,
g(1)g(0)+g(0), g(1) \geq g(0) + g'(0),
which translates into
f(x)f(x0)+f(x0)(xx0) f(x) \geq f(x_0) + \nabla f(x_0) (x-x_0)
by using the chain rule in computing g(0)g'(0).
Prove that a bounded convex differentiable function f:RnRf:\mathbb{R}^n\rightarrow \mathbb{R} is constant.
The following is the generalization of Corollary 6.48
Loading...
.
Let f:URf: U\rightarrow \mathbb{R} be a differentiable function with continuous second order partial derivatives, where URnU\subseteq \mathbb{R}^n is a convex open subset. Then ff is convex if and only if the Hessian 2f(x)\nabla^2 f(x) is positive semidefinite for every xUx\in U. If 2f(x)\nabla^2 f(x) is positive definite for every xUx\in U, then ff is strictly convex.
Proof
We have done all the work for a convenient reduction to the one variable case. Suppose that ff is convex. Then the same reasoning as in the proof of Theorem 8.19
Let f:URf: U\rightarrow \mathbb{R} be a differentiable function, where URnU\subseteq \mathbb{R}^n is an open convex subset. Then ff is convex if and only if
f(x)f(x0)+f(x0)(xx0)(8.14) f(x) \geq f(x_0) + \nabla f(x_0) (x-x_0) \tag{8.14}
for every x,x0Ux, x_0\in U.
shows that
g(t)=f(x+tv) g(t) = f(x + t v)
is a convex function for every xUx\in U and every vRnv\in \mathbb{R}^n from an open interval (δ,δ)(-\delta, \delta) to R\mathbb{R} for suitable δ>0\delta>0. Therefore g(0)=vt2f(x)v0g''(0) = v^t \nabla^2 f(x) v \geq 0 by Theorem 6.47
Loading...
. This proves that the matrix 2f(x)\nabla^2 f(x) is positive semidefinite for every xUx\in U. Suppose on the other hand that 2f(x)\nabla^2 f(x) is positive semidefinite for every xUx\in U. Then Theorem 6.47
Loading...
shows that g(t)=f(x+t(yx))g(t) = f(x + t(y-x)) is a convex function from (δ,1+δ)(-\delta, 1+\delta) to R\mathbb{R} for δ>0\delta>0 small and x,yUx, y\in U, since
g(α)=(yx)t2f(x+α(yx))(yx)0 g''(\alpha) = (y-x)^t \nabla^2 f(x + \alpha(y-x)) (y-x) \geq 0
for 0α10\leq \alpha \leq 1. Therefore ff is a convex function, since
f((1t)x+ty)=g((1t)0+t1)(1t)g(0)+tg(1)=(1t)f(x)+tf(y).\begin{aligned} f((1-t) x + t y) = &g((1-t)\cdot 0 + t\cdot 1) \\ &\leq(1-t) g(0) + t g(1) = (1-t) f(x) + t f(y). \end{aligned}
The same argument (using the last part of Theorem 6.47
Loading...
on strict convexity), shows that gg is strictly convex if 2f(x)\nabla^2 f(x) is positive definite. It follows that ff is strictly convex if 2f(x)\nabla^2 f(x) is positive definite for every xUx\in U.
Prove that
f(x,y)=x2+y2 f(x, y) = x^2 + y^2
is a strictly convex function from R2\mathbb{R}^2 to R\mathbb{R}. Also, prove that
{(x,y)R2x2+y21} \{(x, y)\in \mathbb{R}^2 \mid x^2 + y^2 \leq 1\}
is a convex subset of R2\mathbb{R}^2.
Is f(x,y)=cos(x)+sin(y)f(x, y) = \cos(x) + \sin(y) strictly convex on some non-empty open convex subset of the plane?
Show that f:R2Rf:\mathbb{R}^2 \rightarrow \mathbb{R} given by
f(x,y)=log(ex+ey) f(x, y) = \log(e^x + e^y)
is a convex function. Is ff strictly convex?
Let f:R2Rf:\mathbb{R}^2\rightarrow \mathbb{R} be given by
f(x,y)=ax2+by2+cxy, f(x, y) = a x^2 + b y^2 + c x y,
where a,b,cRa, b, c\in \mathbb{R}.
  1. Show that ff is a strictly convex function if and only if a>0a > 0 and 4abc2>04 a b - c^2 > 0.
    This is a hint for the only if part. If HH is the Hessian for ff, then
    f(v)=12vTHv, f(v) = \frac{1}{2} v^T H v,
    where v=(x,y)Tv = (x, y)^T - this is seen by a matrix multiplication computation. We know that HH is positive semidefinite. If HH was not positive definite, there would exist v0v\neq 0 with f(v)=0f(v) = 0. Now use f(tv)=t2f(v)f(t v) = t^2 f(v) to complete the proof that HH is positive definite by looking at f((1t)0+tv)f((1-t)\cdot 0 + t\cdot v).
  2. Suppose now that a>0a > 0 and 4abc2>04 a b - c^2>0. Show that g(x,y)=f(x,y)+x+yg(x, y) = f(x, y) + x + y has a unique global minimum and give a formula for this minimum in terms of a,ba, b and cc.

8.6 How to decide the definiteness of a matrix

In this section we will outline a straightforward method for deciding if a matrix is positive definite, positive semidefinite, negative definite or indefinite.
Before proceeding it is a must that you do the following exercise.
Show that a diagonal matrix
(λ1000λ2000λn) \begin{pmatrix} \lambda_1 & 0 & \dots & 0\\ 0 &\lambda_2 & \dots &0\\ \vdots & \vdots & \ddots & \vdots\\ 0 & 0 &\dots & \lambda_n \end{pmatrix}
is positive definite if and only if λ1>0,,λn>0\lambda_1 > 0, \dots, \lambda_n > 0 and positive semidefinite if and only if λ10,,λn0\lambda_1 \geq 0, \dots, \lambda_n \geq 0.
Also, show that if AA is a symmetric matrix, then AA is positive definite if and only if
BTAB B^T A B
is positive definite, where BB is an invertible matrix.
The crucial ingredient is the following result.
Let AA be a real symmetric n×nn\times n matrix. Then there exists an invertible matrix BB, such that BTABB^T A B is a diagonal matrix.
The proof contains an algorithm for building BB by different steps. We will supply examples afterwards illustrating these. Incidentally, how does this help you in deciding for example that a given matrix is positive definite? If you cannot answer this question, please do the exercise above.
Proof
Suppose that A=(aij)A=(a_{ij}). If AA has a non-zero entry in the upper left hand corner i.e., a110a_{11}\neq 0, then
B1TAB1=(a11000c11c1,n10cn1,1cn1,n1,) B_1^T A B_1 = \begin{pmatrix} a_{11} & 0 & \cdots & 0\\ 0 & c_{11} & \cdots & c_{1, n-1}\\ \vdots & \vdots & \ddots & \vdots \\ 0 & c_{n-1, 1} & \cdots & c_{n-1, n-1}, \end{pmatrix}
where C=(cij)C = (c_{ij}) is a real symmetric matrix and B1B_1 is the invertible n×nn\times n matrix
(1a12a11a1na11010001). \begin{pmatrix} 1 & -\frac{ a_{12}}{a_{11}} & \cdots & -\frac{ a_{1n}}{a_{11}}\\ 0 & 1 & \cdots & 0\\ \vdots & \vdots & \ddots & \vdots\\ 0 & 0 & \cdots & 1 \end{pmatrix} .
By induction on nn we may find an invertible matrix (n1)×(n1)(n-1)\times (n-1) matrix B2B_2 such that
B2tCB2=(a1000a2000an1). B_2^t C B_2 = \begin{pmatrix} a_1 & 0 &\cdots & 0\\ 0 & a_2 & \cdots & 0\\ \vdots & \vdots & \ddots &\vdots\\ 0 & 0 & \cdots & a_{n-1} \end{pmatrix} .
Putting
B=B1(100B2), B = B_1 \begin{pmatrix} 1 & 0\\ 0 & B_2 \end{pmatrix},
it follows that
BtAB=(a11000a1000an1). B^t A B = \begin{pmatrix} a_{11} & 0 &\cdots & 0\\ 0 & a_1 & \cdots & 0\\ \vdots & \vdots & \ddots &\vdots\\ 0 & 0 & \cdots & a_{n-1} \end{pmatrix} .
We now treat the case of a zero entry in the upper left hand corner i.e., a11=0a_{11}=\nobreak 0. Suppose first that ajj0a_{jj} \neq 0 for some j>1j > 1. Let PP denote the identity matrix with the first and jj-th rows interchanged. The operation AAPA\mapsto A P amounts to interchanging the first and jj-th columns in AA. Similarly APtAA\mapsto P^t A is interchanging that first and jj-th rows in AA. The matrix PP is invertible and PtAPP^t A P is a symmetric matrix with (PtAP)11=ajj0(P^t A P)_{11} = a_{jj}\neq 0 and we have reduced to the case of a non-zero entry in the upper left hand corner.
If aii=0a_{ii} = 0 for every i=1,,ni = 1, \dots, n we may assume that a1j0a_{1 j}\neq 0 for some j>1j>1. Let BB denote the identity matrix where the entry in the first column and jj-th row is 11. The operation AABA\mapsto A B amounts to adding the jj-th column to the first column in AA. Similarly ABtAA\mapsto B^t A is adding the jj-th row to the first row in AA. All in all we get (BtAB)11=2a1j0(B^t A B)_{11} = 2 a_{1 j} \neq 0, where we have used that aii=0a_{ii} = 0 for i=1,,ni=1, \dots, n. Again we have reduced to the case of a non-zero entry in the upper left hand corner.
Consider the 3×33\times 3 real symmetric matrix.
A=(aij)=(152530205). A = (a_{ij}) = \begin{pmatrix} 1 & 5 & 2\\ 5 & 3 & 0\\ 2 & 0 & 5 \end{pmatrix} .
Here a11=10a_{11} = 1 \neq 0. Therefore the fundamental step in the proof of Theorem 8.27
Let AA be a real symmetric n×nn\times n matrix. Then there exists an invertible matrix BB, such that BTABB^T A B is a diagonal matrix.
applies and
(100510201)A(152010001)=(100022100101) \begin{pmatrix} 1 & 0 & 0 \\ -5 & 1 & 0\\ -2 & 0 & 1 \end{pmatrix} A \begin{pmatrix} 1 & -5 & -2\\ 0 & 1 & 0\\ 0 & 0 & 1 \end{pmatrix} = \begin{pmatrix} 1 & 0 & 0\\ 0 & -22 & -10\\ 0 & -10 & 1 \end{pmatrix}
and again
(10001005111)(100022100101)(10001511001)=(1000220006111). \begin{pmatrix} 1 & 0 & 0 \\ 0 & 1 & 0\\ 0 & -\frac{5}{11} & 1 \end{pmatrix} \begin{pmatrix} 1 & 0 & 0\\ 0 & -22 & -10\\ 0 & -10 & 1 \end{pmatrix} \begin{pmatrix} 1 & 0 & 0\\ 0 & 1 & -\frac{5}{11}\\ 0 & 0 & 1 \end{pmatrix} = \begin{pmatrix} 1 & 0 & 0\\ 0 & -22 & 0\\ 0 & 0 & \frac{61}{11} \end{pmatrix} .
Summing up we get
B=(152010001)(10001511001)=(1531101511001). B = \begin{pmatrix} 1 & -5 & -2\\ 0 & 1 & 0\\ 0 & 0 & 1 \end{pmatrix} \begin{pmatrix} 1 & 0 & 0\\ 0 & 1 & -\frac{5}{11}\\ 0 & 0 & 1 \end{pmatrix} = \begin{pmatrix} 1 & -5 & \frac{3}{11}\\ 0 & 1 & -\frac{5}{11}\\ 0 & 0 & 1 \end{pmatrix} .
You are invited to check that
BtAB=(1000220006111). B^t A B = \begin{pmatrix} 1 & 0 & 0\\ 0 & -22 & 0\\ 0 & 0 & \frac{61}{11} \end{pmatrix} .
Let
A=(0011002312141340). A = \begin{pmatrix} 0 & 0 & 1 & 1\\ 0 & 0 & 2 & 3\\ 1 & 2 & 1 & 4\\ 1 & 3 & 4 & 0 \end{pmatrix}.
Here a11=a22=0a_{11} = a_{22} = 0, but the diagonal element a330a_{33}\neq 0. So we are in the second step of the proof of Theorem 8.27
Let AA be a real symmetric n×nn\times n matrix. Then there exists an invertible matrix BB, such that BTABB^T A B is a diagonal matrix.
. Using the matrix
P=(0010010010000001) P = \begin{pmatrix} 0 & 0 & 1 & 0\\ 0 & 1 & 0 & 0\\ 1 & 0 & 0 & 0\\ 0 & 0 & 0 & 1 \end{pmatrix}
we get
PtAP=(1214200310014310). P^t A P = \begin{pmatrix} 1 & 2 & 1 & 4\\ 2 & 0 & 0 & 3\\ 1 & 0 & 0 & 1\\ 4 & 3 & 1 & 0 \end{pmatrix} .
As argued in the proof, this corresponds to interchanging the first and third columns and then interchanging the first and third rows. In total you move the non-zero a33a_{33} to the upper left corner in the matrix.
Consider the symmetric matrix
A=(0111101111011110). A = \begin{pmatrix} 0 & 1 & 1 & 1\\ 1 & 0 & 1 & 1\\ 1 & 1 & 0 & 1\\ 1 & 1 & 1 & 0 \end{pmatrix} .
We have zero entries in the diagonal. As in the third step in the proof of Theorem 8.27
Let AA be a real symmetric n×nn\times n matrix. Then there exists an invertible matrix BB, such that BTABB^T A B is a diagonal matrix.
we must find an invertible matrix B1B_1, such that the upper left corner in B1tAB1B_1^t A B_1 is non-zero. In the proof it is used that every diagonal element is zero: if we locate a non-zero element in the jj-th column in the first row, we can add the jj-th column to the first column and then the jj-th row to the first row obtaining a non-zero element in the upper left corner. For AA above we choose j=2j=2 and the matrix B1B_1 becomes
B1=(1000110000100001) B_1 = \begin{pmatrix} 1 & 0 & 0 & 0\\ 1 & 1 & 0 & 0\\ 0 & 0 & 1 & 0\\ 0 & 0 & 0 & 1 \end{pmatrix}
so that
B1tAB1=(2122101121012110). B_1^t A B_1 = \begin{pmatrix} 2 & 1 & 2 & 2\\ 1 & 0 & 1 & 1\\ 2 & 1 & 0 & 1\\ 2 & 1 & 1 & 0 \end{pmatrix} .
Let AA be any matrix. Show that
ATA A^T A
is positive semidefinite.
Find inequalities defining the set
{(a,b)R2|(21a111a1b)  is positive definite}. \left\{(a, b)\in \mathbb{R}^2 \middle\vert \begin{pmatrix} 2 & 1 & a\\ 1 & 1 & 1 \\ a & 1 & b \end{pmatrix} \,\,\text{is positive definite} \right\}.
Same question with positive semidefinite. Sketch and compare the two subsets of the plane {(a,b)a,bR}\{(a, b) \mid a, b\in \mathbb{R}\}.
Let f:R3Rf:\mathbb{R}^3\rightarrow \mathbb{R} denote the function given by
f(x,y,z)=x2+y2+z2+axy+xz+yz, f(x, y, z) = x^2 + y^2 + z^2 + a x y + x z + y z,
where aRa\in \mathbb{R}. Let HH denote the Hessian of ff in a point (x,y,z)R3(x, y, z)\in \mathbb{R}^3.
  1. Compute HH.
  2. Show that f(v)=vTAvf(v) = v^T A v for v=(x,y,z)R3v=(x, y, z)\in \mathbb{R}^3 and A=12HA = \frac{1}{2} H.
  3. Compute a non-zero vector vR3v\in \mathbb{R}^3, such that Hv=0H v = 0 in the case, where a=2a=2. Is HH invertible in this case?
  4. Show that ff is strictly convex if 1<a<2-1 < a < 2.
  5. Is ff strictly convex if a=2a=2?
    Hint
    Consider the line segment between 00 and a suitable vector u0u\neq 0, where f(u)=0f(u) = 0.
Why is the subset given by the inequalities
x0y0xyz20\begin{aligned} x &\geq 0\\ y &\geq 0\\ x y - z^2 &\geq 0 \end{aligned}
a convex subset of R3\mathbb{R}^3?