7Several variables

A function of several variables usually refers to a function

$f: \mathbb{R}^n \rightarrow \mathbb{R}, \tag{7.1}$ where $n > 1$ is a natural number. We have already seen functions of several variables with $n>1$ . In particular, in Chapter 4 , we saw linear functions (in connection with linear programming) like

$f(x_1, x_2) = 3 x_1 + 2 x_2. \tag{7.2}$ This is a rather simple function of several variables with $n=2$ in (7.1) . In general functions as in (7.1) can be wildly complicated. One of the main purposes of this chapter is to zero in on the class of differentiable functions in (7.1) . In Chapter 6 we defined what it means for a function of one variable to be differentiable. This was inspired by a drawing of the graph of the function. In several variables (for $n > 1$ ) one has to be a bit clever in the definition of differentiability. The upshot is that the derivative at a point now is a row vector (or more generally a matrix) instead of being a single number. As an example, using notation that we introduce in this chapter, the derivative of the function in (7.2) at $(0, 0)$ is

$\left(\frac{\partial f}{\partial x_1}\quad \frac{\partial f}{\partial x_2}\right) = \left(3\quad 2\right).$ This notation means that partial differentiation with respect to a variable occurs i.e., one fixes the variable and computes the derivative with respect to this variable viewing all the other variables as constants.

First some treasured memories from the author's past.

7.1 Introduction

Many years ago (1986-89), I had a job as a financial analyst in a bank (now a hotel!) working (often late at night) with a spectacular view of Copenhagen from the office circled below.

This was long before a financial analyst became a quant and machine learning became a buzz word. Digging through my old notes from that time, I found the outlines below.

These were notes I made in connection with modelling the yield curve for zero coupon bonds. I had to fit a very non-linear function in several variables to financial data and had to use effective numerical tools (and programming them

in APL). Tools that are also used today in machine learning and data science.

Ultimately we are interested in solving optimization problems like

$\begin{aligned} &\text{Minimize} &f(x_1, \dots, x_n)\\ &\text{with constraint}\\ &&(x_1, \dots, x_n)\in C, \end{aligned}$ where $C\subseteq \mathbb{R}^n$ and $f:\mathbb{R}^n\rightarrow \mathbb{R}$ is a differentiable (read nice for now) function.

Training neural networks is a fancy name for solving an optimization problem, where usually $C = \mathbb{R}^n$ and $f$ is built just like in the least squares method from some data points. The difference is that in neural networks, $f$ is an incredibly complicated (differentiable) function composed of several intermediate functions. We do not, as in the method of least squares, have an explicit formula for finding a minimum. We have to rely on iterative methods. One such method is called gradient descent.

Let me illustrate this in the simplest case, where $n=1$ . The general case is conceptually very similar (see Lemma 7.19 ).

Suppose that $f$ is differentiable at $x_0$ with $f'(x_0)\neq 0$ and we wish to solve the minimization problem

$\begin{aligned} &\text{Minimize} &f(x)\\ &\text{with constraint}\\ &&x\in \mathbb{R}. \end{aligned}$ Solving the equation $f'(x) = 0$ (to find potential minima) may be difficult. Instead we try something else.

We know for sure that $x_0$ is not a local minimum (why?). It turns out that we can move a little bit in the direction Left if $f'(x_0) > 0$ and right if $f'(x_0) < 0$ . of $-f'(x_0)$ and get a better candidate for a minimum than $x_0$ i.e., for small $\lambda > 0$ and $h = -\lambda f'(x_0)$ we have

$f(x_0 + h) - f(x_0) < 0.$ This is a consequence If you use the definition of differentiability with $h = -\lambda f'(x_0)$ , you will see that

$f(x_0 + h) - f(x_0) = - \lambda( f'(x_0)^2 + \epsilon(-\lambda f'(x_0)) f'(x_0)).$ For small $\lambda > 0$ this shows that $f(x_0 + h) - f(x_0) < 0$ , as $f'(x_0)^2 > 0$ . of the definition of $f$ being differentiable at $x_0$ with $f'(x_0)\neq 0$ .

The process is then repeated putting $x_0 := x_0 + h$ until the absolute value of $f'(x_0)$ is sufficiently small (indicating that we are close to a point $x$ with $f'(x) = 0$ ).

The number $\lambda > 0$ is called the learning rate in machine learning.

Illustrate the gradient descent method for $f(x) = x^2$ . Pay attention to the learning rate $\lambda > 0$ . How big is $\lambda$ allowed to be, when

$f(x_0 + h) - f(x_0) < 0$ is required and $h = -\lambda f'(x_0)$ ?

This is a hands-on exercise: carry out the gradient descent method numerically for the function

$f(x) = (x-1)^4 + \sin(x)^2$ to solve the minimization problem

$\begin{aligned} &\text{Minimize} &f(x)\\ &\text{with constraint}\\ &&x\in \mathbb{R} \end{aligned}$ starting with $x_0=1$ .

Hint

It is not clear how to choose the step size here. Proceed by letting $k$ be the smallest natural number, such that

$f(x_0 - 2^{-k} f'(x_0)) < f(x_0).$ Stop the process, when $|f'(x_0)| < 0.001$ .

Helpful code

Is $f$ a convex function?

Explain how the Newton-Raphson method This is an iterative method for approximating a zero for a differentiable function $g(x)$ . It works by guessing $x_0$ and then iterating $x_{i+1} = x_i - g(x_i)/g'(x_i)$ to get a sequence $x_0, x_1, \dots$ approximating a zero $z$ ( $g(z) = 0$ ). may be used to solve the minimization problem and compute the minimum also using this method.

Helpful code

Recall the definition of a function $f:\mathbb{R}\rightarrow \mathbb{R}$ being differentiable at a point $x_0\in \mathbb{R}$ with derivative $c = f'(x_0)$ . Here we measured the change $f(x_0+h) - f(x_0)$ of $f$ in terms of the change $h$ (in $x$ ). It had to have the form

$f(x_0+h) - f(x_0) = c\, h + \epsilon(h) h, \tag{7.3}$ where $\epsilon:(-\delta, \delta)\rightarrow \mathbb{R}$ is a function continuous in $0$ with $\epsilon(0) = 0$ and $\delta>0$ small. If you divide both sides of (7.3) by $h$ you recover the usual more geometric definition of differentiability as a limiting slope:

$\lim_{h\to 0}\, \frac{f(x_0 + h) - f(x_0)}{h} = c = f'(x_0). \tag{7.4}$

We wish to define differentiability at $x_0\in \mathbb{R}^n$ for a function $f: \mathbb{R}^n\rightarrow \mathbb{R}^m$ . In this setting the quotient

$\frac{f(x_0 + h) - f(x_0)}{h}$ in (7.4) does not make any sense. There is no way we can divide a vector $f(x_0 + h) - f(x_0)\in \mathbb{R}^m$ by a vector $h\in \mathbb{R}^n$ , unless of course $m = n = 1$ as in (7.4) , where we faced usual numbers.

The natural thing here is to generalize the definition in (7.3) . First let us recall what functions $f:\mathbb{R}^n\rightarrow \mathbb{R}^m$ look like.

7.2 Vector functions

We will flesh out the general Definition 1.99 in a special case below.

A function $f:\mathbb{R}^n\rightarrow \mathbb{R}^m$ takes a vector $(x_1, \dots, x_n)\in \mathbb{R}^n$ as input and gives a vector $(y_1, \dots, y_m)\in \mathbb{R}^m$ as output. This means that every coordinate $y_1, \dots, y_m$ in the output must be a function of $x_1, \dots, x_n$ i.e.,

$y_i = f_i(x_1, \dots, x_n)$ for $i = 1, \dots, m$ . So in total, we may write $f$ as

$f(x_1, \dots, x_n) = \begin{pmatrix} f_1(x_1, \dots, x_n)\\ \vdots \\ f_m(x_1, \dots, x_n) \end{pmatrix}. \tag{7.5}$

Each of the (coordinate) functions $f_i$ are functions from $\mathbb{R}^n$ to $\mathbb{R}$ .

Look back at Exercise 1.115 . Write down precisely the vector function $h:\mathbb{R}^2\rightarrow \mathbb{R}^2$ occuring there.

The function $f:\mathbb{R}^2\rightarrow \mathbb{R}^2$ is rotating a vector $90$ degrees counter clockwise. What are $f_1$ and $f_2$ in

$f(x, y) = \begin{pmatrix} f_1(x, y)\\ f_2(x, y) \end{pmatrix}?$ Hint

Try rotating some specific vectors like $(1, 0), (0, 1), (1, 1)$ $90$ degrees. Do you see a pattern?

7.3 Differentiability

The definition of differentiability for a function $f:\mathbb{R}^n\rightarrow \mathbb{R}^m$ mimics (7.3) , except that $\epsilon(h) h$ is replaced by $\epsilon(h) |h|$ . Also the open interval $(a, b)$ is replaced by an open subset $U$ and the (open) interval $(-\delta, \delta)$ is replaced by an open subset $O$ containing $0$ . Notice, however, that now the derivate is a matrix!

Let $f:U\rightarrow \mathbb{R}^m$ be a function with $U\subseteq \mathbb{R}^n$ an open subset. Then $f$ is differentiable at $v_0\in U$ if there exists

an $m\times n$ matrix $C$ ,
an open subset $O\subseteq \mathbb{R}^n$ with $0\in O$ , such that $v_0+h\in U$ for every $h\in O$ ,
a function $\epsilon: O\rightarrow \mathbb{R}^m$ continuous at $0$ with $\epsilon(0) = 0$ ,

such that

$f(v_0 + h) - f(v_0) = C\, h + \epsilon(h)\, \left\vert h \right\vert,$ In this case, the $m\times n$ matrix $C$ is called the (matrix) derivative of $f$ at $x_0$ and denoted by $f'(x_0)$ .

The function $f$ is called differentiable if it is differentiable at every $v\in U$ .

How do we compute the matrix derivative $C$ in the above definition? We need to look at the representation of $f$ in (7.5) and introduce the partial derivatives.

7.3.1 Partial derivatives

A function of one variable $x$ has a derivative with respect to $x$ . For a function of several variables $x_1, \dots, x_n$ we have a well defined derivative with respect to each of these variables. These are called the partial derivatives (if they exist) and they are defined below.

Let $f: U\rightarrow \mathbb{R}$ be a function, where $U$ is an open subset of $\mathbb{R}^n$ . Fix a point $v = (a_1, a_2, \dots, a_n)\in U$ and let

$p_i = f(a_1, \dots, a_{i-1}, x, a_{i+1}, \dots, a_n)$ for $i = 1, \dots, n$ . If $p_i$ is differentiable at $x = a_i$ according to Definition 7.3 , then we say that the partial derivative of $f$ with respect to $x_i$ exists at $v\in U$ and use the notation

$\frac{\partial f}{\partial x_i}(v) := p_i'(a_i).$

The partial derivative with respect to a specific variable is computed by letting all the other variables appear as constants.

To get a feeling for the definition and computation of partial derivatives, take a look at the example below, where we compute using the classical (geometric) definition of the one variable derivative.

Consider the function $f: \mathbb{R}^2 \rightarrow \mathbb{R}$ given by

$f(x_1, x_2) = x_1 x_2^2 + x_1.$ Then

$\begin{aligned} \frac{\partial f}{\partial x_2}(v) &= \lim_{\delta\to 0}\frac{f(x_1, x_2 + \delta) - f(x_1, x_2)}{\delta}\\ & = \lim_{\delta\to 0} \frac{x_1 (x_2+\delta)^2 + x_1 - (x_1 x_2^2 + x_1)}{\delta}\\ &= x_1 \lim_{\delta\to 0}\frac{(x_2+\delta)^2 - x_2^2}{\delta} = x_1\, \lim_{\delta\to 0} (2 x_2 + \delta) = 2 x_1 x_2, \end{aligned}$ where $v = (x_1, x_2)$ . This example illustrates that $\frac{\partial f}{\partial x_i}$ can be computed just like in the one variable case, when the other variables ( $\neq x_i$ ) are treated as constants. Notice that

$\frac{\partial}{\partial x_1}\frac{\partial f}{\partial x_2} = \frac{\partial }{\partial x_2}\frac{\partial f}{ \partial x_1} = 2 x_2.$

Partial derivatives behave almost like the usual derivatives of one variable functions. You simply fix one variable that you consider the "real" variable and treat the other variables as constants.

$\frac{\partial}{\partial x}\left( \sin(x y) + x^2 y^2 + y\right) = y \cos(x y) + 2 x y^2.$

Below are examples of Sage code computing partial derivatives. Notice that the variables must be declared first.

The Sage computations above point to a really surprising result. It seems that it makes no difference if you compute the partial derivative with respect to $x_1$ and then with respect to $x_2$ or the other way around. You could, just for fun, try this out on the more complicated function

$f(x_1, x_2) = x_1 x_2^2 + \cos(\sin(x_1 x_2) + \log(x_1^{17} x_2)).B$ This result is formulated in Theorem 7.13 below.

Use the Sage window above to verify the computation of the partial derivative in Example 7.9 .

The following result tells us how to compute the matrix derivative.

Let $f:U\rightarrow \mathbb{R}^m$ be a function with $U\subseteq \mathbb{R}^n$ an open subset. If $f$ is differentiable at $x_0\in U$ , then the partial derivatives

$\frac{\partial f_i}{\partial x_j}(x_0)$ exist for $i = 1, \dots, m$ and $j = 1, \dots, n$ and the matrix $C$ in Definition 7.5 is

$C = \begin{pmatrix} \dfrac{\partial f_1}{\partial x_1}(x_0) & \cdots & \dfrac{\partial f_1}{\partial x_n}(x_0)\\ \vdots & \ddots & \vdots \\ \dfrac{\partial f_m}{\partial x_1}(x_0) & \cdots & \dfrac{\partial f_m}{\partial x_n}(x_0) \end{pmatrix}.$

The $j$ -th column in $C$ is $C e_j$ . Putting $h = \delta e_j$ for $\delta\in \mathbb{R}$ in Definition 7.5 gives

$f(x_0 + \delta e_j) - f(x_0) = \delta C e_j + \epsilon(\delta e_j) \left\vert \delta \right\vert.$ The $i$ -th coordinate of this identity of $m$ -dimensional vectors can be written

$f_i(x_0 + \delta e_j) - f_i(x_0) = \delta C_{i j} + \tilde{\epsilon}_i(\delta) \delta \tag{7.6}$ where

$\tilde{\epsilon}_i(\delta) = \begin{cases} \epsilon_i(\delta e_j) \dfrac{\left\vert \delta \right\vert}{\delta} &\text{if } \delta\neq 0 \\ 0 &\text{if } \delta = 0 \end{cases}$ and (7.6) shows that $C_{ij} = \frac{\partial f_i}{\partial x_j}(x_0)$ .

Compute the matrix derivative of the vector function in Exercise 7.4 .

For a function $f:U\rightarrow \mathbb{R}$ with $U\subseteq \mathbb{R}^n$ an open subset, the partial derivative, if it exists for every $x\in U$ , is a new function

$\frac{\partial f}{\partial x_j} : U\rightarrow \mathbb{R}.$ We will use the notation

$\frac{\partial^2 f}{\partial x_i \partial x_j} :=\frac{\partial}{\partial x_i} \frac{\partial f}{\partial x_j}$ for the iterated (second order) partial derivative.

The first part of following result is a converse to Proposition 7.11 . The second part contains the surprising symmetry of the second order partial derivatives under rather mild conditions. We will not go into the proof of this result, which is known as Clairaut's theorem.

Let $f:U\rightarrow \mathbb{R}^m$ be a function with $U\subseteq \mathbb{R}^n$ an open subset. If the partial derivatives for $f$ exist at every $x\in U$ with

$\frac{\partial f_i}{\partial x_j}$ continuous (for $i = 1, \dots, m$ and $j = 1, \dots, n$ ), then $f$ is differentiable. If the second order partial derivatives exist for a function $f : U\rightarrow \mathbb{R}$ and are continuous functions, then

$\frac{\partial^2 f}{\partial x_i \partial x_j} = \frac{\partial^2 f}{\partial x_j \partial x_i}$ for $i, j = 1, \dots, n$ .

Verify (by hand!) the symmetry of the second order partial derivatives for the function $f$ in Example 7.9 i.e., show that

$\frac{\partial^2 f}{\partial x \partial y} = \frac{\partial^2 f}{\partial y \partial x}.$

Verify that $f: \mathbb{R}^2\rightarrow \mathbb{R}$ given by

$f(x, y) = \frac{x^2 y}{1 + y^2}$ is a differentiable function by computing

$\frac{\partial f}{\partial x}\qquad\text{and}\qquad\frac{\partial f}{\partial y}$ and applying Theorem 7.13 . Check also that

$\frac{\partial^2 f}{\partial x \partial y} = \frac{\partial^2 f}{\partial y \partial x}.$

7.4 Newton-Raphson in several variables!

There is a beautiful generalization of the Newton-Raphson method to several variable functions $f:\mathbb{R}^n\rightarrow \mathbb{R}^n$ . Consider first that you would like to solve the system

$\begin{aligned} y^2-x^3 + x &= 0\\ y^3-x^2 &= 0 \end{aligned}\tag{7.7}$ of non-linear equations in the two variables $x$ and $y$ . Notice that we are talking non-linear here. This is so much more difficult than the systems of linear equations that you encountered in a previous chapter.

However, just like we used Newton's method in one variable for solving a non-linear equation, Newton's method for finding a zero for a function $f: \mathbb{R}^n\rightarrow \mathbb{R}^n$ generalized to the iterative scheme

$v_{i+1} = v_i - f'(v_i)^{-1} f(v_i) \tag{7.8}$

provided that the $n\times n$ matrix derivative $f'(v_i)$ is invertible.

The reason that (7.8) works comes again from the powerful definition of differentiability in Definition 7.5 using that

$f(x) - f(x_0)\qquad \text{is close to}\qquad f'(x_0) (x - x_0) \tag{7.9}$ provided that $h = x - x_0$ is small. In fact, you get (again) (7.8) from (7.9) by putting $f(x)$ to $0$ , replacing is close to by $=$ and then isolating $x$ .

For the equations in (7.7) , the iteration scheme (7.8) becomes

$\begin{pmatrix} x_{i+1} \\ y_{i+1} \end{pmatrix} = \begin{pmatrix} x_i\\ y_i \end{pmatrix} - \begin{pmatrix} - 3 x_i^2+1 & 2 y_i\\ -2 x_i & 3 y_i^2 \end{pmatrix}^{-1} \begin{pmatrix} y_i^2-x_i^3 + x_i\\ y_i^3 - x_i^2 \end{pmatrix}. \tag{7.10}$

Verify the claim in (7.10) by applying (7.8) to

$f(x, y) = \begin{pmatrix} y^2 - x^3 + x\\ y^3 - x^2 \end{pmatrix}.$ Carry out sufficiently many iterations starting with the vector $(1, 1)$ in (7.10) to see the iteration stabilize. You should do this using a computer, for example by modifying the Sage code in the last half of Example 7.18 .

7.5 Local extrema in several variables

For a function $f:U\rightarrow \mathbb{R}$ , where $U\subseteq \mathbb{R}^n$ , the derivative $f'(v)$ at $v\in U$ is called the gradient for $f$ at $v$ . Classically, it is denoted $\nabla f(v)$ i.e.,

$\nabla f(v) = \left(\frac{\partial f}{\partial x_1}(v) \dots \frac{\partial f}{\partial x_n} (v)\right).$ The definition below is inspired by the one variable case (see Definition 6.34 ).

Let $f:U\rightarrow \mathbb{R}$ be a function, where $U\subseteq \mathbb{R}^n$ is an open subset. Suppose that the partial derivatives exist at $v_0\in U$ . Then $v_0$ is called a critical point for $f$ if $\nabla f(v_0)=0$ .

Consider the function $f: \mathbb{R}^2 \rightarrow \mathbb{R}^2$ given by

$f\begin{pmatrix} x \\ y \end{pmatrix} = \begin{pmatrix} 2 x + 3 \log(y)\\ 3 x / y - 3 y^2 \end{pmatrix}$ corresponding to finding critical points for the function

$g(x, y) = x^2 + 3 x \log(y) - y^3. \tag{7.11}$

You can left click and hold the graph computed below (after it has rendered) and rotate the surface to get a feeling for what (7.11) looks like. Zooming in is also possible.

Here

$f' = \begin{pmatrix} 2 & 3/y \\ 3/y & -3 x/y^2 - 6 y \end{pmatrix}.$

In the Sage code below, Newton's method is started at $(1, 1)$ and iterated four times.

If $v_0$ is not a critical point for $f$ we can use the gradient vector to move in a direction making $f$ strictly smaller/larger. This is very important for optimization problems.

Let $f: U\rightarrow \mathbb{R}$ be a differentiable function, where $U\subseteq \mathbb{R}^n$ is an open subset. Suppose that $u\in \mathbb{R}^n$ and $\nabla f(v_0) u < 0$ for $v_0\in U$ . Then

$f(v_0 + \lambda u) < f(v_0)$ for $\lambda > 0$ small.

By the differentiability of $f$ ,

$f(v_0 + u) - f(v_0) = \nabla f(v_0) u + \epsilon(u) |u|,$ where $\epsilon:\mathbb{R}^n \rightarrow \mathbb{R}$ is a function satisfying $\epsilon(h) \rightarrow 0$ for $h\rightarrow 0$ . For $\lambda > 0$ with $v_0 + \lambda u\in U$ we have

$f(v_0 + \lambda u) - f(v_0) = \lambda (\nabla f(v_0) u + \epsilon(\lambda u) | u |).$ When $\lambda$ tends to zero from the right, it follows that $f(v_0 + \lambda u) - f(v_0) < 0$ for small $\lambda > 0$ .

Lemma 7.19 looks innocent, but it is the bread and butter in the training of neural networks. In mathematical terms, training means minimizing a function. In machine learning terms, $\lambda$ above is called the learning rate. One iteration (why do I choose $u = -\nabla f(x)$ ?)

$x - \lambda \nabla f(x)$ of Lemma 7.19 is the central ingredient in an epoch in training a neural network.

Let us briefly pause and see Lemma 7.19 in action. Consider the function $f:\mathbb{R}^2\rightarrow \mathbb{R}$ given by

$f(x, y) = x^2 + y^2$ and $v_0 = \begin{pmatrix} 1\\ 1\end{pmatrix}$ with $u = \begin{pmatrix} -1 \\ 0\end{pmatrix}$ . In this case $\nabla f(x_0) = \begin{pmatrix} 2 & 2\end{pmatrix}$ and $\nabla f(v_0) u = -2 < 0$ . Therefore we may find a small $\lambda > 0$ , such that $f(v_0 + \lambda u) < f(v_0)$ . How do we choose $\lambda$ optimally? If $\lambda$ is too big we fail and land up in a worse point than $x_0$ . Here

$f(v_0 + \lambda u) = (1 -\lambda)^2 + 1$

This is a quadratic polynomial, which is minimal for $\lambda = 1$ . Therefor the minimal value reached in the direction of $u$ is $1$ . The process now continues replacing $v_0$ by $v_0 + 1\cdot u$ .

The result below is the multi variable generalization of looking for local extrema by putting $f'=0$ in the one variable case.

Let $f: U\rightarrow \mathbb{R}$ be a differentiable function, where $U\subseteq \mathbb{R}^n$ is an open subset. If $v_0\in U$ is a local extremum, then $v_0$ is a critical point for $f$ .

Suppose that $\nabla f(v_0)\neq 0$ . If $v_0$ is a local minimum, then we may use $u = -\nabla f(v_0)$ in Lemma 7.19 to deduce that $f(v_0 + \lambda u) < f(v_0)$ for $\lambda>0$ small. This contradicts the local minimality of $v_0$ . If $v_0$ is a local maximum we can apply Lemma 7.19 with $-f$ and $u = \nabla f(v_0)$ to reach a similar contradiction. Therefore $\nabla f(v_0)=0$ and $v_0$ is a critical point for $f$ .

Compute the critical points of

$f(x, y) = x^3 + x y + y^3.$ Is $(0,0)$ a local maximum or a local minimum for $f$ ?

Hint

Look at

$\begin{aligned} f_1(t) &= f(t, t)\\ f_2(t) &= f(t, -t) \end{aligned}$ and $f_1''(0)$ and $f_2''(0)$ (along with Theorem 6.47 ).

We will prove later that a differentiable function $f:\mathbb{R}^2\rightarrow \mathbb{R}$ is strictly convex if the socalled Hessian matrix given by

$\begin{pmatrix} \dfrac{\partial^2 f}{\partial x^2}(v) & \dfrac{\partial^2 f}{\partial x \partial y}(v) \\[1em] \dfrac{\partial^2 f}{\partial y \partial x}(v) & \dfrac{\partial^2 f}{\partial y^2}(v) \end{pmatrix}$ is positive definite for every $v\in \mathbb{R}^2$ . This is a multivariable generalization of the fact that $g: \mathbb{R}\rightarrow \mathbb{R}$ is strictly convex if $g''(x) > 0$ for every $x\in \mathbb{R}$ .

Now let

$f(x, y) = x^2 + y^2 - \cos(x) -\sin(y). \tag{7.12}$

3D graph

You can left click the surface computed below after it has rendered and rotate or zoom in.

Show that $f$ is strictly convex.
Compute the critical point(s) of $f$ .

Hint
This is a numerical computation! Modify the relevant Sage window for Newton's method in the previous chapter to do it.
For a differentiable convex function $f:\mathbb{R}^2\rightarrow \mathbb{R}$ we have in general that
$f(v) \geq f(u) + \nabla f(u) (v-u) \tag{7.13}$ for every $u, v\in \mathbb{R}^2$ . This is a multivariable generalization of Theorem 6.58 .

Explain how one can use (7.13) to find a global minimum for the function $f$ in (7.12) . Is this minimum unique? Is $f(x, y)\geq -1$ for every $x, y\in \mathbb{R}$ ?

7.6 The chain rule

Suppose you want to compute the value of the function $h(x) = \sin(\cos(2 x))$ for $x = 0$ . Then you would start by evaluating the inner function $2 x$ , then applying $\cos$ and finally $\sin$ . This computation can be illustrated in the (computational) graph

where you plug $x = 0$ into the leftmost node and fill in each node taking input from its left neighbor

Suppose we want to compute $h'(x)$ for $x = 0$ . Can we use the computational graph for this?

Recall the chain rule for functions of one variable. Here we have functions $f:(a, b)\rightarrow \mathbb{R}$ and $g:(c, d)\rightarrow \mathbb{R}$ , such that $g(x)\in (a, b)$ for $x\in (c, d)$ . If $g$ is differentiable at $x_0\in (c, d)$ and $f$ is differentiable at $g(x_0)\in (a, b)$ , the chain rule says that $f\circ g$ is differentiable at $x_0$ with

$(f\circ g)'(x_0) = f'(g(x_0)) g'(x_0).$

The chain rule tells us that

$h'(x) = \cos(\cos(2 x)) (- \sin( 2 x)) 2.$ This expression involves three derivatives corresponding to the three edges in the comptuational graph. We can illustrate the chain rule by labeling each edge with the derivative of its end node:

Then the derivative $h'(0)$ can be computed by as the product of the labels evaluated on their left nodes in the filled in comptutational graph:

i.e., $h'(0) = \cos(1) (-\sin(0)) 2$ . This observation is the basis of the famous backpropagation rule used in training neural networks.

The chain rule for functions of one variable generalizes verbatim to functions of several variables:

$(f\circ g)'(x_0) = f'( g(x_0)) g'(x_0)$ for compatible multivariate functions $f$ and $g$ when you replace usual multiplication by matrix multiplication.

Let $f:U\rightarrow \mathbb{R}^m$ and $g: V\rightarrow \mathbb{R}^n$ with $U\subseteq \mathbb{R}^n$ , $V\subseteq \mathbb{R}^l$ open subsets and $g(V)\subseteq U$ . If $g$ is differentiable at $v_0\in V$ and $f$ is differentiable at $g(x_0)\in U$ , then $f\circ g$ is differentiable at $v_0$ with

$(f\circ g)'(v_0) = f'( g(x_0)) g'(v_0). \tag{7.14}$

The proof of the chain rule in this general setting uses Definition 7.5 just as in the one variable case. It is not conceptually difficult, but severely cumbersome. We will not give it here.

7.6.1 Matrix multiplication graphically

To really understand the chain rule, it pays to view the matrix multiplication in (7.14) in a new light (inspired by computer science and neural networks).

An $m\times n$ matrix $(a_{ij})$ is a rectangular table with $m$ rows and $n$ columns containing $m n$ numbers. We may also view it as a (bipartite) graph with $m$ left nodes, $n$ right nodes and an edge from the left node $i$ to the right node $j$ with weight $a_{ij}$ . This is best illustrated by an example, which also tells you how matrix multiplication is (beautifully) interpreted in this setting.

The $2\times 3$ matrix

$A = \begin{pmatrix} 1 & 2 & 3\\ 4 & 5 & 6 \end{pmatrix}$ is represented below as a graph

Similarly the $3\times 1$ matrix

$B = \begin{pmatrix} 7 \\ 8 \\ 9 \end{pmatrix}$ is represented as

You know that the matrix product $A B$ is a $2\times 1$ matrix. Let us line $A$ and $B$ up graphically:

There are three paths from $s_1$ to $t$ and three paths from $s_2$ to $t$ . Here are the three paths from $s_1$ to $t$ :

Finally, the matrix product $A B$ is represented by the graph

The number $50$ on the edge from $s_1$ to $t$ is gotten by adding the products of the weights on each of the three paths from $s_1$ to $t$ i.e., $1\cdot 7 + 2\cdot 8 + 3\cdot 9$ . This is the graphical interpretation of matrix multiplication!

7.6.2 Unpacking the chain rule

The matrix multiplication in (7.14) looks deceivingly simple. Let us write it out. Assume for simplicity that $g: \mathbb{R}^l \rightarrow \mathbb{R}^n$ is a function in the variables $x_1, \dots, x_l$ and that $f: \mathbb{R}^n \rightarrow \mathbb{R}^m$ is a function in the variables $y_1, \dots, y_n$ :

$g(x_1, \dots, x_l ) = \begin{pmatrix} g_1(x_1, \dots, x_l) \\ \vdots \\ g_n(x_1, \dots, x_l) \end{pmatrix} \qquad\text{and}\qquad f(y_1, \dots, y_n) ) = \begin{pmatrix} f_1(y_1, \dots, y_n) \\ \vdots \\ f_m(y_1, \dots, y_n) \end{pmatrix}.$ Then $h = (f\circ g):\mathbb{R}^l\rightarrow \mathbb{R}^m$ is a function in the variables $x_1, \dots, x_l$ :

$h(x_1, \dots, x_l) = \begin{pmatrix} h_1(x_1, \dots, x_l) \\ \vdots \\ h_m(x_1, \dots, x_l) \end{pmatrix}$ and we wish to compute $h'(v_0)$ , which is an $m\times l$ matrix with entries

$\frac{\partial h_i}{\partial x_j}(v_0),$ where $i = 1, \dots, m$ represent the rows and $j = 1, \dots, l$ the columns. Here (7.14) says that

$\frac{\partial h_i}{\partial x_j}(v_0) = \frac{\partial f_i}{\partial y_1}(g(v_0)) \frac{\partial g_1}{\partial x_j}(v_0) + \cdots + \frac{\partial f_i}{\partial y_n}(g(v_0)) \frac{\partial g_n}{\partial x_j}(v_0).$

When using the chain rule in computations it pays to use the graphical interpretation of matrix multiplication in subsection 7.6.1 with edges labeled by the derivatives in a computational graph. We illustrate this below.

The function $h:\mathbb{R}^2\rightarrow \mathbb{R}$ given by

$h(x, y) = \frac{x^2+y}{x + \sin(x y)}$ may be evaluated using the computational graph

where $u$ is the function $u(x, y) = x^2+y$ and $v$ is the function $v(x, y) = x + \sin(x y)$ . Similar to the one variable case discussed in the beginning of section 7.6 , we label each edge, but now by the partial derivative of the function in its ending node with respect to the variable in its beginning node:

From the graphical interpretation of the matrix product and the chain rule you follow the two paths from the input node $x$ to the output node $u/v$ and conclude that

$\frac{\partial h}{\partial x} = \frac{2 x}{x + \sin(x y)} - \frac{x^2+y}{(x + \sin(x y))^2} (1 + \cos(x y) y).$

Here is another example of the chain rule in action through a computational graph. In the end you will see an implementation in a famous python library.

Consider the example

$f(x, y) = \sin(x y) + x^2 y^2 + y$ from Example 7.9 . Even though $f(x, y)$ superficially looks rather simple, it is composed of several smaller functions as displayed in the computational graph

Every node in the above graph, except the input nodes (with no ingoing arrows), represents some function $f:\mathbb{R}^n\rightarrow \mathbb{R}^m$ . For example the node $\sin$ represents a function $f: \mathbb{R}\rightarrow \mathbb{R}$ and $*$ represents a function $f: \mathbb{R}^2\rightarrow \mathbb{R}$ .

To emphasize that the non-input nodes really are functions we replace them by letters:

Here we see that

$f(x, y) = F(a(c(x, y)), y, b(d(x), e(y))),$ where

$\begin{aligned} F(a, y, b) &= a + y + b\\ a(c) &= \sin(c)\\ c(x, y) &= x y\\ b(d, e) &= d e\\ d(x) &= x^2\\ e(y) &= y^2 \end{aligned}$

The gradient is then available from the decorated graph below

by multiplying the decorations on each path from the top to the input variable and the summing up. For example,

$\frac{\partial F}{\partial x} = \frac{\partial F}{\partial a} \frac{\partial a}{\partial c} \frac{\partial c}{\partial x} + \frac{\partial F}{\partial b} \frac{\partial b}{\partial d} \frac{\partial d}{\partial x}.$

Computational graphs and the chain rule are important components in machine learning libraries. Below is an example of the computation of $\frac{\partial F}{\partial x}$ in the computational graph above using the pytorch library.

Construct a computational graph for

$f(x, y) = x^3 + x y + y^3$ and detail the computation of the gradient $\nabla f$ in this context.

Compute the gradient of $f$ at $(x, y) = (1, 1)$ using pytorch.

Consider $f: \mathbb{R}\rightarrow \mathbb{R}^3$ and $g:\mathbb{R}^3\rightarrow \mathbb{R}$ given by

$f(t) = \begin{pmatrix} t\\ t^2\\ t^3 \end{pmatrix}\qquad \text{and}\qquad g \begin{pmatrix} x\\ y\\ z \end{pmatrix} = x^2 + 3 y^6 + 2 z^5.$ Compute $(g\circ f)'(t)$ using the chain rule and check the result with an explicit computation of the derivative of $g\circ f: \mathbb{R}\rightarrow \mathbb{R}$ .

We wish to show that the function $f: \mathbb{R}^2 \rightarrow \mathbb{R}$ given by

$f(x, y) = x^2 + y^2$ is convex. This means that we need to prove that

$f((1-t)x_0 + t x_1, (1-t) y_0 + t y_1) \leq (1-t) f(x_0, y_0) + t f(x_1, y_1)$ for every $(x_0. y_0), (x_1, y_1)\in \mathbb{R}^2$ and every $t$ with $0\leq t\leq 1$ . This can be accomplished from the one variable case in the following way. Define

$g(t) = f((1-t)x_0 + t x_1, (1-t) y_0 + t y_1)$ and show that $g$ is convex by using the chain rule to show that $g''(t) \geq 0$ . Show how the convexity of $f$ follows from this by using that

$g(t) = g((1-t)\cdot 0 + t\cdot 1).$

7.7 Logistic regression

The beauty of the sigmoid function is that it takes any value $x\in \mathbb{R}$ and turns it into a probability $0< \sigma(x) < 1$ by

$\sigma(x) = \frac{1}{1 + e^{-x}},$ i.e., $\sigma(-\infty) = 0$ and $\sigma(\infty) = 1$ .

Graph of the sigmoid function

Prove that

$\sigma'(x) = \sigma(x) (1-\sigma(x))$ and

$\log \frac{\sigma(x)}{1-\sigma(x)} = x.$

We will not go into all the details (some of which can be traced to introductory probability and statistics), but suppose that we have an outcome $E$ , which may or may not happen.

We have an idea, that the probability of $E$ is dependent on certain parameters $w_0, w_1, \dots, w_n$ and observations $x_1, \dots, x_n$ that fit into the sigmoid function as

$p(x_1, \dots, x_n) = \sigma(w_0 + w_1 x_1 + \cdots + w_n x_n) = \frac{1}{1 + e^{-w_0 - w_1 x_1 - \cdots - w_n x_n}}. \tag{7.15}$

An example of this could be where $x_1, \dots, x_{784}$ denote the gray scale of each pixel in a $28\times 28$ image. The event $E$ is whether the image contains the digit $4$ :

Here $p(x_1, \dots, x_{784})$ would be the probability that the image contains the digit $4$ .

7.7.1 Estimating the parameters

Suppose also that we have a table of observations (data set)

$\begin{matrix} x_{11} & \cdots & x_{1n} & E_1\\ x_{21} & \cdots & x_{2n} & E_2\\ \vdots & \ddots & \vdots & \vdots\\ x_{m1} & \cdots & x_{mn} & E_m, \end{matrix} \tag{7.16}$

where each row has observations $x_{i1}, \dots, x_{in}$ along with a binary variable $E_i$ , which is $1$ if $E$ was observed to occur and $0$ if not.

Assuming that (7.15) holds, the probability of observing the $m$ observations in (7.16) is

$\prod_{i=1}^m p(x_{i1}, \dots, x_{in})^{E_i} (1 - p(x_{i1}, \dots, x_{in}))^{1-E_i}. \tag{7.17}$ Notice that (7.17) is a function $L(w_0, \dots, w_n)$ of the parameters $w_0, w_1, \dots, w_n$ for fixed observations $x_1,\dots, x_n$ .

We wish to choose the parameters so that $L(w_0, w_1, \dots, w_n)$ is maximized (this is called maximum likelihood estimation). So we are in fact here, dealing with an optimization problem, which is usually solved by gradient descent (for $-L$ ) or solving the equations

$\nabla L (w_0, w_1, \dots, w_n) = 0.$

Instead of maximizing $L(w_0, \dots, w_n)$ one usually maximizes the logarithm

$\begin{aligned} \ell(w_0, w_1, \dots, w_n) &= \log L(w_0, w_1, \dots, w_n)\\ &= \sum_{i=1}^m E_i \log p(x_{i1}, \dots, x_{in}) + (1 - E_i) \log ( 1 - p(x_{i1}, \dots, x_{in}))\\ &= \sum_{i=1}^m E_i (w_0 + w_1 x_{i1} + \cdots + w_n x_{in}) - \log (1 + e^{w_0 + w_1 x_{i1} + \cdots + w_n x_{in}}). \end{aligned}$

Notice that we have used Exercise 7.31 and the logarithm rules $\log(a b) = \log(a) + \log(b)$ and $\log(a/b) = \log(a) - \log(b)$ in the computation above.

Suppose that the event $E$ is assumed to be dependent on only one observation $x$ i.e., $n=1$ above. For example, $E$ could be the event of not showing up on a Monday paired with the amount of sleep $x$ in the weekend.

Here

$p(x) = \sigma(\alpha + \beta x)$ and

$\begin{aligned} \ell(\alpha, \beta) &= \sum_{i=1}^m E_i \log p(x_i) + (1 - E_i)\log (1-p(x_i))\\ &= \sum_{i=1}^m E_i (\alpha + \beta x_i) - \log(1 + e^{\alpha + \beta x_i}). \end{aligned}$

Explain how the end result of the computation of $\ell(\alpha, \beta)$ in Example 7.32 is obtained and compute $\nabla \ell (\alpha, \beta)$ .

I remember exactly where I was when first hearing about the Challenger See byuistats.github.io for more details on this example disaster in 1986.

This dreadful event was caused by failure of a socalled O-ring. The O-rings had been tested before the launch for failure (=1 below) at different temperatures (in F) resulting in the (partial) table below.

$\begin{matrix} 53.0 & 1\\ 56.0 & 1\\ 57.0 & 1\\ 63.0 & 0\\ \vdots & \vdots\\ 70.0 & 0\\ 70.0 & 1\\ \vdots & \vdots\\ 79.0 & 0 \end{matrix}$ At the morning of the launch the outside temperature was (uncharacteristically low for Florida) $31$ degrees Fahrenheit. We wish to use logistic regression to compute the probability that the O-ring fails.

Below we have sketched how the logistic regression is carried out using the python library SciKit-Learn. The option solver='lbfgs' chooses an algorithm for maximizing $\ell(\alpha, \beta)$ .

Press the Compute button and see the probability of failure during the launch.

LLM

💬

Explain the function LogicsticRegression in sklearn. In particular, what do the
parameters in

model = LogisticRegression(C=25, solver='lbfgs')
model.fit(X,y)

mean?

In the button below is a naive implementation of gradient descent (in fact gradient ascent, because we are dealing with a maximization problem) for the Challenger data set and logistic regression. The implementation is derived from the introduction to gradient descent in this chapter, where we adjusted the step with successive negative powers of $2$ .

Run experiments with different initial values and number of iterations. Compare with the official output from scikit-learn in the example above. What is going on?

Also try adjusting the scikit-learn output in the example above by removing C=25 first and then solver='lbfgs'. What happens? Compare the quality of the solutions in terms of the gradient (which is available in the output from the Naive code).

Yes, you are allowed (and encouraged) to use generative AI tools here!

Naive code

7.8 3Blue1Brown

Sit back and enjoy the masterful presentations of neural networks (and the chain rule) by the YouTuber 3Blue1Brown.

7.8.1 Introduction to neural networks

7.8.2 Gradient descent

7.8.3 Backpropagation and training

7.8.4 The chain rule in action

Watch the video above before solving this exercise.

Consider the simple neural network

where

$\begin{aligned} z_2 = \sigma_1(z_1) = \sigma(a + b z_1)\\ z_3 = \sigma_2(z_2) = \sigma(c + d z_2)\\ z_4 = \sigma_3(z_3) = \sigma(e + f z_3), \end{aligned}$ and $\sigma$ is the sigmoid function. This neural network has input $z_1$ and output $z_4$ . Let $C$ be a function of the output $z_4$ . For fixed $z_1$ , we consider $C$ as a function of $a, b, c, d, e, f$ via

$F\begin{pmatrix} a\\ b\\ c\\ d\\ e\\ f\end{pmatrix} = C(\sigma_3(\sigma_2(\sigma_1(z_1)))).$ Backpropagation for training neural networks is using the chain rule for computing the gradient

$\nabla F = \left( \dfrac{\partial F}{\partial a}, \dfrac{\partial F}{\partial b}, \dfrac{\partial F}{\partial c}, \dfrac{\partial F}{\partial d}, \dfrac{\partial F}{\partial e}, \dfrac{\partial F}{\partial f} \right).$ Explain how to do this.

7.9 Lagrange multipliers

The method of Lagrange multipliers is a super classical way of solving optimization problems with non-linear (equality) constraints. We will only consider the special case

$\begin{aligned} &\text{Maximize/Minimize} &f(x_1, \dots, x_n)\\ &\text{with constraint}\\ &&g (x_1, \dots, x_n) = 0, \end{aligned}\tag{7.18}$

where both $f:\mathbb{R}^n\rightarrow \mathbb{R}$ and $g:\mathbb{R}^n\rightarrow \mathbb{R}$ are differentiable functions.

There is a very useful trick for attacking (7.18) . One introduces an extra variable $\lambda$ (a Lagrange multiplier) and the Lagrangian function $L:\mathbb{R}^{n+1}\rightarrow \mathbb{R}$ given by

$L(x_1, \dots, x_n, \lambda) = f(x_1, \dots, x_n) + \lambda g(x_1, \dots, x_n).$

The main result is the following.

Suppose that $(z_1, \dots, z_n)$ is a local maximum/minimum for (7.18) . Then there exists $\lambda\in \mathbb{R}$ , such that $(z_1, \dots, z_n, \lambda)$ is a critical point for $L$ .

So to solve (7.18) we simply (well, this is not always so simple) look for critical points for $L$ . This amounts to solving the $n+1$ (non-linear) equations coming from $\nabla L = 0$ i.e.,

$\begin{aligned} g(x_1, \dots, x_n) &= 0\\ \dfrac{\partial f}{\partial x_1}(x_1, \dots, x_n) + \lambda \dfrac{\partial g}{\partial x_1}(x_1, \dots, x_n) &= 0\\ &\vdots\\ \dfrac{\partial f}{\partial x_n}(x_1, \dots, x_n) + \lambda \dfrac{\partial g}{\partial x_n}(x_1, \dots, x_n) &= 0 \end{aligned}\tag{7.19}$

For $n=2$ we can quickly give a sketch of the idea behind the proof. The (difficult) fact is that we may find a differentiable function $x(t)$ in one variable $t$ , such that

$g(t, x(t)) = 0$ and the local minimum has the form $v_0 = (t_0, x(t_0))$ .

Once we have this, the chain rule does its magic. We consider the one variable functions

$\begin{aligned} F(t) &= f(t, x(t))\\ G(t) &= g(t, x(t)) \end{aligned}\tag{7.20}$ For both of these we have $F'(t_0) = G'(t_0) = 0$ (why?). The chain rule now gives a non-zero vector orthogonal to $\nabla f(v_0)$ and $\nabla g(v_0)$ . This is only possible if they are parallel as vectors i. e. , there exists $\lambda$ , such that

$\nabla f (v_0) = \lambda \nabla g (v_0).$

Consider the minimization problem

$\begin{aligned} &\text{Minimize} &x+y\\ &\text{with constraint}\\ &&x^2 + y^2 = 1. \end{aligned}$ First of all, why does this problem have a solution at all? We write the non-linear equations

$\begin{aligned} 1 + 2 x\lambda &= 0\\ 1 + 2 y\lambda &= 0\\ x^2 + y^2 - 1&= 0 \end{aligned}$ up coming from the critical points of the Langrange function. Now we know that these can be solved and that amongst our solutions there is a minimum!

Computing the distance from the line $y = x +1$ to the point $(1, 1)$ gives rise to the minimization problem

$\begin{aligned} &\text{Minimize} &(x-1)^2 + (y-1)^2\\ &\text{with constraint}\\ &&y = x + 1. \end{aligned}$ Solve this minimization problem using Theorem 7.37 .

Use Theorem 7.37 to maximize $x^2 + y^2$ subject to $x^2 + x y + y^2 = 4$ .

Hint

Here you end up with the system

$\begin{aligned} (2\lambda + 2) x + \lambda y &= 0\\ \lambda x + (2\lambda + 2) y &= 0 \end{aligned}$ of linear equations in $x$ and $y$ , where you regard $\lambda$ as a constant. Use Gaussian elimination to solve this system in order to derive a (nice) quadratic equation in $\lambda$ coming from

$-\frac{\lambda}{2\lambda +2} \lambda y + (2 \lambda + 2) y = 0,$ where you assume that $y\neq 0$ . Handle the case $y = 0$ separately.

Consider the subset $C = \{(x, y)\in \mathbb{R}^2\mid x^2 + x y + y^2 = 4\}$ . Why is $C$ a closed subset? Why is $C$ bounded?

Hint

To prove that $C$ is bounded you can keep $y$ fixed in

$x^2 + y x + y^2 - 4 = 0 \tag{7.21}$ and solve for $x$ . A last resort is using the plot in Sage in the Hint button below, but that does not give any real insight unless you explain how Sage makes the plot from the equation (7.21) .

How does this relate to Theorem 5.66 ?

Does the optimization problem have a geometric interpretation?

Hint

A rectangular box has side lengths $x$ , $y$ and $z$ . What is its maximal volume when we assume that $(x, y, z)$ lies on the plane

$\frac{x}{a} + \frac{y}{b} + \frac{z}{c} = 1$ for $a, b, c > 0$ .

A company is planning to produce a box with volume $2$ $m^3$ . For design reasons it needs different materials for the sides, top and bottom. The cost of the materials per square meter is $1$ dollar for the sides, $1.5$ dollars for the bottom and the top. Find the measurements of the box minimizing the production costs.

Hint

Let $x, y$ and $z$ be the measurements. Use $x y z = 2$ to rewrite the Lagrange equations so that $y$ and $z$ are expressed in terms of $x$ .

$\begin{aligned} &\text{Maximize} &- p_1 \log_2(p_1) - \cdots - p_n \log_2(p_n)\\ &\text{with constraint(s)}\\ &&p_1 + \cdots + p_n = 1.\\ &&p_1 > 0, \dots, p_n > 0 \end{aligned}$

The sum

$H(p_1, \dots, p_n) = -p_1 \log_2(p_1) - \cdots - p_n \log_2(p_n)$ is called the (Shannon) entropy of the discrete probability distribution $p_1, \dots, p_n$ . One may use Jensen's inequality applied to the convex function $-\log_2(x)$ to prove that

$H(p_1,\dots, p_n) \leq \log_2(n).$

7.10 Optimization using the interior and boundary of a subset

Suppose that $C\subseteq \mathbb{R}^n$ is a closed subset and $f:\mathbb{R}^n\rightarrow \mathbb{R}$ is a continuous function. Recall (see Theorem 5.66 ) that the optimization problem

$\begin{aligned} &\text{Optimize} &f(v)\\ &\text{with constraint}\\ &&v\in C \end{aligned}$ always has a solution if $C$ in addition to being closed is also bounded. To solve such an optimization problem, it often pays to decompose $C$ as

$C = \partial C \cup C^o,$ where $\partial C$ is the boundary of $C$ (recall Definition 5.44 ) and $C^o$ the interior of $C$ (recall Definition 5.46 ). The strategy is then to look for an optimal solution both in $\partial C$ and $C^o$ and then compare these. In some sense we are making a "recursive" call to a lower dimensional optimization problem for the boundary $\partial C$ . This is illustrated by the basic example: $f(x) = x^2 - 5 x + 6$ and $C = [0, 4]$ . Here $\partial C = \{0, 4\}$ and $C^o = (0, 4)$ . Notice that $\partial C$ is finite here.

If $v_0$ is an element of $C^o$ , then there exists an open subset $U\subseteq C$ , such that $v_0\in U$ . Therefore the following proposition holds, when you take Proposition 7.21 into account.

Consider an optimization problem

$\begin{aligned} &\text{Optimize} &f(v)\\ &\text{with constraint}\\ &&v\in C, \end{aligned}\tag{7.22}$ where $C\subseteq \mathbb{R}^n$ is a subset, $f:\mathbb{R}^n\rightarrow \mathbb{R}$ a differentiable function and $v_0$ an optimal solution to (7.22) . If $v_0\in C^o$ , then $v_0$ is a critical point of $f$ .

Basically, to solve an optimization problem like (7.22) one needs to consider the boundary and interior as separate cases. For points on the boundary we cannot use the critical point test in Proposition 7.21 . This test only applies to the interior points. Usually the boundary cases are of smaller dimension and easier to handle as illustrated in the example below.

Consider the minimization problem

$\begin{aligned} &\text{Minimize} &x+y\\ &\text{with constraint}\\ &&x^2 + y^2 = 1. \end{aligned}$ from Example 7.38 . Let us modify it to

$\begin{aligned} &\text{Minimize} &x+y\\ &\text{with constraint}\\ &&(x, y)\in C, \end{aligned}\tag{7.23}$ where

$C=\{(x, y)\in \mathbb{R}^2 \mid x^2 + y^2\leq 1\}.$ We are now minimizing not only over the unit circle, but the whole unit disk. Here

$\partial C = \{(x, y)\in \mathbb{R}^2 \mid x^2 + y^2 = 1\}\quad \text{and}\quad C^o = \{(x, y)\in \mathbb{R}^2 \mid x^2 + y^2 < 1\}.$ Proposition 7.44 guides us first to look for optimal points in $C^o$ . Here we use Proposition 7.21 to show that there can be no optimal points in $C^o$ , because the gradient of the function $f(x, y) = x + y$ is

$\nabla f = (1, 1).$ Therefore the boundary needs to be analyzed and the usual technique (as was implicit in Lagrange multipliers) is to find a parametrization for the points $(x, y)$ satisfying $x^2 + y^2 = 1$ . There are two of those (one for the upper unit circle and one for the lower unit circle):

$\begin{aligned} &\left(t, \sqrt{1 - t^2}\right)\\ &\left(t, -\sqrt{1 - t^2}\right), \end{aligned}$ where $t\in [-1, 1]$ . This means that the optimization problem for the boundary $\partial C$ turns into the two simpler optimization problems of minimizing

$t + \sqrt{1 - t^2}\qquad \text{and}\qquad t - \sqrt{1 - t^2}$ subject to $t\in [-1, 1]$ . These can as one variable optimization problems be solved the usual way.

The exercises below are taken from an older Calculus course at Aarhus.

Solve the two optimization problems

$\begin{aligned} &\text{Maximize/Minimize} &x^2 - 2 x y + 2 y\\ &\text{with constraint}\\ &&(x, y)\in C, \end{aligned}$ where $C = \{(x, y)\in \mathbb{R}^2 \mid 0\leq x \leq 3, 0\leq y \leq 2\}$ . But first give a reason as to why they both are solvable.

Hint

First find $\partial C$ and $C^o$ . Then try with Proposition 7.44 supposing that a maximal point really is to be found in $C^o$ and not on $\partial C$ .

Solve the two optimization problems

$\begin{aligned} &\text{Maximize/Minimize} &1 + 4 x - 5 y\\ &\text{with constraint}\\ &&(x, y)\in C, \end{aligned}$ where $C = \{(x, y)\in \mathbb{R}^2 \mid 0\leq x, 0\leq y, 3 x + 2 y \leq 6\}$ . But first give a reason as to why they both are solvable.

Solve the two optimization problems

$\begin{aligned} &\text{Maximize/Minimize} &3 + x y - x - 2 y\\ &\text{with constraint}\\ &&(x, y)\in C, \end{aligned}$ where $C$ is the triangle with vertices in $(1, 0), (5, 0)$ and $(1, 4)$ . But first give a reason as to why they both are solvable.

Use Proposition 7.44 to give all the minute details in applying Theorem 7.37 to solve Exercise 7.42 .

First rewrite to the problem, where you minimize $6/y + 4/x + 2 x y$ subject to $x>0, y> 0$ by using $x y z = 2$ . Then explain why this problem may be solved by restricting with upper and lower bounds on $x$ and $y$ . The minimum ( $6 \sqrt[3]{6}$ ) is attained in a critical point and not on the boundary. For $N\in \mathbb{N}\setminus\{0\}$ one may optimize over the compact subset

$C_N = \{(x, y) \mid \frac{1}{N} \leq x \leq N, \frac{1}{N} \leq y \leq N\}$

and analyze what happens when $N\to\infty$ .