7Several variables

A function of several variables usually refers to a function
where is a natural number. We have already seen functions of several variables with . In particular, in Chapter 4 , we saw linear functions (in connection with linear programming) like
This is a rather simple function of several variables with in (7.1) . In general functions as in (7.1) can be wildly complicated. One of the main purposes of this chapter is to zero in on the class of differentiable functions in (7.1) . In Chapter 6 we defined what it means for a function of one variable to be differentiable. This was inspired by a drawing of the graph of the function. In several variables (for ) one has to be a bit clever in the definition of differentiability. The upshot is that the derivative at a point now is a row vector (or more generally a matrix) instead of being a single number. As an example, using notation that we introduce in this chapter, the derivative of the function in (7.2) at is
This notation means that partial differentiation with respect to a variable occurs i.e., one fixes the variable and computes the derivative with respect to this variable viewing all the other variables as constants.
First some treasured memories from the author's past.

7.1 Introduction

Many years ago (1986-89), I had a job as a financial analyst in a bank (now a hotel!) working (often late at night) with a spectacular view of Copenhagen from the office circled below.
This was long before a financial analyst became a quant and machine learning became a buzz word. Digging through my old notes from that time, I found the outlines below.
These were notes I made in connection with modelling the yield curve for zero coupon bonds. I had to fit a very non-linear function in several variables to financial data and had to use effective numerical tools (and programming them
in APL). Tools that are also used today in machine learning and data science.
Ultimately we are interested in solving optimization problems like
where and is a differentiable (read nice for now) function.
Training neural networks is a fancy name for solving an optimization problem, where usually and is built just like in the least squares method from some data points. The difference is that in neural networks, is an incredibly complicated (differentiable) function composed of several intermediate functions. We do not, as in the method of least squares, have an explicit formula for finding a minimum. We have to rely on iterative methods. One such method is called gradient descent.
Let me illustrate this in the simplest case, where . The general case is conceptually very similar (see Lemma 7.19 ).
Suppose that is differentiable at with and we wish to solve the minimization problem
Solving the equation (to find potential minima) may be difficult. Instead we try something else.
We know for sure that is not a local minimum (why?). It turns out that we can move a little bit in the direction Left if and right if . of and get a better candidate for a minimum than i.e., for small and we have
This is a consequence If you use the definition of differentiability with , you will see that
For small this shows that , as .
of the definition of being differentiable at with .
The process is then repeated putting until the absolute value of is sufficiently small (indicating that we are close to a point with ).
The number is called the learning rate in machine learning.
Illustrate the gradient descent method for . Pay attention to the learning rate . How big is allowed to be, when
is required and ?
This is a hands-on exercise: carry out the gradient descent method numerically for the function
to solve the minimization problem
starting with .
Hint
It is not clear how to choose the step size here. Proceed by letting be the smallest natural number, such that
Stop the process, when .
Helpful code
Is a convex function?
Explain how the Newton-Raphson method This is an iterative method for approximating a zero for a differentiable function . It works by guessing and then iterating to get a sequence approximating a zero (). may be used to solve the minimization problem and compute the minimum also using this method.
Helpful code
Recall the definition of a function being differentiable at a point with derivative . Here we measured the change of in terms of the change (in ). It had to have the form
where is a function continuous in with and small. If you divide both sides of (7.3) by you recover the usual more geometric definition of differentiability as a limiting slope:
We wish to define differentiability at for a function . In this setting the quotient
in (7.4) does not make any sense. There is no way we can divide a vector by a vector , unless of course as in (7.4) , where we faced usual numbers.
The natural thing here is to generalize the definition in (7.3) . First let us recall what functions look like.

7.2 Vector functions

We will flesh out the general Definition 1.99 in a special case below.
A function takes a vector as input and gives a vector as output. This means that every coordinate in the output must be a function of i.e.,
for . So in total, we may write as
Each of the (coordinate) functions are functions from to .
Look back at Exercise 1.115 . Write down precisely the vector function occuring there.
The function is rotating a vector degrees counter clockwise. What are and in
Hint
Try rotating some specific vectors like degrees. Do you see a pattern?

7.3 Differentiability

The definition of differentiability for a function mimics (7.3) , except that is replaced by . Also the open interval is replaced by an open subset and the (open) interval is replaced by an open subset containing . Notice, however, that now the derivate is a matrix!
Let be a function with an open subset. Then is differentiable at if there exists
  1. an matrix ,
  2. an open subset with , such that for every ,
  3. a function continuous at with ,
such that
In this case, the matrix is called the (matrix) derivative of at and denoted by .
The function is called differentiable if it is differentiable at every .
How do we compute the matrix derivative in the above definition? We need to look at the representation of in (7.5) and introduce the partial derivatives.

7.3.1 Partial derivatives

A function of one variable has a derivative with respect to . For a function of several variables we have a well defined derivative with respect to each of these variables. These are called the partial derivatives (if they exist) and they are defined below.
Let be a function, where is an open subset of . Fix a point and let
for . If is differentiable at according to Definition 7.3 , then we say that the partial derivative of with respect to exists at and use the notation
The partial derivative with respect to a specific variable is computed by letting all the other variables appear as constants.
To get a feeling for the definition and computation of partial derivatives, take a look at the example below, where we compute using the classical (geometric) definition of the one variable derivative.
Consider the function given by
Then
where . This example illustrates that can be computed just like in the one variable case, when the other variables () are treated as constants. Notice that
Partial derivatives behave almost like the usual derivatives of one variable functions. You simply fix one variable that you consider the "real" variable and treat the other variables as constants.
Below are examples of Sage code computing partial derivatives. Notice that the variables must be declared first.
The Sage computations above point to a really surprising result. It seems that it makes no difference if you compute the partial derivative with respect to and then with respect to or the other way around. You could, just for fun, try this out on the more complicated function
This result is formulated in Theorem 7.13 below.
Use the Sage window above to verify the computation of the partial derivative in Example 7.9 .
The following result tells us how to compute the matrix derivative.
Let be a function with an open subset. If is differentiable at , then the partial derivatives
exist for and and the matrix in Definition 7.5 is
The -th column in is . Putting for in Definition 7.5 gives
The -th coordinate of this identity of -dimensional vectors can be written
where
and (7.6) shows that .
Compute the matrix derivative of the vector function in Exercise 7.4 .
For a function with an open subset, the partial derivative, if it exists for every , is a new function
We will use the notation
for the iterated (second order) partial derivative.
The first part of following result is a converse to Proposition 7.11 . The second part contains the surprising symmetry of the second order partial derivatives under rather mild conditions. We will not go into the proof of this result, which is known as Clairaut's theorem.
Let be a function with an open subset. If the partial derivatives for exist at every with
continuous (for and ), then is differentiable. If the second order partial derivatives exist for a function and are continuous functions, then
for .
Verify (by hand!) the symmetry of the second order partial derivatives for the function in Example 7.9 i.e., show that
Verify that given by
is a differentiable function by computing
and applying Theorem 7.13 . Check also that

7.4 Newton-Raphson in several variables!

There is a beautiful generalization of the Newton-Raphson method to several variable functions . Consider first that you would like to solve the system
of non-linear equations in the two variables and . Notice that we are talking non-linear here. This is so much more difficult than the systems of linear equations that you encountered in a previous chapter.
However, just like we used Newton's method in one variable for solving a non-linear equation, Newton's method for finding a zero for a function generalized to the iterative scheme
provided that the matrix derivative is invertible.
The reason that (7.8) works comes again from the powerful definition of differentiability in Definition 7.5 using that
provided that is small. In fact, you get (again) (7.8) from (7.9) by putting to , replacing is close to by and then isolating .
For the equations in (7.7) , the iteration scheme (7.8) becomes
Verify the claim in (7.10) by applying (7.8) to
Carry out sufficiently many iterations starting with the vector in (7.10) to see the iteration stabilize. You should do this using a computer, for example by modifying the Sage code in the last half of Example 7.18 .

7.5 Local extrema in several variables

For a function , where , the derivative at is called the gradient for at . Classically, it is denoted i.e.,
The definition below is inspired by the one variable case (see Definition 6.34 ).
Let be a function, where is an open subset. Suppose that the partial derivatives exist at . Then is called a critical point for if .
Consider the function given by
corresponding to finding critical points for the function
You can left click and hold the graph computed below (after it has rendered) and rotate the surface to get a feeling for what (7.11) looks like. Zooming in is also possible.
Here
In the Sage code below, Newton's method is started at and iterated four times.
If is not a critical point for we can use the gradient vector to move in a direction making strictly smaller/larger. This is very important for optimization problems.
Let be a differentiable function, where is an open subset. Suppose that and for . Then
for small.
By the differentiability of ,
where is a function satisfying for . For with we have
When tends to zero from the right, it follows that for small .
Lemma 7.19 looks innocent, but it is the bread and butter in the training of neural networks. In mathematical terms, training means minimizing a function. In machine learning terms, above is called the learning rate. One iteration (why do I choose ?)
of Lemma 7.19 is the central ingredient in an epoch in training a neural network.
Let us briefly pause and see Lemma 7.19 in action. Consider the function given by
and with . In this case and . Therefore we may find a small , such that . How do we choose optimally? If is too big we fail and land up in a worse point than . Here
This is a quadratic polynomial, which is minimal for . Therefor the minimal value reached in the direction of is . The process now continues replacing by .
The result below is the multi variable generalization of looking for local extrema by putting in the one variable case.
Let be a differentiable function, where is an open subset. If is a local extremum, then is a critical point for .
Suppose that . If is a local minimum, then we may use in Lemma 7.19 to deduce that for small. This contradicts the local minimality of . If is a local maximum we can apply Lemma 7.19 with and to reach a similar contradiction. Therefore and is a critical point for .
Compute the critical points of
Is a local maximum or a local minimum for ?
Hint
Look at
and and (along with Theorem 6.47 ).
We will prove later that a differentiable function is strictly convex if the socalled Hessian matrix given by
is positive definite for every . This is a multivariable generalization of the fact that is strictly convex if for every .
Now let
3D graph
You can left click the surface computed below after it has rendered and rotate or zoom in.
  1. Show that is strictly convex.
  2. Compute the critical point(s) of .
    Hint
    This is a numerical computation! Modify the relevant Sage window for Newton's method in the previous chapter to do it.
  3. For a differentiable convex function we have in general that
    for every . This is a multivariable generalization of Theorem 6.58 .
    Explain how one can use (7.13) to find a global minimum for the function in (7.12) . Is this minimum unique? Is for every ?

7.6 The chain rule

Suppose you want to compute the value of the function for . Then you would start by evaluating the inner function , then applying and finally . This computation can be illustrated in the (computational) graph
where you plug into the leftmost node and fill in each node taking input from its left neighbor
Suppose we want to compute for . Can we use the computational graph for this?
Recall the chain rule for functions of one variable. Here we have functions and , such that for . If is differentiable at and is differentiable at , the chain rule says that is differentiable at with
The chain rule tells us that
This expression involves three derivatives corresponding to the three edges in the comptuational graph. We can illustrate the chain rule by labeling each edge with the derivative of its end node:
Then the derivative can be computed by as the product of the labels evaluated on their left nodes in the filled in comptutational graph:
i.e., . This observation is the basis of the famous backpropagation rule used in training neural networks.
The chain rule for functions of one variable generalizes verbatim to functions of several variables:
for compatible multivariate functions and when you replace usual multiplication by matrix multiplication.
Let and with , open subsets and . If is differentiable at and is differentiable at , then is differentiable at with
The proof of the chain rule in this general setting uses Definition 7.5 just as in the one variable case. It is not conceptually difficult, but severely cumbersome. We will not give it here.

7.6.1 Matrix multiplication graphically

To really understand the chain rule, it pays to view the matrix multiplication in (7.14) in a new light (inspired by computer science and neural networks).
An matrix is a rectangular table with rows and columns containing numbers. We may also view it as a (bipartite) graph with left nodes, right nodes and an edge from the left node to the right node with weight . This is best illustrated by an example, which also tells you how matrix multiplication is (beautifully) interpreted in this setting.
The matrix
is represented below as a graph
Similarly the matrix
is represented as
You know that the matrix product is a matrix. Let us line and up graphically:
There are three paths from to and three paths from to . Here are the three paths from to :
Finally, the matrix product is represented by the graph
The number on the edge from to is gotten by adding the products of the weights on each of the three paths from to i.e., . This is the graphical interpretation of matrix multiplication!

7.6.2 Unpacking the chain rule

The matrix multiplication in (7.14) looks deceivingly simple. Let us write it out. Assume for simplicity that is a function in the variables and that is a function in the variables :
Then is a function in the variables :
and we wish to compute , which is an matrix with entries
where represent the rows and the columns. Here (7.14) says that
When using the chain rule in computations it pays to use the graphical interpretation of matrix multiplication in subsection 7.6.1 with edges labeled by the derivatives in a computational graph. We illustrate this below.
The function given by
may be evaluated using the computational graph
where is the function and is the function . Similar to the one variable case discussed in the beginning of section 7.6 , we label each edge, but now by the partial derivative of the function in its ending node with respect to the variable in its beginning node:
From the graphical interpretation of the matrix product and the chain rule you follow the two paths from the input node to the output node and conclude that
Here is another example of the chain rule in action through a computational graph. In the end you will see an implementation in a famous python library.
Consider the example
from Example 7.9 . Even though superficially looks rather simple, it is composed of several smaller functions as displayed in the computational graph
Every node in the above graph, except the input nodes (with no ingoing arrows), represents some function . For example the node represents a function and represents a function .
To emphasize that the non-input nodes really are functions we replace them by letters:
Here we see that
where
The gradient is then available from the decorated graph below
by multiplying the decorations on each path from the top to the input variable and the summing up. For example,
Computational graphs and the chain rule are important components in machine learning libraries. Below is an example of the computation of in the computational graph above using the pytorch library.
Construct a computational graph for
and detail the computation of the gradient in this context.
Compute the gradient of at using pytorch.
Consider and given by
Compute using the chain rule and check the result with an explicit computation of the derivative of .
We wish to show that the function given by
is convex. This means that we need to prove that
for every and every with . This can be accomplished from the one variable case in the following way. Define
and show that is convex by using the chain rule to show that . Show how the convexity of follows from this by using that

7.7 Logistic regression

The beauty of the sigmoid function is that it takes any value and turns it into a probability by
i.e., and .
Graph of the sigmoid function
Prove that
and
We will not go into all the details (some of which can be traced to introductory probability and statistics), but suppose that we have an outcome , which may or may not happen.
We have an idea, that the probability of is dependent on certain parameters and observations that fit into the sigmoid function as
An example of this could be where denote the gray scale of each pixel in a image. The event is whether the image contains the digit :
Here would be the probability that the image contains the digit .

7.7.1 Estimating the parameters

Suppose also that we have a table of observations (data set)
where each row has observations along with a binary variable , which is if was observed to occur and if not.
Assuming that (7.15) holds, the probability of observing the observations in (7.16) is
Notice that (7.17) is a function of the parameters for fixed observations .
We wish to choose the parameters so that is maximized (this is called maximum likelihood estimation). So we are in fact here, dealing with an optimization problem, which is usually solved by gradient descent (for ) or solving the equations
Instead of maximizing one usually maximizes the logarithm
Notice that we have used Exercise 7.31 and the logarithm rules and in the computation above.
Suppose that the event is assumed to be dependent on only one observation i.e., above. For example, could be the event of not showing up on a Monday paired with the amount of sleep in the weekend.
Here
and
Explain how the end result of the computation of in Example 7.32 is obtained and compute .
I remember exactly where I was when first hearing about the Challenger See byuistats.github.io for more details on this example disaster in 1986.
This dreadful event was caused by failure of a socalled O-ring. The O-rings had been tested before the launch for failure (=1 below) at different temperatures (in F) resulting in the (partial) table below.
At the morning of the launch the outside temperature was (uncharacteristically low for Florida) degrees Fahrenheit. We wish to use logistic regression to compute the probability that the O-ring fails.
Below we have sketched how the logistic regression is carried out using the python library SciKit-Learn. The option solver='lbfgs' chooses an algorithm for maximizing .
Press the Compute button and see the probability of failure during the launch.
LLM
💬

Explain the function LogicsticRegression in sklearn. In particular, what do the
parameters in

model = LogisticRegression(C=25, solver='lbfgs')
model.fit(X,y)

mean?

| ChatGPT | Claude | Copilot | Gemini | HuggingChat | Mistral |
In the button below is a naive implementation of gradient descent (in fact gradient ascent, because we are dealing with a maximization problem) for the Challenger data set and logistic regression. The implementation is derived from the introduction to gradient descent in this chapter, where we adjusted the step with successive negative powers of .
Run experiments with different initial values and number of iterations. Compare with the official output from scikit-learn in the example above. What is going on?
Also try adjusting the scikit-learn output in the example above by removing C=25 first and then solver='lbfgs'. What happens? Compare the quality of the solutions in terms of the gradient (which is available in the output from the Naive code).
Yes, you are allowed (and encouraged) to use generative AI tools here!
Naive code

7.8 3Blue1Brown

Sit back and enjoy the masterful presentations of neural networks (and the chain rule) by the YouTuber 3Blue1Brown.

7.8.1 Introduction to neural networks

7.8.2 Gradient descent

7.8.3 Backpropagation and training

7.8.4 The chain rule in action

Watch the video above before solving this exercise.
Consider the simple neural network
where
and is the sigmoid function. This neural network has input and output . Let be a function of the output . For fixed , we consider as a function of via
Backpropagation for training neural networks is using the chain rule for computing the gradient
Explain how to do this.

7.9 Lagrange multipliers

The method of Lagrange multipliers is a super classical way of solving optimization problems with non-linear (equality) constraints. We will only consider the special case
where both and are differentiable functions.
There is a very useful trick for attacking (7.18) . One introduces an extra variable (a Lagrange multiplier) and the Lagrangian function given by
The main result is the following.
Suppose that is a local maximum/minimum for (7.18) . Then there exists , such that is a critical point for .
So to solve (7.18) we simply (well, this is not always so simple) look for critical points for . This amounts to solving the (non-linear) equations coming from i.e.,
For we can quickly give a sketch of the idea behind the proof. The (difficult) fact is that we may find a differentiable function in one variable , such that
and the local minimum has the form .
Once we have this, the chain rule does its magic. We consider the one variable functions
For both of these we have (why?). The chain rule now gives a non-zero vector orthogonal to and . This is only possible if they are parallel as vectors i. e. , there exists , such that
Consider the minimization problem
First of all, why does this problem have a solution at all? We write the non-linear equations
up coming from the critical points of the Langrange function. Now we know that these can be solved and that amongst our solutions there is a minimum!
Computing the distance from the line to the point gives rise to the minimization problem
Solve this minimization problem using Theorem 7.37 .
Use Theorem 7.37 to maximize subject to .
Hint
Here you end up with the system
of linear equations in and , where you regard as a constant. Use Gaussian elimination to solve this system in order to derive a (nice) quadratic equation in coming from
where you assume that . Handle the case separately.
Consider the subset . Why is a closed subset? Why is bounded?
Hint
To prove that is bounded you can keep fixed in
and solve for . A last resort is using the plot in Sage in the Hint button below, but that does not give any real insight unless you explain how Sage makes the plot from the equation (7.21) .
How does this relate to Theorem 5.66 ?
Does the optimization problem have a geometric interpretation?
Hint
A rectangular box has side lengths , and . What is its maximal volume when we assume that lies on the plane
for .
A company is planning to produce a box with volume . For design reasons it needs different materials for the sides, top and bottom. The cost of the materials per square meter is dollar for the sides, dollars for the bottom and the top. Find the measurements of the box minimizing the production costs.
Hint
Let and be the measurements. Use to rewrite the Lagrange equations so that and are expressed in terms of .
The sum
is called the (Shannon) entropy of the discrete probability distribution . One may use Jensen's inequality applied to the convex function to prove that

7.10 Optimization using the interior and boundary of a subset

Suppose that is a closed subset and is a continuous function. Recall (see Theorem 5.66 ) that the optimization problem
always has a solution if in addition to being closed is also bounded. To solve such an optimization problem, it often pays to decompose as
where is the boundary of (recall Definition 5.44 ) and the interior of (recall Definition 5.46 ). The strategy is then to look for an optimal solution both in and and then compare these. In some sense we are making a "recursive" call to a lower dimensional optimization problem for the boundary . This is illustrated by the basic example: and . Here and . Notice that is finite here.
If is an element of , then there exists an open subset , such that . Therefore the following proposition holds, when you take Proposition 7.21 into account.
Consider an optimization problem
where is a subset, a differentiable function and an optimal solution to (7.22) . If , then is a critical point of .
Basically, to solve an optimization problem like (7.22) one needs to consider the boundary and interior as separate cases. For points on the boundary we cannot use the critical point test in Proposition 7.21 . This test only applies to the interior points. Usually the boundary cases are of smaller dimension and easier to handle as illustrated in the example below.
Consider the minimization problem
from Example 7.38 . Let us modify it to
where
We are now minimizing not only over the unit circle, but the whole unit disk. Here
Proposition 7.44 guides us first to look for optimal points in . Here we use Proposition 7.21 to show that there can be no optimal points in , because the gradient of the function is
Therefore the boundary needs to be analyzed and the usual technique (as was implicit in Lagrange multipliers) is to find a parametrization for the points satisfying . There are two of those (one for the upper unit circle and one for the lower unit circle):
where . This means that the optimization problem for the boundary turns into the two simpler optimization problems of minimizing
subject to . These can as one variable optimization problems be solved the usual way.
The exercises below are taken from an older Calculus course at Aarhus.
Solve the two optimization problems
where . But first give a reason as to why they both are solvable.
Hint
First find and . Then try with Proposition 7.44 supposing that a maximal point really is to be found in and not on .
Solve the two optimization problems
where . But first give a reason as to why they both are solvable.
Solve the two optimization problems
where is the triangle with vertices in and . But first give a reason as to why they both are solvable.
Use Proposition 7.44 to give all the minute details in applying Theorem 7.37 to solve Exercise 7.42 .
First rewrite to the problem, where you minimize subject to by using . Then explain why this problem may be solved by restricting with upper and lower bounds on and . The minimum () is attained in a critical point and not on the boundary. For one may optimize over the compact subset
and analyze what happens when .