5Euclidean vector spaces

Big data are made up of many numbers in data sets. Such data sets can be represented as vectors in a high dimensional euclidean vector space. A vector is nothing but a list of numbers, but we need to talk mathematically about the size of a vector and perform operations on vectors. The term euclidean refers to vectors with a dot product as known from the plane $\mathbb{R}^2$ .

The purpose of this chapter is to set the stage for this, especially by introducing the dot product (or inner product) for general vectors. Having a dot product is immensely useful and we give several applications like linear regression and the perceptron learning algorithm

In the last part of the chapter rudimentary basics of analysis are introduced like sequences, continuous functions, open, closed and compact subsets. Some results will in this context only be quoted and not proved.

5.1 Vectors in the plane

The dot product (or inner product) between two vectors $u, v\in \mathbb{R}^2$ is given by

$u\cdot v = x_1 x_2 + y_1 y_2, \tag{5.1}$ where

$u = \begin{pmatrix} x_1 \\ y_1 \end{pmatrix}\qquad\text{and}\qquad v = \begin{pmatrix} x_2 \\ y_2 \end{pmatrix}. \tag{5.2}$ We may also interpret $u$ and $v$ as $2\times 1$ matrices (or column vectors). Then the dot product in (5.1) may be realized as the matrix product:

$u\cdot v = u^\intercal v.$ The length or norm of the vector $u\in \mathbb{R}^2$ is given by

$\left\vert u \right\vert = \sqrt{u\cdot u} = \sqrt{u^\intercal u} = \sqrt{x_1^2 + y_1^2}.$ This follows from the Pythagorean theorem:

Also, the cosine of the angle $\theta$ between $u$ and $v$ is given by

$\cos(\theta) = \frac{u\cdot v}{\left\vert u \right\vert\left\vert v \right\vert}\qquad\text{or}\qquad u \cdot v = \left\vert u \right\vert \left\vert v \right\vert \cos(\theta).$ We will not go into this formula. It is a byproduct of considering the projection of a vector on another vector (see Exercise 5.5).

5.2 Higher dimensions

The notions of dot product, norm and the formula for cosine of the angle generalize immediately to vectors in dimensions higher than two.

We denote the set of column vectors with $d$ rows by $\mathbb{R}^d$ and call it the euclidean vector space of dimension $d$ . An element $v\in \mathbb{R}^d$ is called a vector and it has the form (column vector with $d$ entries)

$v = \begin{pmatrix} x_1 \\ x_2 \\ \vdots \\ x_d \end{pmatrix}.$

A vector in $\mathbb{R}^d$ is a model for a data set in real life. A collection of $d$ numbers, which could signify $d$ measurements. You will see an example of this below, where a vector represents a data set counting words in a string.

Being column vectors, vectors in $\mathbb{R}^d$ can be added and multiplied by numbers:

$\begin{pmatrix} x_1 \\ x_2 \\ \vdots \\ x_d \end{pmatrix} + \begin{pmatrix} y_1 \\ y_2 \\ \vdots \\ y_d \end{pmatrix} = \begin{pmatrix} x_1 + y_1 \\ x_2 + y_2\\ \vdots \\ x_d + y_d \end{pmatrix}\qquad\qquad \lambda \begin{pmatrix} x_1 \\ x_2 \\ \vdots \\ x_d \end{pmatrix} = \begin{pmatrix} \lambda x_1 \\ \lambda x_2 \\ \vdots \\ \lambda x_d \end{pmatrix}.$

The dot product generalizes as follows to higher dimensions.

5.2.1 Dot product and norm

The dot product of

$u = \begin{pmatrix} x_1 \\ x_2 \\ \vdots \\ x_d \end{pmatrix}\qquad \text{and} \qquad v = \begin{pmatrix} y_1 \\ y_2 \\ \vdots \\ y_d \end{pmatrix}$ is defined by

$u\cdot v = u^\top v = x_1 y_1 + x_2 y_2 + \cdots + x_d y_d. \tag{5.3}$

The norm of $u\in \mathbb{R}^d$ is defined by

$\left\vert u \right\vert = \sqrt{u\cdot u} = \sqrt{x_1^2 + x_2^2 + \cdots + x_d^2}. \tag{5.4}$

A vector $e\in \mathbb{R}^d$ with $\left\vert e \right\vert=1$ is called a unit vector.

Two vectors $u, v\in \mathbb{R}^d$ are called orthogonal if $u\cdot v = 0$ . We write this as $u \perp v$ .

Show that

$u \perp u \iff u = 0,$ where $u\in \mathbb{R}^d$ .

Use the definition in (5.3) to show that

$\begin{aligned} u\cdot (v + w) &= u\cdot v + u\cdot w\\ (\lambda u)\cdot v &= u\cdot (\lambda v) = \lambda (u\cdot v) \end{aligned}$ for $u, v, w\in \mathbb{R}^d$ and $\lambda\in\mathbb{R}$ .

Let $u\in \mathbb{R}^d$ be a nonzero vector and $\lambda\in \mathbb{R}$ . Use the definition in (5.4) to show that $|\lambda u| = |\lambda| \, |u|$ and that

$\frac{1}{|u|} u$ is a unit vector.

You could perhaps use Exercise 5.3 to do this. Notice also that $|\lambda|$ is the absolute value for $\lambda$ if $\lambda\in \mathbb{R}$ .

Given two vectors $u, v\in \mathbb{R}^d$ with $v\neq 0$ , find $\lambda\in \mathbb{R}$ , such that $u - \lambda v$ and $v$ are orthogonal, i.e.

$(u - \lambda v) \cdot v = 0.$

$(u - \lambda v) \cdot v = 0 \iff (u\cdot v) - \lambda (v\cdot v) = 0.$ This is an equation, where $\lambda$ is unknown!

For $d=2$ , it is sketched below that if $u - \lambda v$ and $v$ are orthogonal, then $u, \lambda v$ and $u-\lambda v$ are the sides in a right triangle.

In this case, if $\theta$ is the angle between $u$ and $v$ , show that

$\left\vert u \right\vert \cos(\theta) = \left\vert v \right\vert \lambda.$ Use this to show that

$u\cdot v = \left\vert u \right\vert \left\vert v \right\vert \cos(\theta).$ Finally show that

$\cos(A - B) = \cos(A) \cos(B) + \sin(A) \sin(B),$ where $A$ and $B$ are two angles.

In the last question, you could use that the vectors

$\begin{pmatrix} \cos(A)\\ \sin(A) \end{pmatrix}\qquad\text{and}\qquad \begin{pmatrix} \cos(B)\\ \sin(B) \end{pmatrix}$ are unit vectors.

Given two vectors $u, v\in \mathbb{R}^d$ , solve the minimization problem

$\begin{aligned} &\text{Minimize} &\left\vert u - \lambda v \right\vert&\\ &\text{with constraint}\\ &&\lambda\in\mathbb{R}. \end{aligned}$

First convince yourself that $\lambda$ minimizes $\left\vert u - \lambda v \right\vert$ if and only if it minimizes

$(u - \lambda v)\cdot (u - \lambda v) = \left\vert v \right\vert^2 \lambda^2 - 2 (u\cdot v)\lambda + \left\vert u \right\vert^2,$ which happens to be a quadratic polynomial in $\lambda$ .

Let $d$ denote the distance from $(1, 1)$ to the line through $(0, 0)$ and $(2, 1)$ . What is true about $d$ ?

$d = \frac{1}{2}.$

$d = 0.447214.$

$d = \frac{\sqrt{5}}{5}.$

$d = \frac{2}{\sqrt{5}}.$

Mentimeter

Distance from point to line

5.2.2 The dist formula from high school

The infamous dist formula from high school says that the distance from the point $(x_1, y_1)$ to the line given by $y = a x + b$ is

$\frac{\left\vert a x_1 + b - y_1 \right\vert}{\sqrt{a^2 + 1}}. \tag{5.5}$ Where does this magical formula come from? Consider a general line $L$ in parametrized form (see Definition 4.9)

$L = \{ u + t v\mid t\in \mathbb{R}\} \subseteq \mathbb{R}^d.$ If $w\in \mathbb{R}^d$ , then the distance from $w$ to $L$ is given by the solution $t_0$ to the optimization problem

$\begin{aligned} &\text{min}\, (w - (u + t v))\cdot (w - (u + t v))\\ &t\in \mathbb{R}. \end{aligned}\tag{5.6}$ This looks scary, but simply boils down to finding the top of a parabola. The solution is

$t_0 = \frac{v\cdot w - u\cdot v}{v\cdot v}$ and the point on $L$ closest to $w$ is $u + t_0 v$ .

Now we put (see Example 4.10)

$u = \begin{pmatrix} 0 \\ b \end{pmatrix}, \quad v = \begin{pmatrix} 1 \\ a \end{pmatrix}\quad\text{and}\quad w = \begin{pmatrix} x_1 \\ y_1 \end{pmatrix}$ in order to derive (5.5). The solution to (5.6) becomes

$t_0 = \frac{x_1 + a y_1 - a b}{1 + a^2}.$ We must compute the distance $D$ from $w$ to $u + t_0 v$ in this case. The distance squared is

$\begin{aligned} D^2 &= |w - (u + t_0 v)|^2 = (w - (u + t_0 v))\cdot (w - (u + t_0 v)) \\ &=(x_1 - t_0)^2 + (y_1 - b - t_0 a)^2. \end{aligned}$ This is a mouthful and I have to admit that I used symbolic software (see below) to verify that

$D^2 = \frac{a^2 x_1^2 + 2 a b x_1 - 2 a x_1 y_1 + b^2 - 2 b y_1 + y_1^2}{1 + a^2} = \frac{(a x_1 + b - y_1)^2}{1+a^2}.$

5.2.3 The perceptron algorithm

Already at this point we have the necessary definitions for explaining the perceptron algorithm. This is one of the early algorithms of machine learning. It aims at finding a high dimensional line (hyperplane) that separates data organized in two clusters. In terms of the dot product, the core of the algorithm is described below.

A line in the plane is given by an equation

$a x + b y + c = 0$ for $a, b, c\in \mathbb{R}$ , where $(a, b) \neq (0,0)$ . Given finitely many points $(x_1, y_1), \dots, (x_n, y_n)$ each with a label $\ell_1, \dots, \ell_n$ of $\pm 1$ (or blue and red for that matter), we wish to find a line (given by $a, b, c$ ), such that $a x_i + b y_i + c > 0$ if $\ell_i = 1$ (above the line) and $a x_i + b x_i + c < 0$ if $\ell_i = -1$ (below the line). In some cases this is impossible (an example is illustrated below).

A clever approach to finding such a line, if it exists, is to reformulate the problem by looking at the vectors given by

$v_1 = (\ell_1 x_1, \ell_1 y_1, \ell_1),\qquad v_2 = (\ell _2 x_2, \ell_2 y_2, \ell_2),\qquad \dots\qquad ,\qquad v_n = (\ell_n x_n, \ell_n y_n, \ell_n) \tag{5.7}$ in $\mathbb{R}^3$ . Then the existence of the line is equivalent to the existence of a vector $\alpha\in \mathbb{R}^3$ with $\alpha \cdot v_i > 0$ for $i = 1, \dots, n$ . If $\alpha = (\alpha_1, \alpha_2, \alpha_3)$ is such a vector, then we have for $i = 1, \dots, n$ ,

$\begin{aligned} \alpha_1 x_i + \alpha_2 y_i + \alpha_3 &> 0\quad\text{if}\quad \ell_i = 1\\ -\alpha_1 x_i - \alpha_2 y_i - \alpha_3 &> 0\quad\text{if}\quad \ell_i = -1. \end{aligned}$ Therefore we may take $a = \alpha_1, b = \alpha_2$ and $c = \alpha_3$ as the line. So in general the following question is interesting.

Given finitely many vectors $v_1, \dots, v_n\in \mathbb{R}^d$ , can we find $\alpha\in \mathbb{R}^d$ , such that

$\alpha\cdot v_i > 0$ for every $i = 1, \dots, n$ ?

Come up with a simple example, where this problem is unsolvable i.e., come up with vectors $v_1, \dots, v_n\in \mathbb{R}^d$ , where such an $\alpha$ does not exist.

Hint

Try out some simple examples for $d=1$ and $d=2$ .

In case $\alpha$ exists, the following ridiculously simple algorithm works in computing $\alpha$ . It is called the perceptron (learning) algorithm.

Begin by putting $\alpha = 0$ .
If there exists $v_i\in \{v_1, \dots, v_n\}$ with $\alpha\cdot v_i \leq 0$ , then replace $\alpha$ by $\alpha + v_i$ and repeat this step. Otherwise $\alpha$ is the desired output vector.

Let us try out the algorithm on the simple example of just two points in $\mathbb{R}^2$ given by

$v_1 = \begin{pmatrix} -1 \\ \hphantom{-} 1 \end{pmatrix}\qquad\text{and}\qquad v_2 = \begin{pmatrix} 1 \\ 0 \end{pmatrix}.$

In this case the algorithm proceeds as pictured below.

$\alpha = \begin{pmatrix} 0 \\ 0 \end{pmatrix}\xrightarrow{+v_1} \begin{pmatrix} -1 \\ \hphantom{-} 1 \end{pmatrix} \xrightarrow{+v_2} \begin{pmatrix} 0 \\ 1 \end{pmatrix} \xrightarrow{+v_2}\begin{pmatrix} 1 \\ 1 \end{pmatrix} \xrightarrow{+v_1}\begin{pmatrix} 0 \\ 2 \end{pmatrix} \xrightarrow{+v_2} \begin{pmatrix} 1 \\ 2 \end{pmatrix}.$

It patiently crawls its way ending with the vector $\alpha = \begin{pmatrix} 1 \\ 2 \end{pmatrix}$ , which satisfies $\alpha\cdot v_1 > 0$ and $\alpha\cdot v_2 > 0$ .

Let us see how (5.7) works in a concrete example.

Consider the points

$v_1 = \begin{pmatrix} 0 \\ 0 \end{pmatrix},\qquad v_2=\begin{pmatrix} 1 \\ 1 \end{pmatrix}\qquad\text{and}\qquad v_3=\begin{pmatrix} \hphantom{-} 1 \\ -1 \end{pmatrix}$ in $\mathbb{R}^2$ , where $v_1$ and $v_2$ are labeled by $+1$ and $v_3$ is labeled by $-1$ . Then we let

$\hat{v}_1 = \begin{pmatrix} 0 \\ 0 \\ 1\end{pmatrix},\qquad \hat{v}_2 = \begin{pmatrix} 1 \\ 1 \\ 1\end{pmatrix}\qquad \text{and}\qquad \hat{v}_3 = \begin{pmatrix} -1 \\ \hphantom{-} 1 \\ -1\end{pmatrix}.$ Now we run the simple algorithm above Example 5.9:

$\hat{\alpha} = \begin{pmatrix} 0 \\ 0 \\ 0\end{pmatrix}\xrightarrow{+\hat{v}_1} \begin{pmatrix} 0 \\ 0 \\ 1\end{pmatrix} \xrightarrow{+\hat{v}_3} \begin{pmatrix} -1 \\ \hphantom{-} 1 \\ \hphantom{-} 0\end{pmatrix} \xrightarrow{+\hat{v}_1} \begin{pmatrix} - 1 \\ \hphantom{-} 1 \\ \hphantom{-} 1\end{pmatrix}.$

From the last vector we see that $x - y - 1 = 0$ determines a line separating the labeled points.

Consider the points

$\begin{pmatrix} 0 \\ 0 \end{pmatrix},\qquad \begin{pmatrix} 0 \\ 1 \end{pmatrix},\qquad \begin{pmatrix} 1 \\ 1 \end{pmatrix},\qquad\text{and}\qquad \begin{pmatrix} 1 \\ 0 \end{pmatrix}$ in $\mathbb{R}^2$ , where the first point is labeled with $-1$ and the rest by $1$ . Use the perceptron algorithm to compute a separating hyperplane.

What happens when you run the perceptron algorithm on the above points, but where the label of

$\begin{pmatrix} 1 \\ 1 \end{pmatrix}$ is changed from $1$ to $-1$ ?

Why does the perceptron algorithm work?

We will assume that there exists $\alpha\in \mathbb{R}^d$ , such that

$\alpha\cdot v_i = \alpha^\top v_i > 0$ for every $i = 1, \dots, n$ . This is equivalent to the existence of $\alpha^*\in \mathbb{R}^d$ , such that

$(\alpha^*)^\top v_i \geq 1$ for every $i = 1, \dots, n$ .

Suppose that there exists $\alpha\in \mathbb{R}^d$ , such that $\alpha\cdot v_i > 0$ for every $i = 1, \dots, n$ . Show then that there exists $\alpha^*\in \mathbb{R}^d$ , such that

$\alpha^* \cdot v_i \geq 1$ for every $i = 1, \dots, n$ .

Let $\mu = \min(\alpha\cdot v_1, \dots, \alpha\cdot v_n)$ . Show that $\alpha^* = \frac{1}{\mu} \alpha$ works.

The basic insight is the following

Let $r = \max\{\left\vert v_1 \right\vert, \dots, \left\vert v_n \right\vert\}$ . After $m$ iterations of the perceptron algorithm,

$\alpha^\top \alpha^* \geq m \qquad\text{and}\qquad m r^2 \geq \left\vert \alpha \right\vert^2.$

These statements follow from the inequalities

$(\alpha + v_i)^\top \alpha^* = \alpha^\top \alpha^* + v_i^\top \alpha^* \geq \alpha^\top \alpha^* + 1$ and

$(\alpha + v_i)^\top (\alpha + v_i) = \left\vert \alpha \right\vert^2 + 2 x_i^\top \alpha + \left\vert v_i \right\vert^2 \leq \left\vert \alpha \right\vert^2 + \left\vert v_i \right\vert^2 \leq \left\vert \alpha \right\vert^2 + r^2.$

Proposition 5.13 implies that

$m \leq \left\vert \alpha \right\vert \left\vert \alpha^* \right\vert \leq \sqrt{m} r \left\vert \alpha^* \right\vert.$ Therefore we get $m\leq r^2 \left\vert \alpha^* \right\vert^2$ and there is an upper bound on the number of iterations used in the second step. At a certain iteration within this bound we must have $\alpha^\top v_i > 0$ for every $i=1, \dots, n$ . Below is an implementation of the perceptron (learning) algorithm in python (with numpy) with input from Example 5.10 (it also works in higher dimensions).

5.2.4 Pythagoras and the least squares method

The result below is a generalization of the theorem of Pythagoras about right triangles to higher dimensions.

If $u, v\in \mathbb{R}^d$ and $u\perp v$ , then

$\left\vert u + v \right\vert^2 = \left\vert u \right\vert^2 + \left\vert v \right\vert^2.$

This follows from

$(u + v)\cdot(u+v) = u\cdot u + u\cdot v + v\cdot u + v\cdot v = u\cdot u + v\cdot v = \left\vert u \right\vert^2 + \left\vert v \right\vert^2,$ since $u \cdot v = v\cdot u = 0$ .

The dot product and the norm have a vast number of applications. One of them is the method of least squares: suppose that you are presented with a system

$A x = b \tag{5.8}$ of linear equations, where $A$ is an $m\times n$ matrix.

You may not be able to solve (5.8). There could be for example $17$ equations and only $2$ unknowns making it impossible for all the equations to hold. As an example, the system

$\begin{pmatrix} \phantom{-} 1 & \phantom{-} 1\\ \phantom{-} 1 & -1\\ -1 & \phantom{-} 1 \end{pmatrix} \begin{pmatrix} x \\ y\end{pmatrix} = \begin{pmatrix} 3 \\ 1 \\ 1 \end{pmatrix} \tag{5.9}$ of three linear equations and two unknowns does not have any solutions.

The method of (linear) least squares seeks the best approximate solution $x_0$ to (5.8) as a solution to the minimization problem

$\begin{aligned} &\text{Minimize} &\left\vert b - A x \right\vert^2&\\ &\text{with constraint}\\ &&x\in\mathbb{R}^n. \end{aligned}\tag{5.10}$

There is a surprising way of finding optimal solutions to (5.10):

If $x_0\in\mathbb{R}^n$ is a solution to the system

$(A^\top A) x = A^\top b \tag{5.11}$ of $n$ linear equations with $n$ unknowns, then $x_0$ is an optimal solution to (5.10). If $x_0$ on the other hand is an optimal solution to (5.10), then $x_0$ is a solution to (5.11).

Suppose we know that $b - A x_0$ is orthogonal to $A v$ for every $v\in \mathbb{R}^n$ . Then

$\left\vert b - A x \right\vert^2 = \left\vert b - A x_0 + A(x_0 - x) \right\vert^2 = \left\vert b - A x_0 \right\vert^2 + \left\vert A(x-x_0) \right\vert^2$ for every $x\in \mathbb{R}^n$ by Proposition 5.14. So, in the case that $b - A x_0 \perp A v$ for every $v\in \mathbb{R}^n$ we have

$\left\vert b - A x \right\vert^2 \geq \left\vert b - A x_0 \right\vert^2$ for every $x\in \mathbb{R}^n$ proving that $x_0$ is an optimal solution to (5.10).

Now we wish to show that $b - A x_0$ is orthogonal to $A v$ for every $v\in \mathbb{R}^n$ if and only if $A^\top A x_0 = A^\top b$ . This is a computation involving the matrix arithmetic introduced in Chapter 3:

$\begin{aligned} &(b - A x_0)\cdot A v = \\ &(b - A x_0)^\top A v =\\ &b^\top A v - x_0^\top A^\top A v = \\ &(b^\top A - x_0^\top A^\top A) v = 0 \end{aligned}$ for every $v\in \mathbb{R}^n$ if and only if $b^\top A - x_0 A^\top A = 0$ . But

$(b^\top A - x_0^\top A^\top A)^\top = A^\top b - A^\top A x_0$ so that $(A^\top A) x_0 = A^\top b$ .

On the other hand, if $|b - A x_0|^2 \leq |b - A x|^2$ for every $x\in \mathbb{R}^n$ , then $(b - A x_0)\cdot A v = 0$ for every $v\in \mathbb{R}^n$ : if we could find $v$ with $(b - A x_0)\cdot A v < 0$ , then

$|b - A(x_0 - \epsilon v)|^2 <|b - A x_0|^2$ for a small number $\epsilon > 0$ . This follows, since

$|b - A(x_0 - \epsilon v)|^2 = ((b - A x_0) + \epsilon A v)^2,$ which is

$|b - A x_0|^2 + 2\epsilon (b - A x_0)\cdot A v + \epsilon^2 (A v)^2$ By picking $\epsilon>0$ sufficiently small,

$\epsilon (2(b-A x_0)\cdot A v + \epsilon (A v)^2) < 0.$

In a future course on linear algebra you will see that the system of linear equations in Theorem 5.15 is always solvable i.e., an optimal solution to (5.10) can always be found in this way.

Mentimeter

Least squares example

Show that (5.9) has no solutions. Compute the best approximate solution to (5.9) using Theorem 5.15.

The classical application of the least squares method is to find the best line $y = \alpha x + \beta$ through a given set of points

$(x_1, y_1), \quad (x_2, y_2), \quad \dots \quad, (x_n, y_n)$ in the plane $\mathbb{R}^2$ .

Usually we cannot find a line matching the points precisely. This corresponds to the fact that the system of equations

$\begin{pmatrix} x_1 & 1\\ x_2 & 1\\ \vdots & \vdots\\ x_n & 1 \end{pmatrix} \begin{pmatrix} \alpha \\ \beta \end{pmatrix} = \begin{pmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{pmatrix}$ has no solutions.

Working with the least squares solution, we try to compute the best line $y = \alpha x + \beta$ in the sense that

$(y_1 -\alpha x_1 -\beta)^2 + (y_2 -\alpha x_2 -\beta)^2 + \cdots + (y_n -\alpha x_n -\beta)^2$ is minimized.

Best fit of line to random points from Wikipedia.

We might as well have asked for the best quadratic polynomial

$y = \alpha x^2 + \beta x + \gamma$ passing through the points

$(x_1, y_1), \quad (x_2, y_2), \quad \dots \quad, (x_n, y_n)$ in $\mathbb{R}^2$ .

The same method gives us the system

$\begin{pmatrix} x_1^2 & x_1 & 1\\ x_2 ^2 & x_2 & 1\\ \vdots & \vdots\\ x_n^2 & x_n & 1 \end{pmatrix} \begin{pmatrix} \alpha \\ \beta \\ \gamma \end{pmatrix} = \begin{pmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{pmatrix}$ of linear equations.

Best fit of quadratic polynomial to random points from Wikipedia.

The method generalizes naturally to finding the best polynomial of degree $m$

$y = a_m x^m + a_{m-1} x^{m-1} + \cdots + a_1 x + a_0$ through a given set of points.

Find the best line $y = \alpha x + \beta$ through the points $(1, 2), (2,1)$ and $(4,3)$ and the best quadratic polynomial $y = a x^2 + b x + c$ through the points $(-2, 2), (-1, 1), (0,0), (1,1)$ and $(2,2)$ .

It is important here, that you write down the relevant system of linear equations according to Theorem 5.15. It is however ok to solve the equations on a computer (or check your best fit on WolframAlpha).

Also, you can get a graphical illustration of your result in the sage window below.

A circle with center $(a, b)$ and radius $r$ is given by the equation

$(x - a)^2 + (y - b)^2 = r^2. \tag{5.12}$

Explain how (5.12) can be rewritten to the equation
$2 a x + 2 b y + c = x^2 + y^2, \tag{5.13}$ where $c = r^2 - a^2 - b^2$ .
Explain how fitting a circle to the points $(x_1, y_1), \dots, (x_n, y_n)$ in the least squares context using (5.13) leads to the system
$\begin{pmatrix} 2 x_1 & 2 y_1 & 1\\ 2 x_2 & 2 y_2 & 1\\ \vdots &\vdots &\vdots\\ 2 x_n & 2 y_n & 1 \end{pmatrix} \begin{pmatrix} a \\ b \\ c \end{pmatrix} = \begin{pmatrix} x_1^2 + y_1^2\\ x_2^2 + y_2^2\\ \vdots \\ x_n^2 + y_n^2 \end{pmatrix},$ of linear equations.
Compute the best circle through the points
$(0, 2),\quad (0, 3),\quad (2,0)\quad\text{and}\quad (3, 1)$ by giving the center coordinates and radius with two decimals. Use the Sage window below to plot your result too see if it matches the drawing.

5.2.5 The Cauchy-Schwarz inequality

Even though the generalizations of the dot product and norm to higher dimensions amount to just adding some coordinates, they entail a rather stunning result called the Cauchy-Schwarz inequality. The proof is not long, but revolves around a rather beautiful trick.

For two vectors $u, v\in \mathbb{R}^d$ ,

$\left\vert u\cdot v \right\vert \leq \left\vert u \right\vert \left\vert v \right\vert.$

We consider the function $q:\mathbb{R}\rightarrow \mathbb{R}$ given by

$q(x) = (x u + v)\cdot (x u + v) = \left\vert u \right\vert^2 x^2 + 2 (u\cdot v) x + \left\vert v \right\vert^2$ Then $q(x)$ is a quadratic polynomial with $q(x)\geq 0$ . Therefore its discriminant must be $\leq 0$ i.e.,

$4 (u\cdot v)^2 - 4 \left\vert u \right\vert^2 \left\vert v \right\vert^2 \leq 0,$ which gives the result.

The Cauchy-Schwarz inequality implies that

$-1 \leq \frac{u\cdot v}{|u|\, |v|} \leq 1$ for two vectors $u, v\in \mathbb{R}^d$ and it makes sense to define the angle $\theta$ between these vectors by

$\cos(\theta) = \frac{u \cdot v}{|u|\, |v|}. \tag{5.14}$

For arbitrary two numbers $x, y\in \mathbb{R}$ ,

$2 (x^2 + y^2) \geq (x + y)^2,$ since

$2(x^2 + y^2) - (x + y)^2 = x^2 + y^2 - 2 x y = (x - y)^2\geq 0.$

Why is

$n (x_1^2 + \cdots + x_n^2) \geq (x_1 + \cdots + x_n)^2$ for arbitary $n$ numbers $x_1, \dots, x_n\in \mathbb{R}$ ?

When vectors are interpreted as data sets, the number in (5.14) is known as the cosine similarity and measures the correlation between the two data sets $u$ and $v$ .

An application could be the similarity between two strings. Consider the two strings "Mathematics is fun and matrices are useful" and "Mathematics is fun and matrices are applicable".

From the words in the two strings we form the following vectors in $\mathbb{R}^8$ .

$\begin{array}{lll} \text{Mathematics} & 1 & 1\\[0.5em] \text{is} & 1& 1\\[0.5em] \text{fun} & 1 & 1\\[0.5em] \text{and} & 1 & 1\\[0.5em] \text{matrices} & 1 & 1\\[0.5em] \text{applicable} & 0 & 1\\[0.5em] \text{useful} & 1 & 0\\[0.5em] \text{are} & 1 & 1 \end{array}$

where every word in the two strings has an entry counting the number of occurences in the string. A measure for the equality between the two strings is the cosine of the angle between the two vectors.

The closer the cosine gets to $1$ (corresponding to an angle of $0$ degrees), the more similar we consider the strings.

In the above case the cosine similarity is approximately $0.86$ .

Below is a snippet of python code (using numpy) for computing the cosine similarity of two strings, where words are separated by blanks. It can be extended in many ways.

This application is based on rather basic mathematics, but we do get a quantitative measure for how close two strings are. This is a crude tool applicable for flagging potential plagiarism.

The cosine similarity is crucial in machine learning, especially in NLP. On a more advanced (and modern!) level, we have the word embeddings. These are maps $e: V \rightarrow \mathbb{R}^d$ from a finite set $V$ of words to a potentially very high dimensional euclidean space $\mathbb{R}^d$ . One usually requires that two words $w_1$ and $w_2$ similar in meaning should have $e(w_1)$ close to $e(w_2)$ in $\mathbb{R}^d$ . A breakthrough occured in 2013, when Google introduced the word embedding word2vec.

5.2.6 Distance of vectors and the triangle inequality

We know how to measure the size of a vector $u\in \mathbb{R}^d$ by its norm $\left\vert u \right\vert$ . We need to measure how close two vectors $u, v\in \mathbb{R}^d$ are i.e., we need to measure their distance. A perfectly good measure for the distance from $u$ to $v$ is the norm

$\left\vert u-v \right\vert.$ You can see from (5.4) that $|u-v|$ is small implies that the coordinates of $u$ and $v$ are close. Also we want $u = v$ if their distance is zero. This is satisfied. Similarly we want the distance from $u$ to $v$ to equal the distance from $v$ to $u$ . This is true, since $\left\vert x \right\vert = \left\vert -x \right\vert$ for any vector $x\in \mathbb{R}^d$ .

Show the above, that $\left\vert x \right\vert = \left\vert -x \right\vert$ for any vector $x\in \mathbb{R}^d$ . Explain why this implies $\left\vert u - v \right\vert = \left\vert v - u \right\vert$ for every $u, v\in \mathbb{R}^d$ .

One other, not so obvious property, is the triangle inequality.

For two vectors $u, v\in \mathbb{R}^d$ ,

$\left\vert u + v \right\vert \leq \left\vert u \right\vert + \left\vert v \right\vert.$

From the Cauchy-Schwarz inequality (Theorem 5.22) it follows that

$\left\vert u+v \right\vert^2 = (u+v)\cdot (u+v) = \left\vert u \right\vert^2 + 2 u\cdot v + \left\vert v \right\vert^2 \leq \left\vert u \right\vert^2 + 2 \left\vert u \right\vert\left\vert v \right\vert + \left\vert v \right\vert^2.$ Since the right hand side of this inequality is $(\left\vert u \right\vert + \left\vert v \right\vert)^2$ , the result follows.

Why is this result called the triangle inequality? A consequence is that

$\left\vert u - v \right\vert = \left\vert (u - w) + (w - v) \right\vert \leq \left\vert u - w \right\vert + \left\vert w - v \right\vert,$ i.e., that the distance from $u$ to $v$ is always less than or equal to the distance from $u$ to $w$ plus the distance from $w$ to $v$ , where $w$ is a third vector.

In boiled down terms: the length of any one side in a triangle is less than or equal to the sum of the lengths of the two other sides.

Find two typos in the figure above. Correct them!

The triangle inequality implies that

$\begin{aligned} \left\vert \left\vert x \right\vert-\left\vert y \right\vert \right\vert &\leq \left\vert x - y \right\vert\\ \left\vert \left\vert x \right\vert-\left\vert y \right\vert \right\vert &\leq \left\vert x + y \right\vert \end{aligned}\tag{5.15}$ for every $x, y\in \mathbb{R}^d$ .

Show how (5.15) follows from Theorem 5.26.

5.2.7 Bounded subsets

A subset $S \subseteq \mathbb{R}^d$ is called bounded if there exists a (potentially very, very big) number $M\in \mathbb{R}$ , such that

$u\cdot u\leq M$ for every $u\in S$ . Putting $u = (x_1, \dots, x_d)$ this is the same as

$\exists M\in \mathbb{R}\, \forall (x_1, \dots, x_d)\in S: x_1^2 + \cdots + x_d^2\leq M. \tag{5.16}$

This condition is true if and only if (the norm is bounded)

$\exists N\in \mathbb{R}\, \forall (x_1, \dots, x_d)\in S: \sqrt{x_1^2 + \cdots x_d^2}\leq N \tag{5.17}$

Every finite subset is bounded (why?). For $d = 1$ , (5.17) simply says

$\exists N\in \mathbb{R}\, \forall x\in S: |x|\leq N$ This implies that an interval $S = [a, b]$ is bounded by putting $N = \max(|a|, |b|)$ in (5.17).

Describe geometrically what it means for a subset $S$ of $\mathbb{R}$ resp. $\mathbb{R}^2$ to be bounded by using intervals resp. circles (disks).

Show precisely that the subset $\mathbb{N}$ of $\mathbb{R}$ is not bounded, whereas the subset $\{1, \frac{1}{2}, \frac{1}{3}, \dots\}$ is.

Sketch why

$S = \{(x, y) \mid x\geq 0, y\geq 0, x + y\leq 1\}\subseteq \mathbb{R}^2$ is bounded. Now use Fourier-Motzkin elimination to show the same without sketching.

5.3 An important remark about the real numbers

In the beginning of this course, we postulated the existence of the real numbers $\mathbb{R}$ as an extension of the rational numbers $\mathbb{Q}$ with their ordering $\leq$ .

The rational numbers had the glaring defect that the graph of the function $f:\mathbb{Q}\rightarrow \mathbb{Q}$ given by

$f(x) = x^2 - 2$ does not intersect the $x$ -axis between $1$ and $2$ in spite of the fact that $f(1) = -1$ and $f(2) = 2$ .

It seems from the sage plot below, that the graph intersects the $x$ -axis around $x_0\approx 1.4$ , but it really does not happen! Your computer and its screen only handles rational numbers.

Surely the most natural property for a well behaved function (like $f(x) = x^2 - 2$ ) is that it must intersect the $x$ -axis in a point $x_0$ with $a < x_0 < b$ if $f(a) < 0$ and $f(b) > 0$ .

I will not be completely precise about how to repair this defect about the rational numbers $\mathbb{Q}$ , but just state one exceedingly important property about the real numbers $\mathbb{R}$ in the button below.

In fact this one property guarantees that $\mathbb{R}$ does not have any holes as in the graph above.

Supremum and infimum

5.3.1 Supremum

A subset $S$ of $\mathbb{R}$ is called bounded from above if there exists $M\in \mathbb{R}$ , such that $x\leq M$ for every $x\in S$ . Here $M$ is called an upper bound for $S$ .

Give an example of a subset of the real numbers, which is not bounded from above and one that is.

The set of real numbers satisfies that for every subset $S\subseteq \mathbb{R}$ bounded from above, there exists a smallest upper bound denoted $\operatorname{sup}(S)$ called the supremum of $S$ . In precise terms,

$\operatorname{sup}(S)\geq x$ for every $x\in S$
If we move a little to the left of $\operatorname{sup}(S)$ we encounter elements from $S$ : for every $\epsilon > 0$ , there exists $x\in S$ , such that
$\operatorname{sup}(S)-\epsilon < x \leq \operatorname{sup}(S).$

Notice that we may have $\operatorname{sup}(S)\not\in S$ .

5.3.2 Infimum

In the same way a subset $S$ of $\mathbb{R}$ is called bounded from below, if there exists $m\in \mathbb{R}$ , such that $m \leq x$ for every $x\in S$ . Every subset $S$ bounded from below has a largest lower bound denoted $\operatorname{inf}(S)$ called the infimum of $S$ . In precise terms,

$\operatorname{inf}(S)\leq x$ for every $x\in S$
If we move a little to the right of $\operatorname{inf}(S)$ we encounter elements from $S$ : for every $\epsilon > 0$ , there exists $x\in S$ , such that
$\operatorname{inf}(S) \leq x < \operatorname{inf}(S) + \epsilon.$

Give a simple example of a subset $S\subseteq \mathbb{R}$ bounded from above, where $\operatorname{sup}(S)\not\in S$ .

Show that the subset $S = \{x\in \mathbb{Q} \mid x^2 < 2\}$ of $\mathbb{R}$ is bounded from above and below and that $\operatorname{sup}(S)\not\in \mathbb{Q}$ and $\operatorname{inf}(S)\not\in \mathbb{Q}$ .

Show that $S$ is infinite if $\operatorname{sup}(S)\not\in S$ .

5.4 Sequences and limits in $\mathbb{R}^d$

For the first time in the notes we are now moving towards infinite processes. We will introduce limits of vectors organized in an infinite sequence.

A sequence in $\mathbb{R}^d$ is an infinite list of vectors

$v_1, v_2, v_3, \dots$ in $\mathbb{R}^d$ , where repetitions are allowed. Such a sequence is denoted $(v_n)$ .

In order to define a sequence we just need to tell what its $n$ -th element is. So in abstract terms a sequence in $\mathbb{R}^d$ is nothing but a function $\mathbb{N}\rightarrow \mathbb{R}^d$ .

Below we give two examples of sequences in $\mathbb{R}$ .

$\begin{aligned} &(x_n): 1, 2, 3, 4, \dots\\ &(y_n): 1, \tfrac{1}{2}, \tfrac{1}{3}, \tfrac{1}{4}, \dots \end{aligned}$ The first sequence is given by $x_n = n$ and the second $y_n = \frac{1}{n}$ for $n\in \mathbb{N}$ . The first sequence explodes to infinity, whereas the second sequence gets closer and closer to $0$ . In the latter case we write

$\lim_{n\to\infty} y_n = 0.$

What does it mean that a sequence $(v_n)$ of vectors in $\mathbb{R}^d$ has limit $v\in \mathbb{R}^d$ ? Intuitively, we can get $v_n$ as close to $v$ as we want by choosing $n\in \mathbb{N}$ sufficiently big. Here is the precise way of saying this:

$\forall \epsilon > 0\, \exists N\in \mathbb{N}: n\geq N\implies \left\vert v_n - v \right\vert < \epsilon. \tag{5.18}$

If a sequence $(v_n)$ has a limit $v$ , then we write

$\lim_{n\to\infty} v_n = v.$

A sequence is called convergent if it has a limit. Let us see how our new technology works on two intuitively obvious examples.

Let us use (5.18) to give a precise proof of

$\lim_{n\to\infty} x_n = 0,$ where $x_n = 1/n$ . So given any $\epsilon >0$ we must find $N\in \mathbb{N}$ , such that

$\left\vert x_n - 0 \right\vert = \left\vert x_n \right\vert = \frac{1}{n} < \epsilon$ for $n > N$ . But

$\frac{1}{n} < \epsilon \iff \frac{1}{\epsilon} < n.$ So we simply choose $N$ to be the smallest natural number bigger than $\frac{1}{\epsilon}$ .

An even simpler example is a constant sequence like

$17, 17, 17, 17, \dots$ i.e., $x_n = 17$ for all $n\in \mathbb{N}$ . Here we want the limit to be $17$ and (5.18) agrees. We can put $N= 1$ :

$\forall \epsilon > 0 : n\geq 1\implies \left\vert x_n - 17 \right\vert = \left\vert 17 - 17 \right\vert = 0 < \epsilon.$

If a sequence is convergent, then it can have only one limit. You can not have a convergent sequence with two different limits! In particular, the constant sequence

$17, 17, 17, 17, \dots$ cannot converge to $18$ .

Give a precise proof of the fact that a convergent sequence $(x_n)$ can only have one limit using proof by contradiction i.e., start by assuming that it has two different limits $x\neq y$ . Then show that

$\forall \epsilon > 0\, \exists N\in \mathbb{N}: n\geq N\implies \left\vert x_n - y \right\vert < \epsilon.$ cannot be true by showing that

$\exists \epsilon > 0\, \forall N\in \mathbb{N}: \exists n\geq N\implies \left\vert x_n - y \right\vert \geq \epsilon.$

Try $\epsilon = \left\vert x-y \right\vert/2$ in the definition of $x$ being a limit and apply (5.15) to

$\left\vert x_n - y \right\vert = \left\vert x_n - x + x - y \right\vert.$

Now, that we have the definition of a convergent sequence, we go on to use it in a rather typical proof of a rather typical result. In this (typical) proof we first handle the infinite and then the finite.

A convergent sequence $(x_n)$ is bounded i.e., there exists $M\in \mathbb{R}$ , such that $|x_n| \leq M$ for every $n\in \mathbb{N}$ .

Let $x$ denote the limit of $(x_n)$ . Then for $\epsilon = 1$ , we may find $N\in \mathbb{N}$ , such that $\left\vert x - x_n \right\vert < 1$ for $n\geq N$ . Therefore $\left\vert x_n \right\vert < \left\vert x \right\vert + 1$ for $n\geq N$ by (5.15). Let $M_1 = \max\{\left\vert x_0 \right\vert, \dots, \left\vert x_N \right\vert\}$ and then letting $M = \max\{M_1, \left\vert x \right\vert + 1\}$ , we see that $\left\vert x_n \right\vert \leq M$ for every $n\in \mathbb{N}$ .

Proposition 5.39 shows that the sequence

$1, 2, 3, 4, \dots$ cannot be convergent. Why?

What is the limit of the sequence

$\phantom{-} 1, -1, \phantom{-} 1, -1, \phantom{-} 1, -1, \dots ?$

$1$

$-1$

It does not have a limit.

$0$

Try to prove first that the sequence cannot converge to $0$ , by assuming that it does.

Sage may be helpful in computing limits (see below).

For convergent sequences we have the following result.

Let $(x_n)$ and $(y_n)$ be convergent sequences in $\mathbb{R}^d$ with limits $x$ and $y$ respectively. Then

the sequence $(x_n + y_n)$ is convergent with limit $x + y$ .
the sequence $(x_n y_n)$ is convergent with limit $x y$ (if $d=1$ )
the sequence $(x_n/y_n)$ is convergent with limit $x/y$ provided that $y\neq 0$ and $y_n \neq 0$ for every $n\in\mathbb{N}$ (if $d=1$ ).

I will give the proof of (ⅱ.). By definition (see (5.18)) we are given $\epsilon > 0$ and we must find $N\in \mathbb{N}$ , such that

$\left\vert x y - x_n y_n \right\vert < \epsilon$ for $n\geq N$ . An old trick shows that

$\left\vert x y - x_n y_n \right\vert = \left\vert (x-x_n) y + (y - y_n) x_n \right\vert.$ Therefore we may find $M > 0$ so that

$\left\vert x y - x_n y_n \right\vert \leq \left\vert x-x_n \right\vert\left\vert y \right\vert + \left\vert y - y_n \right\vert\left\vert x_n \right\vert \leq \left\vert x-x_n \right\vert M + \left\vert y-y_n \right\vert M,$ where $y \leq M$ and $\left\vert x_n \right\vert\leq M$ for every $n\in \mathbb{N}$ (see Proposition 5.39). We are assuming the $(x_n)$ and $(y_n)$ are convergent sequences. Therefore we may find $N_1$ and $N_2$ in $\mathbb{N}$ , so that

$\begin{aligned} \left\vert x - x_n \right\vert &< \frac{\epsilon}{2 M}\qquad\text{for}\qquad n \geq N_1\\ \left\vert y - y_n \right\vert &< \frac{\epsilon}{2 M}\qquad\text{for}\qquad n \geq N_2. \end{aligned}$ Choosing $N = \max\{N_1, N_2\}$ , we get

$\left\vert x y - x_n y_n \right\vert \leq \left\vert x-x_n \right\vert M + \left\vert y-y_n \right\vert M \leq \frac{\epsilon}{2 M} M + \frac{\epsilon}{2 M} M = \epsilon,$ for $n\geq N$ .

The proof of (ⅰ.) in Proposition 5.42 is much less involved than the given proof of (ⅱ.) in the same result. In the proof of (ⅱ.) we used a trick using $\frac{\epsilon}{2M}$ . Use the same trick with $\frac{\epsilon}{2}$ and the triangle inequality to prove (ⅰ.).

$\left\vert x + y - (x_n + y_n) \right\vert = \left\vert (x - x_n) + (y - y_n) \right\vert \leq \left\vert x - x_n \right\vert + \left\vert y - y_n \right\vert.$

What is the limit of the sequence $(x_n)$ given by

$x_n = \frac{3 n^2 + 17 n + 5}{2 n^2 + 3 n + 2}?$

$0$

It does not have a limit

$1.5$

$\frac{17}{5}$

Consider the sequence $(x_n)$ given by

$x_n = \left(1 + \frac{1}{n}\right)^n.$ Carry out a computer experiment in Sage below to find the limit of $(x_n)$ . Can you prove what you observe in the experiment?

$n \ln\left(1 + \frac{1}{n}\right) = \frac{\ln\left(1 + \frac{1}{n}\right) - \ln(1)}{\frac{1}{n}}$

Assume that $(x_n)$ is a convergent sequence in $\mathbb{R}^d$ . Show that $(\left\vert x_n \right\vert)$ is a convergent sequence in $\mathbb{R}$ .

Let $(x_n)$ be a sequence bounded below with the property that

$x_1 \geq x_2 \geq x_3 \geq \cdots$ Show that $\operatorname{inf}\{x_n \mid n\in \mathbb{N}\}$ is the limit of $(x_n)$ .

Similarly let $(z_n)$ be a sequence bounded above with the property that

$z_1 \leq z_2 \leq z_3 \leq \cdots$ Show that $\operatorname{sup}\{z_n \mid n\in \mathbb{N}\}$ is the limit of $(z_n)$ .

Show that
$\sqrt{a b} \leq \frac{a+b}{2},$ for $0 \leq a \leq b$ .
Prove that $a < \sqrt{a b}$ and $(a+b)/2 < b$ for $0< a < b$ .
Start with two numbers $a$ and $b$ with $0\leq a\leq b$ and define
$\begin{aligned} a_{n+1} &= \sqrt{a_n b_n}\\ b_ {n+1} &= (a_n + b_n)/2, \end{aligned}$ where $a_0 = a$ and $b_0 = b$ . Carry out computer experiments in the sage (python) window below to analyze the sequences $a_0, a_1, \dots$ and $b_0, b_1, \dots$ for different values of $a_0$ and $b_0$ .
Prove for $n\geq 1$ that
$b_n - a_n < \left(\frac{1}{2}\right)^n (b-a)$ if $0< a < b$ .
Let $s = \lim_{n\to \infty} a_n$ and $t=\lim_{n\to \infty} b_n$ . Show that the limits exist and that $s = t$ .

The common limit $M(a, b)$ of the sequences $(a_n)$ and $(b_n)$ is called the arithmetic-geometric mean of $a$ and $b$ . Just for the fun of it, here is a cool way of computing $\pi$ involving this quantity:

$\pi = \frac{4 M(1, \frac{1}{\sqrt{2}})^2}{1 - \sum_{n=1}^\infty 2^ {n+1}(b_n^2 - a_n^2)}.$

Mentimeter

Harmonic series.

5.4.1 Closed and open subsets

We have defined what it means for a subset of a euclidean space to be bounded. Now we come to an exceedingly important definition about subsets being closed meaning that they should (in a mathematically precise way) contain their boundary points. For example, we want the interval $[0, 1]$ to be closed, whereas the interval $[0, 1)$ should not be closed, because it is missing its boundary point $1$ .

A subset $F\subseteq \mathbb{R}^d$ is called closed if it contains all its limit vectors. This means that if $(v_n)$ is a convergent sequence contained in $F$ , then its limit must be contained in $F$ .

We can immediately come up with a non-closed subset using the definition. Consider the subset

$S = \{x\in \mathbb{R} \mid x > 0\}\subset \mathbb{R}.$ Here $\left(\frac{1}{n}\right)$ is a convergent sequence, whose elements all are contained in $S$ , but its limit $0$ is outside $S$ (see Example 5.37). We have, however, the following important result relating to this example.

The following subsets

$\begin{aligned} [a, b] &= \{x\in \mathbb{R} \mid a \leq x \leq b\} \\ [a, \infty) &= \{x\in \mathbb{R} \mid a \leq x\}\\ (-\infty, a] &= \{x\in \mathbb{R} \mid x \leq a\} \end{aligned}$ are closed subsets of $\mathbb{R}$ for every $a, b\in \mathbb{R}$ .

Closed subsets are preserved by finite unions and intersections.

Let $F_1, F_2, \dots, F_n$ be finitely many closed subsets of $\mathbb{R}^d$ . Then

$F_1 \cap F_2 \cap \cdots \cap F_n \qquad\text{and}\qquad F_1 \cup F_2 \cup \cdots \cup F_n$ are closed subsets of $\mathbb{R}^d$ .

The complementary notion of a closed subset is an open subset.

A subset $U\subseteq \mathbb{R}^d$ is called open if $\mathbb{R}^d\setminus U$ is closed.

Prove that

$U_1 \cap U_2 \cap \cdots \cap U_n \qquad\text{and}\qquad U_1 \cup U_2 \cup \cdots \cup U_n$ are open subsets if $U_1, \dots, U_n$ are open subsets.

Let $(a, b) =\{x \in \mathbb{R} \mid a < x < b\}$ for $a, b\in \mathbb{R}$ , where $a < b$ . Show that $(a, b)$ is an open subset of $\mathbb{R}$ .

In fact, an arbitrary (also infinite) intersection of closed subsets is closed and an arbitrary (also infinite) union of open subsets is open. However, for a first course introducing intersections and unions over arbitrary families is pushing the envelope.

Infinite series

5.4.2 Infinite series

Given a sequence $(x_m)$ in $\mathbb{R}^d$ we may form the new sequence given by the sums

$\begin{aligned} s_1 &= x_1\\ s_2 &= x_1 + x_2\\ &\vdots\\ s_n &= x_1 + x_2 + \cdots + x_n\\ &\vdots \end{aligned}$ Such a sequence is called an infinite series. It is denoted

$\sum_{n=1}^\infty x_n$ and is defined to converge if the sequence $(s_n)$ converges.

Infinite series give rise to very beautiful identities like

$\sum_{n=1}^\infty \frac{1}{n^2} = \frac{\pi^2}{6}.$

We will not go deeper into the rich theory of infinite series, but settle at defining a widely used infinite series called the geometric series. Let $q\in \mathbb{R}$ with $\left\vert q \right\vert < 1$ . We saw in the first chapter that

$1 + q + \cdots + q^n = \frac{1 - q^{n+1}}{1-q}$ for any number $q\neq 1$ . If $\left\vert q \right\vert < 1$ , then $\lim_{n\to\infty} q^n = 0$ .

Show that $\lim_{n\to\infty} q^n = 0$ if $\left\vert q \right\vert < 1$ .

Therefore

$\sum_{n=0}^\infty q^n = \frac{1}{1-q}. \tag{5.19}$ The series in (5.19) is called the geometric series.

Compute the (infinite) sums

$\frac{1}{2} + \frac{1}{4} + \frac{1}{8} + \cdots$
$1 - \frac{1}{2} + \frac{1}{4} - \frac{1}{8} + \cdots$

The series given by $x_n = \frac{1}{n}$ i.e.,

$s_n = 1 + \frac{1}{2} + \frac{1}{3} + \cdots + \frac{1}{n}$ is called the harmonic series. Explore the growth of the harmonic series as a function of $n$ using the sage window below.

What does this video on twitter have to do with the harmonic series?

Suppose that the inequality

$\begin{aligned} \ln(n) = \int_1^n \frac{1}{x} d x \leq 1 + \frac{1}{2} + \cdots + \frac{1}{n} \end{aligned}\tag{5.20}$ holds. What does (5.20) imply for the harmonic series? Is (5.20) true? Compare with the graphs in the sage window below.

Use the sage window below to investigate if the sequence $e_n$ given by

$e(n) = 1 + \frac{1}{2} + \cdots + \frac{1}{n} - \ln(n)$ converges. In particular, make a clever statement about the convergence by studying a finite table of

$e(1), e(2), e(3), \dots$ observing $e(n) - e(n+1)$ for $n = 1, 2, 3, \dots$ .

5.5 Continuous functions

A function $f: S\rightarrow T$ , where $S \subseteq \mathbb{R}^d$ and $T\subseteq \mathbb{R}^e$ is called continuous at $v\in S$ if for every convergent sequence $(v_n)$ in $S$ with limit $v\in S$ , $(f(v_n))$ is a convergent sequence with limit $f(v)$ in $T$ . The function $f: S\rightarrow T$ is called continuous if it is continuous at every $v\in S$ .

The above is the formal definition of a continuous function. It is short and sweet. To get an understanding, you should study the mother of all examples of non-continuous functions given below:

$f(x) = \begin{cases} 0 &\text{if } x > 0\\ 1 &\text{if } x \leq 0 \end{cases}. \tag{5.21}$

This is a function from $\mathbb{R}$ to $\mathbb{R}$ . It is impossible to plot it without lifting the pencil or defining such a beast without using a bracket as in (5.21).

Let me sketch how the formal Definition 5.59 kills any hope of (5.21) being continuous. It is enough to come up with just one sequence $(x_n)$ converging to $x$ , such that the sequence $(f(x_n))$ does not converge to $f(x)$ . We pick the sequence $x_n= 1/n$ , which converges to $x = 0$ . But $(f(x_n))$ is the constant sequence

$0, 0, 0, \dots$ because $f(1/n) = 0$ for $n = 1, 2, \dots$ . Therefore $(f(x_n))$ has to converge to $0$ , which is different from $f(0) = 1$ and $f$ fails to be continuous according to Definition 5.59. Had we picked the sequence $x_n = -1/n$ , then we would not have revealed that $f(x)$ is not continuous. It is very important to notice the for every in Definition 5.59.

Almost all functions we encounter will be continuous. The function $f$ above is an anomaly.

Let us stop briefly once more and see Definition 5.59 in action.

Let $S = T = \mathbb{R}$ in Definition 5.59. We consider the two functions

$\begin{aligned} f(x) &= x\\ g(x) &= c, \end{aligned}$ where $c\in \mathbb{R}$ i.e., $f$ is the identity function and $g$ is a constant function given by the real number $c$ . Both of these functions are continuous. Let us see why.

A sequence $(x_n)$ is convergent with limit $x$ if

$\forall \epsilon > 0 \exists N\in \mathbb{N}: n\geq N \implies \left\vert x - x_n \right\vert<\epsilon \tag{5.22}$ according to (5.18). To verify Definition 5.59, we must prove that

$\forall \epsilon > 0 \exists N\in \mathbb{N}: n\geq N \implies \left\vert f(x) - f(x_n) \right\vert<\epsilon.$ But $\left\vert f(x) - f(x_n) \right\vert = \left\vert x - x_n \right\vert$ , so that the above claim is true by (5.22) with the same $N$ . Similarly $\left\vert g(x) - g(x_n) \right\vert = \left\vert c - c \right\vert = 0$ . Here we may pick $N=1$ , since $0 < \epsilon$ to begin with.

We give now three important results, which can be used in concrete situations to verify that a given function is continuous. They can be proved without too much hassle. The first result below basically follows from the definition of the norm of a vector (see (5.4)).

The functions $\pi_i:\mathbb{R}^d\rightarrow \mathbb{R}$ given by

$\pi_i(x_1, \dots, x_i, \dots, x_d) = x_i$ for $i = 1, ,\dots, d$ are continuous. In general a function $f:S\rightarrow T$ is continuous if and only if $\pi_j\circ f: S\rightarrow \mathbb{R}$ is continuous for every $j = 1, \dots, e$ , where $S\subseteq \mathbb{R}^d$ and $T\subseteq \mathbb{R}^e$ .

Lemma 5.61 shows for example that the functions $f(x, y) = x$ and $g(x, y) = y$ are continuous functions from $\mathbb{R}^2$ to $\mathbb{R}$ .

Consider the vector function $f: \mathbb{R}^2 \rightarrow \mathbb{R}^2$ given by

$f(x, y) = \begin{pmatrix} x^2 + y^2 \\ \sin(x y) \end{pmatrix}$ as an example. To prove that $f$ is continuous, Lemma 5.61 tells us that it is enough to prove that its coordinate functions $f_1, f_2:\mathbb{R}^2\rightarrow \mathbb{R}$

$\begin{aligned} f_1(x, y) &= x^2 + y^2\\ f_2(x, y) &= \sin (x y) \end{aligned}$ are continuous.

Definition 5.59 also behaves nicely when continuous functions are composed. This is the content of the following

Suppose that $g: S\rightarrow T$ and $f: T\rightarrow R$ are continuous functions, where $S\subseteq \mathbb{R}^d, T\subseteq \mathbb{R}^e$ and $R\subseteq \mathbb{R}^f$ . Then the composition

$(f\circ g): S\rightarrow R$ is continuous.

To get continuous functions from functions already known to be continuous using arithmetic operations, the result below is useful.

Let $f, g: U\rightarrow \mathbb{R}$ be functions defined on a subset $U\subset \mathbb{R}^d$ . If $f$ and $g$ are continuous, then the functions

$\begin{aligned} (f + g): U\rightarrow \mathbb{R}\qquad\text{given by}\quad &(f + g)(x) = f(x) + g(x)\\ (f g): U\rightarrow \mathbb{R}\qquad\text{given by}\quad &(f g)(x) = f(x) g(x)\\ (f/g): V\rightarrow \mathbb{R}\qquad\text{given by}\quad &(f/g)(x) = f(x)/g(x) \end{aligned}$ are continuous functions, where $V = \{x\in U\mid g(x)\neq 0\}$ (the last function is defined only if $g(x)\neq 0$ ).

This result is a consequence of the definition of continuity and Proposition 5.42.

Show in detail that the function $f: \mathbb{R}^2\rightarrow \mathbb{R}$ given by

$f(x, y) = x^2 + y^2$ is continuous by using Proposition 5.63 combined with Lemma 5.61.

By combining Example 5.60 with Proposition 5.63, one finds that every polynomial is a continuous function and that

$h(x) = \frac{f(x)}{g(x)}$ is continuous for $g(x)\neq 0$ , where $f, g\in \mathbb{R}[x]$ .

Verify the claim in Remark 5.65.

More advanced (transcendental) functions like $\sin(x)$ and $e^x$ also turn out to be continuous.

We are now in position to prove a famous result from 1817 due to Bolzano.

Let $f: [a, b]\rightarrow \mathbb{R}$ be a continuous function, where $a < b$ . If $f(a) < 0$ and $f(b) > 0$ , then there exists $x_0$ with $a < x_0 < b$ , such that $f(x_0) = 0$ .

This is proved using the supremum property of the real numbers. The subset

$S = \{x\in [a, b]\mid f(x) \leq 0\}$ is non-empty (since $a\in S$ ) and bounded from above. We let $c = \operatorname{sup}(S)$ .

We will need the following observation about the continuous function $f$ : If $f(z) < 0$ for $a \leq z < b$ , then there exists a small $\delta > 0$ , such that

$f(x) < 0$ for every $x\in [z, z+\delta]$ .

Similarly if $f(z) > 0$ for $a < z \leq b$ , then there exists a small $\delta > 0$ , such that

$f(x) > 0$ for every $x\in [z - \delta, z]$ .

These observations imply that $a < c < b$ by the definition of supremum. Similarly we cannot according to these observations have $f(c) < 0$ or $f(c) > 0$ . In this case $c\not\in S$ and by definition of supremum we have $S\cap [c-\varepsilon, c] \neq \emptyset$ for every $\varepsilon > 0$ . But for some $\varepsilon>0$ there must exist $x\in [c-\varepsilon, c]\cap S$ , such that $f(x) > 0$ .This is impossible.

The only possibility remaining is $f(c) = 0$ .

Again, by Proposition 5.63, polynomials are continuous functions. Now, as promised previously, we state and prove the following result.

Let

$f(x) = a_n x^n + \cdots + a_1 x + a_0$ be a polynomial of odd degree, i.e. $n$ is odd. Then $f$ has a root, i.e. there exists $x_0\in\mathbb{R}$ , such that $f(x_0) = 0$ .

We will assume that $a_n > 0$ (if not, just multiply $f$ by $-1$ ). Consider $f(x)$ written as

$f(x) = x^n \left(a_n + \frac{a_{n-1}}{x} + \cdots + \frac{a_1}{x^{n-1}} + \frac{a_0}{x^n}\right).$ By choosing $c$ negative with $\left\vert c \right\vert$ extremely big, we have $f(c) < 0$ , since $c^n$ is negative and

$a_n + \frac{a_{n-1}}{c} + \cdots + \frac{a_1}{c^{n-1}} + \frac{a_0}{c^n} > 0$ as $a_n$ is positive. Notice here that the terms

$\frac{a_{n-1}}{c} + \cdots + \frac{a_1}{c^{n-1}} + \frac{a_0}{c^n}$ are extremely small, when $\left\vert c \right\vert$ is extremely big.

Similarly by choosing $d$ positive and tremendously big, we have $f(d) > 0$ . By Theorem 5.67, there exists $x_0$ with $c < x_0 < d$ with $f(x_0) = 0$ .

Before defining (and more importantly giving examples of) closed subsets, we will define abstractly the preimage of a subset of a function.

Consider a function

$f: A \rightarrow B,$ where $A$ and $B$ are sets. If $C\subseteq B$ , then the preimage of $C$ under $f$ is defined by

$f^{-1}(C) = \{x\in A \mid f(x)\in C\}.$

Definition 5.69 is short and sweet. Here is a first example of the preimage.

Consider the function $f: A\rightarrow B$ , where $A = \{1, 2, 3, 4, 5\}$ and $B = \{a, b, c, d\}$ given by

$\def\arraystretch{1.5} \begin{array}{c|ccccccc} x & 1 & 2 & 3 & 4 & 5\\ \hline f(x) & a & b & c & a & d \end{array}$ For $C = \{a, c\}$ , $f^{-1}(C) = \{1, 3, 4\}$ as illustrated below.

What is $f^{-1}(C)$ when $A = \mathbb{R}, B = \mathbb{R}, f(x) = x^2 - 5 x + 6$ and $C = (-\infty, 0]$ ?

Consider the function $f:\{1, 2, 3, 4, 5\}\rightarrow \mathbb{R}$ given by

$\def\arraystretch{1.5} \begin{array}{c|ccccccc} x & 1 & 2 & 3 & 4 & 5\\ \hline f(x) & 1 & 2 & 4 & -1 & \pi \end{array}$ and let $C = [3, \infty)$ . What is true about $f^{-1}(C)$ ?

$1\in f^{-1}(C)$

$\{3, 5\}\subseteq f^{-1}(C)$

$\{3, 4, 5\}\subseteq f^{-1}(C)$

$f^{-1}(\mathbb{R}\setminus C) \cup f^{-1}(C) = \{1, 2, 3, 4, 5\}.$

The following result is often a very useful tool in showing that a subset is closed.

If $F\subseteq \mathbb{R}^e$ is a closed subset and $f: \mathbb{R}^d\rightarrow \mathbb{R}^e$ a continuous function, then the preimage

$f^{-1}(F)$ is a closed subset of $\mathbb{R}^d$ .

Given a convergent sequence $(x_n)$ with $x_n\in f^{-1}(F)$ and $\lim_{n\to\infty} x_n = x$ , we must prove that $x\in f^{-1}(F)$ by Definition 5.49. Since $f$ is continuous, it follows by Definition 5.59 that $\lim_{n\to\infty} f(x_n) = f(x)$ . As $F$ is closed, we must have $f(x)\in F$ . Therefore $x\in f^{-1}(F)$ by Definition 5.69.

The function $f:\mathbb{R}^2\rightarrow \mathbb{R}$ given by $f(x, y) = x^2 + y^2$ is continuous. Therefore the subset

$\begin{aligned} &\{(x, y)\in \mathbb{R}^2\mid x^2 + y^2 \geq 1\} = \\ &\{(x, y)\in \mathbb{R}^2\mid f(x, y) \geq 1\} = \\ &\{(x, y)\in \mathbb{R}^2\mid f(x, y) \in [1, \infty)\} = \\ &f^{-1}([1, \infty)) \end{aligned}$ of $\mathbb{R}^2$ is closed, since $[1, \infty)$ is a closed subset of $\mathbb{R}$ by Proposition 5.50.

Show that

$g(x) = \dfrac{a(x)}{b(x)}$

is a continuous function $g: V\rightarrow \mathbb{R}$ , where $a(x) = x^2 - 3 x + 2$ and $b(x) = x^2 - 4 x + 3$ and

$V = \mathbb{R}\setminus\{1, 3\}.$

Use Proposition 5.50 and Proposition 5.73 to show that

$\{x\in \mathbb{R} \mid a(x) \leq 17\}$

is a closed subset of $\mathbb{R}$ .

Hint

Write

$\{x\in \mathbb{R} \mid a(x) \leq 17\} = a^{-1}(S),$ where $S\subseteq \mathbb{R}$ is a suitable (closed) interval.

Experiment a bit and compute $g(x)$ , when $x$ is close to $1$ . Is $g(x)$ close to a special value $y_0$ when $x$ is close to $1$ ? What happens when $x$ is close to $1$ ? How do you explain this in terms of $a(x)$ and $b(x)$ ?

We end the section on continuous functions by introducing compact subsets and a crucial optimization result.

A subset of euclidean space is called compact if it is bounded and closed.

Let $C$ be a compact subset of $\mathbb{R}^d$ and $f: C\rightarrow \mathbb{R}$ a continuous function. Then there exists $u, v\in C$ , such that

$f(u) \leq f(x)\qquad\text{and}\qquad f(x)\leq f(v)$ for every $x\in C$ .

This is a rather stunning result! You are guaranteed solutions to optimization problems of the type

$\begin{aligned} &\text{Minimize} &f(x)&\\ &\text{with constraint}\\ &&x\in C, \end{aligned}$

where $C$ is a compact subset and $f: C\rightarrow \mathbb{R}$ a continuous function. Finding the optimal solutions in this setting is another story. It can be extremely hard. For the rest of these notes we will actually dive into methods for computing optimal solutions of optimization problems such as the one above.

5Euclidean vector spaces

5.1 Vectors in the plane

5.2 Higher dimensions

5.2.1 Dot product and norm

5.2.2 The dist formula from high school

5.2.3 The perceptron algorithm

Why does the perceptron algorithm work?

5.2.4 Pythagoras and the least squares method

5.2.5 The Cauchy-Schwarz inequality

5.2.6 Distance of vectors and the triangle inequality

5.2.7 Bounded subsets

5.3 An important remark about the real numbers

5.3.1 Supremum

5.3.2 Infimum

5.4 Sequences and limits in \mathbb{R}^d

5.4.1 Closed and open subsets

5.4.2 Infinite series

5.5 Continuous functions

5.4 Sequences and limits in $\mathbb{R}^d$