5Euclidean vector spaces

Big data are made up of many numbers in data sets. Such data sets can be represented as vectors in a high dimensional euclidean vector space. A vector is nothing but a list of numbers, but we need to talk mathematically about the size of a vector and perform operations on vectors. The term euclidean refers to vectors with a dot product as known from the plane .
The purpose of this chapter is to set the stage for this, especially by introducing the dot product (or inner product) for general vectors. Having a dot product is immensely useful and we give several applications like linear regression and the perceptron learning algorithm
In the last part of the chapter we will list rudimentary basics of analysis starting with bounded, open, closed and compact subsets of euclidean spaces leading to continuous functions and the socalled extreme value theorem, Theorem 5.60 . This result states that a huge class of optimization problems always have a solution.

5.1 Vectors in the plane

The dot product (or inner product) between two vectors is given by
where
We may also interpret and as matrices (or column vectors). Then the dot product in (5.1) may be realized as the matrix product:
The length or norm of the vector is given by
This follows from the Pythagorean theorem:
The distance between the two vectors and is given by
Also, the cosine of the angle between and is given by
We will not go into this formula. It is a byproduct of considering the projection of a vector on another vector (see Exercise 5.6 ).
All of these rather natural notions in the plane generalize naturally to for .

5.2 Higher dimensions

We denote the set of column vectors with rows by and call it the euclidean vector space of dimension . An element is called a vector and it has the form (column vector with entries)
A vector in is a model for a data set in real life. A collection of numbers, which could signify measurements. You will see an example of this below, where a vector represents a data set counting words in a string.
Being column vectors, vectors in can be added and multiplied by numbers:
The dot product generalizes as follows to higher dimensions.

5.2.1 Dot product, norm and cosine

Suppose that
are vectors in .
  1. The dot product between and is defined by
  2. Two vectors are called orthogonal if . We write this as .
  3. The norm of is defined by
  4. The distance between the two vectors and is defined by
  5. The cosine of the angle between and is defined by
    provided that they both are non-zero.
All of the definitions above are present in modern machine learning frameworks. Below we see their incarnations in the python library numpy.
Show that
where .
Use the definition in (5.3) to show that
for and .
Let be a nonzero vector and . Use the definition in (5.4) to show that and that
is a unit vector.
You could perhaps use Exercise 5.4 to do this. Notice also that is the absolute value for if .
Given two vectors with , find , such that and are orthogonal, i.e.
This is an equation, where is unknown!
For , it is sketched below that if and are orthogonal, then and are the sides in a right triangle.
In this case, if is the angle between and , show that
Use this to show that
Finally show that
where and are two angles.
In the last question, you could use that the vectors
are unit vectors.
Given two vectors , solve the minimization problem
First convince yourself that minimizes if and only if it minimizes
which happens to be a quadratic polynomial in .
Mentimeter
Distance from point to line

5.3 The unreasonable effectiveness of the dot product

Let denote the distance from to the line through and . What is true about ?

5.3.1 The dist formula from high school

The infamous dist formula from high school says that the distance from the point to the line given by is
Where does this magical formula come from? Consider a general line in parametrized form (see Definition 4.10 )
If , then the distance from to is given by the solution to the optimization problem
This looks scary, but simply boils down to finding the top of a parabola. The solution is
and the point on closest to is .
Now we put (see Example 4.11 )
in order to derive (5.5) . The solution to (5.6) becomes
We must compute the distance from to in this case. The distance squared is
This is a mouthful and I have to admit that I used symbolic software (see below) to verify that

5.3.2 The perceptron algorithm

Already at this point we have the necessary definitions for explaining the perceptron algorithm. This is one of the early algorithms of machine learning. It aims at finding a high dimensional line (hyperplane) that separates data organized in two clusters. In terms of the dot product, the idea of the algorithm is described below in dimension two.
A line in the plane is given by an equation
for , where . Given finitely many points
each with a label of (or blue and red for that matter), we wish to find a line (given by ), such that
for . Such a line is called a separating line for the labeled points.
In some cases this is impossible (an example is illustrated below).
Show that it is imposible to find a line separating the red and blue points above.
A clever approach to finding such a line, if it exists, is to reformulate the problem by looking at the vectors given by
in . Then the existence of the line is equivalent to the existence of a vector with for . If is such a vector, then we have for ,
Therefore we may take and as the line.

A ridiculously simple algorithm

In view of the approach introduced in (5.8) , the the following general question is interesting.
Given finitely many vectors , can we find , such that
for every ?
Come up with a simple example, where this problem is unsolvable i.e., come up with vectors , where such an does not exist.
Hint
Try out some simple examples for and .
In case exists, the following ridiculously simple algorithm works in computing . It is called the perceptron (learning) algorithm.
  1. Begin by putting .
  2. If there exists with , then replace by and repeat this step. Otherwise is the desired output vector.
Let us try out the algorithm on the simple example of just two points in given by
In this case the algorithm proceeds as pictured below.
It patiently crawls its way ending with the vector , which satisfies and .
Let us see how (5.7) works in a concrete example.
Consider the points
in , where and are labeled by and is labeled by . Then we let
Now we run the simple algorithm above Example 5.11 :
From the last vector we see that determines a line separating the labeled points.
Below is an implementation of the perceptron (learning) algorithm in python (with numpy) with input from Example 5.12 (it also works in higher dimensions).
Consider the points
in , where the first point is labeled with and the rest by . Use the perceptron algorithm to compute a separating hyperplane.
What happens when you run the perceptron algorithm on the above points, but where the label of
is changed from to ?

5.3.3 Why does the perceptron algorithm work?

We will assume that there exists , such that
for every . Therefore and if we put
then for every .
The basic insight is the following
Let . After iterations of the perceptron algorithm, satisfies
where is defined in (5.9) .
The algorithm starts with . In the second step we update to if . For such a we have the following inequalities
and
If the second step of the algorithm is executed after steps, then we get for the new that
Proposition 5.14 implies that
Therefore we get and there is an upper bound on the number of iterations used in the second step. So after a finite number of steps, we must have for every .

5.4 Pythagoras and the least squares method

The result below is a generalization of the theorem of Pythagoras about right triangles to higher dimensions.
If and , then
This follows from
since .
The dot product and the norm have a vast number of applications. One of them is the method of least squares: suppose that you are presented with a system
of linear equations, where is an matrix.
You may not be able to solve (5.10) . There could be for example equations and only unknowns making it impossible for all the equations to hold. As an example, the system
of three linear equations and two unknowns does not have any solutions.
The method of (linear) least squares seeks the best approximate solution to (5.10) as a solution to the minimization problem
There is a surprising way of finding optimal solutions to (5.12) :
If is a solution to the system
of linear equations with unknowns, then is an optimal solution to (5.12) . If on the other hand is an optimal solution to (5.12) , then is a solution to (5.13) .
Suppose we know that is orthogonal to for every . Then
for every by Proposition 5.15 . So, in the case that for every we have
for every proving that is an optimal solution to (5.12) .
Now we wish to show that is orthogonal to for every if and only if . This is a computation involving the matrix arithmetic introduced in Chapter 3 :
for every if and only if . But
so that .
On the other hand, if for every , then for every : if we could find with , then
for a small number . This follows, since
which is
By picking sufficiently small,
In a future course on linear algebra you will see that the system of linear equations in Theorem 5.16 is always solvable i.e., an optimal solution to (5.12) can always be found in this way.
Mentimeter
Least squares example
Show that (5.11) has no solutions. Compute the best approximate solution to (5.11) using Theorem 5.16 .
The classical application of the least squares method is to find the best line through a given set of points
in the plane .
Usually we cannot find a line matching the points precisely. This corresponds to the fact that the system of equations
has no solutions.
Working with the least squares solution, we try to compute the best line in the sense that
is minimized.
Best fit of line to random points from Wikipedia.
We might as well have asked for the best quadratic polynomial
passing through the points
in .
The same method gives us the system
of linear equations.
Best fit of quadratic polynomial to random points from Wikipedia.
The method generalizes naturally to finding the best polynomial of degree
through a given set of points.
Find the best line through the points and and the best quadratic polynomial through the points and .
It is important here, that you write down the relevant system of linear equations according to Theorem 5.16 . It is however ok to solve the equations on a computer (or check your best fit on WolframAlpha).
Also, you can get a graphical illustration of your result in the sage window below.
A circle with center and radius is given by the equation
  1. Explain how (5.14) can be rewritten to the equation
    where .
  2. Explain how fitting a circle to the points in the least squares context using (5.15) leads to the system
    of linear equations.
  3. Compute the best circle through the points
    by giving the center coordinates and radius with two decimals. Use the Sage window below to plot your result too see if it matches the drawing.

5.5 The Cauchy-Schwarz inequality

Take another look at (5) in Definition 5.1 . It is actually a small miracle that no matter which (non-zero) vectors and you use as input to the cosine function defined in Example 5.2 , you always get a number between and . The mathematics behind this is rather elegant. It is a consequence of the famous Cauchy-Schwarz inequality stated and proved below.
For two vectors ,
We consider the function given by
Then is a quadratic polynomial with . Therefore its discriminant must be i.e.,
which gives the result.
Why are the two inequalities in UNDEFINED: cosineeqs a consequence of Theorem UNDEFINED: cs ?
For arbitrary two numbers ,
since
Why is
for arbitary numbers ?

5.5.1 The triangle inequality

Another nice consequence of the Cauchy-Schwarz inquality is the triangle inquality.
For three vectors ,
From the Cauchy-Schwarz inequality (Theorem 5.23 ) it follows that
for two vectors . Since the right hand side of this inequality is , we have
By the definition of , we then get the desired inequality as
Apply the triangly inequality in the form
for to show that

5.5.2 Cosine similarity in machine learning

When vectors two vectors are interpreted as data sets, the number in (5) of Definition 5.1 is known as the cosine similarity. It measures the correlation between the vectors and .
A very primitive way of modelling sentences in a language is the socalled one-hot encoding of its words. We will illustrate this by an example. Suppose that our language consists of the words
'a', 'and', 'applicable', 'are', 'fun', 'is', 'mathematics', 'matrices', 'matrix', 'useful'
' Each word gets embedded into with a vector associated to its row below
Now onsider the two sentences "mathematics is fun and a matrix is useful" and "mathematics is fun and matrices are applicable".
From the words in the two strings we form the following vectors in using the one-hot embedding in (5.16) .
Here a sentence is mapped to the vector, which is the sum of all the vectors corresponding to the words in the sentence, where each vector is multiplied by its multiplicity i.e., how many times the word occurs. The closer the cosine gets to (corresponding to an angle of degrees), the more similar we consider the sentences. Use the python snippet below to experiment and compute the cosine similarity in the example.
Cosine similarity is crucial in machine learning, especially in NLP. The one-hot embedding is very crude and does not really capture the semantics of a sentence. The bread and butter of modern (large) language models is more advanced (dense) embeddings constructed using deep learning. The embeddings even take whole sentences as input! The recent breakthroughs can be traced back to 2013, where Google introduced the word embedding word2vec. When embedding a sentence one usually considers tokens and not words. This means that every sentence (as input to ChatGPT) must be broken down into a sequence of tokens. Modern large language models typically operate with around 50,000 tokens. Each token is embedded into a euclidean space of dimension usually .

5.6 Special subsets of euclidean spaces

Recall that a circle (or an open disk) centered at with radius is defined as the subset
Similarly an open ball in centered at with radius is defined as the subset
The natural generalization of this definition to higher dimensions is given below.
The open ball centered at with radius is defined as

5.6.1 Bounded subsets

It makes sense to defined bounded subsets as subsets that can be contained in a large enough open ball centered at :
A subset is called bounded if there exists , such that
Written out Definition 5.29 says that
for every . Boundedness of is also equivalent to the following two conditions
  1. There exists , such that
    for every .
  2. There exists , such that
    for and every .
Every finite subset is bounded (why?). For , Definition 5.29 simply says
This implies that an interval is bounded by putting in Definition 5.29 .
Show precisely that the subset of is not bounded, whereas the subset is.
Sketch why
is bounded. Now use Fourier-Motzkin elimination to show the same without sketching.

5.6.2 Open, closed and compact subsets and boundaries of subsets

Open subsets

An open subset of is a subset consisting of points, that are interior in the following sense:
A subset is called open if for every , there exists , such that
Prove that an open ball given by is an open subset.
Suppose that . Define a suitable for and use Corollary 5.26 to conclude that .
We will need the result below.
If are open subsets, then
are open subsets.

Closed subsets

A subset is called closed if is open.
In analogy with Proposition 5.35 we have the result below.
If are closed subsets, then
are closed subsets.

Open intervals

The following subsets
are open subsets of for every .
Let us prove that is an open subset of . If , then we let . Suppose that . If , then and . If , then and therefore and . We have proved that is an open subset.
A similiar proof shows that is an open subset. If , then
which is an open subset by the above and Proposition 5.35 .

Closed intervals

We have a similar result for closed subsets.
The following subsets
are closed subsets of for every .
The proof follows from Definition 5.36 and Proposition 5.38 . For example,

Compact subsets

We single out the following very important class of subsets
A subset is called compact if it is bounded and closed.

The boundary of a subset

The boundary of a subset is informally the subset of points barely touching :
This is made precise in the following definition.
The boundary of a subset is defined as

5.7 Continuous functions

A function , where and is called continuous at if for every , there exists , such that
for every . Equivalently,
The function is called continuous if it is continuous at every .
The above is the formal definition of a continuous function. It is short and sweet, but takes some time to assimilate. To get an understanding, you should study the mother of all examples of non-continuous functions given below:
This is a function from to . It is impossible to plot it without lifting the pencil or defining such a beast without using a bracket as in (5.18) .
Let me sketch how the formal Definition 5.42 kills any hope of (5.18) being continuous at . To prove this we must prove that the negation of the proposition in (5.17) is true. This reads
for the function defined in (5.18) . You can verify that the above is true by setting and . For these values,
Almost all functions we encounter will be continuous. The function above is an anomaly.
Let us stop briefly once more and see Definition 5.42 in action.
Let in Definition 5.42 . We consider the two functions
where i.e., is the identity function and is a constant function given by the real number . Both of these functions are continuous. Let us see why.
For the function , (5.17) reads
This is certainly true if we pick .
For the function , (5.17) reads
Here can be picked arbitrary, since is always true.

5.7.1 An elegant way of characterizing a continuous function

Recall the definition of the preimage from Definition 1.110 and the definition of an open subset from Definition 5.33 . The following characterization of continuous functions came rather late in the history of mathematics.
Let be a function. Then is continuous if and only if is open in for every open subset .
Let be an open subset. Assume first that is continuous. We wish to prove that is open. Pick and so that . Now use the continuity of to pick so that (5.17) is satisfied i.e.,
Since , (5.19) says that
showing that is an open subset.
Now suppose that is open whenever is open. For and we put . Since is an open subset, is open and . So we may find so that . But this is exactly the statement that
showing that is continuous.
The following result is often a very useful tool in showing that a subset is closed.
If is a closed subset and a continuous function, then the preimage
is a closed subset of .
If is closed, then is open. Therefore
is open by Proposition 5.44 . This implies that is closed.
Let us assume for now that given by is continuous (see Exercise 5.52 ). Then Proposition 5.45 shows that the subset
of is closed, since is a closed subset of by Proposition 5.39 .
Show formally that the subset
is an open subset of .

5.7.2 Working with continuous functions

We give now three important results, which can be used in concrete situations to verify that a given function is continuous. They can be proved without too much hassle. The first result below basically follows from the definition of the norm of a vector (see (5.4) ).
The projection functions defined in Definition 1.99 are continuous. In general a function is continuous if and only if is continuous for every , where and .
Lemma 5.48 shows for example that the functions and are continuous functions from to .
Consider the vector function given by
as an example. To prove that is continuous, Lemma 5.48 tells us that it is enough to prove that its coordinate functions
are continuous.
Definition 5.42 also behaves nicely when continuous functions are composed. This is the content of the following
Suppose that and are continuous functions, where and . Then the composition
is continuous.
To get continuous functions from functions already known to be continuous using arithmetic operations, the result below is useful.
Let be functions defined on a subset . If and are continuous, then the functions
are continuous functions, where (the last function is defined only if ).
This result is a consequence of the definition of continuity and Proposition UNDEFINED: propconvarithm .
Show in detail that the function given by
is continuous by using Proposition 5.51 combined with Lemma 5.48 .
By combining Example 5.43 with Proposition 5.51 , one finds that every polynomial is a continuous function and that
is continuous for , where .
Verify the claim in Remark 5.53 .
More advanced (transcendental) functions like and also turn out to be continuous. We will return to this in the next chapter, where differentiable functions are defined.
Show from scratch (without using Remark 5.53 ) that
is a continuous function , where and and
Use Proposition 5.39 and Proposition 5.45 to show that
is a closed subset of .
Hint
Write
where is a suitable (closed) interval.
Experiment a bit and compute , when is close to . Is close to a special value when is close to ? What happens when is close to ? How do you explain this in terms of and ?

5.8 Important and special results for continuous functions

Below we quote a famous and very intuitive result from 1817 due to Bolzano. This result is also known as the intermediate value theorem.
Let be a continuous function, where . If and , then there exists with , such that .
Polynomials are continuous functions. Bolzano's result fits perfectly in the proof of the result below. This result is wrong for polynomials in as witnessed by , which does not have a rational root.
Use the methods of Example 1.38 to show that there is no with , where .
Let
be a polynomial of odd degree, i.e. is odd and . Then has a root, i.e. there exists , such that .
We will assume that (if not, just multiply by ). Consider written as
By choosing negative with extremely big, we have , since is negative and
as is positive. Notice here that the terms
are extremely small, when is extremely big.
Similarly by choosing positive and tremendously big, we have . By Theorem 5.57 , there exists with with .
We end this section with a result that might be coined the mathematical cornerstone of optimization (also due to Bolzano, at least for ). The result below is called the extreme value theorem.
Let be a compact subset of and a continuous function. Then there exists , such that
for every .
This is a rather stunning result! You are guaranteed solutions to optimization problems of the type
where is a compact subset and a continuous function. Finding the optimal solutions in this setting is another story. It can be extremely hard. For the rest of these notes we will actually dive into methods for computing optimal solutions of optimization problems such as the one above.
Give two examples, where Theorem 5.60 fails for if we relax the conditions on . One, where is open and another one where is not bounded.