9Convex optimization

In this last chapter we will deal exclusively with convex optimization problems.
Recall that a convex optimization problem has the form
where is a convex subset (see Definition 4.19 ) and a convex function (see Definition 4.24 ). We will mainly deal with the case, where is differentiable defined on all of in addition to just being convex defined on . Also recall that convex optimization problems are very well behaved in the sense that local minima are global (see Theorem 6.13 ).
Below is an example of a convex optimization problem in the plane .
where is the subset of points in satisfying
Sketch the subset in Example 9.1 . Show that Example 9.1 really is a convex optimization problem and solve it.
Below is an example of a convex optimization problem modelling the real life problem of placing a fire station (center of circle) so that the maximal distance to the surrounding houses (points to be enclosed) is minimal.
Given points , what is the center and radius of the smallest circle containing these points?
We can write this optimization problem as
where
Upon rewriting this turns into the optimization problem
where
and .
Prove that (9.1) and (9.2) both are convex optimization problems. Explain how (9.1) is rewritten into (9.2) .
Prove that (9.1) and (9.2) both are convex optimization problems. Explain how (9.1) is rewritten into (9.2) .
Hint
Expand
and put .

9.1 Finding the optimal hyperplane separating data

In section 5.3.2 we were presented with labeled data
where and . The task at hand was to separate differently labeled data by a hyperplane , such that
for . Please browse back to Definition 4.38 for the definition of a hyperplane in .
To make this more real, consider the points with label and the points with label (as in Exercise 5.9 and above it). Here a hyperplane satisfying (9.4) cannot exist: suppose that . Then (9.4) is tantamount to the following inequalities
But these inequalities are unsolvable in and (why?).
If the data in (9.3) can be separated according to (9.4) , we may find a hyperplane , such that
How do you go from (9.4) to (9.5) ?
Suppose that
Let
Show that and . How can and be applied in constructing and ?
Consider
What is special about ?
What does optimal hyperplane mean in this setting? This is the one maximizing the width of the strip between the two labeled clusters.
Figure from the Cortes and Vapnik paper: Support vector networks, Machine Learning, 1995.
A rather crucial insight (not explicitly mentioned in the paper by Cortes and Vapnik) is that such a hyperplane is given as satisfying (9.5) and with maximal distance to all of the data points. Therefore the function to maximize (with respect to and ) is
since the distance from to a point is (see Exercise 9.8 )
The conditions in (9.5) may be written as for . If , then we multiply and by , so that we may assume
Let be the hyperplane in given by and let . The point closest to in can be found by solving the optimization problem
Explain why (9.6) is a convex optimization problem.
Show how Theorem 7.37 can be used to solve this optimization problem by first deducing the equations
for the Lagrange multiplier . Notice here that above really contains equations, whereas is only one equation in , where . Solve the equations (9.7) for and . How can we be sure that really is a minimum in (9.6) ?
Finally show that the distance from to is given by the formula
We have therefore proved the following result (notice that maximizing is the same as minimizing )
Let the points be labeled by . Then the optimal hyperplane separating the points is given by the optimization problem
for .
The vectors among the data points satisfying or are called support vectors.
Let us explicitly write up the optimization problem In Theorem 9.9 in a very simple situation: finding the best line separating the points and . In the notation of (9.5) , we have (without the stars on and )
so that
The points are
where and .
Therefore the optimization problem in Theorem 9.9 becomes
Solve the optimization problem (9.8) and verify that the best line from the optimization problem is the one we expect it to be. Also, check how WolframAlpha solves this optimization problem.
Hint
You could maybe use Fourier-Motzkin elimination to show that
implies .
Notice that the optimization problem in Theorem 9.9 has number of constraints equal to the number of points to be separated. For an extended (soft margin) optimization problem, when the data at hand cannot be separated we refer to section 3 of the Cortes and Vapnik paper.
Usually one does not use the optimization problem in Theorem 9.9 , but rather its socalled (Lagrange) dual for finding the optimal hyperplane. This dual optimization problem uses that the normal vector is a linear combination
of the support vectors. It is an optimization problem in from (9.9) and looks like
where is the vector of labels attached to the points and is the symmetric matrix given by
Notice that the dual optimization problem is an optimization problem in with , where is the number of data points. This can be in stark contrast to the original optimization problem in Theorem 9.9 , which is an optimization problem in , where is the dimension of the data points. Sometimes the data points are high dimensional and it pays to solve the dual optimization problem.
Let us write down the dual optimization problem for the points in Example 9.10 . Here
so that the dual optimization problem becomes
This reduces to the optimization problem of maximizing subject to , which has the solution . Therefore the optimal hyperplane has normal vector
The dual optimization problem (9.10) can be derived formally from the original optimization problem in Theorem 9.9 . This is, however, beyond the scope of this course (see section 2.1 of the Cortes and Vapnik paper).
Quadratic optimization problems, such as the one in Theorem 9.9 can in fact be handled by Sage (well, python in this case). See CVXOPT for further information. Note that the code below needs to be executed as Python code (choose Python in the pull down). It attempts (in general) to solve the optimization problem
In the Sage window below the optimization problem
has been entered.
What happens if you remove

RealNumber=float
Integer=int

from the code above?
Take a look at the input format in Example 9.14 . Can you tell which optimization problem in this chapter is solved below? Also, the code below seems to report some errors after pressing the compute button. Can you make it run smoothly by making a very, very small change?

9.1.1 Separating by non-linear functions

Sometimes one needs more complex separating curves than just a line. Consider the five points
where we wish to separate from the other points. This is impossible using a line, but certainly doable by a circle
where :
The circle (9.12) may be a circle in two dimensions, but viewed in three dimensions it turns into a hyperplane in the following way.
By using the function given by
points lying on (9.12) map to points lying on the hyperplane in given by
in .
An even simpler example is given in dimension one. Consider the points (or numbers)
with respective labels . These cannot be separated by a hyperplane . Things change dramatically if we use the function to embed the points in . Using , the points (9.13) map to
with the (same) labels . Here they can be separated by the hyperplane . This means that the original numbers can be separated by the non-linear function or rather so that and .
The general trick is to find a suitable map
such that the transformed data
becomes linearly separable. Since,
for suitable , the (dual) optimization problem in (9.10) becomes
where is the vector of labels attached to the points and is the symmetric matrix given by
The beauty of the dual problem (as mentioned in Remark 9.12 ) is that we do not have to care about the (sometimes astronomical, even "infinite") dimension of . The optimization problem is situated in , where is the number of data points (or more precisely the number of constraints). We only need a clever way of getting our hands on in (9.14) . Here an old concept from pure mathematics called kernel functions helps us.

9.1.2 Kernel functions

A kernel function is a function , that is a hidden dot product in the following sense: there exists a function
such that
Let be given by
Then
One gleans from (9.16) that is a kernel function, since (9.15) is satisfied for given by
Once we have a kernel function for we can replace the matrix in (9.14) by
and proceed to solve the optimization problem without worrying about the sometimes insurmountable size of .

9.1.3 The kernel perceptron algorithm

Recall the stunningly simple perceptron algorithm from section 5.3.2 . This algorithm can be modified to handle non-linear separation too by using kernel functions. In fact, this modification was one of the inspirations for the development of the support vector machines described above.
After having mapped a set of vectors with labels to , via , we are looking for a vector , such that
Such a vector is expressible as
The (dual) perceptron algorithm works adjusting the coefficients successively as follows: if is wrong about the placement of in (9.17) i.e., if , then let
If we have a kernel function for , then
and we can use the kernel function in the algorithm without resorting to computing and the inner product in .
Use the kernel function in Example 9.17 and the kernel perceptron algorithm to separate
Sketch the points and the separating curve.

9.2 Logarithmic barrier functions

We need an algorithm for solving optimization problems like (9.9) . There is a very nice trick (probably going back to von Neumann) for solving constrained optimization problems of the form
where is defined by the differentiable functions as
The functions define the boundary (or barrier) of . We use them to define the logarithmic barrier function
defined on the interior
The boundary of is
You can see that the logarithmic barrier function explodes (becomes unbounded), when a vector approaches , since is unbounded as for .
The cool idea is to consider the function
for . This function has a global minimum .
Prove that is a convex function if and are convex functions.
Hint
Prove and use that if is a decreasing convex function (in one variable) and is a convex function, then is a convex function, where we assume the composition makes sense.
The upshot is that as . This is the content of the following theorem, which we will not prove.
Let be a point in with
for and . Then
and as . If (9.18) has a unique optimum , then by using we obtain a sequence as .
We move on to give concrete examples of Theorem 9.20 in action.

9.2.1 Quadratic function with polyhedral constraints

A much used setup in optimization is minimizing a quadratic functions subject to polyhedral constraints. This is the optimization problem
where is an matrix, is an matrix, and .
Certainly the constraints define a convex subset of , but the function is not strictly convex unless is positive definite. If is not positive semidefinite (9.20) is difficult.
If is positive semidefinite, the interior point method outlined above usually works well.
The optimization problem (9.2) has the form (9.20) , when we put
The optimization problem in Theorem 9.9 has the form (9.20) , when we put
Here is a matrix, is an matrix and .
Optimization of a quadratic function as in (9.20) is implemented below using the interior point method and exact line search. See Section 10.5.1 of my book Undergraduate Convexity for further details. Only python with numpy is used.
Below are samples of output running the interior point algorithm on the enclosing circle problem in Example 9.3 .
in the barrier function in (9.19) . We are attempting to compute the center of the smallest enclosing circle of the points

>>> Newton(Q1, c1, A1, b1, [0,0,18], 1)
array([-0.43814179,  1.84835643,  7.69779763])
>>> Newton(Q1, c1, A1, b1, [0,0,18], 0.5)
array([-0.45244596,  1.81135335,  4.99353576])
>>> Newton(Q1, c1, A1, b1, [0,0,18], 0.1)
array([-0.49569549,  1.84020923,  2.91071666])
>>> Newton(Q1, c1, A1, b1, [0,0,18], 0.05)
array([-0.49875614,  1.87002433,  2.63917096])
>>> Newton(Q1, c1, A1, b1, [0,0,18], 0.01)
array([-0.4999099 ,  1.93327221,  2.28805064])
>>> Newton(Q1, c1, A1, b1, [0,0,18], 0.005)
array([-0.49996901,  1.95176304,  2.20331123])
>>> Newton(Q1, c1, A1, b1, [0,0,18], 0.001)
array([-0.49999729,  1.97792915,  2.09031159])
>>> Newton(Q1, c1, A1, b1, [0,0,18], 0.0005)
array([-0.49999904,  1.98432672,  2.06370265])
>>> Newton(Q1, c1, A1, b1, [0,0,18], 0.0001)
array([-0.49999991,  1.99295433,  2.0283835 ])

The first two coordinates of the output are the - and -coordinates of the center. The third is from (9.2) .
Try out the code in the Sage window above on the Exercises 7.46 , 7.47 and 7.48 . Check the output of the code by actually solving these exercises.
Compute the best line separating the labeled data

((1, 0), +1), ((2, 0), +1), ((3, 0), +1), ((3, 2), +1), ((1, 1), -1), ((2, 2), -1).

9.3 A geometric optimality criterion

Consider the general optimization problem
where is a subset of .
Suppose that is a convex subset and a differentiable function in (9.21) . If is an optimal solution of (9.21) , then
If in addition is a convex function, then (9.22) implies that is optimal.
Proof
If is an optimal solution and , then
for every with , where denotes the epsilon function in the definition of differentiability (see Definition 7.5 ). Therefore
for . This is only possible if . We have silently applied the convexity of and the differentiability of at .
If in addition is convex and (9.22) holds, then Theorem 8.21 shows that is an optimal solution.
A nice application of Proposition 9.26 is for example to the optimization problem
Here Proposition 9.26 shows that is optimal, since the hyperplane
touches the boundary of
as shown below.
Sketch how Proposition 9.26 applies to show that an optimum in a linear programming problem
in the plane always can be found in a vertex.
Let be a differentiable convex function and
Suppose that for . Prove that is a minimum for defined on .
Guess the solution to the optimization problem
Show that your guess was correct!

9.4 KKT

The KKT in the title of this section is short for Karush-Kuhn-Tucker.
We will limit ourselves to a convex optimization problem of the form
where is defined by the differentiable convex functions for as
and is a convex function.
To the optimization problem (9.23) we associate the (famous) Karush-Kuhn-Tucker (KKT) conditions:
Notice that the KKT conditions consist of inequalities and equations in the unknowns . The KKT conditions form a surprising theoretical foundation for optimization problems of the type in (9.23) . You should take a peek back to the theory of Lagrange multipliers in section 7.9 and compare with (9.24) .
The KKT conditions associated with the convex optimization problem in (9.1) are
Verify that the KKT conditions of the optimization problem in (9.1) are the ones given in Example 9.31 .
To state our main theorem we need a definition.
The optimization problem (9.23) is called strictly feasible if there exists with
Below is the main result in our limited convex setting. We will not go into the proof, which can be found in my book Undergraduate Convexity.
  1. Let be an optimal solution of (9.23) . If (9.23) is strictly feasible, then the KKT conditions are satisfied at for suitable .
  2. If the KKT conditions are satisfied at for some , then is an optimal solution to (9.23) .
Let us now touch base with a rather simple example. Consider the optimization problem
Here and in (9.23) . Therefore the KKT conditions in (9.24) are
Before even thinking about moving on to the next section, you should attempt to find a solution to the above KKT conditions (inequalities) and then verify using Theorem 9.34 (ⅱ) that is optimal. Also, try only using Theorem 9.34 (ⅰ) and (9.26) to show that is not a solution to (9.25) .
Give an example of a convex optimization problem as in (9.23) , which is not strictly feasible and with an optimal solution that does not satisfy the KKT conditions. Such an example shows that strict feasibility is necessary in Theorem 9.34 (ⅰ) .

9.5 Computing with KKT

9.5.1 Strategy

A general strategy for finding solutions to the KKT conditions in (9.24) is zooming in on (the Lagrange multipliers) testing each of them for the two cases and .
One important point, which you can read from (9.24) , is that if . To further elaborate, if , then an optimal solution must satisfy .
So where exactly in (9.24) is the above claim verified?
The condition simplifies the equations
in (9.24) .
In principle to solve the KKT conditions, one has to try out all the possibilities coming from or for .
Why possibilities above?
How do you solve the optimization problem (or decide there is no solution) if ?

9.5.2 Example

Let denote the set (see Figure 9.40 ) of points with
We will illustrate the mechanics of solving the KKT conditions in finding an optimal solution for
The convex set with optimal solution for (9.27) marked.
Putting
and , we are in a position to apply Theorem 9.34 , since are convex functions and for . This means that an optimal solution of (9.27) satisfies the KKT conditions. The same theorem tells us that the in a solution of the KKT conditions is an optimal solution (here we also use that is a convex function). The full set of KKT conditions in are
A strategy for finding a solution to the KKT conditions is trying (the eight) different combinations of strict inequalities in . You can see from the last two equations that is impossible. The condition shows that an optimal solution has to occur on the lower arc in Figure 9.40 . If , then and by the last two equations. This implies violating . Therefore . If , then and by and the last two equations. Therefore . So we are left with the case and giving
Inserting this into we end up with (see Figure 9.40 )
Theorem 9.34 is beautiful mathematics. Going through the KKT conditions as above can be quite lengthy if not impossible in practice. As we have seen, there are other methods for (at least) approximating an optimal solution.

9.6 Optimization exercises

Below are some exercises especially related to the KKT conditions. In some of the exercises the minimization problem
is denoted
This should cause no confusion.
Consider the optimization problem
  1. Show that (9.29) is a convex optimization problem.
  2. Sketch the set of constraints in and show that cannot be an optimal solution to (9.29) .
  3. Write up the KKT conditions for (9.29) and explain theoretically (without actually solving them) why they must have a solution.
  4. Now solve (9.29) . Is the solution unique?
Consider the function given by
  1. Show that is strictly convex.
  2. Let denote the subset of points satisfying
    Show that is a closed convex subset.
  3. Solve the optimization problem
Let , where
  1. Why does the optimization problem
    have a solution?
  2. Find all optimal solutions to (9.31) .
  3. Let , where at least one of is non-zero. Show that an optimal solution to
    belongs to .
Let
  1. Use the KKT conditions to solve the minimization problem
  2. Use the KKT conditions to solve the minimization problem
Solve the optimization problem
Let and .
  1. State the KKT conditions for for .
  2. Suppose now that . For which and does have optimum in ? State the KKT conditions when .
Let be given by
  1. Show that is a convex function.
  2. Find . Is this minimum unique? Is a strictly convex function.
    Let
  3. Apply the KKT-conditions to decide if is an optimal solution to
  4. Find
    and
Let be given by
and let
  1. Show that is not a convex function.
  2. Show that is a convex function on the open subset
    and conclude that is convex on .
  3. Show that is an optimal solution for the optimization problem . Is a unique optimal solution here?
Let be given by
and by
  1. Show that is a convex function. Is strictly convex?
  2. Show that is a convex subset of .
  3. Does there exist an optimal point for the minimization problem
    with ?
  4. Does there exist an optimal point for the minimization problem
    with ?
Let
and
Solve the optimization problem
Let be given by
Below, the minimization problem
is analyzed for various subsets .
  1. Show that is a convex function
  2. Let
    Show that cannot be an optimal solution to (9.32) . Find an optimal solution to (9.32) .
  3. Find an optimal solution in (9.32) for
  4. Are the optimal solutions in (2) and (3) unique?
Let be given by
  1. Show that is a convex function and solve the minimization problem .
    Now let
    and consider the minimization problem (P) given by
  2. Show using the KKT conditions that is not optimal for (P).
  3. Find an optimal solution for (P). Is it unique?