Lecture 7 | Machine Learning (Stanford)

0:12 - 0:15

This presentation is delivered by the Stanford Center for Professional
0:15 - 0:22

Development.
0:24 - 0:26

So welcome back.
0:26 - 0:31

And what I wanna do today is continue our discussion on support vector machines. And in
0:31 - 0:34

particular, I wanna talk about the optimal margin classifier.
0:34 - 0:38

Then I wanna take a brief digression and talk about primal and duo optimization problems, and
0:38 - 0:39

in particular,
0:39 - 0:42

what's called the KKT conditions.
0:42 - 0:45

And then we'll derive the duo to the optimization problem that I
0:45 - 0:47

had posed earlier.
0:47 - 0:51

And that will lead us into a discussion of kernels, which I won't really -
0:51 - 0:54

which we just get to say couple words about, but which I'll do probably
0:54 - 0:58

only in the next lecture.
0:58 - 1:02

And as part of today's lecture, I'll spend some time talking about optimization problems.
1:02 - 1:04

And in
1:04 - 1:08

the little time I have today, I won't really be able to do this topic justice.
1:08 - 1:13

I wanna talk about convex optimization and do that topic justice.
1:13 - 1:18

And so at this week's discussion session, the TAs will have more
1:18 - 1:22

time - will teach a discussion session - focus on convex optimization
1:22 - 1:25

- sort of very beautiful and useful theory.
1:25 - 1:32

So you want to learn more about that, listen to this Friday's discussion session.
1:33 - 1:37

Just to recap what we did in the previous lecture,
1:37 - 1:39

as
1:39 - 1:42

we were beginning on developing on support vector machines, I
1:42 - 1:44

said that a
1:44 - 1:49

hypothesis represented as H sub [inaudible] wb as g of
1:49 - 1:53

w transpose [inaudible] x + b, where
1:53 - 1:55
1:55 - 1:56

g
1:56 - 2:03

will be
2:04 - 2:09

+1 or -1, depending on whether z is greater than
2:09 - 2:10

0.
2:10 - 2:13

And I said that in our development of support vector machines,
2:13 - 2:17

we'll use - we'll change the convention of letting y be +1, -1 to note
2:17 - 2:20

the class labels.
2:20 - 2:24

So last time,
2:24 - 2:28

we also talked about the functional margin, which was this thing, gamma
2:28 - 2:34

hat i.
2:34 - 2:36
2:36 - 2:39

And so we had the intuition that the
2:39 - 2:42

if functional margin is a large positive number,
2:42 - 2:47

then that means that we are classifying a training example correctly and very
2:47 - 2:48

confidently. So
2:48 - 2:50

yi is +1. We
2:50 - 2:53

would like w transpose xi + b to be very large.
2:53 - 2:55

And it makes i - if, excuse me, if
2:55 - 2:57

yi is -1,
2:57 - 2:59

then we'd w transpose xi + b to be a large negative number. So
2:59 - 3:03

we'd sort of like functional margins to be large. We
3:03 - 3:06

also said - functional margin is a strange property -
3:06 - 3:10

that you can increase functional margin just by, say, taking your parameters,
3:10 - 3:11

w and b,
3:11 - 3:14

and multiplying them by
3:14 - 3:16

2.
3:16 - 3:18

And then we also
3:18 - 3:21

defined the geometric margin,
3:21 - 3:23

which
3:23 - 3:25
3:25 - 3:32

was
3:33 - 3:37

that we just - essentially, the functional margin
3:37 - 3:38

divided by
3:38 - 3:40

the normal w.
3:40 - 3:43

And so the geometric margin had
3:43 - 3:47

the interpretation as being - I'll give
3:47 - 3:49

you a few examples. The
3:49 - 3:52

geometric margin, for example, is - has
3:52 - 3:56

the interpretation as a distance between a training example
3:56 - 3:57

and a hyperplane.
3:57 - 4:01

And it'll actually be a sin distance, so that this distance will be positive
4:01 - 4:05

if you're classifying the example correctly. And if you misclassify the example, this
4:05 - 4:08

distance - it'll be the minus of the distance,
4:08 - 4:10

reaching the point, reaching the training example.
4:10 - 4:12

And you're separating hyperplane.
4:12 - 4:16

Where you're separating hyperplane is defined by the equation w transpose x
4:16 - 4:18

+
4:18 - 4:21

b =
4:21 - 4:24

0.
4:24 - 4:28

So - oh,
4:28 - 4:32

well, and I guess
4:32 - 4:37

also defined these things as the functional margin, geometric margins,
4:37 - 4:38

respect to training set I defined as
4:38 - 4:45

the worst case or the minimum functional geometric margin. So in our
4:49 - 4:53

development of the optimal margin classifier,
4:53 - 4:56

our learning algorithm would choose parameters w and b so as to maximize
4:56 - 5:00

the geometric margin. So our goal is to find the separating hyperplane
5:00 - 5:05

that separates the positive and negative examples with as large a distance as possible
5:05 - 5:06

between hyperplane
5:06 - 5:11

and the positive and negative examples. And if you
5:11 - 5:15

go to choose parameters w and b to maximize this, [inaudible] one copy of
5:15 - 5:17

the geometric margin
5:17 - 5:18

is that you
5:18 - 5:20

can actually scale w
5:20 - 5:24

and b arbitrarily. So you look at this definition for the geometric margin.
5:24 - 5:25
5:25 - 5:28

I can choose to multiply my parameters w and b
5:28 - 5:32

by 2 or by 10 or any other constant.
5:32 - 5:34

And it doesn't change
5:34 - 5:36

my geometric margin.
5:36 - 5:38

And one way of interpreting that is you're looking
5:38 - 5:39

at
5:39 - 5:43

just separating hyperplane. You look at this line you're separating by positive and negative training examples. If
5:43 - 5:45

I
5:45 - 5:46

scale w
5:46 - 5:47

and b,
5:47 - 5:49

that doesn't change the position of this plane, though
5:49 - 5:52

because the equation wh +
5:52 - 5:59

b = 0 is the same as equation 2 w transpose x + 2b = 0. So it use the same straight
5:59 - 6:03

line. And what that means is that I can actually choose whatever scaling
6:03 - 6:06

for w and b is convenient for me.
6:06 - 6:08

And in particular,
6:08 - 6:09

we use in a minute,
6:09 - 6:14

I can [inaudible] perfect constraint like that the normal w [inaudible] 1
6:14 - 6:17

because this means that you can find a solution to w and b.
6:17 - 6:22

And then by rescaling the parameters, you can easily meet this condition, this rescaled w
6:22 - 6:24

[inaudible] 1. And so I can
6:24 - 6:28

add the condition like this and then essentially not change the problem. Or I
6:28 - 6:32

can add other conditions. I can actually add a
6:32 - 6:35

condition
6:35 - 6:38

that - excuse me, the absolute value of w1 = 1. I can have
6:38 - 6:41

only one of these conditions right now [inaudible]. And adding condition to the absolute
6:41 - 6:42

value - the
6:42 - 6:45

first component of w must be to 1. And again,
6:45 - 6:49

you can find the absolute solution and just rescale w and meet this
6:49 - 6:51

condition.
6:51 - 6:53

And it can have other,
6:53 - 6:56

most esoteric conditions like that
6:56 - 6:57

because again,
6:57 - 7:01

this is a condition that you can solve for the optimal margin, and then just
7:01 - 7:03

by scaling,
7:03 - 7:05

you have w up and down. You can - you can then
7:05 - 7:09

ensure you meet this condition as well. So
7:09 - 7:13

again, [inaudible] one of these conditions right now, not all of them.
7:13 - 7:15

And so our ability to choose
7:15 - 7:19

any scaling condition on w that's convenient to us
7:19 - 7:24

will be useful again in a second. All right.
7:24 - 7:29

So let's go ahead and break down the optimization problem. And again, my goal is to choose
7:29 - 7:30

parameters w and b
7:30 - 7:34

so as to maximize the geometric margin.
7:34 - 7:39

Here's my first attempt at writing down the optimization problem. Actually wrote this one down
7:39 - 7:40

right at
7:40 - 7:43

the end of the previous
7:43 - 7:47

lecture. Begin to solve the parameters gamma w and b
7:47 - 7:48

such that
7:48 - 7:49
7:49 - 7:54

-
7:54 - 7:57

that [inaudible] i
7:57 - 8:02

for in training examples.
8:02 - 8:02

Let's say I
8:02 - 8:06

choose to add this normalization condition.
8:06 - 8:11

So the norm condition that w - the normal w is equal to 1 just makes
8:11 - 8:15

the geometric and the functional margin the same.
8:15 - 8:18

And so I'm saying I want to find a value -
8:18 - 8:23

I want to find a value for gamma as big as possible
8:23 - 8:26

so that all of my training examples have functional margin
8:26 - 8:29

greater than or equals gamma,
8:29 - 8:30

and
8:30 - 8:34

with the constraint that normal w equals 1,
8:34 - 8:36

functional margin and geometric margin are the same.
8:36 - 8:38

So it's the same.
8:38 - 8:40

Find the value for gamma so that
8:40 - 8:43

all the values - all the geometric margins are greater or equal to
8:43 - 8:46

gamma.
8:46 - 8:51

So you solve this optimization problem, then you have derived
8:51 - 8:55

the optimal margin classifier -
8:55 - 8:57

that
8:57 - 9:00

there's not a very nice optimization problem because this is a
9:00 - 9:04

nasty, nonconvex constraints. And [inaudible] is asking that you
9:04 - 9:06

solve for parameters w
9:06 - 9:11

that lie on the surface of a unisphere, lie on his [inaudible].
9:11 - 9:15

It lies on a unicircle - a unisphere.
9:15 - 9:16
9:16 - 9:18

And so
9:18 - 9:21

if we can come up with a convex optimization problem, then
9:21 - 9:25

we'd be guaranteed that our [inaudible] descend to other local [inaudible] will
9:25 - 9:25

not have
9:25 - 9:26

local optimal. And
9:26 - 9:31

it turns out this is an example of a nonconvex constraint. This is a nasty constraint
9:31 - 9:38

that I would like to get rid of. So
9:41 - 9:43

let's change the optimization problem
9:43 - 9:48

one more time.
9:48 - 9:52

Now, let me
9:52 - 9:57

pose a slightly different optimization problem. Let
9:57 - 10:03

me maximize the functional margin divided by the normal w
10:03 - 10:05

subject
10:05 - 10:12

to yi w transpose xi.
10:13 - 10:15

So in other words, once you find
10:15 - 10:17

a number, gamma hat,
10:17 - 10:20

so that every one of my training examples has functional margin greater
10:20 - 10:23

than the gamma hat,
10:23 - 10:27

and my optimization objective is I want to maximize gamma hat divided by the normal
10:27 - 10:28
10:28 - 10:31

w. And so I wanna maximize the
10:31 - 10:36

function margin divided by the normal w. And we saw previously the function
10:36 - 10:38

margin divided by the normal
10:38 - 10:42

w is just a geometric margin, and so this is a different way of posing
10:42 - 10:45

the same
10:45 - 10:47

optimization problem. [Inaudible] confused, though. Are there questions
10:47 - 10:54

about this?
10:54 - 10:54
10:54 - 10:57

Student:[Inaudible] the second statement has to be made of the functional margin y
10:57 - 10:59

divided by - why don't you just have it
10:59 - 11:02

the geometric
11:02 - 11:05

margin? Why do
11:05 - 11:09

you [inaudible]? Instructor (Andrew Ng):[Inaudible] say it again? Student:For the second statement, where we're saying the data of the functional margin is divided [inaudible]. Instructor (Andrew
11:09 - 11:11

Ng):Oh, I see, yes. Student:[Inaudible]
11:11 - 11:13
11:13 - 11:17

is that [inaudible]? Instructor (Andrew Ng):So let's see, this is the function margin, right? This is not the geometric margin.
11:17 - 11:19

Student:Yeah.
11:19 - 11:26

Instructor (Andrew Ng):So - oh, I want to divide by the normal w of my optimization objective. Student:I'm just wondering how come you end up dividing also under the second stage [inaudible] the functional
11:26 - 11:28

margin.
11:28 - 11:32

Why are you dividing there by the normal w? Instructor (Andrew Ng):Let's see. I'm not sure I get the question. Let me
11:32 - 11:36

try saying
11:36 - 11:40

this again. So here's my goal. My - I want [inaudible]. So
11:40 - 11:42

let's see,
11:42 - 11:47

the parameters of this optimization problem where gamma hat w and b - so
11:47 - 11:50

the convex optimization software
11:50 - 11:52

solves this problem for some set of parameters gamma
11:52 - 11:54

w and b.
11:54 - 11:57

And I'm imposing the constraint that
11:57 - 11:59

whatever values it comes up with,
11:59 - 12:04

yi x [inaudible] x5 + b must be greater than gamma hat.
12:04 - 12:06

And so this means that
12:06 - 12:09

the functional margin of every example had
12:09 - 12:09

better be
12:09 - 12:11

greater than equal to gamma hat. So
12:11 - 12:14

there's a constraint to the function margin and a constraint to the gamma hat.
12:14 - 12:18

But what I care about is not really maximizing the functional margin. What I
12:18 - 12:19

really care about -
12:19 - 12:21

in other words, in optimization objective,
12:21 - 12:25

is maximizing gamma hat divided by the normal w,
12:25 - 12:28

which is the geometric margin.
12:28 - 12:32

So in other words, my optimization [inaudible] is I want to maximize the function margin
12:32 - 12:35

divided by the normal
12:35 - 12:38

w. Subject to that, every example must have
12:38 - 12:40

function margin and at least gamma hat. Does that make
12:40 - 12:46

sense now? Student:[Inaudible] when you
12:46 - 12:48

said that to maximize gamma
12:48 - 12:51

or gamma hat, respect to gamma w and with
12:51 - 12:52

respect to gamma hat
12:52 - 12:57

so that
12:57 - 13:01

[inaudible] gamma hat are no
13:01 - 13:05

longer [inaudible]? Instructor (Andrew Ng):So this is the -
13:05 - 13:07

so it turns out -
13:07 - 13:11

so this is how I write down the - this is how I write down an optimization problem
13:11 - 13:13

in order to solve for
13:13 - 13:16

the geometric margin. What is it -
13:16 - 13:18

so it turns out that
13:18 - 13:21

the question of this - is the gamma hat the function of w and b? And it turns out
13:21 - 13:22

that
13:22 - 13:25

in my previous mathematical definition, it was,
13:25 - 13:30

but the way I'm going to pose this as an optimization problem is
13:30 - 13:35

I'm going to ask the convex optimization solvers - and this [inaudible] software - unless you have software for solving
13:35 - 13:38

convex optimization problems -
13:38 - 13:40

hen I'm going to
13:40 - 13:44

pretend that these are independent variables and ask my convex optimization software
13:44 - 13:46

to find me values for gamma, w, and b,
13:46 - 13:51

to make this value as big as possible and subject to this constraint.
13:51 - 13:53

And it'll turn out that
13:53 - 13:55

when it does that, it will choose - or
13:55 - 13:59

obviously, it will choose for gamma to be as big as possible
13:59 - 14:01

because optimization objective is this:
14:01 - 14:03

You're trying to maximize gamma hat.
14:03 - 14:06

So for x value of w and b,
14:06 - 14:09

my software, which choose to make gamma hat as big as possible -
14:09 - 14:13

well, but how big can we make gamma hat? Well, it's limited by use
14:13 - 14:15

constraints. It says that every training example
14:15 - 14:17

must have function margin
14:17 - 14:20

greater than equal to gamma hat.
14:20 - 14:24

And so my - the bigger you can make gamma hat
14:24 - 14:28

will be the value of the smallest functional margin.
14:28 - 14:30

And so when you solve this optimization problem,
14:30 - 14:34

the value of gamma hat you get out will be, indeed,
14:34 - 14:39

the minimum of the functional margins of your training set. Okay, so
14:39 - 14:41

Justin? Student:Yeah, I was just
14:41 - 14:42

wondering, I
14:42 - 14:46

guess I'm a little confused because it's like, okay, you have two class of data. And you can say, "Okay, please draw me a line
14:46 - 14:50

such that you maximize the distance between - the smallest distance that [inaudible] between the line and
14:50 - 14:52

the
14:52 - 14:54

data points."
14:54 - 14:56

And it seems like that's kind of what we're doing, but it's - it seems like
14:56 - 15:00

this is more complicated than that. And I guess I'm wondering what
15:00 - 15:04

is the difference. Instructor (Andrew Ng):I see. So I mean, this is - the question is [inaudible]. Two class of data - trying to find separate hyperplane. And
15:04 - 15:09

this seems
15:09 - 15:12

more complicated than trying to find a line [inaudible]. So I'm
15:12 - 15:18

just repeating the questions in case - since I'm not sure how all the audio catches it.
15:18 - 15:21

So the answer is this is actually exactly that problem. This is exactly that problem
15:21 - 15:23

of
15:23 - 15:27

given the two class of data, positive and negative examples,
15:27 - 15:30

this is exactly the formalization of the problem
15:30 - 15:32

where I go is to find
15:32 - 15:36

a line that separates the two - the positive and negative examples,
15:36 - 15:43

maximizing the worst-case distance between the [inaudible] point and this line. Okay? Yeah, [Inaudible]? Student:So why do you care about the worst-case
15:43 - 15:44

distance [inaudible]?
15:44 - 15:46

Instructor (Andrew Ng):Yeah, let me - for now, why do we care about the worst-case distance? For now,
15:46 - 15:48
15:48 - 15:53

let's just say - let's just care about the worst-case distance for now. We'll come back,
15:53 - 15:55

and we'll fix that later. We'll - that's a -
15:55 - 15:58

caring about the worst case is is just -
15:58 - 16:01

is just a nice way to formulate this optimization problem. I'll come back, and I'll
16:01 - 16:04

change that later. Okay,
16:04 - 16:10

raise your hand if this makes sense - if this formulation makes sense? Okay, yeah, cool.
16:10 - 16:14

Great. So let's see -
16:14 - 16:16

so this is just a different way of posing
16:16 - 16:18

the same optimization problem. And
16:18 - 16:22

on the one hand, I've got to get rid of this nasty, nonconvex constraint, while on
16:22 - 16:24

the other hand, I've now
16:24 - 16:25

added
16:25 - 16:30

a nasty, nonconvex objective. In particular, this is not a convex function in parameters
16:30 - 16:31

w.
16:31 - 16:33

And so you can't -
16:33 - 16:37

you don't have the usual guarantees like if you [inaudible]
16:37 - 16:38

global minimum.
16:38 - 16:40

At least that
16:40 - 16:45

does not follow immediately from this because this is nonconvex.
16:45 - 16:48
16:48 - 16:51

So what
16:51 - 16:55

I'm going to do is,
16:55 - 16:57

earlier, I said that can pose
16:57 - 16:58

any of
16:58 - 17:03

a number of even fairly bizarre scaling constraints on w. So you can
17:03 - 17:06

choose any scaling constraint like this, and things are still fine.
17:06 - 17:08

And so here's
17:08 - 17:14

the scaling I'm going to choose to add. Again, I'm
17:14 - 17:18

gonna assume for the purposes of today's lecture, I'm gonna assume that these examples are linearly
17:18 - 17:20

separable, that
17:20 - 17:23

you can actually separate the positive and negative classes, and that we'll come back and
17:23 - 17:26

fix this later as well.
17:26 - 17:29

But here's the scaling constraint I want to impose on w. I want
17:29 - 17:34

to impose a constraint that
17:34 - 17:38

the functional margin is equal to 1.
17:38 - 17:40

And another way of writing that
17:40 - 17:44

is that I want to impose a constraint that min
17:44 - 17:51

over i, yi -
17:52 - 17:55

that in the worst case, function y is over 1.
17:55 - 17:59

And clearly, this is a scaling constraint because if
17:59 - 18:02
18:02 - 18:06

you solve for w and b, and you find that your worst-case function margin is actually 10 or
18:06 - 18:07

whatever,
18:07 - 18:10

then by dividing through w and b by a factor of 10, I can get
18:10 - 18:13

my functional margin to be over 1.
18:13 - 18:16

So this is a scaling constraint [inaudible] would
18:16 - 18:17

imply. And this is
18:17 - 18:19

just more compactly written
18:19 - 18:23

as follows. This is imposing a constraint that the functional margin be
18:23 - 18:26

equal to 1.
18:26 - 18:29

And so if we just take
18:29 - 18:32

what I wrote down as No. 2 of our previous optimization problem and add the
18:32 - 18:35

scaling constraint,
18:35 - 18:38

we then get the following optimization problem:
18:38 - 18:45

min over wb.
18:57 - 19:01

I guess previously, we had a maximization
19:01 - 19:06

over gamma hats divided by the normal w. So those maximize
19:06 - 19:10

1 over the normal w, but so that's the same as minimizing the normal w squared. It was great. Maximum
19:10 - 19:11

normal w
19:11 - 19:15

is min w - normal w squared. And then these
19:15 - 19:16

are our constraints.
19:16 - 19:21

Since I've added the constraint, the functional margin
19:21 - 19:24

is over 1. And this is actually
19:24 - 19:27

my
19:27 - 19:30

final - well,
19:30 - 19:37

final formulation of the optimal margin classifier problem, at least for now.
19:37 - 19:38

So the picture
19:38 - 19:41

to
19:41 - 19:45

keep in mind for this, I guess, is that our
19:45 - 19:48

optimization objective is once you minimize the normal w. And so our
19:48 - 19:50

optimization objective
19:50 - 19:52

is just the [inaudible] quadratic function.
19:52 - 19:55

And [inaudible] those pictures [inaudible] can draw
19:55 - 19:57

it. So it -
19:57 - 19:59

if [inaudible] is w1 and w2, and you
19:59 - 20:03

want to minimize the quadratic function like this - so quadratic function
20:03 - 20:06

just has [inaudible] that look like this.
20:06 - 20:10

And moreover, you have a number of linear constraints in your parameters, so you may have linear
20:10 - 20:12

constraints that eliminates that half space or
20:12 - 20:17

linear constraint eliminates that half space [inaudible]. So there's that half space
20:17 - 20:20
20:20 - 20:23

and so on.
20:23 - 20:26

And so the picture is you have a quadratic function,
20:26 - 20:27

and you're ruling out
20:27 - 20:29
20:29 - 20:34

various half spaces where each of these linear constraints. And I hope - if
20:34 - 20:38

you can picture this in 3D, I guess [inaudible] kinda draw our own 3D, hope you can
20:38 - 20:43

convince yourself that this is a convex problem that has no local optimum. But they
20:43 - 20:46

be run great
20:46 - 20:50

[inaudible] within this set of points that hasn't ruled out, then you convert to the
20:50 - 20:53

global optimum.
20:53 - 20:56

And so that's the convex optimization problem.
20:56 - 20:59

The - does this [inaudible] nice and
20:59 - 21:02

[inaudible].
21:02 - 21:09

Questions about this?
21:17 - 21:24

Actually, just raise your hand if this makes sense. Okay, cool.
21:26 - 21:30

So this gives you the optimal margin classifier algorithm.
21:30 - 21:32

And it turns out that
21:32 - 21:34

this is the convex optimization problem,
21:34 - 21:37

so you can actually take this formulation of the problem
21:37 - 21:41

and throw it at off-the-shelf software - what's called a QP or quadratic
21:41 - 21:45

program software. This [inaudible] optimization is called a quadratic program,
21:45 - 21:49

where the quadratic convex objective function and [inaudible] constraints -
21:49 - 21:54

so you can actually download software to solve these optimization problems for you.
21:54 - 21:56

Usually, as you wanna use the -
21:56 - 21:57

use
21:57 - 22:00

[inaudible] because you have constraints like these, although you could actually modify [inaudible] work with
22:00 - 22:04

this, too.
22:04 - 22:06

So we could just declare success and say that we're done with this formulation of the
22:06 - 22:08

problem. But
22:08 - 22:11

what I'm going to do now is take a
22:11 - 22:15

digression to talk about primal and duo optimization problems.
22:15 - 22:16

And in particular, I'm going to
22:16 - 22:18

-
22:18 - 22:22

later, I'm going to come back and derive yet another very different form
22:22 - 22:25

of this optimization problem.
22:25 - 22:29

And the reason we'll do that is because it turns out this optimization problem has
22:29 - 22:33

certain properties that make it amenable to very efficient algorithms.
22:33 - 22:35

And moreover, I'll be deriving
22:35 - 22:37

what's called the duo formulation of this
22:37 - 22:40

that allows us to
22:40 - 22:43

apply the optimal margin classifier even in
22:43 - 22:47

very high-dimensional feature spaces - even in sometimes infinite
22:47 - 22:50

dimensional feature spaces. So we
22:50 - 22:52

can come back to that later.
22:52 - 22:53

But
22:53 - 22:54

let me know,
22:54 - 23:01

since I'm talking about
23:03 - 23:06

convex optimization. So how many here is - how many of you, from, I don't
23:06 - 23:08

know, calculus,
23:08 - 23:12

remember the method of Lagrange multipliers for
23:12 - 23:14

solving an optimization problem
23:14 - 23:18

like minimum - minimization, maximization problem subject to some constraint?
23:18 - 23:23

How many of you remember the method of Lagrange multipliers for that? Oh, okay,
23:23 - 23:24

cool. Some of you, yeah.
23:24 - 23:28

So if you don't remember, don't worry. I - I'll describe that briefly
23:28 - 23:28

here
23:28 - 23:32

as well, but what I'm really gonna do is talk about the generalization of this method
23:32 - 23:34

of Lagrange multipliers
23:34 - 23:38

that you may or may not have seen in some calculus classes. But if you haven't
23:38 - 23:40

seen it before, don't worry about it.
23:40 - 23:43
23:43 - 23:49

So the method of Lagrange multipliers is - was - well, suppose there's some
23:49 - 23:53

function you want to minimize, or minimize f of w.
23:53 - 23:56

We're subject to
23:56 - 24:03

some set of constraints that each i of w must equal 0 - for i = 1 [inaudible] l. And
24:03 - 24:07

given this constraint, I'll actually usually write it in vectorial
24:07 - 24:10

form in which I write h of w
24:10 - 24:11

as this
24:11 - 24:17

vector value function.
24:17 - 24:21

So that is equal to 0, where 0 is the arrow on top. I used
24:21 - 24:25

that to denote the vector of
24:25 - 24:27

all 0s. So you want to solve this optimization problem.
24:27 - 24:29
24:29 - 24:32

Some of you have seen method of Lagrange multipliers where
24:32 - 24:35

you construct this
24:35 - 24:39

[inaudible] Lagrange,
24:39 - 24:46
24:46 - 24:50

which is the original optimization objective plus some [inaudible] Lagrange multipliers the
24:50 - 24:52

highest constraints.
24:52 - 24:55

And these parameters - they derive -
24:55 - 25:02

we call the Lagrange multipliers.
25:02 - 25:06

And so the way you actually solve the optimization problem is
25:06 - 25:11

you take the partial derivative of this with respect to the original parameters
25:11 - 25:14

and set that to 0. So
25:14 - 25:20

the partial derivative with respect to your Lagrange multipliers [inaudible], and set that to 0.
25:20 - 25:24

And then the same as theorem through [inaudible], I guess [inaudible]
25:24 - 25:25

Lagrange
25:25 - 25:31

was that for w - for some value w star to get a
25:31 - 25:32

solution,
25:32 - 25:35

it
25:35 - 25:42

is necessary
25:43 - 25:48

that - can this
25:48 - 25:49

be
25:49 - 25:50
25:50 - 25:54

the star? Student:Right. Instructor (Andrew Ng):The backwards e - there exists. So there exists
25:54 - 25:55

beta star
25:55 - 26:00
26:00 - 26:07
26:17 - 26:20

such that those partial derivatives are
26:20 - 26:22

equal to
26:22 - 26:22

0.
26:22 - 26:24

So the
26:24 - 26:25

method
26:25 - 26:29

of Lagrange multipliers is to solve this problem,
26:29 - 26:31

you construct a Lagrange,
26:31 - 26:33

take the derivative with respect to
26:33 - 26:35

the original parameters b, the original
26:35 - 26:40

parameters w, and with respect to the Lagrange multipliers beta.
26:40 - 26:42

Set the partial derivatives equal to 0, and solve
26:42 - 26:45

for our solutions. And then you check each of the solutions to see if it is indeed a minimum.
26:45 - 26:48

Great.
26:48 - 26:52

So great
26:52 - 26:57

-
26:57 - 26:59

so what I'm going to do is actually
26:59 - 27:00

write
27:00 - 27:03

down the generalization of this. And
27:03 - 27:06

if you haven't seen that before, don't worry about it. This is [inaudible].
27:06 - 27:11

So what I'm going to do is actually write down the generalization of this to solve
27:11 - 27:16

a slightly more difficult type of constraint optimization problem,
27:16 - 27:20

which is suppose you want to minimize f of w
27:20 - 27:24

subject to the constraint that gi of w,
27:24 - 27:26

excuse me,
27:26 - 27:31

is less than equal to
27:31 - 27:35

0, and that hi of w is equal to 0.
27:35 - 27:38

And
27:38 - 27:43

again, using my vector notation, I'll write this as g of w is equal to 0. And h of w is equal to 0.
27:43 - 27:45

So
27:45 - 27:49

in [inaudible]'s
27:49 - 27:52

case, we now have inequality for constraint as well as
27:52 - 27:59

equality constraint.
28:02 - 28:03

I then have
28:03 - 28:07

a Lagrange, or it's actually still - called - say generalized
28:07 - 28:08

Lagrange,
28:08 - 28:13

which is now a function of my original optimization for parameters w,
28:13 - 28:18

as well as two sets of Lagrange multipliers, alpha and beta.
28:18 - 28:20

And so this will be
28:20 - 28:27

f of w.
28:29 - 28:33

Now,
28:33 - 28:37
28:37 - 28:39

here's a
28:39 - 28:40

cool part.
28:40 - 28:43

I'm going to define theta
28:43 - 28:44
28:44 - 28:46

subscript p of
28:46 - 28:47

w
28:47 - 28:49

to be equal to
28:49 - 28:52

max of alpha beta subject to the
28:52 - 28:59

constraints that the alphas are, beta equal
28:59 - 29:06

to 0 of the Lagrange.
29:10 - 29:15

And so
29:15 - 29:18

I want you to consider
29:18 - 29:22

the optimization problem
29:22 - 29:24

min over w of
29:24 - 29:31

max over alpha beta, such that the alpha is a greater than 0 of the Lagrange. And
29:32 - 29:36

that's just equal to min over w,
29:36 - 29:42

theta p of
29:42 - 29:48

w. And just to give us a name, the [inaudible] - the subscript p here is a sense of
29:48 - 29:50

primal problem.
29:50 - 29:57

And that refers to this entire thing.
29:57 - 30:01

This optimization problem that written down is called a primal problem. This means
30:01 - 30:04

there's the original optimization problem in which [inaudible] solving.
30:04 - 30:08

And later on, I'll derive in another version of this, but that's what p
30:08 - 30:13

stands for. It's a - this is a primal problem.
30:13 - 30:18

Now, I want you to look at - consider theta over p again. And in particular, I wanna
30:18 - 30:22

consider the problem of what happens if you minimize w
30:22 - 30:24

- minimize as a function of w
30:24 - 30:27

this quantity theta over p.
30:27 - 30:34
30:34 - 30:38

So let's look at what theta p of w is. Notice
30:38 - 30:40

that
30:40 - 30:42

if
30:42 - 30:43

gi of w
30:43 - 30:47

is greater than 0, so let's pick the value of w.
30:47 - 30:50

And let's ask what is the state of p of w?
30:50 - 30:57

So if w violates one of your primal problems constraints,
30:57 - 31:00
31:00 - 31:04

then state of p of w would be infinity.
31:04 - 31:06

Why is that?
31:06 - 31:09

[Inaudible] p [inaudible] second.
31:09 - 31:13

Suppose I pick a value of w that violates one of these constraints. So gi of
31:13 - 31:17

w is positive.
31:17 - 31:22

Then - well, theta p is this - maximize this function of alpha and beta - the Lagrange. So
31:22 - 31:26

one of these gi of w's is this positive,
31:26 - 31:30

then by setting the other responding alpha i to plus infinity, I can make this
31:30 - 31:32

arbitrarily large.
31:32 - 31:35

And so if w violates one of my
31:35 - 31:38

primal problem's constraints in one of the gis, then
31:38 - 31:42

max over alpha of this Lagrange will be plus
31:42 - 31:49

infinity. There's some of - and in the same way - I guess in a similar way,
31:50 - 31:55

if hi of w is not equal to
31:55 - 31:58

0,
31:58 - 32:03

then theta p of w also be infinity for a very similar reason because
32:03 - 32:06

if hi of w is not equal to 0 for some value of i,
32:06 - 32:10

then in my Lagrange, I had a beta i x hi theorem.
32:10 - 32:13

And so by setting beta i to be plus infinity or minus
32:13 - 32:16

infinity depending on the sign of hi,
32:16 - 32:20

I can make this plus infinity as well.
32:20 - 32:22

And otherwise,
32:22 - 32:29
32:30 - 32:34
32:34 - 32:38

theta p of w is just equal to f of w. Turns
32:38 - 32:40

out if
32:40 - 32:42

I had a value of w that
32:42 - 32:45

satisfies all of the gi and the hi constraints,
32:45 - 32:48

then we maximize in terms of alpha and beta
32:48 - 32:49

- all the Lagrange multiply
32:49 - 32:54

theorems will actually be obtained by
32:54 - 32:58

setting all the Lagrange multiply theorems to be 0,
32:58 - 32:59

and so theta p
32:59 - 33:04

just left with f of w. Thus, theta
33:04 - 33:06

p
33:06 - 33:11

of w is equal to
33:11 - 33:14

f of w
33:14 - 33:17

if constraints are
33:17 - 33:18

satisfied
33:18 - 33:21

[inaudible]
33:21 - 33:23

the gi in
33:23 - 33:24

hi constraints,
33:24 - 33:28

and is equal to plus infinity
33:28 - 33:32

otherwise.
33:32 - 33:34

So the problem I
33:34 - 33:39

wrote down that minimizes the function of w -
33:39 - 33:45

theta p of w - this is [inaudible] problem. That's
33:45 - 33:48

just
33:48 - 33:51

exactly the same problem as my original primal problem
33:51 - 33:55

because if you choose a value of w that violates the constraints, you get infinity.
33:55 - 33:58

And if you satisfy the constraints, you get f of w.
33:58 - 34:00

So this is really just the same as - well,
34:00 - 34:01

we'll say,
34:01 - 34:05

"Satisfy the constraints, and minimize f of w." That's really what
34:05 - 34:09

minimizing the state of p
34:09 - 34:16

of w is. Raise your hand if this makes sense. Yeah, okay, cool. So all right.
34:25 - 34:28

I hope no one's getting mad at me because I'm doing so much work, and when we come back, it'll be exactly
34:28 - 34:31

the same thing we started with. So here's
34:31 - 34:33

the cool part.
34:33 - 34:40

Let me know if you find it in your problem. To find
34:40 - 34:42

theta
34:42 - 34:45

d
34:45 - 34:47

and d [inaudible] duo,
34:47 - 34:50

and this is how the function of alpha and beta is. It's not the function of the Lagrange
34:50 - 34:53

multipliers. It's not of w.
34:53 - 34:58

To find this, we minimize over w of
34:58 - 35:05

my generalized Lagrange. And
35:06 - 35:13

my dual problem is this. So in
35:17 - 35:20

other words, this is max over
35:20 - 35:22

that.
35:22 - 35:23
35:23 - 35:28
35:28 - 35:33

And so this is my duo optimization problem. To maximize over alpha and
35:33 - 35:34

beta, theta
35:34 - 35:35

d over alpha
35:35 - 35:38

and beta. So this optimization problem, I
35:38 - 35:41

guess, is my dual problem. I
35:41 - 35:45

want you to compare this to our previous prime optimization problem.
35:45 - 35:47

The only difference is that
35:47 - 35:52

I took the max and min, and I switched the order around with the max and
35:52 - 35:56

min. That's the difference in the primal and the duo optimization [inaudible].
35:56 - 36:00

And it turns out that
36:00 - 36:02
36:02 - 36:05

it's a - it's sort of - it's a fact - it's
36:05 - 36:06

true, generally, that d
36:06 - 36:10

star is less than [inaudible] p star. In other words, I think I defined
36:10 - 36:16

p star previously. P star was a value of the prime optimization problem.
36:16 - 36:20

And in other words, that it's just generally true
36:20 - 36:25

that the max of the min of something is less than equal to the min of the max
36:25 - 36:28

of something.
36:28 - 36:29

And this is a
36:29 - 36:30

general fact.
36:30 - 36:33

And just as a concrete example, the
36:33 - 36:37

max over y in the set 01 x -
36:37 - 36:40

oh, excuse me, of the min of the
36:40 - 36:44

set in 01
36:44 - 36:48

of indicator x =
36:48 - 36:51

y - this is
36:51 - 36:58

[inaudible] equal to
37:03 - 37:10

min.
37:10 - 37:14

So this equality - this inequality actually holds true for any
37:14 - 37:16

function you might find in here.
37:16 - 37:18

And this is one specific example
37:18 - 37:21

where
37:21 - 37:24

the min over xy - excuse me, min over x of [inaudible] equals y -
37:24 - 37:27

this is always equal to 0
37:27 - 37:31

because whatever y is, you can choose x to be something different. So
37:31 - 37:33

this is always 0,
37:33 - 37:34

whereas if
37:34 - 37:38

I exchange the order to min
37:38 - 37:40

and max, then thing here is always equal to
37:40 - 37:42

1. So 0 [inaudible] to
37:42 - 37:45

1. And more generally, this min/max - excuse
37:45 - 37:51

me, this max/min, thus with the min/max holds true for any function you might put in
37:51 - 37:52

there.
37:52 - 37:58

But it turns out that sometimes under certain conditions,
37:58 - 37:59
37:59 - 38:01
38:01 - 38:02
38:02 - 38:03
38:03 - 38:07

these two optimization problems have the same value. Sometimes under certain
38:07 - 38:08

conditions,
38:08 - 38:10

the primal and the dual problems
38:10 - 38:12

have the same value.
38:12 - 38:15

And so
38:15 - 38:16

you might be able to solve
38:16 - 38:20

the dual problem rather than the primal problem.
38:20 - 38:22

And the reason to do that is that
38:22 - 38:26

sometimes, which we'll see in the optimal margin classifier problem, the support vector machine problem,
38:26 - 38:31

the dual problem turns out to be much easier than it - often has many useful properties that
38:31 - 38:34

will make user
38:34 - 38:41

compared to the primal. So for the sake of -
38:42 - 38:48

so
38:48 - 38:55

what
39:02 - 39:05

I'm going to do now is write down formally the certain conditions under which that's
39:05 - 39:09

true - where the primal and the dual problems are equivalent.
39:09 - 39:14

And so our strategy for working out the [inaudible] of support vector machine algorithm
39:14 - 39:18

will be that we'll write down the primal optimization problem, which we did
39:18 - 39:21

previously, and maximizing classifier.
39:21 - 39:22

And then we'll
39:22 - 39:24

derive the duo optimization problem for that.
39:24 - 39:26

And then we'll solve the dual problem.
39:26 - 39:29

And by modifying that a little bit, that's how we'll derive this support vector machine. But let me ask you - for
39:29 - 39:30
39:30 - 39:34

now, let me just first, for
39:34 - 39:37

the sake of completeness, I just write down the conditions under which the primal
39:37 - 39:42

and the duo optimization problems give you the same solutions. So let f
39:42 - 39:45

be convex. If you're not
39:45 - 39:47
39:47 - 39:50

sure what convex means, for the purposes of this class, you can take it to
39:50 - 39:53

mean that the
39:53 - 39:56

Hessian, h is positive. [Inaudible], so it just means it's
39:56 - 39:58

a [inaudible] function like that.
39:58 - 39:59

And
39:59 - 40:01

once you learn more about optimization
40:01 - 40:07

- again, please come to this week's discussion session taught by the TAs.
40:07 - 40:09

Then
40:09 - 40:11

suppose
40:11 - 40:13

hi
40:13 - 40:18

- the hi constraints [inaudible], and what that means is that hi of w equals alpha i
40:18 - 40:25

transpose w plus vi. This actually
40:25 - 40:29

means the same thing as linear. Without the term b here, we say
40:29 - 40:30

that hi is linear
40:30 - 40:34

where we have a constant interceptor as well. This is technically called [inaudible] other than
40:34 - 40:37

linear.
40:37 - 40:40

And let's suppose
40:40 - 40:43
40:43 - 40:50

that gi's are strictly feasible.
40:51 - 40:54

And
40:54 - 40:57

what that means is that
40:57 - 41:01

there is just a value of the w such that
41:01 - 41:04

from i,
41:04 - 41:07

gi of w is less
41:07 - 41:11

than 0. Don't worry too much [inaudible]. I'm writing these things down for the sake of completeness, but don't worry too much about all the
41:11 - 41:13

technical details.
41:13 - 41:16

Strictly feasible, which just means that there's a value of w such that
41:16 - 41:19

all of these constraints are satisfy were stricter than the equality rather than what
41:19 - 41:22

less than equal to.
41:22 - 41:22
41:22 - 41:25

Under these conditions,
41:25 - 41:26
41:26 - 41:30

there were exists w star,
41:30 - 41:32

alpha
41:32 - 41:34

star, beta
41:34 - 41:40

star such that w star solves the primal problem.
41:40 - 41:42

And alpha star
41:42 - 41:49

and beta star, the Lagrange multipliers, solve the dual problem.
41:52 - 41:54
41:54 - 41:59

And the value of the primal problem will be equal to the value of the dual problem will
41:59 - 42:04

be equal to the value of your Lagrange multiplier - excuse
42:04 - 42:06

me, will be equal to the value of your generalized Lagrange, the value
42:06 - 42:09

of that w star, alpha star, beta star.
42:09 - 42:13

In other words, you can solve either the primal or the dual problem. You get
42:13 - 42:15

the same
42:15 - 42:20

solution. Further,
42:20 - 42:23

your parameters will satisfy
42:23 - 42:24
42:24 - 42:27
42:27 - 42:32

these conditions. Partial derivative perspective parameters would be
42:32 - 42:33

0. And
42:33 - 42:37

actually, to keep this equation in mind, we'll actually use this in a second
42:37 - 42:39

when we take the Lagrange, and we - and
42:39 - 42:43

our support vector machine problem, and take a derivative with respect to w to solve a -
42:43 - 42:45

to solve our -
42:45 - 42:47

to derive our dual problem. We'll actually
42:47 - 42:50

perform this step ourselves in a second.
42:50 - 42:52
42:52 - 42:58

Partial derivative with respect to the Lagrange multiplier beta is
42:58 - 43:02
43:02 - 43:05

equal
43:05 - 43:06

to
43:06 - 43:08

0.
43:08 - 43:10

Turns out this will hold true,
43:10 - 43:12

too.
43:12 - 43:14

This is called
43:14 - 43:19
43:19 - 43:21

the -
43:21 - 43:27

well
43:27 - 43:29

-
43:29 - 43:32

this is called the KKT complementary condition.
43:32 - 43:38

KKT stands for Karush-Kuhn-Tucker, which were the authors of this theorem.
43:38 - 43:42

Well, and by tradition, usually this [inaudible] KKT conditions.
43:42 - 43:49

But the other two are
43:50 - 43:53

- just so the [inaudible] is greater than 0, which we had
43:53 - 43:55

previously
43:55 - 44:02

and that your constraints are actually satisfied.
44:03 - 44:10

So let's see. [Inaudible] All
44:22 - 44:29

right.
44:30 - 44:34

So let's take those and apply this to our
44:34 - 44:37

optimal margin
44:37 - 44:41

optimization problem that we had previously. I
44:41 - 44:44

was gonna say one word about this,
44:44 - 44:47

which is -
44:47 - 44:49

was gonna say one word about this
44:49 - 44:51

KTT
44:51 - 44:52

complementary condition is
44:52 - 44:54

that a condition that is a -
44:54 - 45:01

at your solution, you must have that alpha star i times gi of w is equal to 0.
45:01 - 45:03

So
45:03 - 45:04

let's see.
45:04 - 45:09

So the product of two numbers is equal to 0. That means that
45:09 - 45:09
45:09 - 45:12

at least one of these things must be equal to
45:12 - 45:16

0. For the product of two things to be equal to 0, well, that's just saying either alpha
45:16 - 45:17

or i
45:17 - 45:21

or gi is equal to 0.
45:21 - 45:26

So what that implies is that the - just Karush-Kuhn-Tucker
45:26 - 45:32
45:32 - 45:38
45:38 - 45:43

- most people just say KKT, but we wanna show you the right
45:43 - 45:46

spelling
45:46 - 45:47

of their names. So
45:47 - 45:48

KKT
45:48 - 45:52

complementary condition implies that if alpha
45:52 - 45:55

i is not 0,
45:55 - 46:01

that necessarily implies that
46:01 - 46:03

gi of w star
46:03 - 46:06

is equal to
46:06 - 46:11

0.
46:11 - 46:13

And
46:13 - 46:15

usually,
46:15 - 46:21
46:21 - 46:24
46:24 - 46:26

it turns out - so
46:26 - 46:29

all the KKT condition guarantees is that
46:29 - 46:33

at least one of them is 0. It may actually be the case that both
46:33 - 46:35

alpha and gi are both equal to 0.
46:35 - 46:39

But in practice, when you solve this optimization problem, you find that
46:39 - 46:43

to a large part, alpha i star is not equal to 0 if and only gi of
46:43 - 46:45

w star 0, 0.
46:45 - 46:50

This is not strictly true because it's possible that both of these may be 0.
46:50 - 46:52

But in practice,
46:52 - 46:55

when we - because when we solve problems like these, you're, for the most part,
46:55 - 46:59

usually exactly one of these will be non-0.
46:59 - 47:03

And also, when this holds true, when gi of w star is equal to 0,
47:03 - 47:06

we say that
47:06 - 47:07

gi
47:07 - 47:14

- gi of w, I guess, is an active
47:16 - 47:18

constraint
47:18 - 47:22

because we call a constraint - our constraint was a gi of w must be less than or equal to
47:22 - 47:23

0.
47:23 - 47:26

And so it is equal to 0, then
47:26 - 47:33

we say that that's a constraint that this is an active constraint.
47:34 - 47:37

Once we talk about [inaudible], we come back and [inaudible] and just extend this
47:37 - 47:43

idea a little bit more.
47:43 - 47:48

[Inaudible] board. [Inaudible]
47:48 - 47:52

turn
47:52 - 47:59

to this board in a second, but -
48:07 - 48:11

so let's go back and work out one of the primal and the duo optimization problems for
48:11 - 48:12
48:12 - 48:17

our optimal margin classifier for the optimization problem that we worked on just now. As
48:17 - 48:20

a point of notation,
48:20 - 48:25

in whatever I've been writing down so far in deriving the KKT conditions,
48:25 - 48:28
48:28 - 48:32

when Lagrange multipliers were alpha i and beta i,
48:32 - 48:36

it turns out that when applied as [inaudible]
48:36 - 48:37

dm,
48:37 - 48:41

turns out we only have one set of Lagrange multipliers alpha i.
48:41 - 48:44

And also,
48:44 - 48:47

as I was working out the KKT conditions, I used w
48:47 - 48:52

to denote the parameters of my primal optimization problem. [Inaudible] I wanted to
48:52 - 48:54

minimize f of w. In my
48:54 - 48:58

very first optimization problem, I had
48:58 - 48:59

that optimization problem [inaudible] finding the parameters
48:59 - 49:02

w. In my svn problem, I'm
49:02 - 49:05

actually gonna have two sets of parameters, w and b. So
49:05 - 49:08

this is just a - keep that
49:08 - 49:15

sort of slight notation change in mind. So
49:16 - 49:20

problem we worked out previously was we want to minimize the normal w squared and just add a
49:20 - 49:21

half there
49:21 - 49:26

by convention because it makes other work - math work a little
49:26 - 49:27

nicer.
49:27 - 49:33

And subject to the constraint that yi x w [inaudible] xi + v must be = greater
49:41 - 49:46

And so let me just take this constraint, and I'll rewrite it as a constraint. It's gi of w,
49:46 - 49:49

b. Again, previously,
49:49 - 49:53

I had gi
49:53 - 49:56

of w, but now I have parameters w and b. So
49:56 - 50:03

gi of w, b defined
50:03 - 50:10

as 1.
50:10 - 50:16

So
50:16 - 50:20

let's look at the implications of this in terms
50:20 - 50:25

of the KKT duo complementary condition again.
50:25 - 50:29

So we have that alpha i is basically equal to
50:29 - 50:32

0. That necessarily implies that
50:32 - 50:35

gi of w, b
50:35 - 50:42

is equal to 0. In other words, this is an active constraint.
50:47 - 50:54

And what does this mean? It means that it actually turns out gi of wb equal to 0 that
50:54 - 51:01

is - that means exactly that the training example xi, yi
51:01 - 51:08

has functional margin
51:09 - 51:10

equal to 1.
51:10 - 51:11

Because
51:11 - 51:15

this constraint was that
51:15 - 51:19

the functional margin of every example has to be greater equal to
51:19 - 51:22

1. And so if this is an active constraint, it just -
51:22 - 51:24

inequality holds that equality.
51:24 - 51:27

That means that my training example i
51:27 - 51:30

must have functional margin equal to exactly 1. And so - actually, yeah,
51:30 - 51:34

right now, I'll
51:34 - 51:40

do this on a different board, I guess.
51:40 - 51:47
51:57 - 51:59

So in pictures,
51:59 - 52:04

what that means is that, you have some
52:04 - 52:11

training sets, and you'll
52:11 - 52:15

have some separating hyperplane.
52:15 - 52:19

And so the examples with functional margin equal to 1
52:19 - 52:23

will be exactly those which are -
52:23 - 52:26

so they're closest
52:26 - 52:28

to my separating hyperplane. So
52:28 - 52:31

that's
52:31 - 52:34

my equation. [Inaudible] equal to 0.
52:34 - 52:39

And so in this - in this cartoon example that I've done, it'll be
52:39 - 52:45

exactly
52:45 - 52:50

- these three
52:50 - 52:53

examples that have functional margin
52:53 - 52:54

equal to 1,
52:54 - 52:59

and all of the other examples as being further away than these
52:59 - 53:06

three will have functional margin that is strictly greater than 1.
53:06 - 53:08

And
53:08 - 53:11

the examples with functional margin equal to 1 will usually correspond to the
53:11 - 53:15
53:15 - 53:22
53:22 - 53:23

ones where
53:23 - 53:27

the corresponding Lagrange multipliers also not equal to 0. And again, it may not
53:27 - 53:29

hold true. It may be the case that
53:29 - 53:33

gi and alpha i equal to 0. But usually, when gi's not -
53:33 - 53:36

is 0, alpha i will be non-0.
53:36 - 53:39

And so the examples of functional margin equal to 1 will be the ones where alpha i is not equal
53:39 - 53:46

to 0. One
53:47 - 53:49

useful property of this is that
53:49 - 53:53

as suggested by this picture and so true in general as well, it
53:53 - 53:57

turns out that we find a solution to this - to the optimization problem,
53:57 - 54:02

you find that relatively few training examples have functional margin equal to 1. In
54:02 - 54:03

this picture I've drawn,
54:03 - 54:06

there are three examples with functional margin equal to 1. There
54:06 - 54:09

are just few examples of this minimum possible distance to your separating hyperplane.
54:09 - 54:11
54:11 - 54:14

And these are three -
54:14 - 54:18

these examples of functional margin equal to 1 - they
54:18 - 54:21

are what we're going to call
54:21 - 54:26

the support vectors. And
54:26 - 54:30

this needs the name support vector machine. There'll be these three points with functional margin
54:30 - 54:31

1
54:31 - 54:35

that we're calling support vectors.
54:35 - 54:37

And
54:37 - 54:40

the fact that they're relatively few support vectors also means that
54:40 - 54:41

usually,
54:41 - 54:43

most of the alpha i's are equal to
54:43 - 54:45

0. So with alpha i equal
54:45 - 54:52

to
54:53 - 54:54

0,
54:54 - 55:01

for examples, though, not support vectors.
55:03 - 55:07

Let's go ahead and work out the actual
55:07 - 55:08

optimization problem.
55:08 - 55:12
55:12 - 55:19
55:28 - 55:29

So
55:29 - 55:32

we have a [inaudible] margin
55:32 - 55:33

optimization problem.
55:33 - 55:34
55:34 - 55:39

So there we go and write down the margin,
55:39 - 55:39

and
55:39 - 55:44

because we only have inequality constraints where we really have gi star
55:44 - 55:47

constraints, no hi star constraint. We have
55:47 - 55:50

inequality constraints and no equality constraints,
55:50 - 55:52

I'll only have
55:52 - 55:53

Lagrange multipliers of type
55:53 - 55:57

alpha - no betas in my generalized Lagrange. But
55:57 - 56:01
56:01 - 56:03

my Lagrange will be
56:03 - 56:10

one-half w squared minus.
56:15 - 56:18
56:18 - 56:20

That's my
56:20 - 56:27

Lagrange.
56:27 - 56:29

And so let's work out what the dual problem is.
56:29 - 56:33

And to do that, I need to figure out what theta d of alpha - and I know again, beta's there
56:33 - 56:37

- so what theta d of alpha
56:37 - 56:44

is min with
56:44 - 56:48

respect to wb of lb alpha. So the dual problem is the maximize theta d as the function of alpha.
56:48 - 56:55

So as to work out what theta d is, and then that'll give us our dual problem.
56:55 - 56:58

So then to work out what this is, what do you need to do? We need to
56:58 - 57:02

take a look at Lagrange and minimize it as a function of lv and b
57:02 - 57:03

so - and what is this? How do you
57:03 - 57:05

minimize Lagrange? So in order to
57:05 - 57:06
57:06 - 57:09

minimize the Lagrange as a function of w and b,
57:09 - 57:11

we do the usual thing. We
57:11 - 57:13

take the derivatives of w -
57:13 - 57:17

Lagrange with respect to w and b. And we set that to 0. That's how we
57:17 - 57:20

minimize the Lagrange with respect
57:20 - 57:26

to w and b. So take the derivative with respect to w of the Lagrange.
57:26 - 57:28

And
57:28 - 57:35

I want - I just write down the answer. You know how to do calculus like this.
57:38 - 57:41

So I wanna minimize this function of w, so I take the derivative and set it
57:41 - 57:43

to 0. And
57:43 - 57:45

I get that. And
57:45 - 57:52

then so this implies that w
57:56 - 57:59

must be that.
57:59 - 58:01

And so
58:01 - 58:05

w, therefore, is actually a linear combination of your input feature vectors xi.
58:05 - 58:07

This is
58:07 - 58:10

sum of your various weights given by the alpha i's and times
58:10 - 58:13

the xi's, which are your examples in your training set.
58:13 - 58:16

And this will be useful later. The
58:16 - 58:23

other equation we have is - here, partial derivative of
58:23 - 58:28

Lagrange with respect to p is equal to minus sum
58:28 - 58:34

of i plus 1 to m [inaudible] for
58:34 - 58:35

i.
58:35 - 58:37

And so I'll just set that to equal to
58:37 - 58:40

0. And so these are my two constraints.
58:40 - 58:43

And so
58:43 - 58:48
58:48 - 58:50

[inaudible].
58:50 - 58:54

So what I'm going to do is I'm actually going to take these two constraints,
58:54 - 58:58

and well, I'm going to take whatever I thought to be the value for w.
58:58 - 59:00

And I'm
59:00 - 59:01

gonna
59:01 - 59:04

take what I've worked out to be the
59:04 - 59:05

value for w, and
59:05 - 59:07

I'll plug it back in there
59:07 - 59:09

to figure out what the Lagrange really is
59:09 - 59:14

when I minimize with respect to w. [Inaudible] and I'll
59:14 - 59:21

deal with b in a second.
59:28 - 59:35

So
59:38 - 59:41

let's see. So my Lagrange is 1/2
59:41 - 59:44

w transpose w minus.
59:44 - 59:51
59:51 - 59:58
59:58 - 60:04

So this first term, w transpose w
60:04 - 60:06

- this becomes
60:06 - 60:11

sum y equals one to m, alpha i, yi,
60:11 - 60:18
60:21 - 60:25

xi transpose. This is just putting in the value for w that I worked out previously.
60:25 - 60:28

But since this is w transpose w -
60:28 - 60:32

and so when they expand out of this quadratic function, and when I plug in w
60:32 - 60:34

over there as well,
60:34 - 60:36

I
60:36 - 60:39

find
60:39 - 60:41
60:41 - 60:43
60:43 - 60:50

that
60:50 - 60:55
60:55 - 60:58

I
60:58 - 61:00

have
61:00 - 61:04
61:04 - 61:09
61:09 - 61:11
61:11 - 61:18

that.
61:21 - 61:22

Oh,
61:22 - 61:29

where I'm using these angle brackets to denote end product, so this
61:29 - 61:35

thing here, it just means the end product, xi transpose
61:35 - 61:36

xj. And
61:36 - 61:40

the first and second terms are actually the same except for the minus one half. So to
61:40 - 61:42

simplify to be
61:42 - 61:45

equal to
61:45 - 61:51
61:51 - 61:58
62:08 - 62:13

that.
62:13 - 62:16

So
62:16 - 62:19

let me go ahead and
62:19 - 62:22

call this w of alpha.
62:22 - 62:23
62:23 - 62:29
62:29 - 62:36
62:47 - 62:52

My dual problem is, therefore, the following. I want to maximize w
62:52 - 62:56

of alpha, which is that [inaudible].
62:56 - 62:58

And
62:58 - 63:02

I want to the - I realize the notation is somewhat
63:02 - 63:07

unfortunate. I'm using capital W of alpha to denote that formula I wrote down earlier.
63:07 - 63:14

And then we also had our lowercase w. The original [inaudible] is the primal
63:14 - 63:16

problem. Lowercase w transpose xi. So
63:16 - 63:18

uppercase and lowercase w
63:18 - 63:21

are totally different
63:21 - 63:27

things, so unfortunately, the notation is standard as well, as far as I know,
63:27 - 63:30

so. So the dual problem is
63:30 - 63:33

that subject to the alpha [inaudible] related to 0,
63:33 - 63:37

and
63:37 - 63:39

we also have that the
63:39 - 63:41

sum of i,
63:41 - 63:44

yi, alpha i is related to 0.
63:44 - 63:46

That last constraint
63:46 - 63:50

was the constraint I got from
63:50 - 63:52

this - the
63:52 - 63:55

sum of i - sum of i, yi alpha i equals to 0. But that's where
63:55 - 64:00

that [inaudible] came
64:00 - 64:02

from. Let
64:02 - 64:02

me just
64:02 - 64:06

- I think in previous years that I taught this,
64:06 - 64:09

where this constraint comes from is just - is slightly confusing. So let
64:09 - 64:13

me just take two minutes to say what the real interpretation of that is. And if you
64:13 - 64:16

don't understand it, it's
64:16 - 64:18

not a big deal, I guess.
64:18 - 64:22

So when we took the partial derivative of the Lagrange with
64:22 - 64:23

respect to b,
64:23 - 64:28

we end up with this constraint that sum of i, yi, alpha i must be equal to 0.
64:28 - 64:34

The interpretation of that, it turns out, is that if sum of i, yi, alpha i
64:34 - 64:37

is not equal to
64:37 - 64:41

0, then
64:41 - 64:43
64:43 - 64:50
64:51 - 64:53

theta d of wb
64:53 - 64:56

is -
64:56 - 65:00

actually, excuse me.
65:00 - 65:02
65:02 - 65:03
65:03 - 65:06

Then theta d of alpha is equal to
65:06 - 65:08
65:08 - 65:13

minus infinity for minimizing.
65:13 - 65:16

So in other words, it turns out my Lagrange is
65:16 - 65:20

actually a linear function of my parameters b. And so the interpretation of
65:20 - 65:23

that constraint we worked out previously was that if sum of i or yi, alpha i
65:23 - 65:26

is not equal to 0, then
65:26 - 65:29

theta d of alpha is equal to minus infinity.
65:29 - 65:32

And so if your goal is to
65:32 - 65:37

maximize as a function of alpha, theta
65:37 - 65:38

d of alpha,
65:38 - 65:45

then you've gotta choose values of alpha for which sum of yi alpha is equal to 0.
65:45 - 65:51

And then when sum of yi alpha is equal to 0, then
65:51 - 65:54
65:54 - 65:56
65:56 - 66:01

theta d of
66:01 - 66:04

alpha is equal to w of alpha.
66:04 - 66:09

And so that's why we ended up deciding to maximize w of alpha subject to
66:09 - 66:14

that sum of yi alpha is equal to 0.
66:14 - 66:19

Yeah, the - unfortunately, the fact of that d would be [inaudible]
66:19 - 66:22

adds just a little bit of extra notation in our
66:22 - 66:23

derivation of the duo. But
66:23 - 66:27

by the way, and [inaudible] all the action of the optimization problem is with w
66:27 - 66:33

because b is just one parameter.
66:33 - 66:40

So let's check. Are there any questions about this? Okay, cool.
66:47 - 66:49

So
66:49 - 66:54

what derived a duo optimization problem - and really, don't worry about this
66:54 - 66:56

if you're not quite sure where this was. Just think of this as
66:56 - 67:00

we worked out this constraint, and we worked out, and we took partial derivative with
67:00 - 67:01

respect to b,
67:01 - 67:06

that this constraint has the [inaudible] and so I just copied that over here. But so - worked out
67:06 - 67:11

the duo of the optimization problem,
67:11 - 67:16

so our approach to finding - to deriving the optimal margin classifier or support vector
67:16 - 67:17

machine
67:17 - 67:19

will be that we'll solve
67:19 - 67:26

along this duo optimization problem for the parameters alpha
67:28 - 67:29

star.
67:29 - 67:31

And then
67:31 - 67:34

if you want, you can then - this is the equation that we worked out on
67:34 - 67:36

the previous board. We said that
67:36 - 67:43

w - this [inaudible] alpha - w must be equal to
67:44 - 67:46

that.
67:46 - 67:48

And so
67:48 - 67:53

once you solve for alpha, you can then go back and quickly derive
67:53 - 67:57

w in parameters to your primal problem. And we worked this out earlier.
67:57 - 67:58
67:58 - 68:00

And moreover,
68:00 - 68:05

once you solve alpha and w, you can then focus back into your - once you solve for alpha and w,
68:05 - 68:07
68:07 - 68:10

it's really easy to solve for v, so
68:10 - 68:14
68:14 - 68:15

that b gives us the interpretation of [inaudible]
68:15 - 68:20

training set, and you found the direction for w. So you know where your separating
68:20 - 68:22

hyperplane's direction is. You know it's got to be
68:22 - 68:25

one of these things.
68:25 - 68:28

And you know the orientation and separating hyperplane. You just have to
68:28 - 68:31

decide where to place
68:31 - 68:33

this hyperplane. And that's what solving b is.
68:33 - 68:37

So once you solve for alpha and w, it's really easy to solve b.
68:37 - 68:40

You can plug alpha and w back into the
68:40 - 68:43

primal optimization problem
68:43 - 68:46
68:46 - 68:50
68:50 - 68:57

and solve for b.
69:02 - 69:05

And I just wrote it down
69:05 - 69:12

for the sake of completeness,
69:15 - 69:20
69:20 - 69:21
69:21 - 69:23

but
69:23 - 69:28
69:28 - 69:30

- and the
69:30 - 69:36

intuition behind this formula is just that find the worst positive
69:36 - 69:37

[inaudible] and the
69:37 - 69:42

worst negative example. Let's say
69:42 - 69:45

this one and this one - say [inaudible] and [inaudible] the difference between them. And
69:45 - 69:46

that tells you where you should
69:46 - 69:48

set the threshold for
69:48 - 69:53

where to place the separating hyperplane.
69:53 - 69:53

And then
69:53 - 69:55

that's the -
69:55 - 69:58

this is the optimal margin classifier. This is also called a support vector
69:58 - 70:02

machine. If you do not use one y [inaudible], it's called kernels. And I'll say a few words
70:02 - 70:04

about that. But I
70:04 - 70:08

hope the process is clear. It's a dual problem. We're going to solve the duo
70:08 - 70:10

problem for the alpha i's.
70:10 - 70:11

That gives us w, and that gives
70:11 - 70:14

us b.
70:14 - 70:17

So
70:17 - 70:22

there's just one more thing I wanna point out as I lead into the next lecture,
70:22 - 70:23

which is that - I'll just
70:23 - 70:28

write this out again,
70:28 - 70:29

I guess -
70:29 - 70:31

which is that it turns out
70:31 - 70:34

we can take the entire algorithm,
70:34 - 70:38

and we can express the entire algorithm in terms of inner products. And here's what I
70:38 - 70:41

mean by that. So
70:41 - 70:42

say that the parameters w
70:42 - 70:45

is the sum of your input examples.
70:45 - 70:49

And so we need to make a prediction.
70:49 - 70:55

Someone gives you a new value of x. You want a value of the hypothesis on the value of x.
70:55 - 70:58

That's given by g of w transpose x plus b, or
70:58 - 71:03

where g was this threshold function that outputs minus 1 or plus 1.
71:03 - 71:06

And so you need to compute w transpose x plus b.
71:06 - 71:11

And that is equal to alpha i,
71:11 - 71:18

yi.
71:20 - 71:24

And that can be expressed as a sum of these inner products between
71:24 - 71:26

your training examples
71:26 - 71:33

and this new value of x [inaudible] value [inaudible]. And this will
71:34 - 71:39

lead into our next lecture, which is the idea of kernels.
71:39 - 71:40

And
71:40 - 71:44

it turns out that in the source of feature spaces where used to support vector
71:44 - 71:45

machines -
71:45 - 71:52

it turns out that sometimes your training examples may be very high-dimensional. It may even be the case
71:54 - 71:58

that the features that you want to use
71:58 - 72:00

are
72:00 - 72:04

inner-dimensional feature vectors.
72:04 - 72:11

But despite this, it'll turn out that there'll be an interesting representation that
72:11 - 72:13

you can use
72:13 - 72:15

that will allow you
72:15 - 72:19

to compute inner products like these efficiently.
72:19 - 72:26
72:26 - 72:29

And this holds true only for certain feature spaces. It doesn't hold true for arbitrary sets
72:29 - 72:30

of features.
72:30 - 72:34

But we talk about the idea of
72:34 - 72:35

kernels. In the next lecture, we'll
72:35 - 72:38

see examples where
72:38 - 72:41

even though you have extremely high-dimensional feature vectors, you can compute
72:41 - 72:46

- you may never want to represent xi, x plus [inaudible] inner-dimensional
72:46 - 72:48

feature vector. You can even store in computer memory.
72:48 - 72:52

But you will nonetheless be able to compute inner products between different
72:52 - 72:54

[inaudible] feature vectors very efficiently.
72:54 - 72:57

And so you can - for example, you can make predictions by making use of these inner
72:57 - 72:59

products.
72:59 - 73:02

This is just xi
73:02 - 73:04

transpose.
73:04 - 73:09

You will compute these inner products very efficiently and, therefore, make predictions.
73:09 - 73:11

And this pointed also - the other
73:11 - 73:15

reason we derive the duo was because
73:15 - 73:18

on this board, when we worked out what w of alpha is, w of alpha
73:18 - 73:24

- actually are the same property - w of alpha is again
73:24 - 73:27

written in terms of these inner products.
73:27 - 73:29

And so if you
73:29 - 73:33

actually look at the duo optimization problem and step - for all the steps of the
73:33 - 73:34

algorithm,
73:34 - 73:37

you'll find that you actually do everything you want - learn the parameters of
73:37 - 73:38

alpha. So
73:38 - 73:42

suppose you do an optimization problem, go into parameters alpha, and you do everything you want
73:42 - 73:47

without ever needing to represent xi directly. And all you need to do
73:47 - 73:54

is represent this compute inner products with your feature vectors like these. Well,
73:54 - 73:56

one last property of
73:56 - 73:58

this algorithm that's kinda nice is that
73:58 - 74:00

I said previously
74:00 - 74:01

that
74:01 - 74:07

the alpha i's are 0 only for the - are non-0 only for the support vectors,
74:07 - 74:09

only for the vectors
74:09 - 74:10

that function y [inaudible] 1.
74:10 - 74:14

And in practice, there are usually fairly few of them.
74:14 - 74:17

And so what this means is that if you're representing w this way,
74:17 - 74:18

then
74:18 - 74:23

w when represented as a fairly small fraction of training examples
74:23 - 74:25

because mostly alpha i's is 0 -
74:25 - 74:27

and so when you're summing up
74:27 - 74:28

the sum,
74:28 - 74:32

you need to compute inner products only if the support vectors, which is
74:32 - 74:36

usually a small fraction of your training set. So that's another nice [inaudible]
74:36 - 74:39

because [inaudible] alpha is
74:39 - 74:41

0. And well,
74:41 - 74:45

much of this will make much more sense when we talk about kernels. [Inaudible] quick
74:45 - 74:52

questions
74:52 - 74:53
74:53 - 74:58

before I close? Yeah. Student:It seems that for anything we've done the work, the point file has to be really well
74:58 - 75:00

behaved, and if any of the points are kinda on the wrong side - Instructor (Andrew Ng):No, oh, yeah, so again, for today's lecture asks you that
75:00 - 75:04

the data is linearly separable - that you can actually get perfect
75:04 - 75:11

[inaudible]. I'll fix this in the next lecture as well. But excellent assumption. Yes? Student:So can't we assume that [inaudible]
75:11 - 75:13

point [inaudible], so [inaudible]
75:13 - 75:18

have
75:18 - 75:19

[inaudible]? Instructor (Andrew Ng):Yes, so unless I - says that
75:19 - 75:23

there are ways to generalize this in multiple classes that I probably won't [inaudible] -
75:23 - 75:26

but yeah, that's generalization [inaudible].
75:26 - 75:28

Okay. Let's close for today, then.
75:28 - 75:29

We'll talk about kernels in our next lecture.

Title:: Lecture 7 | Machine Learning (Stanford)
Description:: Lecture by Professor Andrew Ng for Machine Learning (CS 229) in the Stanford Computer Science department. Professor Ng lectures on optimal margin classifiers, KKT conditions, and SUM duals.

This course provides a broad introduction to machine learning and statistical pattern recognition. Topics include supervised learning, unsupervised learning, learning theory, reinforcement learning and adaptive control. Recent applications of machine learning, such as to robotic control, data mining, autonomous navigation, bioinformatics, speech recognition, and text and web data processing are also discussed.

Complete Playlist for the Course:
http://www.youtube.com/view_play_list?p=A89DCFA6ADACE599

CS 229 Course Website:
http://www.stanford.edu/class/cs229/

Stanford University:
http://www.stanford.edu/

Stanford University Channel on YouTube:
http://www.youtube.com/stanford

more » « less
Video Language:: English
Duration:: 01:15:45

	N. Ueda edited English subtitles for Lecture 7 \| Machine Learning (Stanford)
	jhprks2 added a translation

English subtitles

Revisions

Revision 2

N. Ueda

Lecture 7 | Machine Learning (Stanford)

Revisions

Our website uses cookies

Operating cookies (Required)