Lecture 9 | Machine Learning (Stanford)

0:12 - 0:15

This presentation is delivered by the Stanford Center for Professional
0:15 - 0:22

Development. Okay.
0:29 - 0:32

So welcome back.
0:32 - 0:36

What I want to do today is start a new chapter
0:36 - 0:41

in between now and then. In particular, I want to talk about learning theory.
0:41 - 0:45

So in the previous, I guess eight lectures so far, you've learned about
0:45 - 0:50

a lot of learning algorithms, and yes, you now I hope understand a little about
0:50 - 0:54

some of the best and most powerful tools of machine learning in the [inaudible].
0:54 - 0:58

And all of you are now sort of well qualified to go into industry
0:58 - 1:02

and though powerful learning algorithms apply, really the most powerful
1:02 - 1:05

learning algorithms we know to all sorts of problems, and in fact, I hope
1:05 - 1:11

you start to do that on your projects right away as well.
1:11 - 1:14

You might remember, I think it was in the very first lecture,
1:14 - 1:19

that I made an analogy to if you're trying to learn to be a carpenter, so
1:19 - 1:21

if
1:21 - 1:24

you imagine you're going to carpentry school to learn to be a carpenter,
1:24 - 1:29

then only a small part of what you need to do is to acquire a set of tools. If you learn to
1:29 - 1:30

be a carpenter
1:30 - 1:33

you don't walk in and pick up a tool box and [inaudible], so
1:33 - 1:36

when you need to
1:36 - 1:40

cut a piece of wood do you use a rip saw, or a jig saw, or a keyhole saw
1:40 - 1:43

whatever, is this really mastering the tools there's also
1:43 - 1:46

an essential part of becoming a good carpenter.
1:46 - 1:47

And
1:47 - 1:50

what I want to do in the next few lectures is
1:50 - 1:53

actually give you a sense of the mastery of the machine learning tools all of
1:53 - 1:55

you have. Okay?
1:55 - 1:57

And so in particular,
1:57 - 2:01

in the next few lectures what I want to is to talk more deeply about
2:01 - 2:05

the properties of different machine learning algorithms so that you can get a sense of
2:05 - 2:07

when it's most appropriate to use
2:07 - 2:08

each one.
2:08 - 2:12

And it turns out that one of the most common scenarios in machine learning is
2:13 - 2:15

someday you'll be doing research or
2:15 - 2:17

[inaudible] a company.
2:17 - 2:20

And you'll apply one of the learning algorithms you learned about, you may apply logistic regression,
2:20 - 2:24

or support vector machines, or Naive Bayes or something,
2:24 - 2:28

and for whatever bizarre reason, it won't work as well as you were hoping, or it
2:28 - 2:32

won't quite do what you were hoping it to.
2:33 - 2:35

To
2:35 - 2:37

me what really separates the people from
2:37 - 2:41

what really separates the people that really understand and really get machine
2:41 - 2:42

learning,
2:42 - 2:46

compared to people that maybe read the textbook and so they'll work through the math,
2:46 - 2:50

will be what you do next. Will be in your decisions of when
2:50 - 2:54

you apply a support vector machine and it doesn't quite do what you wanted,
2:54 - 2:57

do you really understand enough about support vector machines to know what to
2:57 - 2:59

do next and how to modify the algorithm?
2:59 - 3:02

And to me that's often what really separates the great people in machine
3:02 - 3:06

learning versus the people that like read the text book and so they'll [inaudible] the math, and so they'll have
3:06 - 3:09

just understood that. Okay?
3:09 - 3:10

So
3:10 - 3:14

what I want to do today, today's lecture will mainly be on learning theory and
3:14 - 3:15

we'll start to
3:15 - 3:19

talk about some of the theoretical results of machine learning.
3:19 - 3:22

The next lecture, later this week, will be on algorithms for
3:22 - 3:26

sort of [inaudible], or fixing some of the problems
3:26 - 3:30

that the learning theory will point out to us and help us understand. And then
3:30 - 3:34

two lectures from now, that lecture will be almost entirely focused on
3:34 - 3:36

the practical advice for
3:36 - 3:40

how to apply learning algorithms. Okay?
3:40 - 3:47

So you have any questions about this before I start? Okay.
3:51 - 3:54

So the very first thing we're gonna talk about is something that
3:54 - 3:59

you've probably already seen on the first homework, and something that
3:59 - 4:01

alluded to previously,
4:01 - 4:07

which is the bias variance trade-off. So take
4:07 - 4:11

ordinary least squares, the first learning algorithm we learned
4:11 - 4:16

about, if you [inaudible] a straight line through these datas, this is not a very good model.
4:16 - 4:20

Right.
4:20 - 4:21

And if
4:21 - 4:24

this happens, we say it has
4:24 - 4:24

underfit
4:24 - 4:27

the data, or we say that this is a learning algorithm
4:27 - 4:31

with a very high bias, because it is
4:31 - 4:36

failing to fit the evident quadratic structure in the data.
4:36 - 4:36

And
4:36 - 4:38

for the prefaces
4:38 - 4:41

of [inaudible] you can formally think of the
4:41 - 4:45

bias of the learning algorithm as representing the fact that even if you
4:45 - 4:48

had an infinite amount of training data, even if
4:48 - 4:51

you had tons of training data,
4:51 - 4:53

this algorithm would still fail
4:53 - 4:54

to fit the quadratic
4:54 - 4:58

function, the quadratic structure in
4:58 - 5:02

the data. And so we think of this as a learning algorithm with high bias.
5:02 - 5:04

Then there's the opposite problem, so that's the
5:04 - 5:06

same dataset.
5:06 - 5:13

If
5:13 - 5:20

you fit a fourth of the polynomials into this dataset,
5:20 - 5:20

then you have
5:20 - 5:26

you'll be able to interpolate the five data points exactly, but clearly, this is also
5:26 - 5:27

not a great model
5:27 - 5:31

to the structure that you and I probably see in the data.
5:31 - 5:34

And we say that this
5:34 - 5:37

algorithm has a problem, excuse me,
5:37 - 5:43

is overfitting the data,
5:43 - 5:49

or alternatively that this algorithm has high variance. Okay? And the intuition behind
5:49 - 5:50

overfitting a high variance is that
5:50 - 5:54

the algorithm is fitting serious patterns in the data, or is fitting
5:54 - 5:58

idiosyncratic properties of this specific dataset, be it the dataset of
5:58 - 6:01

housing prices or whatever.
6:01 - 6:08

And quite often, they'll be some happy medium
6:08 - 6:10

of fitting a quadratic function
6:10 - 6:15

that maybe won't interpolate your data points perfectly, but also captures multistructure
6:15 - 6:17

in your data
6:17 - 6:21

than a simple model which under fits.
6:21 - 6:24

I say that you can sort of have the exactly the same picture
6:24 - 6:26

of classification problems as well,
6:26 - 6:28

so lets say
6:28 - 6:35

this is my training set, right,
6:36 - 6:37

of
6:37 - 6:39

positive and negative examples,
6:44 - 6:47

and so you can fit
6:47 - 6:52

logistic regression with a very high order polynomial [inaudible], or [inaudible] of X
6:52 - 6:57

equals the sigmoid function of
7:01 - 7:04

whatever. Sigmoid function applied to a tenth of the polynomial.
7:04 - 7:08

And you do that, maybe you get a decision boundary
7:08 - 7:13

like this. Right.
7:13 - 7:17

That does indeed perfectly separate the positive and negative classes, this is
7:17 - 7:19

another example of how
7:19 - 7:21

overfitting, and
7:21 - 7:24

in contrast you fit logistic regression into this model with just the linear features,
7:24 - 7:27

with none
7:27 - 7:30

of the quadratic features, then maybe you get a decision boundary like that, which
7:30 - 7:33

can also underfit.
7:33 - 7:35

Okay.
7:35 - 7:38

So what I want to do now is
7:38 - 7:42

understand this problem of overfitting versus underfitting, of high bias versus high
7:42 - 7:43

variance, more explicitly,
7:45 - 7:49

I will do that by posing a more formal model of machine learning and so
7:49 - 7:51

trying to prove when
7:51 - 7:57

these two twin problems when each of these two problems come up.
7:57 - 7:59

And as I'm modeling the
7:59 - 8:01
8:01 - 8:04

example for our initial foray into learning theory,
8:04 - 8:10

I want to talk about learning classification,
8:10 - 8:13

in which
8:14 - 8:16

H of X is equal
8:16 - 8:19

to G of data transpose X. Okay?
8:19 - 8:21

So the learning classifier.
8:21 - 8:27

And for this class I'm going to use, Z
8:27 - 8:33

excuse me. I'm gonna use G as indicator Z grading with
8:33 - 8:40

zero.
8:40 - 8:44

With apologies in advance for changing the notation yet again,
8:44 - 8:45

for the support vector machine
8:45 - 8:49

lectures we use Y equals minus one or plus one. For
8:49 - 8:53

learning theory lectures, turns out it'll be a bit cleaner if I switch back to
8:53 - 8:58

Y equals zero-one again, so I'm gonna switch back to my original notation.
8:58 - 9:02

And so you think of this model as a model forum as
9:02 - 9:05

logistic regressions, say, and think of this as being
9:05 - 9:07

similar to logistic regression,
9:07 - 9:08

except that now we're going to force
9:08 - 9:12

the logistic regression algorithm, to opt for labels that are
9:12 - 9:14

either zero or one. Okay?
9:14 - 9:16

So you can think of this as a
9:16 - 9:21

classifier to opt for labels zero or one involved in the probabilities.
9:21 - 9:25

And so as
9:25 - 9:32

usual, let's say we're given a training set of M examples.
9:34 - 9:38

That's just my notation for writing a set of M examples ranging from I equals
9:38 - 9:40

one
9:40 - 9:45

through M. And I'm going to assume that the training example is XIYI.
9:45 - 9:48

I've drawn IID,
9:48 - 9:50

from sum distribution,
9:50 - 9:51

scripts
9:51 - 9:54

D. Okay? [Inaudible]. Identically and definitively distributed
9:54 - 10:00

and if you have you have running a classification problem on houses, like
10:00 - 10:03

features of the house comma, whether the house will be sold in the next six months, then this
10:03 - 10:05

is just
10:05 - 10:08

the priority distribution over
10:08 - 10:12

features of houses and whether or not they'll be sold. Okay? So I'm gonna assume that
10:12 - 10:17

training examples we've drawn IID from some probability distributions,
10:18 - 10:20

scripts D. Well, same thing for spam, if you're
10:20 - 10:23

trying to build a spam classifier then this would be the distribution of
10:23 - 10:30

what emails look like comma, whether they are spam or not.
10:30 - 10:32

And in particular, to understand
10:32 - 10:38

or simplify to understand the phenomena of bias invariance, I'm actually going to use a
10:38 - 10:42

simplified model of machine learning.
10:42 - 10:45

And in particular,
10:45 - 10:46

logistic regression fits
10:46 - 10:50

this parameters the model like this for maximizing the law of likelihood.
10:50 - 10:55

But in order to understand learning algorithms more deeply, I'm just going to assume a simplified
10:55 - 10:59

model of machine learning, let me just write that down.
11:02 - 11:05

So I'm going to define training error
11:05 - 11:09

as
11:09 - 11:14

so this is a training error of a hypothesis X subscript data.
11:14 - 11:18

Write this epsilon hat of subscript data.
11:18 - 11:20

If I want to make the
11:20 - 11:23

dependence on a training set explicit, I'll write this with
11:23 - 11:27

a subscript S there where S is a training set.
11:27 - 11:34

And I'll define this as,
11:42 - 11:44

let's see. Okay. I
11:44 - 11:48

hope the notation is clear. This is a sum of indicator functions for whether your hypothesis
11:48 - 11:52

correctly classifies the Y the IFE example.
11:52 - 11:55

And so when you divide by M, this is just
11:55 - 11:58

in your training set what's the fraction
11:58 - 12:00

of training examples your
12:00 - 12:01

hypothesis classifies
12:01 - 12:05

so defined as a training error. And
12:05 - 12:08

training error is also called risk.
12:10 - 12:14

The simplified model of machine learning I'm gonna talk about is
12:14 - 12:17

called empirical risk minimization.
12:17 - 12:21

And in particular, I'm going to assume that the way my learning algorithm works
12:21 - 12:26

is it will choose parameters
12:26 - 12:33

data, that
12:34 - 12:39

minimize my training error. Okay?
12:39 - 12:43

And it will be this learning algorithm that we'll prove properties about.
12:43 - 12:45

And it turns out that
12:45 - 12:46

you
12:46 - 12:50

can think of this as the most basic learning algorithm, the algorithm that minimizes
12:50 - 12:52

your training error. It
12:52 - 12:55

turns out that logistic regression and support vector machines can be
12:55 - 12:59

formally viewed as approximation cities, so it
12:59 - 13:03

turns out that if you actually want to do this, this is a nonconvex optimization
13:03 - 13:03

problem.
13:03 - 13:09

This is actually it actually [inaudible] hard to solve this optimization problem.
13:09 - 13:14

And logistic regression and support vector machines can both be viewed as
13:14 - 13:14

approximations
13:14 - 13:17

to this nonconvex optimization problem
13:17 - 13:20

by finding the convex approximation to it.
13:20 - 13:23

Think of this as
13:23 - 13:24

similar to what
13:24 - 13:28

algorithms like logistic regression
13:28 - 13:30

are doing. So
13:30 - 13:33

let me take that
13:33 - 13:34

definition of empirical risk
13:34 - 13:36

minimization
13:36 - 13:43

and actually just rewrite it in a different equivalent way.
13:43 - 13:45

For the results I want to prove today, it turns out
13:45 - 13:48

that it will be useful to think of
13:48 - 13:49

our learning algorithm
13:49 - 13:52

as not choosing a set of parameters,
13:52 - 13:55

but as choosing a function.
13:55 - 13:58

So let me say what I mean by that. Let me define
13:58 - 14:00

the hypothesis
14:00 - 14:03

class,
14:03 - 14:05

script h,
14:05 - 14:09

as the class of all hypotheses of in other words as the class of all linear
14:09 - 14:11

classifiers, that your
14:11 - 14:14

learning algorithm is choosing from.
14:14 - 14:17

Okay? So
14:17 - 14:19

H subscript data
14:19 - 14:23

is a specific linear classifier, so H subscript data,
14:23 - 14:30

in each of these functions each
14:30 - 14:34

of these is a function mapping from the input domain X is the class zero-one. Each
14:34 - 14:35

of
14:35 - 14:38

these is a function, and as you vary the parameter's data,
14:38 - 14:41

you get different functions.
14:41 - 14:44

And so let me define the hypothesis class
14:44 - 14:48

script H to be the class of all functions that say logistic regression can choose
14:48 - 14:49

from. Okay. So
14:49 - 14:53

this is the class of all linear classifiers
14:53 - 14:54

and so
14:57 - 15:01

I'm going to define,
15:01 - 15:05

or maybe redefine empirical risk minimization as
15:05 - 15:09

instead of writing this choosing a set of parameters, I want to think of it as
15:09 - 15:10

choosing a
15:10 - 15:12

function
15:12 - 15:15

into hypothesis class of script H
15:15 - 15:22

that minimizes
15:22 - 15:27

that minimizes my training error. Okay?
15:27 - 15:31

So actually can you raise your
15:31 - 15:35

hand if it makes sense to you why this is equivalent to the previous
15:35 - 15:41

formulation? Okay, cool.
15:41 - 15:44

Thanks. So for development of the use of think of algorithms as choosing from
15:44 - 15:47

function from the class instead,
15:47 - 15:48

because
15:48 - 15:51

in a more general case this set, script H,
15:51 - 15:54

can be some other class of functions. Maybe
15:54 - 15:58

is a class of all functions represented by viewer network, or the class of all
15:58 - 16:04

some other class of functions the learning algorithm wants to choose from.
16:04 - 16:05

And
16:05 - 16:11

this definition for empirical risk minimization will still apply. Okay?
16:11 - 16:13

So
16:13 - 16:16

what we'd like to do is understand
16:16 - 16:19

whether empirical risk minimization
16:19 - 16:24

is a reasonable algorithm. Alex? Student:[Inaudible] a function that's
16:24 - 16:26

defined
16:26 - 16:30

by G
16:30 - 16:35

of data TX, or is it now more general? Instructor (Andrew Ng): I see, right, so lets see
16:35 - 16:36

I guess this
16:36 - 16:41

the question is H data still defined by G of phase transpose
16:41 - 16:47

X, is this more general? So Student:[Inaudible] Instructor (Andrew Ng): Oh,
16:47 - 16:48

yeah so very two answers
16:48 - 16:51

to that. One is, this framework is general, so
16:51 - 16:55

for the purpose of this lecture it may be useful to you to keep in mind a model of the
16:55 - 16:57

example of
16:57 - 17:01

when H subscript data is the class of all linear classifiers such as those used by
17:01 - 17:04

like a visectron algorithm or logistic regression.
17:04 - 17:06

This
17:06 - 17:07

everything on this board,
17:07 - 17:11

however, is actually more general. H can be any set of functions, mapping
17:11 - 17:15

from the INFA domain to the center of class label zero and one,
17:15 - 17:18

and then you can perform empirical risk minimization
17:18 - 17:21

over any hypothesis class. For the purpose
17:21 - 17:23

of today's lecture,
17:23 - 17:27

I am going to restrict myself to talking about binary classification, but it turns
17:27 - 17:31

out everything I say generalizes to regression
17:31 - 17:36

in other problem as
17:36 - 17:40

well. Does that answer your question? Yes. Cool. All right. So I wanna understand if empirical risk minimization is a reasonable algorithm.
17:40 - 17:45

In particular, what are the things we can prove about it? So
17:45 - 17:49

clearly we don't actually care about training error, we don't really care about
17:49 - 17:53

making accurate predictions on the training set, or at a least that's not the ultimate goal. The
17:53 - 17:56

ultimate goal is
17:56 - 17:58

how well it makes
17:58 - 17:59

generalization
17:59 - 18:04

how well it makes predictions on examples that we haven't seen before. How
18:04 - 18:06

well it predicts prices or
18:06 - 18:07

sale or no sale
18:07 - 18:10

outcomes of houses you haven't seen before.
18:10 - 18:13

So what we really care about
18:13 - 18:16

is generalization error, which I write
18:16 - 18:18

as epsilon of H.
18:18 - 18:22

And this is defined as the probability
18:22 - 18:24

that if I sample
18:24 - 18:26

a new example, X
18:26 - 18:28

comma Y,
18:28 - 18:35

from that distribution scripts D,
18:40 - 18:43

my hypothesis mislabels
18:43 - 18:47

that example. And
18:47 - 18:50

in terms of notational convention, usually
18:50 - 18:53

if I use if I place a hat on top of something, it
18:53 - 18:57

usually means not always but it usually means that it is an attempt to
18:57 - 18:59

estimate something about the hat. So
18:59 - 19:01

for example,
19:01 - 19:03

epsilon hat here
19:03 - 19:06

this is something that we're trying think of epsilon hat training error
19:06 - 19:09

as an attempt to approximate generalization error.
19:09 - 19:11

Okay, so the notation convention is
19:11 - 19:15

usually the things with the hats on top are things we're using to estimate other
19:15 - 19:16

quantities.
19:16 - 19:20

And H hat is a hypothesis output by learning algorithm to try to
19:20 - 19:25

estimate what the functions from H to Y X to Y. So
19:25 - 19:29

let's actually prove some things about when empirical risk minimization
19:29 - 19:30

will do well
19:30 - 19:33

in a sense of giving us low generalization error, which is what
19:33 - 19:40

we really care about.
19:47 - 19:50

In order to prove our first learning theory result, I'm going to have to state
19:50 - 19:53

two lemmas, the first is
19:53 - 20:00

the union
20:00 - 20:07

vowel, which is the following,
20:08 - 20:11

let A1 through AK be
20:11 - 20:12

K event. And when
20:12 - 20:16

I say events, I mean events in a sense of a probabilistic event
20:16 - 20:18

that either happens or not.
20:18 - 20:25

And these are not necessarily independent.
20:28 - 20:32

So there's some current distribution over the events A one through AK,
20:32 - 20:35

and maybe they're independent maybe not,
20:35 - 20:41

no assumption on that. Then
20:41 - 20:45

the probability of A one or
20:45 - 20:46

A two
20:46 - 20:48

or dot, dot,
20:48 - 20:49

dot, up to
20:49 - 20:52

AK, this union symbol, this
20:52 - 20:54

hat, this just means
20:54 - 20:58

this sort of just set notation for probability just means or. So the probability
20:58 - 20:59

of
20:59 - 21:02

at least one of these events occurring, of A one or A two, or up
21:02 - 21:03

to AK,
21:03 - 21:08

this is S equal to the probability of A one plus probability of A two plus dot,
21:08 - 21:11

dot, dot, plus probability of AK.
21:11 - 21:18

Okay? So
21:18 - 21:21

the intuition behind this is just that
21:21 - 21:23

I'm not sure if you've seen Venn diagrams
21:23 - 21:26

depictions of probability before, if you haven't,
21:26 - 21:29

what I'm about to do may be a little cryptic, so just ignore that. Just
21:29 - 21:32

ignore what I'm about to do if you haven't seen it before.
21:32 - 21:37

But if you have seen it before then this is really
21:37 - 21:41

this is really great the probability of A one,
21:41 - 21:44

union A two, union A three, is less
21:44 - 21:47

than the P of A one, plus
21:47 - 21:51

P of A two, plus P of A
21:51 - 21:52

three.
21:52 - 21:54

Right. So that the total mass in
21:54 - 21:58

the union of these three things [inaudible] to the sum of the masses
21:58 - 22:01

in the three individual sets, it's not very surprising.
22:01 - 22:05

It turns out that depending on how you define your axioms of probability,
22:05 - 22:08

this is actually one of the axioms that probably varies, so
22:08 - 22:12

I won't actually try to prove this. This is usually
22:12 - 22:17

written as an axiom. So sigmas of avitivity are probably
22:17 - 22:17

measured as this
22:17 - 22:24

what is sometimes called as well.
22:26 - 22:31

But in learning theory it's commonly called the union balance I just call it that. The
22:31 - 22:32

other
22:32 - 22:39

lemma I need is called the Hufting inequality. And
22:40 - 22:43

again, I won't actually prove this, I'll just state it,
22:43 - 22:44

which is
22:44 - 22:46

let's let Z1 up to ZM,
22:46 - 22:49

BM, IID,
22:49 - 22:52

there
22:52 - 22:55
22:55 - 23:02

may be random variables with mean Phi.
23:07 - 23:09

So the probability of ZI equals 1 is equal to
23:09 - 23:16

Phi.
23:16 - 23:18

So
23:18 - 23:23

let's say you observe M IID for newly random variables and you want to estimate their
23:23 - 23:24

mean.
23:24 - 23:29

So let me define Phi hat, and this is again that notation, no convention, Phi hat means
23:30 - 23:30

does not attempt
23:30 - 23:34

is an estimate or something else. So when we define Phi
23:34 - 23:39

hat to be 1 over M, semper my equals one through MZI. Okay?
23:39 - 23:41

So this is
23:41 - 23:44

our attempt to estimate the mean of these Benuve random variables by sort of taking
23:44 - 23:47

its average.
23:49 - 23:53

And let any gamma
23:53 - 23:55

be fixed.
23:59 - 24:01

Then,
24:01 - 24:04

the Hufting inequality
24:04 - 24:11

is that
24:14 - 24:17

the probability your estimate of Phi
24:17 - 24:21

is more than gamma away from the true value of Phi,
24:21 - 24:26

that this is bounded by two E to the next of two gamma squared. Okay? So
24:26 - 24:31

just in pictures
24:31 - 24:35

so this theorem holds this lemma, the Hufting inequality,
24:35 - 24:38

this is just a statement of fact, this just holds true.
24:38 - 24:42

But let me now draw a cartoon to describe some of the intuition behind this, I
24:42 - 24:44

guess.
24:44 - 24:45

So
24:45 - 24:49

lets say [inaudible] this is a real number line from zero to one.
24:49 - 24:53

And so Phi is the mean of your Benuve random variables.
24:54 - 24:56

You will remember from you know,
24:56 - 24:59

whatever some undergraduate probability or statistics class,
24:59 - 25:03

the central limit theorem that says that when you average all the things together,
25:03 - 25:05

you tend to get a Gaussian distribution.
25:05 - 25:10

And so when you toss M coins with bias Phi, we observe these M
25:10 - 25:12

Benuve random variables,
25:12 - 25:18

and we average them, then the probability distribution of
25:18 - 25:25

Phi hat
25:26 - 25:29

will roughly be a Gaussian lets say. Okay? It
25:29 - 25:30

turns out if
25:30 - 25:33

you haven't seen this up before, this is actually that the
25:33 - 25:36

cumulative distribution function of Phi hat will converse with that of the Gaussian.
25:36 - 25:40

Technically Phi hat can only take on a discreet set of values
25:40 - 25:43

because these are factions one over Ms. It doesn't really have an
25:43 - 25:45

entity but just as a cartoon
25:45 - 25:49

think of it as a converse roughly to a Gaussian.
25:49 - 25:56

So what the Hufting inequality says is that if you pick a value of gamma, let me put
25:56 - 26:00

S one interval gamma there's another interval gamma.
26:00 - 26:03

Then the saying that the probability mass of the details,
26:03 - 26:06

in other words the probability that
26:06 - 26:08

my value of Phi hat is more than
26:08 - 26:11

a gamma away from the true value,
26:11 - 26:16

that the total mass
26:16 - 26:18

that the
26:18 - 26:22

total probability mass in these tails is at most two
26:22 - 26:26

E to the negative two gamma squared M. Okay?
26:26 - 26:28

That's what the Hufting inequality so if you
26:28 - 26:31

can't read that this just says this is just the right hand side of the bound, two E to
26:31 - 26:33

negative two gamma squared.
26:33 - 26:36

So balance the probability that you make a mistake in estimating the
26:36 - 26:42

mean of a Benuve random variable.
26:42 - 26:43

And the
26:43 - 26:47

cool thing about this bound the interesting thing behind this bound is that
26:48 - 26:51

the [inaudible] exponentially in M,
26:51 - 26:52

so it says
26:52 - 26:54

that for a fixed value of gamma,
26:54 - 26:56

as you increase the size
26:56 - 26:58

of your training set, as you toss a coin more and more,
26:58 - 27:02

then the worth of this Gaussian will shrink. The worth of this
27:02 - 27:06

Gaussian will actually shrink like one over root to M.
27:06 - 27:12

And that will cause the probability mass left in the tails to decrease exponentially,
27:12 - 27:19

quickly, as a function of that. And this will be important later. Yeah? Student: Does this come from the
27:21 - 27:23

central limit theorem [inaudible]. Instructor (Andrew Ng): No it doesn't. So this is proved by a different
27:23 - 27:27

this is proved no so the central limit theorem there may be a
27:27 - 27:29

version of the central limit theorem, but the
27:29 - 27:33

versions I'm familiar with tend are sort of asymptotic, but this works
27:33 - 27:36

for any finer value of M. Oh, and for your
27:36 - 27:40

this bound holds even if M is equal to two, or M is [inaudible], if M is
27:40 - 27:41

very small,
27:41 - 27:45

the central limit theorem approximation is not gonna hold, but this theorem holds
27:45 - 27:46

regardless.
27:46 - 27:48

Okay? I'm
27:48 - 27:52

drawing this just as a cartoon to help explain the intuition, but this
27:52 - 27:59

theorem just holds true, without reference to central limit
28:09 - 28:13

theorem. All right. So lets start to understand empirical risk minimization,
28:13 - 28:20

and what I want to do is
28:23 - 28:25

begin with
28:25 - 28:28

studying empirical risk minimization
28:28 - 28:31

for a
28:31 - 28:33

[inaudible] case
28:33 - 28:37

that's a logistic regression, and in particular I want to start with studying
28:37 - 28:41

the case of finite hypothesis classes.
28:41 - 28:48

So let's say script H is a class of
28:49 - 28:56

K hypotheses.
28:56 - 29:00

Right. So this is K functions with no each of these is just a function mapping
29:00 - 29:05

from inputs to outputs, there's no parameters in this.
29:05 - 29:06

And so
29:06 - 29:13

what the empirical risk minimization would do is it would take the training set
29:13 - 29:14

and it'll then
29:14 - 29:17

look at each of these K functions,
29:17 - 29:22

and it'll pick whichever of these functions has the lowest training error. Okay?
29:22 - 29:24

So now that the logistic regression uses an infinitely
29:24 - 29:29

large a continuous infinitely large class of hypotheses, script H,
29:29 - 29:32

but to prove the first row I actually want to just
29:32 - 29:33

describe
29:33 - 29:38

our first learning theorem is all for the case of when you have a finite hypothesis class, and then
29:38 - 29:42

we'll later generalize that into the hypothesis classes.
29:43 - 29:46

So
29:53 - 29:58

empirical risk minimization takes the hypothesis of the lowest training error,
29:58 - 30:04

and what I'd like to do is prove a bound on the generalization error
30:04 - 30:06

of H hat. All right. So in other words I'm
30:06 - 30:07

gonna prove that
30:07 - 30:14

somehow minimizing training error allows me to do well on generalization error.
30:14 - 30:16

And here's the strategy,
30:16 - 30:20

I'm going to
30:20 - 30:23

the first step in this prove I'm going to show that
30:23 - 30:25

training error
30:25 - 30:30

is a good approximation to generalization error,
30:32 - 30:35

and then I'm going to show
30:35 - 30:38

that this implies a bound
30:38 - 30:40

on
30:40 - 30:43

the generalization error
30:43 - 30:48

of the hypothesis of [inaudible] empirical risk
30:48 - 30:53

minimization. And I just realized, this class I guess is also maybe slightly notation
30:53 - 30:56

heavy class
30:56 - 30:59

round, instead of just introducing a reasonably large set of new symbols, so if
30:59 - 31:03

again, in the course of today's lecture, you're looking at some symbol and you don't quite
31:03 - 31:07

remember what it is, please raise your hand and ask. [Inaudible] what's that, what was that, was that a
31:07 - 31:12

generalization error or was it something else? So raise your hand and
31:12 - 31:17

ask if you don't understand what the notation I was defining.
31:17 - 31:20

Okay. So let me introduce this in two steps. And the empirical risk strategy is I'm gonna show training errors
31:20 - 31:23

that give approximation generalization error,
31:23 - 31:26

and this will imply that minimizing training error
31:26 - 31:30

will also do pretty well in terms of minimizing generalization error.
31:30 - 31:33

And this will give us a bound on the generalization error
31:33 - 31:40

of the hypothesis output by empirical risk minimization. Okay?
31:40 - 31:47

So here's the idea.
31:48 - 31:52

So
31:52 - 31:56

lets even not consider all the hypotheses at once, lets pick any
31:56 - 32:00

hypothesis, HJ in the class script H, and
32:00 - 32:05

so until further notice lets just consider there one fixed hypothesis. So pick any
32:05 - 32:10

one hypothesis and let's talk about that
32:10 - 32:12

one. Let
32:12 - 32:15

me define ZI
32:15 - 32:22

to be indicator function
32:24 - 32:25

for
32:25 - 32:30

whether this hypothesis misclassifies the IFE example excuse me
32:30 - 32:34

or Z subscript I. Okay?
32:34 - 32:40

So
32:40 - 32:43

ZI would be zero or one
32:43 - 32:47

depending on whether this one hypothesis which is the only one
32:47 - 32:49

I'm gonna even consider now,
32:49 - 32:52

whether this hypothesis was classified as an example.
32:52 - 32:57

And so
32:57 - 33:01

my training set is drawn randomly from sum distribution scripts
33:01 - 33:02

d,
33:02 - 33:05

and
33:05 - 33:10

depending on what training examples I've got, these ZIs would be either zero or one.
33:10 - 33:14

So let's figure out what the probability distribution ZI is.
33:14 - 33:15

Well,
33:15 - 33:16

so
33:16 - 33:21

ZI takes on the value of either zero or one, so clearly is a Benuve random variable, it can only
33:21 - 33:25

take on these values.
33:25 - 33:31

Well,
33:31 - 33:35

what's the probability that ZI is equal to one? In other words, what's the
33:35 - 33:36

probability
33:36 - 33:41

that from a fixed hypothesis HJ,
33:41 - 33:45

when I sample my training set IID from distribution D, what is the chance
33:45 - 33:46

that
33:46 - 33:50

my hypothesis will misclassify it?
33:50 - 33:54

Well, by definition,
33:54 - 33:57

that's just a generalization error of my
33:57 - 34:01

hypothesis HJ.
34:01 - 34:04

So ZI is a Benuve random variable
34:04 - 34:05

with mean
34:05 - 34:12

given by the generalization error of this hypothesis.
34:14 - 34:20

Raise your hand if that made sense. Oh, cool. Great.
34:20 - 34:22

And moreover,
34:22 - 34:24

all the ZIs have the same
34:24 - 34:28

probability of being one, and all my training examples I've drawn are IID,
34:28 - 34:33

and so the ZIs are also independent
34:33 - 34:39

and therefore
34:39 - 34:42

the ZIs themselves are IID random
34:42 - 34:49

variables. Okay? Because my training examples were drawn independently of each other, by assumption.
34:56 - 34:59

If you read this as the definition of training error,
34:59 - 35:04

the training error of my hypothesis HJ, that's just that.
35:07 - 35:09

That's just the average
35:09 - 35:16

of my ZIs, which was well I previously defined it like this. Okay?
35:26 - 35:29

And so epsilon hat of HJ
35:29 - 35:30

is exactly
35:30 - 35:33

the average of MIID,
35:33 - 35:36

Benuve random variables,
35:36 - 35:40

drawn from Benuve distribution with mean given by
35:40 - 35:42

the
35:42 - 35:44

generalization error, so this is
35:44 - 35:48

well this is the average of MIID Benuve random variables,
35:48 - 35:55

each of which has meaning given by the
35:55 - 35:59

generalization error of HJ.
35:59 - 36:04

And therefore, by
36:04 - 36:07

the Hufting inequality
36:07 - 36:11

we have to add the
36:11 - 36:16

probability
36:16 - 36:20

that the difference between training and generalization error, the probability that this is greater than gamma is
36:20 - 36:22

less than to two, E to
36:22 - 36:25

the negative two,
36:25 - 36:27

gamma squared M. Okay?
36:27 - 36:32

Exactly by the Hufting inequality.
36:32 - 36:35

And what this proves is that,
36:35 - 36:37

for my fixed hypothesis HJ,
36:37 - 36:41

my training error, epsilon hat will
36:41 - 36:44

with high probability, assuming M is large, if
36:44 - 36:47

M is large than this thing on the right hand side will be small, because
36:47 - 36:49

this is two Es and a negative two gamma squared M.
36:49 - 36:53

So this says that if my training set is large enough,
36:53 - 36:54

then the probability
36:54 - 36:57

my training error is far from generalization error,
36:57 - 36:59

meaning that it is more than gamma,
36:59 - 37:06

will be small, will be bounded by this thing on the right hand side. Okay? Now,
37:09 - 37:12

here's the [inaudible] tricky part,
37:12 - 37:17

what we've done is approve this bound for one fixed hypothesis, for HJ.
37:17 - 37:20

What I want to prove is that training error will be a good estimate for generalization
37:20 - 37:21

error,
37:21 - 37:26

not just for this one hypothesis HJ, but actually for all K hypotheses in my
37:26 - 37:28

hypothesis class
37:28 - 37:30

script
37:30 - 37:31

H.
37:31 - 37:33

So let's do it
37:36 - 37:43

well,
37:55 - 37:58

better do it on a new board. So in order to show that, let me define
37:58 - 38:01

a random event, let me define AJ to be the event
38:01 - 38:05

that
38:05 - 38:12

let
38:21 - 38:24

me define AJ to be the event that
38:24 - 38:28

you know, the difference between training and generalization error is more than gamma on a
38:28 - 38:30

hypothesis HJ.
38:30 - 38:32

And so what we
38:32 - 38:36

put on the previous board was that the probability of AJ is less equal to two E
38:36 - 38:43

to the negative two, gamma squared M, and this is pretty small. Now,
38:44 - 38:50

What I want to bound is the probability that
38:50 - 38:54

there exists some hypothesis in my class
38:54 - 39:01

script H,
39:03 - 39:08

such that I make a large error in my estimate of generalization error. Okay?
39:08 - 39:10

Such that this holds true.
39:12 - 39:14

So this is really just
39:14 - 39:18

that the probability that there exists a hypothesis for which this
39:18 - 39:21

holds. This is really the probability that
39:21 - 39:25

A one or A two, or up to AK holds.
39:25 - 39:29

The chance
39:29 - 39:33

there exists a hypothesis is just well the priority
39:33 - 39:37

that for hypothesis one and make a large error in estimating the generalization error,
39:37 - 39:41

or for hypothesis two and make a large error in estimating generalization error,
39:41 - 39:44

and so on.
39:44 - 39:51

And so by the union bound, this is less than equal to
39:51 - 39:56

that,
39:56 - 39:59

which is therefore less than equal to
40:07 - 40:14

is
40:14 - 40:21

equal to that. Okay?
40:39 - 40:40

So let
40:40 - 40:42

me just take
40:42 - 40:48

one minus both sides of the equation
40:48 - 40:51

on the previous board let me take one minus both sides, so
40:51 - 40:57

the probability that there does not exist for
40:57 - 40:59

hypothesis
40:59 - 41:06

such that,
41:08 - 41:11

that. The probability that there does not exist a hypothesis on which I make a large
41:11 - 41:13

error in this estimate
41:13 - 41:20

while this is equal to the probability that for all hypotheses,
41:23 - 41:25

I make a
41:25 - 41:27

small error, or at
41:27 - 41:32

most gamma, in my estimate of generalization error.
41:32 - 41:36

In taking one minus on the right hand side I get two KE
41:36 - 41:39

to the negative two
41:39 - 41:44

gamma squared M. Okay?
41:44 - 41:46

And so
41:46 - 41:50

and the sign of the inequality flipped because I took one minus both
41:50 - 41:54

sides. The minus sign flips the sign of the
41:54 - 41:56

equality.
41:56 - 42:00

So what we're shown is that
42:00 - 42:05

with probability which abbreviates to WP with probability one minus
42:05 - 42:09

two KE to the negative two gamma squared M.
42:09 - 42:12

We have that, epsilon hat
42:12 - 42:16

of H
42:16 - 42:21

will be
42:21 - 42:28

will then gamma of epsilon of H,
42:31 - 42:33

simultaneously for all
42:33 - 42:35

hypotheses in our
42:35 - 42:42

class script H.
42:43 - 42:47

And so
42:47 - 42:54

just to give this result a name, this is called
42:56 - 43:00

this is one instance of what's called a uniform conversions result,
43:00 - 43:01

and
43:01 - 43:04

the term uniform conversions
43:04 - 43:06

this sort of alludes to the fact that
43:06 - 43:10

this shows that as M becomes large,
43:10 - 43:12

then these epsilon hats
43:12 - 43:16

will all simultaneously converge to epsilon of H. That
43:16 - 43:18

training error will
43:18 - 43:21

become very close to generalization error
43:21 - 43:23

simultaneously for all hypotheses H.
43:23 - 43:25

That's what the term uniform
43:25 - 43:29

refers to, is the fact that this converges for all hypotheses H and not just
43:29 - 43:31

for one hypothesis. And so
43:31 - 43:36

what we're shown is one example of a uniform conversions result. Okay?
43:36 - 43:39

So let me clean a couple more boards. I'll come back and ask what questions you have
43:39 - 43:46

about this. We should take another look at this and make sure it all makes sense. Yeah, okay.
44:20 - 44:27

What questions do you have about this?
44:28 - 44:31
44:31 - 44:34

Student: How the is the
44:34 - 44:36

value of gamma computed [inaudible]? Instructor (Andrew Ng): Right. Yeah. So let's see, the question is how is the value of gamma computed?
44:36 - 44:40

So for these purposes for the purposes of this, gamma is a constant.
44:40 - 44:43

Imagine a gamma is some constant that we chose in advance,
44:43 - 44:49

and this is a bound that holds true for any fixed value
44:49 - 44:51

of gamma. Later on as we
44:51 - 44:54

take this bound and then sort of develop this result further,
44:54 - 44:58

we'll choose specific values of gamma as a [inaudible] of this bound. For now
44:58 - 45:05

we'll just imagine that when we're proved this holds true for any value of gamma. Any questions?
45:09 - 45:12

Yeah? Student:[Inaudible] hypothesis phase is infinite [inaudible]? Instructor (Andrew Ng):Yes, the labs in the hypothesis phase is infinite,
45:12 - 45:16

so this simple result won't work in this present form, but we'll generalize this
45:16 - 45:20

probably won't get to it today but we'll generalize this at the beginning of the
45:20 - 45:27

next lecture to infinite hypothesis classes. Student:How do we use this
45:30 - 45:32

theory [inaudible]? Instructor (Andrew Ng):How do you use theorem factors? So let me
45:32 - 45:37

I might get to a little of that later today, we'll talk concretely about algorithms,
45:37 - 45:39

the consequences of
45:39 - 45:46

the understanding of these things in the next lecture as well. Yeah, okay? Cool. Can you
45:46 - 45:47

just raise your hand if the
45:47 - 45:50

things I've proved so far
45:50 - 45:55

make sense? Okay. Cool. Great.
45:55 - 45:59

Thanks. All right. Let me just take this uniform conversions bound and rewrite
45:59 - 46:02

it in a couple of other forms.
46:02 - 46:05

So
46:05 - 46:09

this is a sort of a bound on probability, this is saying suppose I fix my training
46:09 - 46:11

set and then fix my training
46:11 - 46:15

set fix my threshold, my error threshold gamma,
46:15 - 46:19

what is the probability that uniform conversions holds, and well,
46:19 - 46:23

that's my formula that gives the answer. This is the probability of something
46:23 - 46:26

happening.
46:26 - 46:29

So there are actually three parameters of interest. One is,
46:29 - 46:32

What is this probability? The
46:32 - 46:34

other parameter is, What's the training set size M?
46:34 - 46:36

And the third parameter is, What
46:36 - 46:37

is the value
46:37 - 46:40

of this error
46:40 - 46:42

threshold gamma? I'm not gonna
46:42 - 46:44

vary K for these purposes.
46:44 - 46:48

So other two other equivalent forms of the bounds,
46:48 - 46:50

which so you can ask,
46:50 - 46:51

given
46:51 - 46:52

gamma
46:52 - 46:53

so what
46:53 - 46:56

we proved was given gamma and given M, what
46:56 - 46:58

is the probability of
46:58 - 47:01

uniform conversions? The
47:01 - 47:04

other equivalent forms are,
47:04 - 47:07

so that given gamma
47:07 - 47:14

and the probability delta of making a large error,
47:15 - 47:19

how large a training set size do you need in order to give
47:19 - 47:22

a bound on how large a
47:22 - 47:25

training set size do you need
47:25 - 47:30

to give a uniform conversions bound with parameters gamma
47:30 - 47:31

and delta?
47:31 - 47:34

And so well,
47:34 - 47:38

so if you set delta to be two KE so negative two gamma
47:38 - 47:38

squared M.
47:38 - 47:41

This is that form that I had on the left.
47:41 - 47:44

And if you solve
47:44 - 47:46

for M,
47:46 - 47:48

what you find is that
47:48 - 47:54

there's an equivalent form of this result that says that
47:54 - 47:56

so long as your training
47:56 - 47:59

set assigns M as greater than this.
47:59 - 48:05

And this is the formula that I get by solving for M.
48:05 - 48:09

Okay? So long as M is greater than equal to this,
48:09 - 48:13

then with probability, which I'm abbreviating to WP again,
48:13 - 48:17

with probability at least one minus delta,
48:17 - 48:24

we have for all.
48:25 - 48:31

Okay?
48:31 - 48:34

So this says how large a training set size that I need
48:34 - 48:38

to guarantee that with probability at least one minus delta,
48:38 - 48:42

we have the training error is within gamma of generalization error for all my
48:42 - 48:45

hypotheses, and this gives an answer.
48:45 - 48:49

And just to give this another name, this is an example of a sample
48:49 - 48:53

complexity
48:53 - 48:58
48:58 - 48:59

bound.
48:59 - 49:03

So from undergrad computer science classes you may have heard of
49:03 - 49:06

computational complexity, which is how much computations you need to achieve
49:06 - 49:07

something.
49:07 - 49:11

So sample complexity just means how large a training example how large a training set how
49:11 - 49:14

large a sample of examples do you need
49:14 - 49:17

in order to achieve a certain bound and error.
49:17 - 49:20

And it turns out that in many of the theorems we write out you can
49:20 - 49:24

pose them in sort of a form of probability bound or a sample complexity bound or in some other
49:24 - 49:24

form.
49:24 - 49:28

I personally often find the sample complexity bounds the most
49:28 - 49:32

easy to interpret because it says how large a training set do you need to give a certain bound on the
49:32 - 49:36

errors.
49:36 - 49:39

And in fact well, we'll see this later,
49:39 - 49:41

sample complexity bounds often sort of
49:41 - 49:43

help to give guidance for really if
49:43 - 49:44

you're trying to
49:44 - 49:46

achieve something on a machine learning problem,
49:46 - 49:47

this really is
49:47 - 49:50

trying to give guidance on how much training data you need
49:50 - 49:54

to prove something.
49:54 - 49:58

The one thing I want to note here is that M
49:58 - 50:01

grows like the log of K, right, so
50:01 - 50:06

the log of K grows extremely slowly as a function of K. The log is one
50:06 - 50:11

of the slowest growing functions, right. It's one of well,
50:11 - 50:18

some of you may have heard this, right? That for all values of K, right I learned
50:19 - 50:22

this from a colleague, Andrew Moore,
50:22 - 50:24

at Carnegie Mellon
50:24 - 50:25

that in
50:25 - 50:31

computer science for all practical purposes for all values of K, log K is less [inaudible], this is
50:31 - 50:32

almost true.
50:32 - 50:36

So log K is logs is one of the slowest growing functions, and so the
50:36 - 50:40

fact that M sample complexity grows like the log of K,
50:40 - 50:42

means that
50:42 - 50:46

you can increase this number of hypotheses in your hypothesis class quite a lot
50:46 - 50:51

and the number of the training examples you need won't grow
50:51 - 50:56
50:56 - 51:00

very much. [Inaudible]. This property will be important later when we talk about infinite
51:00 - 51:03

hypothesis classes.
51:03 - 51:06

The final form is the
51:06 - 51:08

I guess is sometimes called the error bound,
51:08 - 51:14

which is when you hold M and delta fixed and solved for gamma.
51:14 - 51:21

And so
51:23 - 51:27

and what do you do what you get then is that the
51:27 - 51:31

probability at least one minus delta,
51:31 - 51:38

we have that.
51:41 - 51:44

For all hypotheses in my hypothesis class,
51:44 - 51:51

the difference in the training generalization error would be less
51:54 - 51:55

than equal to that. Okay? And that's just
51:55 - 52:02

solving for gamma and plugging the value I get in there. Okay? All right.
52:03 - 52:10

So the
52:31 - 52:34

second
52:34 - 52:36

step of
52:36 - 52:43

the overall proof I want to execute is the following.
52:45 - 52:47

The result of the training error is essentially that uniform
52:47 - 52:49

conversions will hold true
52:49 - 52:51

with high probability.
52:51 - 52:55

What I want to show now is let's assume that uniform conversions hold. So let's
52:55 - 52:58

assume that for all
52:58 - 53:00

hypotheses H,
53:00 - 53:05

we have that epsilon of H minus epsilon hat of H, is
53:05 - 53:09

less than of the gamma. Okay?
53:09 - 53:11

What I want to do now is
53:11 - 53:13

use this to see what we can
53:13 - 53:15

prove about the bound
53:15 - 53:22

of see what we can prove about the generalization error.
53:29 - 53:31
53:31 - 53:33

So I want to know
53:33 - 53:36

suppose this holds true I want
53:36 - 53:40

to know can we prove something about the generalization error of H hat, where
53:40 - 53:43

again, H hat
53:43 - 53:46

was the hypothesis
53:46 - 53:52

selected by empirical risk minimization.
53:52 - 53:56

Okay? So in order to show this, let me make one more definition, let me define H
53:56 - 54:00

star,
54:00 - 54:03

to be the hypothesis
54:03 - 54:06

in my class
54:06 - 54:10

script H that has the smallest generalization error.
54:10 - 54:11

So this is
54:11 - 54:15

if I had an infinite amount of training data or if I really I
54:15 - 54:19

could go in and find the best possible hypothesis
54:19 - 54:23

best possible hypothesis in the sense of minimizing generalization error
54:23 - 54:29

what's the hypothesis I would get? Okay? So
54:29 - 54:33

in some sense, it sort of makes sense to compare the performance of our learning
54:33 - 54:34

algorithm
54:34 - 54:36

to the performance of H star, because we
54:36 - 54:40

sort of we clearly can't hope to do better than H star.
54:40 - 54:42

Another way of saying that is that if
54:42 - 54:48

your hypothesis class is a class of all linear decision boundaries, that
54:48 - 54:51

the data just can't be separated by any linear functions.
54:51 - 54:54

So if even H star is really bad,
54:54 - 54:55

then there's sort of
54:55 - 54:59

it's unlikely then there's just not much hope that your learning algorithm could do even
54:59 - 55:04

better
55:04 - 55:09

than H star. So I actually prove this result in three steps.
55:09 - 55:13

So the generalization error of H hat, the hypothesis I chose,
55:13 - 55:20

this is going to be less than equal to that, actually let me number these
55:22 - 55:25

equations, right. This
55:25 - 55:27

is
55:27 - 55:30

because of equation one, because I see that
55:30 - 55:33

epsilon of H hat and epsilon hat of H hat will then gamma of each
55:33 - 55:37

other.
55:37 - 55:38

Now
55:38 - 55:42

because H star, excuse me,
55:42 - 55:46

now by the
55:46 - 55:51

definition of empirical risk minimization, H
55:51 - 55:54

hat was chosen to minimize training error
55:54 - 55:59

and so there can't be any hypothesis with lower training error than H hat.
55:59 - 56:05

So the training error of H hat must be less than the equal to
56:05 - 56:07

the training error of H star. So
56:07 - 56:11

this is sort of by two, or by the definition of H hat,
56:11 - 56:18

as the hypothesis that minimizes training error H hat.
56:18 - 56:20

And the final step is I'm
56:20 - 56:23

going to apply this uniform conversions result again. We know that
56:23 - 56:25

epsilon hat
56:25 - 56:30

of H star must be moving gamma of epsilon of H star.
56:30 - 56:34

And so this is at most
56:34 - 56:35

plus gamma. Then
56:35 - 56:36
56:36 - 56:38

I have my original gamma
56:38 - 56:41

there. Okay? And so this is by
56:41 - 56:45

equation one again because oh, excuse me
56:45 - 56:48

because I know the training error of H star must be moving gamma of the
56:48 - 56:50

generalization error of H star.
56:50 - 56:52

And so well, I'll just
56:52 - 56:57

write this as plus two gamma. Okay? Yeah? Student:[Inaudible] notation, is epsilon proof of [inaudible] H hat that's not the training
56:57 - 57:04

error, that's the generalization error with estimate of the hypothesis? Instructor (Andrew Ng): Oh,
57:25 - 57:27

okay. Let me just well, let
57:27 - 57:33

me write that down on this board. So actually actually let me
57:33 - 57:36

think
57:36 - 57:39

[inaudible]
57:39 - 57:44

fit this in here. So epsilon hat
57:44 - 57:45

of H is the
57:45 - 57:49

training
57:49 - 57:51
57:51 - 57:52

error of the hypothesis H. In
57:52 - 57:56

other words, given the hypothesis a hypothesis is just a function, right mapped from X or
57:56 - 57:58

Ys
57:58 - 57:59

so epsilon hat of H is
57:59 - 58:03

given the hypothesis H, what's the fraction of training examples it
58:03 - 58:05

misclassifies?
58:05 - 58:12

And generalization error
58:12 - 58:14

of
58:14 - 58:18

H, is given the hypothesis H if I
58:18 - 58:19

sample another
58:19 - 58:22

example from my distribution
58:22 - 58:23

scripts
58:23 - 58:28

D, what's the probability that H will misclassify that example? Does that make sense?
58:28 - 58:32

Oh,
58:32 - 58:38

okay. And H hat is the hypothesis that's chosen by empirical risk minimization.
58:38 - 58:43

So when I talk about empirical risk minimization, is the algorithm that
58:43 - 58:48

minimizes training error, and so epsilon hat of H is the training error of H,
58:48 - 58:50

and so H hat is defined as
58:50 - 58:52

the hypothesis that out
58:52 - 58:55

of all hypotheses in my class script H,
58:55 - 58:58

the one that minimizes training error epsilon hat of H. Okay? All right. Yeah? Student:[Inaudible] H is [inaudible] a member
58:58 - 59:05

of
59:07 - 59:09

typical H, [inaudible]
59:09 - 59:11

family
59:11 - 59:13

right? Yes it is.
59:13 - 59:20

So what happens with the generalization error [inaudible]?
59:20 - 59:24

I'll talk about that later. So
59:24 - 59:31

let me tie all these things together into a
59:34 - 59:38

theorem.
59:38 - 59:43

Let there be a hypothesis class given with a
59:43 - 59:49

finite set of K hypotheses,
59:49 - 59:51

and let any M
59:51 - 59:52

delta
59:52 - 59:58

be fixed.
59:58 - 60:02

Then so I fixed M and delta, so this will be the error bound
60:02 - 60:04

form of the theorem, right?
60:04 - 60:07

Then, with
60:07 - 60:10

probability at least one minus delta.
60:10 - 60:14

We have that. The generalization error of H hat is
60:14 - 60:19

less than or equal to
60:19 - 60:23

the minimum over all hypotheses in
60:23 - 60:25

set H epsilon of H,
60:25 - 60:27

plus
60:27 - 60:34

two times,
60:38 - 60:39

plus that. Okay?
60:39 - 60:42

So to prove this, well,
60:42 - 60:46

this term of course is just epsilon of H star.
60:46 - 60:48

And so to prove this
60:48 - 60:50

we set
60:50 - 60:53

gamma to equal to that
60:53 - 60:56

this is two times the square root term.
60:56 - 61:03

To prove this theorem we set gamma to equal to that square root term. Say that again?
61:04 - 61:08

Wait. Say that again?
61:08 - 61:11

Oh,
61:14 - 61:16

yes. Thank you. That didn't
61:16 - 61:20

make sense at all.
61:20 - 61:21

Thanks. Great.
61:21 - 61:26

So set gamma to that square root term,
61:26 - 61:30

and so we know equation one,
61:30 - 61:32

right, from the previous board
61:32 - 61:35

holds with probability one minus delta.
61:35 - 61:40

Right. Equation one was the uniform conversions result right, that well,
61:40 - 61:47

IE.
61:50 - 61:52

This is equation one from the previous board, right, so
61:52 - 61:55

set gamma equal to this we know that we'll probably
61:55 - 61:58

use one minus delta this uniform conversions holds,
61:58 - 62:01

and whenever that holds, that implies you
62:01 - 62:05

know, I
62:05 - 62:05

guess
62:05 - 62:08

if we call this equation star I guess.
62:08 - 62:12

And whenever uniform conversions holds, we showed again, on the previous boards
62:12 - 62:17

that this result holds, that generalization error of H hat is less than
62:17 - 62:21

two generalization error of H star plus two times gamma. Okay? And
62:21 - 62:23

so that proves
62:23 - 62:30

this theorem. So
62:42 - 62:44

this result sort of helps us to quantify
62:44 - 62:45

a little bit
62:45 - 62:48

that bias variance tradeoff
62:48 - 62:50

that I talked about
62:50 - 62:54

at the beginning of actually near the very start of this lecture.
62:54 - 62:58

And in particular
62:58 - 63:04

let's say I have some hypothesis class script H, that
63:04 - 63:06

I'm using, maybe as a class of all
63:06 - 63:10

linear functions and linear regression, and logistic regression with
63:10 - 63:12

just the linear features.
63:12 - 63:15

And let's say I'm considering switching
63:15 - 63:20

to some new class H prime by having more features. So lets say this is
63:20 - 63:23

linear
63:23 - 63:27

and this is quadratic,
63:27 - 63:28
63:28 - 63:32

so the class of all linear functions and the subset of the class of all quadratic functions,
63:32 - 63:35

and so H is the subset of H prime. And
63:35 - 63:37

let's say I'm considering
63:37 - 63:41

instead of using my linear hypothesis class let's say I'm considering switching
63:41 - 63:45

to a quadratic hypothesis class, or switching to a larger hypothesis
63:45 - 63:46

class.
63:46 - 63:47
63:47 - 63:50

Then what are the tradeoffs involved? Well, I
63:50 - 63:53

proved this only for finite hypothesis classes, but we'll see that something
63:53 - 63:56

very similar holds for infinite hypothesis classes too.
63:56 - 63:58

But the tradeoff is
63:58 - 64:01

what if I switch from H to H prime, or I switch from linear to quadratic
64:01 - 64:04

functions. Then
64:04 - 64:08

epsilon of H star will become better because
64:08 - 64:12

the best hypothesis in my hypothesis class will become better.
64:12 - 64:15

The best quadratic function
64:15 - 64:16
64:16 - 64:19

by best I mean in the sense of generalization error
64:19 - 64:21

the hypothesis function
64:21 - 64:25

the quadratic function with the lowest generalization error
64:25 - 64:26

has to have
64:26 - 64:27

equal or
64:27 - 64:28

more likely lower generalization error than the best
64:28 - 64:30

linear function.
64:30 - 64:32
64:32 - 64:38

So by switching to a more complex hypothesis class you can get this first term as you go down.
64:38 - 64:40

But what I pay for then is that
64:40 - 64:43

K will increase.
64:43 - 64:46

By switching to a larger hypothesis class,
64:46 - 64:48

the first term will go down,
64:48 - 64:52

but the second term will increase because I now have a larger class of
64:52 - 64:53

hypotheses
64:53 - 64:55

and so the second term K
64:55 - 64:57

will increase.
64:57 - 65:02

And so this is sometimes called the bias this is usually called the bias variance
65:02 - 65:03

tradeoff. Whereby
65:03 - 65:08

going to larger hypothesis class maybe I have the hope for finding a better function,
65:08 - 65:10

that my risk of sort of
65:10 - 65:14

not fitting my model so accurately also increases, and
65:14 - 65:19

that's because illustrated by the second term going up when the
65:19 - 65:21

size of your hypothesis,
65:21 - 65:28

when K goes up.
65:28 - 65:32

And so
65:32 - 65:35

speaking very loosely,
65:35 - 65:37

we can think of
65:37 - 65:40

this first term as corresponding
65:40 - 65:41

maybe to the bias
65:41 - 65:45

of the learning algorithm, or the bias of the hypothesis class.
65:45 - 65:50

And you can again speaking very loosely,
65:50 - 65:53

think of the second term as corresponding
65:53 - 65:57

to the variance in your hypothesis, in other words how well you can actually fit a
65:57 - 65:58

hypothesis in the
65:58 - 66:00

how well you
66:00 - 66:03

actually fit this hypothesis class to the data.
66:03 - 66:07

And by switching to a more complex hypothesis class, your variance increases and your bias decreases.
66:07 - 66:09
66:09 - 66:14

As a note of warning, it turns out that if you take like a statistics class you've seen
66:14 - 66:15

definitions of bias and variance,
66:15 - 66:18

which are often defined in terms of squared error or something.
66:18 - 66:22

It turns out that for classification problems,
66:22 - 66:25

there actually is no
66:25 - 66:26
66:26 - 66:27

universally accepted
66:27 - 66:29

formal definition of
66:29 - 66:31

bias and
66:31 - 66:35

variance for classification problems. For regression problems, there is this square error
66:35 - 66:39

definition. For classification problems it
66:39 - 66:40

turns out there've
66:40 - 66:44

been several competing proposals for definitions of bias and variance. So when I
66:44 - 66:45

say bias and variance here,
66:45 - 66:47

think of these as very loose, informal,
66:47 - 66:54

intuitive definitions, and not formal definitions. Okay.
67:00 - 67:07

The
67:15 - 67:18

cartoon associated with intuition I just
67:18 - 67:19

said would be
67:19 - 67:22

as follows: Let's
67:22 - 67:24

say
67:24 - 67:27
67:27 - 67:30

and everything about the plot will be for a fixed value of M, for a
67:30 - 67:35

fixed training set size M. Vertical axis I'll plot ever and
67:35 - 67:39

on the horizontal axis I'll plot model
67:39 - 67:44

complexity.
67:44 - 67:47

And by model complexity I mean
67:47 - 67:53

sort of degree of polynomial,
67:53 - 67:58
67:58 - 68:00

size of your
68:00 - 68:05

hypothesis class script H etc. It actually turns out,
68:05 - 68:08

you remember the bandwidth parameter
68:08 - 68:11

from
68:11 - 68:14

locally weighted linear regression, that also
68:14 - 68:17

has a similar effect in controlling how complex your model is.
68:17 - 68:21

Model complexity [inaudible] polynomial I guess.
68:21 - 68:23

So the more complex your model,
68:23 - 68:25

the better your
68:25 - 68:26

training error,
68:26 - 68:30

and so
68:30 - 68:33

your training error will tend to [inaudible] zero as you increase the complexity of your model because the
68:33 - 68:36

more complete your model the better you can fit your training set.
68:36 - 68:38

But
68:38 - 68:45

because of this bias variance tradeoff,
68:45 - 68:49

you find
68:49 - 68:53

that
68:53 - 68:57

generalization error will come down for a while and then it will go back up.
68:57 - 69:04

And this regime on the left is when you're underfitting the data or when
69:04 - 69:06

you have high bias.
69:06 - 69:08

And this regime
69:08 - 69:10

on the right
69:10 - 69:16

is when you have high variance or
69:16 - 69:21

you're overfitting the data. Okay?
69:21 - 69:24

And this is why a model of sort of intermediate
69:24 - 69:27

complexity, somewhere here
69:27 - 69:29

if often preferable
69:29 - 69:30

to if
69:30 - 69:33

[inaudible] and minimize generalization error.
69:33 - 69:36

Okay?
69:36 - 69:40

So that's just a cartoon. In the next lecture we'll actually talk about the number of
69:40 - 69:40

algorithms
69:40 - 69:44

for trying to automatically select model complexities, say to get
69:44 - 69:48

you as close as possible to this minimum to
69:48 - 69:55

this area of minimized generalization error.
70:11 - 70:15

The last thing I want to do is actually
70:15 - 70:19

going back to the theorem I wrote out, I just want to take that theorem well,
70:19 - 70:20

so the theorem I wrote out
70:20 - 70:25

was an error bound theorem this says for fixed M and delta where probability
70:25 - 70:29

one minus delta, I get a bound on
70:29 - 70:31

gamma, which is what this term is.
70:31 - 70:35

So the very last thing I wanna do today is just come back to this theorem and write out a
70:35 - 70:36

corollary
70:36 - 70:40

where I'm gonna fix gamma, I'm gonna fix my error bound, and fix delta
70:40 - 70:41

and solve for M.
70:41 - 70:46

And if you do that,
70:46 - 70:47

you
70:47 - 70:49

get the following corollary: Let H
70:49 - 70:52

be
70:52 - 70:55

fixed with K hypotheses
70:55 - 70:59

and let
70:59 - 71:01

any delta
71:01 - 71:04

and gamma be fixed.
71:09 - 71:11

Then
71:11 - 71:18

in order to guarantee that,
71:23 - 71:25

let's say I want a guarantee
71:25 - 71:27

that the generalization error
71:27 - 71:31

of the hypothesis I choose with empirical risk minimization,
71:31 - 71:34

that this is at most two times gamma worse
71:34 - 71:38

than the best possible error I could obtain with this hypothesis class.
71:38 - 71:40

Lets say I want this to
71:40 - 71:45

hold true with probability at least one minus delta,
71:45 - 71:50

then it suffices
71:50 - 71:54

that M
71:54 - 72:01

is [inaudible] to that. Okay? And this is
72:01 - 72:01

sort of
72:01 - 72:06

solving for the error
72:06 - 72:08

bound for M. One thing we're going to convince yourselves
72:08 - 72:10

of the easy part of this is
72:10 - 72:15

if you set that term [inaudible] gamma and solve for M you will get this.
72:15 - 72:19

One thing I want you to go home and sort of convince yourselves of is that this
72:19 - 72:21

result really holds true.
72:21 - 72:24

That this really logically follows from the theorem we've proved.
72:24 - 72:26

In other words,
72:26 - 72:30

you can take that formula we wrote and solve for M and because this is the formula you get for M,
72:30 - 72:32

that's just that's the easy part. That
72:32 - 72:34

once you go back and convince yourselves that
72:34 - 72:38

this theorem is a true fact and that it does indeed logically follow from the
72:38 - 72:39

other one. In
72:39 - 72:40

particular,
72:40 - 72:44

make sure that if you solve for that you really get M grading equals this, and why is this M
72:44 - 72:47

grading that and not M less equal two, and just make sure
72:47 - 72:50

I can write this down and it sounds plausible why don't you just go back and convince yourself this is really true. Okay?
72:53 - 72:55

And
72:55 - 72:58

it turns out that when we prove these bounds in learning theory it turns out
72:58 - 73:03

that very often the constants are sort of loose. So it turns out that when we prove
73:03 - 73:06

these bounds usually we're interested
73:06 - 73:10

usually we're not very interested in the constants, and so I
73:10 - 73:12

write this as big O of
73:12 - 73:16

one over gamma squared, log
73:16 - 73:18

K over delta,
73:18 - 73:21

and again, the key
73:21 - 73:24

step in this is that the dependence on M
73:24 - 73:27

with the size of the hypothesis class is logarithmic.
73:27 - 73:30

And this will be very important later when we
73:30 - 73:35

talk about infinite hypothesis classes. Okay? Any
73:35 - 73:42

questions about this? No? Okay, cool.
73:50 - 73:51

So
73:51 - 73:54

next lecture we'll come back, we'll actually start from this result again.
73:54 - 73:58

Remember this. I'll write this down as the first thing I do in the next lecture
73:58 - 74:03

and we'll generalize these to infinite hypothesis classes and then talk about
74:03 - 74:05

practical algorithms for model spectrum. So I'll see you guys in a couple days.

Title:: Lecture 9 | Machine Learning (Stanford)
Description:: Lecture by Professor Andrew Ng for Machine Learning (CS 229) in the Stanford Computer Science department. Professor Ng delves into learning theory, covering bias, variance, empirical risk minimization, union bound and Hoeffding's inequalities.

This course provides a broad introduction to machine learning and statistical pattern recognition. Topics include supervised learning, unsupervised learning, learning theory, reinforcement learning and adaptive control. Recent applications of machine learning, such as to robotic control, data mining, autonomous navigation, bioinformatics, speech recognition, and text and web data processing are also discussed.

Complete Playlist for the Course:
http://www.youtube.com/view_play_list?p=A89DCFA6ADACE599

CS 229 Course Website:
http://www.stanford.edu/class/cs229/

Stanford University:
http://www.stanford.edu/

Stanford University Channel on YouTube:
http://www.youtube.com/stanford

more » « less
Video Language:: English
Duration:: 01:14:19

N. Ueda added a translation

English subtitles

Revisions

Revision 1

N. Ueda

Lecture 9 | Machine Learning (Stanford)

Revisions

Our website uses cookies

Operating cookies (Required)