Return to Video

Lecture 9 | Machine Learning (Stanford)

  • 0:12 - 0:15
    This presentation is delivered by the Stanford Center for Professional
  • 0:15 - 0:22
    Development. Okay.
  • 0:29 - 0:32
    So welcome back.
  • 0:32 - 0:36
    What I want to do today is start a new chapter
  • 0:36 - 0:41
    in between now and then. In particular, I want to talk about learning theory.
  • 0:41 - 0:45
    So in the previous, I guess eight lectures so far, you've learned about
  • 0:45 - 0:50
    a lot of learning algorithms, and yes, you now I hope understand a little about
  • 0:50 - 0:54
    some of the best and most powerful tools of machine learning in the [inaudible].
  • 0:54 - 0:58
    And all of you are now sort of well qualified to go into industry
  • 0:58 - 1:02
    and though powerful learning algorithms apply, really the most powerful
  • 1:02 - 1:05
    learning algorithms we know to all sorts of problems, and in fact, I hope
  • 1:05 - 1:11
    you start to do that on your projects right away as well.
  • 1:11 - 1:14
    You might remember, I think it was in the very first lecture,
  • 1:14 - 1:19
    that I made an analogy to if you're trying to learn to be a carpenter, so
  • 1:19 - 1:21
    if
  • 1:21 - 1:24
    you imagine you're going to carpentry school to learn to be a carpenter,
  • 1:24 - 1:29
    then only a small part of what you need to do is to acquire a set of tools. If you learn to
  • 1:29 - 1:30
    be a carpenter
  • 1:30 - 1:33
    you don't walk in and pick up a tool box and [inaudible], so
  • 1:33 - 1:36
    when you need to
  • 1:36 - 1:40
    cut a piece of wood do you use a rip saw, or a jig saw, or a keyhole saw
  • 1:40 - 1:43
    whatever, is this really mastering the tools there's also
  • 1:43 - 1:46
    an essential part of becoming a good carpenter.
  • 1:46 - 1:47
    And
  • 1:47 - 1:50
    what I want to do in the next few lectures is
  • 1:50 - 1:53
    actually give you a sense of the mastery of the machine learning tools all of
  • 1:53 - 1:55
    you have. Okay?
  • 1:55 - 1:57
    And so in particular,
  • 1:57 - 2:01
    in the next few lectures what I want to is to talk more deeply about
  • 2:01 - 2:05
    the properties of different machine learning algorithms so that you can get a sense of
  • 2:05 - 2:07
    when it's most appropriate to use
  • 2:07 - 2:08
    each one.
  • 2:08 - 2:12
    And it turns out that one of the most common scenarios in machine learning is
  • 2:13 - 2:15
    someday you'll be doing research or
  • 2:15 - 2:17
    [inaudible] a company.
  • 2:17 - 2:20
    And you'll apply one of the learning algorithms you learned about, you may apply logistic regression,
  • 2:20 - 2:24
    or support vector machines, or Naive Bayes or something,
  • 2:24 - 2:28
    and for whatever bizarre reason, it won't work as well as you were hoping, or it
  • 2:28 - 2:32
    won't quite do what you were hoping it to.
  • 2:33 - 2:35
    To
  • 2:35 - 2:37
    me what really separates the people from
  • 2:37 - 2:41
    what really separates the people that really understand and really get machine
  • 2:41 - 2:42
    learning,
  • 2:42 - 2:46
    compared to people that maybe read the textbook and so they'll work through the math,
  • 2:46 - 2:50
    will be what you do next. Will be in your decisions of when
  • 2:50 - 2:54
    you apply a support vector machine and it doesn't quite do what you wanted,
  • 2:54 - 2:57
    do you really understand enough about support vector machines to know what to
  • 2:57 - 2:59
    do next and how to modify the algorithm?
  • 2:59 - 3:02
    And to me that's often what really separates the great people in machine
  • 3:02 - 3:06
    learning versus the people that like read the text book and so they'll [inaudible] the math, and so they'll have
  • 3:06 - 3:09
    just understood that. Okay?
  • 3:09 - 3:10
    So
  • 3:10 - 3:14
    what I want to do today, today's lecture will mainly be on learning theory and
  • 3:14 - 3:15
    we'll start to
  • 3:15 - 3:19
    talk about some of the theoretical results of machine learning.
  • 3:19 - 3:22
    The next lecture, later this week, will be on algorithms for
  • 3:22 - 3:26
    sort of [inaudible], or fixing some of the problems
  • 3:26 - 3:30
    that the learning theory will point out to us and help us understand. And then
  • 3:30 - 3:34
    two lectures from now, that lecture will be almost entirely focused on
  • 3:34 - 3:36
    the practical advice for
  • 3:36 - 3:40
    how to apply learning algorithms. Okay?
  • 3:40 - 3:47
    So you have any questions about this before I start? Okay.
  • 3:51 - 3:54
    So the very first thing we're gonna talk about is something that
  • 3:54 - 3:59
    you've probably already seen on the first homework, and something that
  • 3:59 - 4:01
    alluded to previously,
  • 4:01 - 4:07
    which is the bias variance trade-off. So take
  • 4:07 - 4:11
    ordinary least squares, the first learning algorithm we learned
  • 4:11 - 4:16
    about, if you [inaudible] a straight line through these datas, this is not a very good model.
  • 4:16 - 4:20
    Right.
  • 4:20 - 4:21
    And if
  • 4:21 - 4:24
    this happens, we say it has
  • 4:24 - 4:24
    underfit
  • 4:24 - 4:27
    the data, or we say that this is a learning algorithm
  • 4:27 - 4:31
    with a very high bias, because it is
  • 4:31 - 4:36
    failing to fit the evident quadratic structure in the data.
  • 4:36 - 4:36
    And
  • 4:36 - 4:38
    for the prefaces
  • 4:38 - 4:41
    of [inaudible] you can formally think of the
  • 4:41 - 4:45
    bias of the learning algorithm as representing the fact that even if you
  • 4:45 - 4:48
    had an infinite amount of training data, even if
  • 4:48 - 4:51
    you had tons of training data,
  • 4:51 - 4:53
    this algorithm would still fail
  • 4:53 - 4:54
    to fit the quadratic
  • 4:54 - 4:58
    function, the quadratic structure in
  • 4:58 - 5:02
    the data. And so we think of this as a learning algorithm with high bias.
  • 5:02 - 5:04
    Then there's the opposite problem, so that's the
  • 5:04 - 5:06
    same dataset.
  • 5:06 - 5:13
    If
  • 5:13 - 5:20
    you fit a fourth of the polynomials into this dataset,
  • 5:20 - 5:20
    then you have
  • 5:20 - 5:26
    you'll be able to interpolate the five data points exactly, but clearly, this is also
  • 5:26 - 5:27
    not a great model
  • 5:27 - 5:31
    to the structure that you and I probably see in the data.
  • 5:31 - 5:34
    And we say that this
  • 5:34 - 5:37
    algorithm has a problem, excuse me,
  • 5:37 - 5:43
    is overfitting the data,
  • 5:43 - 5:49
    or alternatively that this algorithm has high variance. Okay? And the intuition behind
  • 5:49 - 5:50
    overfitting a high variance is that
  • 5:50 - 5:54
    the algorithm is fitting serious patterns in the data, or is fitting
  • 5:54 - 5:58
    idiosyncratic properties of this specific dataset, be it the dataset of
  • 5:58 - 6:01
    housing prices or whatever.
  • 6:01 - 6:08
    And quite often, they'll be some happy medium
  • 6:08 - 6:10
    of fitting a quadratic function
  • 6:10 - 6:15
    that maybe won't interpolate your data points perfectly, but also captures multistructure
  • 6:15 - 6:17
    in your data
  • 6:17 - 6:21
    than a simple model which under fits.
  • 6:21 - 6:24
    I say that you can sort of have the exactly the same picture
  • 6:24 - 6:26
    of classification problems as well,
  • 6:26 - 6:28
    so lets say
  • 6:28 - 6:35
    this is my training set, right,
  • 6:36 - 6:37
    of
  • 6:37 - 6:39
    positive and negative examples,
  • 6:44 - 6:47
    and so you can fit
  • 6:47 - 6:52
    logistic regression with a very high order polynomial [inaudible], or [inaudible] of X
  • 6:52 - 6:57
    equals the sigmoid function of
  • 7:01 - 7:04
    whatever. Sigmoid function applied to a tenth of the polynomial.
  • 7:04 - 7:08
    And you do that, maybe you get a decision boundary
  • 7:08 - 7:13
    like this. Right.
  • 7:13 - 7:17
    That does indeed perfectly separate the positive and negative classes, this is
  • 7:17 - 7:19
    another example of how
  • 7:19 - 7:21
    overfitting, and
  • 7:21 - 7:24
    in contrast you fit logistic regression into this model with just the linear features,
  • 7:24 - 7:27
    with none
  • 7:27 - 7:30
    of the quadratic features, then maybe you get a decision boundary like that, which
  • 7:30 - 7:33
    can also underfit.
  • 7:33 - 7:35
    Okay.
  • 7:35 - 7:38
    So what I want to do now is
  • 7:38 - 7:42
    understand this problem of overfitting versus underfitting, of high bias versus high
  • 7:42 - 7:43
    variance, more explicitly,
  • 7:45 - 7:49
    I will do that by posing a more formal model of machine learning and so
  • 7:49 - 7:51
    trying to prove when
  • 7:51 - 7:57
    these two twin problems when each of these two problems come up.
  • 7:57 - 7:59
    And as I'm modeling the
  • 7:59 - 8:01
  • 8:01 - 8:04
    example for our initial foray into learning theory,
  • 8:04 - 8:10
    I want to talk about learning classification,
  • 8:10 - 8:13
    in which
  • 8:14 - 8:16
    H of X is equal
  • 8:16 - 8:19
    to G of data transpose X. Okay?
  • 8:19 - 8:21
    So the learning classifier.
  • 8:21 - 8:27
    And for this class I'm going to use, Z
  • 8:27 - 8:33
    excuse me. I'm gonna use G as indicator Z grading with
  • 8:33 - 8:40
    zero.
  • 8:40 - 8:44
    With apologies in advance for changing the notation yet again,
  • 8:44 - 8:45
    for the support vector machine
  • 8:45 - 8:49
    lectures we use Y equals minus one or plus one. For
  • 8:49 - 8:53
    learning theory lectures, turns out it'll be a bit cleaner if I switch back to
  • 8:53 - 8:58
    Y equals zero-one again, so I'm gonna switch back to my original notation.
  • 8:58 - 9:02
    And so you think of this model as a model forum as
  • 9:02 - 9:05
    logistic regressions, say, and think of this as being
  • 9:05 - 9:07
    similar to logistic regression,
  • 9:07 - 9:08
    except that now we're going to force
  • 9:08 - 9:12
    the logistic regression algorithm, to opt for labels that are
  • 9:12 - 9:14
    either zero or one. Okay?
  • 9:14 - 9:16
    So you can think of this as a
  • 9:16 - 9:21
    classifier to opt for labels zero or one involved in the probabilities.
  • 9:21 - 9:25
    And so as
  • 9:25 - 9:32
    usual, let's say we're given a training set of M examples.
  • 9:34 - 9:38
    That's just my notation for writing a set of M examples ranging from I equals
  • 9:38 - 9:40
    one
  • 9:40 - 9:45
    through M. And I'm going to assume that the training example is XIYI.
  • 9:45 - 9:48
    I've drawn IID,
  • 9:48 - 9:50
    from sum distribution,
  • 9:50 - 9:51
    scripts
  • 9:51 - 9:54
    D. Okay? [Inaudible]. Identically and definitively distributed
  • 9:54 - 10:00
    and if you have you have running a classification problem on houses, like
  • 10:00 - 10:03
    features of the house comma, whether the house will be sold in the next six months, then this
  • 10:03 - 10:05
    is just
  • 10:05 - 10:08
    the priority distribution over
  • 10:08 - 10:12
    features of houses and whether or not they'll be sold. Okay? So I'm gonna assume that
  • 10:12 - 10:17
    training examples we've drawn IID from some probability distributions,
  • 10:18 - 10:20
    scripts D. Well, same thing for spam, if you're
  • 10:20 - 10:23
    trying to build a spam classifier then this would be the distribution of
  • 10:23 - 10:30
    what emails look like comma, whether they are spam or not.
  • 10:30 - 10:32
    And in particular, to understand
  • 10:32 - 10:38
    or simplify to understand the phenomena of bias invariance, I'm actually going to use a
  • 10:38 - 10:42
    simplified model of machine learning.
  • 10:42 - 10:45
    And in particular,
  • 10:45 - 10:46
    logistic regression fits
  • 10:46 - 10:50
    this parameters the model like this for maximizing the law of likelihood.
  • 10:50 - 10:55
    But in order to understand learning algorithms more deeply, I'm just going to assume a simplified
  • 10:55 - 10:59
    model of machine learning, let me just write that down.
  • 11:02 - 11:05
    So I'm going to define training error
  • 11:05 - 11:09
    as
  • 11:09 - 11:14
    so this is a training error of a hypothesis X subscript data.
  • 11:14 - 11:18
    Write this epsilon hat of subscript data.
  • 11:18 - 11:20
    If I want to make the
  • 11:20 - 11:23
    dependence on a training set explicit, I'll write this with
  • 11:23 - 11:27
    a subscript S there where S is a training set.
  • 11:27 - 11:34
    And I'll define this as,
  • 11:42 - 11:44
    let's see. Okay. I
  • 11:44 - 11:48
    hope the notation is clear. This is a sum of indicator functions for whether your hypothesis
  • 11:48 - 11:52
    correctly classifies the Y the IFE example.
  • 11:52 - 11:55
    And so when you divide by M, this is just
  • 11:55 - 11:58
    in your training set what's the fraction
  • 11:58 - 12:00
    of training examples your
  • 12:00 - 12:01
    hypothesis classifies
  • 12:01 - 12:05
    so defined as a training error. And
  • 12:05 - 12:08
    training error is also called risk.
  • 12:10 - 12:14
    The simplified model of machine learning I'm gonna talk about is
  • 12:14 - 12:17
    called empirical risk minimization.
  • 12:17 - 12:21
    And in particular, I'm going to assume that the way my learning algorithm works
  • 12:21 - 12:26
    is it will choose parameters
  • 12:26 - 12:33
    data, that
  • 12:34 - 12:39
    minimize my training error. Okay?
  • 12:39 - 12:43
    And it will be this learning algorithm that we'll prove properties about.
  • 12:43 - 12:45
    And it turns out that
  • 12:45 - 12:46
    you
  • 12:46 - 12:50
    can think of this as the most basic learning algorithm, the algorithm that minimizes
  • 12:50 - 12:52
    your training error. It
  • 12:52 - 12:55
    turns out that logistic regression and support vector machines can be
  • 12:55 - 12:59
    formally viewed as approximation cities, so it
  • 12:59 - 13:03
    turns out that if you actually want to do this, this is a nonconvex optimization
  • 13:03 - 13:03
    problem.
  • 13:03 - 13:09
    This is actually it actually [inaudible] hard to solve this optimization problem.
  • 13:09 - 13:14
    And logistic regression and support vector machines can both be viewed as
  • 13:14 - 13:14
    approximations
  • 13:14 - 13:17
    to this nonconvex optimization problem
  • 13:17 - 13:20
    by finding the convex approximation to it.
  • 13:20 - 13:23
    Think of this as
  • 13:23 - 13:24
    similar to what
  • 13:24 - 13:28
    algorithms like logistic regression
  • 13:28 - 13:30
    are doing. So
  • 13:30 - 13:33
    let me take that
  • 13:33 - 13:34
    definition of empirical risk
  • 13:34 - 13:36
    minimization
  • 13:36 - 13:43
    and actually just rewrite it in a different equivalent way.
  • 13:43 - 13:45
    For the results I want to prove today, it turns out
  • 13:45 - 13:48
    that it will be useful to think of
  • 13:48 - 13:49
    our learning algorithm
  • 13:49 - 13:52
    as not choosing a set of parameters,
  • 13:52 - 13:55
    but as choosing a function.
  • 13:55 - 13:58
    So let me say what I mean by that. Let me define
  • 13:58 - 14:00
    the hypothesis
  • 14:00 - 14:03
    class,
  • 14:03 - 14:05
    script h,
  • 14:05 - 14:09
    as the class of all hypotheses of in other words as the class of all linear
  • 14:09 - 14:11
    classifiers, that your
  • 14:11 - 14:14
    learning algorithm is choosing from.
  • 14:14 - 14:17
    Okay? So
  • 14:17 - 14:19
    H subscript data
  • 14:19 - 14:23
    is a specific linear classifier, so H subscript data,
  • 14:23 - 14:30
    in each of these functions each
  • 14:30 - 14:34
    of these is a function mapping from the input domain X is the class zero-one. Each
  • 14:34 - 14:35
    of
  • 14:35 - 14:38
    these is a function, and as you vary the parameter's data,
  • 14:38 - 14:41
    you get different functions.
  • 14:41 - 14:44
    And so let me define the hypothesis class
  • 14:44 - 14:48
    script H to be the class of all functions that say logistic regression can choose
  • 14:48 - 14:49
    from. Okay. So
  • 14:49 - 14:53
    this is the class of all linear classifiers
  • 14:53 - 14:54
    and so
  • 14:57 - 15:01
    I'm going to define,
  • 15:01 - 15:05
    or maybe redefine empirical risk minimization as
  • 15:05 - 15:09
    instead of writing this choosing a set of parameters, I want to think of it as
  • 15:09 - 15:10
    choosing a
  • 15:10 - 15:12
    function
  • 15:12 - 15:15
    into hypothesis class of script H
  • 15:15 - 15:22
    that minimizes
  • 15:22 - 15:27
    that minimizes my training error. Okay?
  • 15:27 - 15:31
    So actually can you raise your
  • 15:31 - 15:35
    hand if it makes sense to you why this is equivalent to the previous
  • 15:35 - 15:41
    formulation? Okay, cool.
  • 15:41 - 15:44
    Thanks. So for development of the use of think of algorithms as choosing from
  • 15:44 - 15:47
    function from the class instead,
  • 15:47 - 15:48
    because
  • 15:48 - 15:51
    in a more general case this set, script H,
  • 15:51 - 15:54
    can be some other class of functions. Maybe
  • 15:54 - 15:58
    is a class of all functions represented by viewer network, or the class of all
  • 15:58 - 16:04
    some other class of functions the learning algorithm wants to choose from.
  • 16:04 - 16:05
    And
  • 16:05 - 16:11
    this definition for empirical risk minimization will still apply. Okay?
  • 16:11 - 16:13
    So
  • 16:13 - 16:16
    what we'd like to do is understand
  • 16:16 - 16:19
    whether empirical risk minimization
  • 16:19 - 16:24
    is a reasonable algorithm. Alex? Student:[Inaudible] a function that's
  • 16:24 - 16:26
    defined
  • 16:26 - 16:30
    by G
  • 16:30 - 16:35
    of data TX, or is it now more general? Instructor (Andrew Ng): I see, right, so lets see
  • 16:35 - 16:36
    I guess this
  • 16:36 - 16:41
    the question is H data still defined by G of phase transpose
  • 16:41 - 16:47
    X, is this more general? So Student:[Inaudible] Instructor (Andrew Ng): Oh,
  • 16:47 - 16:48
    yeah so very two answers
  • 16:48 - 16:51
    to that. One is, this framework is general, so
  • 16:51 - 16:55
    for the purpose of this lecture it may be useful to you to keep in mind a model of the
  • 16:55 - 16:57
    example of
  • 16:57 - 17:01
    when H subscript data is the class of all linear classifiers such as those used by
  • 17:01 - 17:04
    like a visectron algorithm or logistic regression.
  • 17:04 - 17:06
    This
  • 17:06 - 17:07
    everything on this board,
  • 17:07 - 17:11
    however, is actually more general. H can be any set of functions, mapping
  • 17:11 - 17:15
    from the INFA domain to the center of class label zero and one,
  • 17:15 - 17:18
    and then you can perform empirical risk minimization
  • 17:18 - 17:21
    over any hypothesis class. For the purpose
  • 17:21 - 17:23
    of today's lecture,
  • 17:23 - 17:27
    I am going to restrict myself to talking about binary classification, but it turns
  • 17:27 - 17:31
    out everything I say generalizes to regression
  • 17:31 - 17:36
    in other problem as
  • 17:36 - 17:40
    well. Does that answer your question? Yes. Cool. All right. So I wanna understand if empirical risk minimization is a reasonable algorithm.
  • 17:40 - 17:45
    In particular, what are the things we can prove about it? So
  • 17:45 - 17:49
    clearly we don't actually care about training error, we don't really care about
  • 17:49 - 17:53
    making accurate predictions on the training set, or at a least that's not the ultimate goal. The
  • 17:53 - 17:56
    ultimate goal is
  • 17:56 - 17:58
    how well it makes
  • 17:58 - 17:59
    generalization
  • 17:59 - 18:04
    how well it makes predictions on examples that we haven't seen before. How
  • 18:04 - 18:06
    well it predicts prices or
  • 18:06 - 18:07
    sale or no sale
  • 18:07 - 18:10
    outcomes of houses you haven't seen before.
  • 18:10 - 18:13
    So what we really care about
  • 18:13 - 18:16
    is generalization error, which I write
  • 18:16 - 18:18
    as epsilon of H.
  • 18:18 - 18:22
    And this is defined as the probability
  • 18:22 - 18:24
    that if I sample
  • 18:24 - 18:26
    a new example, X
  • 18:26 - 18:28
    comma Y,
  • 18:28 - 18:35
    from that distribution scripts D,
  • 18:40 - 18:43
    my hypothesis mislabels
  • 18:43 - 18:47
    that example. And
  • 18:47 - 18:50
    in terms of notational convention, usually
  • 18:50 - 18:53
    if I use if I place a hat on top of something, it
  • 18:53 - 18:57
    usually means not always but it usually means that it is an attempt to
  • 18:57 - 18:59
    estimate something about the hat. So
  • 18:59 - 19:01
    for example,
  • 19:01 - 19:03
    epsilon hat here
  • 19:03 - 19:06
    this is something that we're trying think of epsilon hat training error
  • 19:06 - 19:09
    as an attempt to approximate generalization error.
  • 19:09 - 19:11
    Okay, so the notation convention is
  • 19:11 - 19:15
    usually the things with the hats on top are things we're using to estimate other
  • 19:15 - 19:16
    quantities.
  • 19:16 - 19:20
    And H hat is a hypothesis output by learning algorithm to try to
  • 19:20 - 19:25
    estimate what the functions from H to Y X to Y. So
  • 19:25 - 19:29
    let's actually prove some things about when empirical risk minimization
  • 19:29 - 19:30
    will do well
  • 19:30 - 19:33
    in a sense of giving us low generalization error, which is what
  • 19:33 - 19:40
    we really care about.
  • 19:47 - 19:50
    In order to prove our first learning theory result, I'm going to have to state
  • 19:50 - 19:53
    two lemmas, the first is
  • 19:53 - 20:00
    the union
  • 20:00 - 20:07
    vowel, which is the following,
  • 20:08 - 20:11
    let A1 through AK be
  • 20:11 - 20:12
    K event. And when
  • 20:12 - 20:16
    I say events, I mean events in a sense of a probabilistic event
  • 20:16 - 20:18
    that either happens or not.
  • 20:18 - 20:25
    And these are not necessarily independent.
  • 20:28 - 20:32
    So there's some current distribution over the events A one through AK,
  • 20:32 - 20:35
    and maybe they're independent maybe not,
  • 20:35 - 20:41
    no assumption on that. Then
  • 20:41 - 20:45
    the probability of A one or
  • 20:45 - 20:46
    A two
  • 20:46 - 20:48
    or dot, dot,
  • 20:48 - 20:49
    dot, up to
  • 20:49 - 20:52
    AK, this union symbol, this
  • 20:52 - 20:54
    hat, this just means
  • 20:54 - 20:58
    this sort of just set notation for probability just means or. So the probability
  • 20:58 - 20:59
    of
  • 20:59 - 21:02
    at least one of these events occurring, of A one or A two, or up
  • 21:02 - 21:03
    to AK,
  • 21:03 - 21:08
    this is S equal to the probability of A one plus probability of A two plus dot,
  • 21:08 - 21:11
    dot, dot, plus probability of AK.
  • 21:11 - 21:18
    Okay? So
  • 21:18 - 21:21
    the intuition behind this is just that
  • 21:21 - 21:23
    I'm not sure if you've seen Venn diagrams
  • 21:23 - 21:26
    depictions of probability before, if you haven't,
  • 21:26 - 21:29
    what I'm about to do may be a little cryptic, so just ignore that. Just
  • 21:29 - 21:32
    ignore what I'm about to do if you haven't seen it before.
  • 21:32 - 21:37
    But if you have seen it before then this is really
  • 21:37 - 21:41
    this is really great the probability of A one,
  • 21:41 - 21:44
    union A two, union A three, is less
  • 21:44 - 21:47
    than the P of A one, plus
  • 21:47 - 21:51
    P of A two, plus P of A
  • 21:51 - 21:52
    three.
  • 21:52 - 21:54
    Right. So that the total mass in
  • 21:54 - 21:58
    the union of these three things [inaudible] to the sum of the masses
  • 21:58 - 22:01
    in the three individual sets, it's not very surprising.
  • 22:01 - 22:05
    It turns out that depending on how you define your axioms of probability,
  • 22:05 - 22:08
    this is actually one of the axioms that probably varies, so
  • 22:08 - 22:12
    I won't actually try to prove this. This is usually
  • 22:12 - 22:17
    written as an axiom. So sigmas of avitivity are probably
  • 22:17 - 22:17
    measured as this
  • 22:17 - 22:24
    what is sometimes called as well.
  • 22:26 - 22:31
    But in learning theory it's commonly called the union balance I just call it that. The
  • 22:31 - 22:32
    other
  • 22:32 - 22:39
    lemma I need is called the Hufting inequality. And
  • 22:40 - 22:43
    again, I won't actually prove this, I'll just state it,
  • 22:43 - 22:44
    which is
  • 22:44 - 22:46
    let's let Z1 up to ZM,
  • 22:46 - 22:49
    BM, IID,
  • 22:49 - 22:52
    there
  • 22:52 - 22:55
  • 22:55 - 23:02
    may be random variables with mean Phi.
  • 23:07 - 23:09
    So the probability of ZI equals 1 is equal to
  • 23:09 - 23:16
    Phi.
  • 23:16 - 23:18
    So
  • 23:18 - 23:23
    let's say you observe M IID for newly random variables and you want to estimate their
  • 23:23 - 23:24
    mean.
  • 23:24 - 23:29
    So let me define Phi hat, and this is again that notation, no convention, Phi hat means
  • 23:30 - 23:30
    does not attempt
  • 23:30 - 23:34
    is an estimate or something else. So when we define Phi
  • 23:34 - 23:39
    hat to be 1 over M, semper my equals one through MZI. Okay?
  • 23:39 - 23:41
    So this is
  • 23:41 - 23:44
    our attempt to estimate the mean of these Benuve random variables by sort of taking
  • 23:44 - 23:47
    its average.
  • 23:49 - 23:53
    And let any gamma
  • 23:53 - 23:55
    be fixed.
  • 23:59 - 24:01
    Then,
  • 24:01 - 24:04
    the Hufting inequality
  • 24:04 - 24:11
    is that
  • 24:14 - 24:17
    the probability your estimate of Phi
  • 24:17 - 24:21
    is more than gamma away from the true value of Phi,
  • 24:21 - 24:26
    that this is bounded by two E to the next of two gamma squared. Okay? So
  • 24:26 - 24:31
    just in pictures
  • 24:31 - 24:35
    so this theorem holds this lemma, the Hufting inequality,
  • 24:35 - 24:38
    this is just a statement of fact, this just holds true.
  • 24:38 - 24:42
    But let me now draw a cartoon to describe some of the intuition behind this, I
  • 24:42 - 24:44
    guess.
  • 24:44 - 24:45
    So
  • 24:45 - 24:49
    lets say [inaudible] this is a real number line from zero to one.
  • 24:49 - 24:53
    And so Phi is the mean of your Benuve random variables.
  • 24:54 - 24:56
    You will remember from you know,
  • 24:56 - 24:59
    whatever some undergraduate probability or statistics class,
  • 24:59 - 25:03
    the central limit theorem that says that when you average all the things together,
  • 25:03 - 25:05
    you tend to get a Gaussian distribution.
  • 25:05 - 25:10
    And so when you toss M coins with bias Phi, we observe these M
  • 25:10 - 25:12
    Benuve random variables,
  • 25:12 - 25:18
    and we average them, then the probability distribution of
  • 25:18 - 25:25
    Phi hat
  • 25:26 - 25:29
    will roughly be a Gaussian lets say. Okay? It
  • 25:29 - 25:30
    turns out if
  • 25:30 - 25:33
    you haven't seen this up before, this is actually that the
  • 25:33 - 25:36
    cumulative distribution function of Phi hat will converse with that of the Gaussian.
  • 25:36 - 25:40
    Technically Phi hat can only take on a discreet set of values
  • 25:40 - 25:43
    because these are factions one over Ms. It doesn't really have an
  • 25:43 - 25:45
    entity but just as a cartoon
  • 25:45 - 25:49
    think of it as a converse roughly to a Gaussian.
  • 25:49 - 25:56
    So what the Hufting inequality says is that if you pick a value of gamma, let me put
  • 25:56 - 26:00
    S one interval gamma there's another interval gamma.
  • 26:00 - 26:03
    Then the saying that the probability mass of the details,
  • 26:03 - 26:06
    in other words the probability that
  • 26:06 - 26:08
    my value of Phi hat is more than
  • 26:08 - 26:11
    a gamma away from the true value,
  • 26:11 - 26:16
    that the total mass
  • 26:16 - 26:18
    that the
  • 26:18 - 26:22
    total probability mass in these tails is at most two
  • 26:22 - 26:26
    E to the negative two gamma squared M. Okay?
  • 26:26 - 26:28
    That's what the Hufting inequality so if you
  • 26:28 - 26:31
    can't read that this just says this is just the right hand side of the bound, two E to
  • 26:31 - 26:33
    negative two gamma squared.
  • 26:33 - 26:36
    So balance the probability that you make a mistake in estimating the
  • 26:36 - 26:42
    mean of a Benuve random variable.
  • 26:42 - 26:43
    And the
  • 26:43 - 26:47
    cool thing about this bound the interesting thing behind this bound is that
  • 26:48 - 26:51
    the [inaudible] exponentially in M,
  • 26:51 - 26:52
    so it says
  • 26:52 - 26:54
    that for a fixed value of gamma,
  • 26:54 - 26:56
    as you increase the size
  • 26:56 - 26:58
    of your training set, as you toss a coin more and more,
  • 26:58 - 27:02
    then the worth of this Gaussian will shrink. The worth of this
  • 27:02 - 27:06
    Gaussian will actually shrink like one over root to M.
  • 27:06 - 27:12
    And that will cause the probability mass left in the tails to decrease exponentially,
  • 27:12 - 27:19
    quickly, as a function of that. And this will be important later. Yeah? Student: Does this come from the
  • 27:21 - 27:23
    central limit theorem [inaudible]. Instructor (Andrew Ng): No it doesn't. So this is proved by a different
  • 27:23 - 27:27
    this is proved no so the central limit theorem there may be a
  • 27:27 - 27:29
    version of the central limit theorem, but the
  • 27:29 - 27:33
    versions I'm familiar with tend are sort of asymptotic, but this works
  • 27:33 - 27:36
    for any finer value of M. Oh, and for your
  • 27:36 - 27:40
    this bound holds even if M is equal to two, or M is [inaudible], if M is
  • 27:40 - 27:41
    very small,
  • 27:41 - 27:45
    the central limit theorem approximation is not gonna hold, but this theorem holds
  • 27:45 - 27:46
    regardless.
  • 27:46 - 27:48
    Okay? I'm
  • 27:48 - 27:52
    drawing this just as a cartoon to help explain the intuition, but this
  • 27:52 - 27:59
    theorem just holds true, without reference to central limit
  • 28:09 - 28:13
    theorem. All right. So lets start to understand empirical risk minimization,
  • 28:13 - 28:20
    and what I want to do is
  • 28:23 - 28:25
    begin with
  • 28:25 - 28:28
    studying empirical risk minimization
  • 28:28 - 28:31
    for a
  • 28:31 - 28:33
    [inaudible] case
  • 28:33 - 28:37
    that's a logistic regression, and in particular I want to start with studying
  • 28:37 - 28:41
    the case of finite hypothesis classes.
  • 28:41 - 28:48
    So let's say script H is a class of
  • 28:49 - 28:56
    K hypotheses.
  • 28:56 - 29:00
    Right. So this is K functions with no each of these is just a function mapping
  • 29:00 - 29:05
    from inputs to outputs, there's no parameters in this.
  • 29:05 - 29:06
    And so
  • 29:06 - 29:13
    what the empirical risk minimization would do is it would take the training set
  • 29:13 - 29:14
    and it'll then
  • 29:14 - 29:17
    look at each of these K functions,
  • 29:17 - 29:22
    and it'll pick whichever of these functions has the lowest training error. Okay?
  • 29:22 - 29:24
    So now that the logistic regression uses an infinitely
  • 29:24 - 29:29
    large a continuous infinitely large class of hypotheses, script H,
  • 29:29 - 29:32
    but to prove the first row I actually want to just
  • 29:32 - 29:33
    describe
  • 29:33 - 29:38
    our first learning theorem is all for the case of when you have a finite hypothesis class, and then
  • 29:38 - 29:42
    we'll later generalize that into the hypothesis classes.
  • 29:43 - 29:46
    So
  • 29:53 - 29:58
    empirical risk minimization takes the hypothesis of the lowest training error,
  • 29:58 - 30:04
    and what I'd like to do is prove a bound on the generalization error
  • 30:04 - 30:06
    of H hat. All right. So in other words I'm
  • 30:06 - 30:07
    gonna prove that
  • 30:07 - 30:14
    somehow minimizing training error allows me to do well on generalization error.
  • 30:14 - 30:16
    And here's the strategy,
  • 30:16 - 30:20
    I'm going to
  • 30:20 - 30:23
    the first step in this prove I'm going to show that
  • 30:23 - 30:25
    training error
  • 30:25 - 30:30
    is a good approximation to generalization error,
  • 30:32 - 30:35
    and then I'm going to show
  • 30:35 - 30:38
    that this implies a bound
  • 30:38 - 30:40
    on
  • 30:40 - 30:43
    the generalization error
  • 30:43 - 30:48
    of the hypothesis of [inaudible] empirical risk
  • 30:48 - 30:53
    minimization. And I just realized, this class I guess is also maybe slightly notation
  • 30:53 - 30:56
    heavy class
  • 30:56 - 30:59
    round, instead of just introducing a reasonably large set of new symbols, so if
  • 30:59 - 31:03
    again, in the course of today's lecture, you're looking at some symbol and you don't quite
  • 31:03 - 31:07
    remember what it is, please raise your hand and ask. [Inaudible] what's that, what was that, was that a
  • 31:07 - 31:12
    generalization error or was it something else? So raise your hand and
  • 31:12 - 31:17
    ask if you don't understand what the notation I was defining.
  • 31:17 - 31:20
    Okay. So let me introduce this in two steps. And the empirical risk strategy is I'm gonna show training errors
  • 31:20 - 31:23
    that give approximation generalization error,
  • 31:23 - 31:26
    and this will imply that minimizing training error
  • 31:26 - 31:30
    will also do pretty well in terms of minimizing generalization error.
  • 31:30 - 31:33
    And this will give us a bound on the generalization error
  • 31:33 - 31:40
    of the hypothesis output by empirical risk minimization. Okay?
  • 31:40 - 31:47
    So here's the idea.
  • 31:48 - 31:52
    So
  • 31:52 - 31:56
    lets even not consider all the hypotheses at once, lets pick any
  • 31:56 - 32:00
    hypothesis, HJ in the class script H, and
  • 32:00 - 32:05
    so until further notice lets just consider there one fixed hypothesis. So pick any
  • 32:05 - 32:10
    one hypothesis and let's talk about that
  • 32:10 - 32:12
    one. Let
  • 32:12 - 32:15
    me define ZI
  • 32:15 - 32:22
    to be indicator function
  • 32:24 - 32:25
    for
  • 32:25 - 32:30
    whether this hypothesis misclassifies the IFE example excuse me
  • 32:30 - 32:34
    or Z subscript I. Okay?
  • 32:34 - 32:40
    So
  • 32:40 - 32:43
    ZI would be zero or one
  • 32:43 - 32:47
    depending on whether this one hypothesis which is the only one
  • 32:47 - 32:49
    I'm gonna even consider now,
  • 32:49 - 32:52
    whether this hypothesis was classified as an example.
  • 32:52 - 32:57
    And so
  • 32:57 - 33:01
    my training set is drawn randomly from sum distribution scripts
  • 33:01 - 33:02
    d,
  • 33:02 - 33:05
    and
  • 33:05 - 33:10
    depending on what training examples I've got, these ZIs would be either zero or one.
  • 33:10 - 33:14
    So let's figure out what the probability distribution ZI is.
  • 33:14 - 33:15
    Well,
  • 33:15 - 33:16
    so
  • 33:16 - 33:21
    ZI takes on the value of either zero or one, so clearly is a Benuve random variable, it can only
  • 33:21 - 33:25
    take on these values.
  • 33:25 - 33:31
    Well,
  • 33:31 - 33:35
    what's the probability that ZI is equal to one? In other words, what's the
  • 33:35 - 33:36
    probability
  • 33:36 - 33:41
    that from a fixed hypothesis HJ,
  • 33:41 - 33:45
    when I sample my training set IID from distribution D, what is the chance
  • 33:45 - 33:46
    that
  • 33:46 - 33:50
    my hypothesis will misclassify it?
  • 33:50 - 33:54
    Well, by definition,
  • 33:54 - 33:57
    that's just a generalization error of my
  • 33:57 - 34:01
    hypothesis HJ.
  • 34:01 - 34:04
    So ZI is a Benuve random variable
  • 34:04 - 34:05
    with mean
  • 34:05 - 34:12
    given by the generalization error of this hypothesis.
  • 34:14 - 34:20
    Raise your hand if that made sense. Oh, cool. Great.
  • 34:20 - 34:22
    And moreover,
  • 34:22 - 34:24
    all the ZIs have the same
  • 34:24 - 34:28
    probability of being one, and all my training examples I've drawn are IID,
  • 34:28 - 34:33
    and so the ZIs are also independent
  • 34:33 - 34:39
    and therefore
  • 34:39 - 34:42
    the ZIs themselves are IID random
  • 34:42 - 34:49
    variables. Okay? Because my training examples were drawn independently of each other, by assumption.
  • 34:56 - 34:59
    If you read this as the definition of training error,
  • 34:59 - 35:04
    the training error of my hypothesis HJ, that's just that.
  • 35:07 - 35:09
    That's just the average
  • 35:09 - 35:16
    of my ZIs, which was well I previously defined it like this. Okay?
  • 35:26 - 35:29
    And so epsilon hat of HJ
  • 35:29 - 35:30
    is exactly
  • 35:30 - 35:33
    the average of MIID,
  • 35:33 - 35:36
    Benuve random variables,
  • 35:36 - 35:40
    drawn from Benuve distribution with mean given by
  • 35:40 - 35:42
    the
  • 35:42 - 35:44
    generalization error, so this is
  • 35:44 - 35:48
    well this is the average of MIID Benuve random variables,
  • 35:48 - 35:55
    each of which has meaning given by the
  • 35:55 - 35:59
    generalization error of HJ.
  • 35:59 - 36:04
    And therefore, by
  • 36:04 - 36:07
    the Hufting inequality
  • 36:07 - 36:11
    we have to add the
  • 36:11 - 36:16
    probability
  • 36:16 - 36:20
    that the difference between training and generalization error, the probability that this is greater than gamma is
  • 36:20 - 36:22
    less than to two, E to
  • 36:22 - 36:25
    the negative two,
  • 36:25 - 36:27
    gamma squared M. Okay?
  • 36:27 - 36:32
    Exactly by the Hufting inequality.
  • 36:32 - 36:35
    And what this proves is that,
  • 36:35 - 36:37
    for my fixed hypothesis HJ,
  • 36:37 - 36:41
    my training error, epsilon hat will
  • 36:41 - 36:44
    with high probability, assuming M is large, if
  • 36:44 - 36:47
    M is large than this thing on the right hand side will be small, because
  • 36:47 - 36:49
    this is two Es and a negative two gamma squared M.
  • 36:49 - 36:53
    So this says that if my training set is large enough,
  • 36:53 - 36:54
    then the probability
  • 36:54 - 36:57
    my training error is far from generalization error,
  • 36:57 - 36:59
    meaning that it is more than gamma,
  • 36:59 - 37:06
    will be small, will be bounded by this thing on the right hand side. Okay? Now,
  • 37:09 - 37:12
    here's the [inaudible] tricky part,
  • 37:12 - 37:17
    what we've done is approve this bound for one fixed hypothesis, for HJ.
  • 37:17 - 37:20
    What I want to prove is that training error will be a good estimate for generalization
  • 37:20 - 37:21
    error,
  • 37:21 - 37:26
    not just for this one hypothesis HJ, but actually for all K hypotheses in my
  • 37:26 - 37:28
    hypothesis class
  • 37:28 - 37:30
    script
  • 37:30 - 37:31
    H.
  • 37:31 - 37:33
    So let's do it
  • 37:36 - 37:43
    well,
  • 37:55 - 37:58
    better do it on a new board. So in order to show that, let me define
  • 37:58 - 38:01
    a random event, let me define AJ to be the event
  • 38:01 - 38:05
    that
  • 38:05 - 38:12
    let
  • 38:21 - 38:24
    me define AJ to be the event that
  • 38:24 - 38:28
    you know, the difference between training and generalization error is more than gamma on a
  • 38:28 - 38:30
    hypothesis HJ.
  • 38:30 - 38:32
    And so what we
  • 38:32 - 38:36
    put on the previous board was that the probability of AJ is less equal to two E
  • 38:36 - 38:43
    to the negative two, gamma squared M, and this is pretty small. Now,
  • 38:44 - 38:50
    What I want to bound is the probability that
  • 38:50 - 38:54
    there exists some hypothesis in my class
  • 38:54 - 39:01
    script H,
  • 39:03 - 39:08
    such that I make a large error in my estimate of generalization error. Okay?
  • 39:08 - 39:10
    Such that this holds true.
  • 39:12 - 39:14
    So this is really just
  • 39:14 - 39:18
    that the probability that there exists a hypothesis for which this
  • 39:18 - 39:21
    holds. This is really the probability that
  • 39:21 - 39:25
    A one or A two, or up to AK holds.
  • 39:25 - 39:29
    The chance
  • 39:29 - 39:33
    there exists a hypothesis is just well the priority
  • 39:33 - 39:37
    that for hypothesis one and make a large error in estimating the generalization error,
  • 39:37 - 39:41
    or for hypothesis two and make a large error in estimating generalization error,
  • 39:41 - 39:44
    and so on.
  • 39:44 - 39:51
    And so by the union bound, this is less than equal to
  • 39:51 - 39:56
    that,
  • 39:56 - 39:59
    which is therefore less than equal to
  • 40:07 - 40:14
    is
  • 40:14 - 40:21
    equal to that. Okay?
  • 40:39 - 40:40
    So let
  • 40:40 - 40:42
    me just take
  • 40:42 - 40:48
    one minus both sides of the equation
  • 40:48 - 40:51
    on the previous board let me take one minus both sides, so
  • 40:51 - 40:57
    the probability that there does not exist for
  • 40:57 - 40:59
    hypothesis
  • 40:59 - 41:06
    such that,
  • 41:08 - 41:11
    that. The probability that there does not exist a hypothesis on which I make a large
  • 41:11 - 41:13
    error in this estimate
  • 41:13 - 41:20
    while this is equal to the probability that for all hypotheses,
  • 41:23 - 41:25
    I make a
  • 41:25 - 41:27
    small error, or at
  • 41:27 - 41:32
    most gamma, in my estimate of generalization error.
  • 41:32 - 41:36
    In taking one minus on the right hand side I get two KE
  • 41:36 - 41:39
    to the negative two
  • 41:39 - 41:44
    gamma squared M. Okay?
  • 41:44 - 41:46
    And so
  • 41:46 - 41:50
    and the sign of the inequality flipped because I took one minus both
  • 41:50 - 41:54
    sides. The minus sign flips the sign of the
  • 41:54 - 41:56
    equality.
  • 41:56 - 42:00
    So what we're shown is that
  • 42:00 - 42:05
    with probability which abbreviates to WP with probability one minus
  • 42:05 - 42:09
    two KE to the negative two gamma squared M.
  • 42:09 - 42:12
    We have that, epsilon hat
  • 42:12 - 42:16
    of H
  • 42:16 - 42:21
    will be
  • 42:21 - 42:28
    will then gamma of epsilon of H,
  • 42:31 - 42:33
    simultaneously for all
  • 42:33 - 42:35
    hypotheses in our
  • 42:35 - 42:42
    class script H.
  • 42:43 - 42:47
    And so
  • 42:47 - 42:54
    just to give this result a name, this is called
  • 42:56 - 43:00
    this is one instance of what's called a uniform conversions result,
  • 43:00 - 43:01
    and
  • 43:01 - 43:04
    the term uniform conversions
  • 43:04 - 43:06
    this sort of alludes to the fact that
  • 43:06 - 43:10
    this shows that as M becomes large,
  • 43:10 - 43:12
    then these epsilon hats
  • 43:12 - 43:16
    will all simultaneously converge to epsilon of H. That
  • 43:16 - 43:18
    training error will
  • 43:18 - 43:21
    become very close to generalization error
  • 43:21 - 43:23
    simultaneously for all hypotheses H.
  • 43:23 - 43:25
    That's what the term uniform
  • 43:25 - 43:29
    refers to, is the fact that this converges for all hypotheses H and not just
  • 43:29 - 43:31
    for one hypothesis. And so
  • 43:31 - 43:36
    what we're shown is one example of a uniform conversions result. Okay?
  • 43:36 - 43:39
    So let me clean a couple more boards. I'll come back and ask what questions you have
  • 43:39 - 43:46
    about this. We should take another look at this and make sure it all makes sense. Yeah, okay.
  • 44:20 - 44:27
    What questions do you have about this?
  • 44:28 - 44:31
  • 44:31 - 44:34
    Student: How the is the
  • 44:34 - 44:36
    value of gamma computed [inaudible]? Instructor (Andrew Ng): Right. Yeah. So let's see, the question is how is the value of gamma computed?
  • 44:36 - 44:40
    So for these purposes for the purposes of this, gamma is a constant.
  • 44:40 - 44:43
    Imagine a gamma is some constant that we chose in advance,
  • 44:43 - 44:49
    and this is a bound that holds true for any fixed value
  • 44:49 - 44:51
    of gamma. Later on as we
  • 44:51 - 44:54
    take this bound and then sort of develop this result further,
  • 44:54 - 44:58
    we'll choose specific values of gamma as a [inaudible] of this bound. For now
  • 44:58 - 45:05
    we'll just imagine that when we're proved this holds true for any value of gamma. Any questions?
  • 45:09 - 45:12
    Yeah? Student:[Inaudible] hypothesis phase is infinite [inaudible]? Instructor (Andrew Ng):Yes, the labs in the hypothesis phase is infinite,
  • 45:12 - 45:16
    so this simple result won't work in this present form, but we'll generalize this
  • 45:16 - 45:20
    probably won't get to it today but we'll generalize this at the beginning of the
  • 45:20 - 45:27
    next lecture to infinite hypothesis classes. Student:How do we use this
  • 45:30 - 45:32
    theory [inaudible]? Instructor (Andrew Ng):How do you use theorem factors? So let me
  • 45:32 - 45:37
    I might get to a little of that later today, we'll talk concretely about algorithms,
  • 45:37 - 45:39
    the consequences of
  • 45:39 - 45:46
    the understanding of these things in the next lecture as well. Yeah, okay? Cool. Can you
  • 45:46 - 45:47
    just raise your hand if the
  • 45:47 - 45:50
    things I've proved so far
  • 45:50 - 45:55
    make sense? Okay. Cool. Great.
  • 45:55 - 45:59
    Thanks. All right. Let me just take this uniform conversions bound and rewrite
  • 45:59 - 46:02
    it in a couple of other forms.
  • 46:02 - 46:05
    So
  • 46:05 - 46:09
    this is a sort of a bound on probability, this is saying suppose I fix my training
  • 46:09 - 46:11
    set and then fix my training
  • 46:11 - 46:15
    set fix my threshold, my error threshold gamma,
  • 46:15 - 46:19
    what is the probability that uniform conversions holds, and well,
  • 46:19 - 46:23
    that's my formula that gives the answer. This is the probability of something
  • 46:23 - 46:26
    happening.
  • 46:26 - 46:29
    So there are actually three parameters of interest. One is,
  • 46:29 - 46:32
    What is this probability? The
  • 46:32 - 46:34
    other parameter is, What's the training set size M?
  • 46:34 - 46:36
    And the third parameter is, What
  • 46:36 - 46:37
    is the value
  • 46:37 - 46:40
    of this error
  • 46:40 - 46:42
    threshold gamma? I'm not gonna
  • 46:42 - 46:44
    vary K for these purposes.
  • 46:44 - 46:48
    So other two other equivalent forms of the bounds,
  • 46:48 - 46:50
    which so you can ask,
  • 46:50 - 46:51
    given
  • 46:51 - 46:52
    gamma
  • 46:52 - 46:53
    so what
  • 46:53 - 46:56
    we proved was given gamma and given M, what
  • 46:56 - 46:58
    is the probability of
  • 46:58 - 47:01
    uniform conversions? The
  • 47:01 - 47:04
    other equivalent forms are,
  • 47:04 - 47:07
    so that given gamma
  • 47:07 - 47:14
    and the probability delta of making a large error,
  • 47:15 - 47:19
    how large a training set size do you need in order to give
  • 47:19 - 47:22
    a bound on how large a
  • 47:22 - 47:25
    training set size do you need
  • 47:25 - 47:30
    to give a uniform conversions bound with parameters gamma
  • 47:30 - 47:31
    and delta?
  • 47:31 - 47:34
    And so well,
  • 47:34 - 47:38
    so if you set delta to be two KE so negative two gamma
  • 47:38 - 47:38
    squared M.
  • 47:38 - 47:41
    This is that form that I had on the left.
  • 47:41 - 47:44
    And if you solve
  • 47:44 - 47:46
    for M,
  • 47:46 - 47:48
    what you find is that
  • 47:48 - 47:54
    there's an equivalent form of this result that says that
  • 47:54 - 47:56
    so long as your training
  • 47:56 - 47:59
    set assigns M as greater than this.
  • 47:59 - 48:05
    And this is the formula that I get by solving for M.
  • 48:05 - 48:09
    Okay? So long as M is greater than equal to this,
  • 48:09 - 48:13
    then with probability, which I'm abbreviating to WP again,
  • 48:13 - 48:17
    with probability at least one minus delta,
  • 48:17 - 48:24
    we have for all.
  • 48:25 - 48:31
    Okay?
  • 48:31 - 48:34
    So this says how large a training set size that I need
  • 48:34 - 48:38
    to guarantee that with probability at least one minus delta,
  • 48:38 - 48:42
    we have the training error is within gamma of generalization error for all my
  • 48:42 - 48:45
    hypotheses, and this gives an answer.
  • 48:45 - 48:49
    And just to give this another name, this is an example of a sample
  • 48:49 - 48:53
    complexity
  • 48:53 - 48:58
  • 48:58 - 48:59
    bound.
  • 48:59 - 49:03
    So from undergrad computer science classes you may have heard of
  • 49:03 - 49:06
    computational complexity, which is how much computations you need to achieve
  • 49:06 - 49:07
    something.
  • 49:07 - 49:11
    So sample complexity just means how large a training example how large a training set how
  • 49:11 - 49:14
    large a sample of examples do you need
  • 49:14 - 49:17
    in order to achieve a certain bound and error.
  • 49:17 - 49:20
    And it turns out that in many of the theorems we write out you can
  • 49:20 - 49:24
    pose them in sort of a form of probability bound or a sample complexity bound or in some other
  • 49:24 - 49:24
    form.
  • 49:24 - 49:28
    I personally often find the sample complexity bounds the most
  • 49:28 - 49:32
    easy to interpret because it says how large a training set do you need to give a certain bound on the
  • 49:32 - 49:36
    errors.
  • 49:36 - 49:39
    And in fact well, we'll see this later,
  • 49:39 - 49:41
    sample complexity bounds often sort of
  • 49:41 - 49:43
    help to give guidance for really if
  • 49:43 - 49:44
    you're trying to
  • 49:44 - 49:46
    achieve something on a machine learning problem,
  • 49:46 - 49:47
    this really is
  • 49:47 - 49:50
    trying to give guidance on how much training data you need
  • 49:50 - 49:54
    to prove something.
  • 49:54 - 49:58
    The one thing I want to note here is that M
  • 49:58 - 50:01
    grows like the log of K, right, so
  • 50:01 - 50:06
    the log of K grows extremely slowly as a function of K. The log is one
  • 50:06 - 50:11
    of the slowest growing functions, right. It's one of well,
  • 50:11 - 50:18
    some of you may have heard this, right? That for all values of K, right I learned
  • 50:19 - 50:22
    this from a colleague, Andrew Moore,
  • 50:22 - 50:24
    at Carnegie Mellon
  • 50:24 - 50:25
    that in
  • 50:25 - 50:31
    computer science for all practical purposes for all values of K, log K is less [inaudible], this is
  • 50:31 - 50:32
    almost true.
  • 50:32 - 50:36
    So log K is logs is one of the slowest growing functions, and so the
  • 50:36 - 50:40
    fact that M sample complexity grows like the log of K,
  • 50:40 - 50:42
    means that
  • 50:42 - 50:46
    you can increase this number of hypotheses in your hypothesis class quite a lot
  • 50:46 - 50:51
    and the number of the training examples you need won't grow
  • 50:51 - 50:56
  • 50:56 - 51:00
    very much. [Inaudible]. This property will be important later when we talk about infinite
  • 51:00 - 51:03
    hypothesis classes.
  • 51:03 - 51:06
    The final form is the
  • 51:06 - 51:08
    I guess is sometimes called the error bound,
  • 51:08 - 51:14
    which is when you hold M and delta fixed and solved for gamma.
  • 51:14 - 51:21
    And so
  • 51:23 - 51:27
    and what do you do what you get then is that the
  • 51:27 - 51:31
    probability at least one minus delta,
  • 51:31 - 51:38
    we have that.
  • 51:41 - 51:44
    For all hypotheses in my hypothesis class,
  • 51:44 - 51:51
    the difference in the training generalization error would be less
  • 51:54 - 51:55
    than equal to that. Okay? And that's just
  • 51:55 - 52:02
    solving for gamma and plugging the value I get in there. Okay? All right.
  • 52:03 - 52:10
    So the
  • 52:31 - 52:34
    second
  • 52:34 - 52:36
    step of
  • 52:36 - 52:43
    the overall proof I want to execute is the following.
  • 52:45 - 52:47
    The result of the training error is essentially that uniform
  • 52:47 - 52:49
    conversions will hold true
  • 52:49 - 52:51
    with high probability.
  • 52:51 - 52:55
    What I want to show now is let's assume that uniform conversions hold. So let's
  • 52:55 - 52:58
    assume that for all
  • 52:58 - 53:00
    hypotheses H,
  • 53:00 - 53:05
    we have that epsilon of H minus epsilon hat of H, is
  • 53:05 - 53:09
    less than of the gamma. Okay?
  • 53:09 - 53:11
    What I want to do now is
  • 53:11 - 53:13
    use this to see what we can
  • 53:13 - 53:15
    prove about the bound
  • 53:15 - 53:22
    of see what we can prove about the generalization error.
  • 53:29 - 53:31
  • 53:31 - 53:33
    So I want to know
  • 53:33 - 53:36
    suppose this holds true I want
  • 53:36 - 53:40
    to know can we prove something about the generalization error of H hat, where
  • 53:40 - 53:43
    again, H hat
  • 53:43 - 53:46
    was the hypothesis
  • 53:46 - 53:52
    selected by empirical risk minimization.
  • 53:52 - 53:56
    Okay? So in order to show this, let me make one more definition, let me define H
  • 53:56 - 54:00
    star,
  • 54:00 - 54:03
    to be the hypothesis
  • 54:03 - 54:06
    in my class
  • 54:06 - 54:10
    script H that has the smallest generalization error.
  • 54:10 - 54:11
    So this is
  • 54:11 - 54:15
    if I had an infinite amount of training data or if I really I
  • 54:15 - 54:19
    could go in and find the best possible hypothesis
  • 54:19 - 54:23
    best possible hypothesis in the sense of minimizing generalization error
  • 54:23 - 54:29
    what's the hypothesis I would get? Okay? So
  • 54:29 - 54:33
    in some sense, it sort of makes sense to compare the performance of our learning
  • 54:33 - 54:34
    algorithm
  • 54:34 - 54:36
    to the performance of H star, because we
  • 54:36 - 54:40
    sort of we clearly can't hope to do better than H star.
  • 54:40 - 54:42
    Another way of saying that is that if
  • 54:42 - 54:48
    your hypothesis class is a class of all linear decision boundaries, that
  • 54:48 - 54:51
    the data just can't be separated by any linear functions.
  • 54:51 - 54:54
    So if even H star is really bad,
  • 54:54 - 54:55
    then there's sort of
  • 54:55 - 54:59
    it's unlikely then there's just not much hope that your learning algorithm could do even
  • 54:59 - 55:04
    better
  • 55:04 - 55:09
    than H star. So I actually prove this result in three steps.
  • 55:09 - 55:13
    So the generalization error of H hat, the hypothesis I chose,
  • 55:13 - 55:20
    this is going to be less than equal to that, actually let me number these
  • 55:22 - 55:25
    equations, right. This
  • 55:25 - 55:27
    is
  • 55:27 - 55:30
    because of equation one, because I see that
  • 55:30 - 55:33
    epsilon of H hat and epsilon hat of H hat will then gamma of each
  • 55:33 - 55:37
    other.
  • 55:37 - 55:38
    Now
  • 55:38 - 55:42
    because H star, excuse me,
  • 55:42 - 55:46
    now by the
  • 55:46 - 55:51
    definition of empirical risk minimization, H
  • 55:51 - 55:54
    hat was chosen to minimize training error
  • 55:54 - 55:59
    and so there can't be any hypothesis with lower training error than H hat.
  • 55:59 - 56:05
    So the training error of H hat must be less than the equal to
  • 56:05 - 56:07
    the training error of H star. So
  • 56:07 - 56:11
    this is sort of by two, or by the definition of H hat,
  • 56:11 - 56:18
    as the hypothesis that minimizes training error H hat.
  • 56:18 - 56:20
    And the final step is I'm
  • 56:20 - 56:23
    going to apply this uniform conversions result again. We know that
  • 56:23 - 56:25
    epsilon hat
  • 56:25 - 56:30
    of H star must be moving gamma of epsilon of H star.
  • 56:30 - 56:34
    And so this is at most
  • 56:34 - 56:35
    plus gamma. Then
  • 56:35 - 56:36
  • 56:36 - 56:38
    I have my original gamma
  • 56:38 - 56:41
    there. Okay? And so this is by
  • 56:41 - 56:45
    equation one again because oh, excuse me
  • 56:45 - 56:48
    because I know the training error of H star must be moving gamma of the
  • 56:48 - 56:50
    generalization error of H star.
  • 56:50 - 56:52
    And so well, I'll just
  • 56:52 - 56:57
    write this as plus two gamma. Okay? Yeah? Student:[Inaudible] notation, is epsilon proof of [inaudible] H hat that's not the training
  • 56:57 - 57:04
    error, that's the generalization error with estimate of the hypothesis? Instructor (Andrew Ng): Oh,
  • 57:25 - 57:27
    okay. Let me just well, let
  • 57:27 - 57:33
    me write that down on this board. So actually actually let me
  • 57:33 - 57:36
    think
  • 57:36 - 57:39
    [inaudible]
  • 57:39 - 57:44
    fit this in here. So epsilon hat
  • 57:44 - 57:45
    of H is the
  • 57:45 - 57:49
    training
  • 57:49 - 57:51
  • 57:51 - 57:52
    error of the hypothesis H. In
  • 57:52 - 57:56
    other words, given the hypothesis a hypothesis is just a function, right mapped from X or
  • 57:56 - 57:58
    Ys
  • 57:58 - 57:59
    so epsilon hat of H is
  • 57:59 - 58:03
    given the hypothesis H, what's the fraction of training examples it
  • 58:03 - 58:05
    misclassifies?
  • 58:05 - 58:12
    And generalization error
  • 58:12 - 58:14
    of
  • 58:14 - 58:18
    H, is given the hypothesis H if I
  • 58:18 - 58:19
    sample another
  • 58:19 - 58:22
    example from my distribution
  • 58:22 - 58:23
    scripts
  • 58:23 - 58:28
    D, what's the probability that H will misclassify that example? Does that make sense?
  • 58:28 - 58:32
    Oh,
  • 58:32 - 58:38
    okay. And H hat is the hypothesis that's chosen by empirical risk minimization.
  • 58:38 - 58:43
    So when I talk about empirical risk minimization, is the algorithm that
  • 58:43 - 58:48
    minimizes training error, and so epsilon hat of H is the training error of H,
  • 58:48 - 58:50
    and so H hat is defined as
  • 58:50 - 58:52
    the hypothesis that out
  • 58:52 - 58:55
    of all hypotheses in my class script H,
  • 58:55 - 58:58
    the one that minimizes training error epsilon hat of H. Okay? All right. Yeah? Student:[Inaudible] H is [inaudible] a member
  • 58:58 - 59:05
    of
  • 59:07 - 59:09
    typical H, [inaudible]
  • 59:09 - 59:11
    family
  • 59:11 - 59:13
    right? Yes it is.
  • 59:13 - 59:20
    So what happens with the generalization error [inaudible]?
  • 59:20 - 59:24
    I'll talk about that later. So
  • 59:24 - 59:31
    let me tie all these things together into a
  • 59:34 - 59:38
    theorem.
  • 59:38 - 59:43
    Let there be a hypothesis class given with a
  • 59:43 - 59:49
    finite set of K hypotheses,
  • 59:49 - 59:51
    and let any M
  • 59:51 - 59:52
    delta
  • 59:52 - 59:58
    be fixed.
  • 59:58 - 60:02
    Then so I fixed M and delta, so this will be the error bound
  • 60:02 - 60:04
    form of the theorem, right?
  • 60:04 - 60:07
    Then, with
  • 60:07 - 60:10
    probability at least one minus delta.
  • 60:10 - 60:14
    We have that. The generalization error of H hat is
  • 60:14 - 60:19
    less than or equal to
  • 60:19 - 60:23
    the minimum over all hypotheses in
  • 60:23 - 60:25
    set H epsilon of H,
  • 60:25 - 60:27
    plus
  • 60:27 - 60:34
    two times,
  • 60:38 - 60:39
    plus that. Okay?
  • 60:39 - 60:42
    So to prove this, well,
  • 60:42 - 60:46
    this term of course is just epsilon of H star.
  • 60:46 - 60:48
    And so to prove this
  • 60:48 - 60:50
    we set
  • 60:50 - 60:53
    gamma to equal to that
  • 60:53 - 60:56
    this is two times the square root term.
  • 60:56 - 61:03
    To prove this theorem we set gamma to equal to that square root term. Say that again?
  • 61:04 - 61:08
    Wait. Say that again?
  • 61:08 - 61:11
    Oh,
  • 61:14 - 61:16
    yes. Thank you. That didn't
  • 61:16 - 61:20
    make sense at all.
  • 61:20 - 61:21
    Thanks. Great.
  • 61:21 - 61:26
    So set gamma to that square root term,
  • 61:26 - 61:30
    and so we know equation one,
  • 61:30 - 61:32
    right, from the previous board
  • 61:32 - 61:35
    holds with probability one minus delta.
  • 61:35 - 61:40
    Right. Equation one was the uniform conversions result right, that well,
  • 61:40 - 61:47
    IE.
  • 61:50 - 61:52
    This is equation one from the previous board, right, so
  • 61:52 - 61:55
    set gamma equal to this we know that we'll probably
  • 61:55 - 61:58
    use one minus delta this uniform conversions holds,
  • 61:58 - 62:01
    and whenever that holds, that implies you
  • 62:01 - 62:05
    know, I
  • 62:05 - 62:05
    guess
  • 62:05 - 62:08
    if we call this equation star I guess.
  • 62:08 - 62:12
    And whenever uniform conversions holds, we showed again, on the previous boards
  • 62:12 - 62:17
    that this result holds, that generalization error of H hat is less than
  • 62:17 - 62:21
    two generalization error of H star plus two times gamma. Okay? And
  • 62:21 - 62:23
    so that proves
  • 62:23 - 62:30
    this theorem. So
  • 62:42 - 62:44
    this result sort of helps us to quantify
  • 62:44 - 62:45
    a little bit
  • 62:45 - 62:48
    that bias variance tradeoff
  • 62:48 - 62:50
    that I talked about
  • 62:50 - 62:54
    at the beginning of actually near the very start of this lecture.
  • 62:54 - 62:58
    And in particular
  • 62:58 - 63:04
    let's say I have some hypothesis class script H, that
  • 63:04 - 63:06
    I'm using, maybe as a class of all
  • 63:06 - 63:10
    linear functions and linear regression, and logistic regression with
  • 63:10 - 63:12
    just the linear features.
  • 63:12 - 63:15
    And let's say I'm considering switching
  • 63:15 - 63:20
    to some new class H prime by having more features. So lets say this is
  • 63:20 - 63:23
    linear
  • 63:23 - 63:27
    and this is quadratic,
  • 63:27 - 63:28
  • 63:28 - 63:32
    so the class of all linear functions and the subset of the class of all quadratic functions,
  • 63:32 - 63:35
    and so H is the subset of H prime. And
  • 63:35 - 63:37
    let's say I'm considering
  • 63:37 - 63:41
    instead of using my linear hypothesis class let's say I'm considering switching
  • 63:41 - 63:45
    to a quadratic hypothesis class, or switching to a larger hypothesis
  • 63:45 - 63:46
    class.
  • 63:46 - 63:47
  • 63:47 - 63:50
    Then what are the tradeoffs involved? Well, I
  • 63:50 - 63:53
    proved this only for finite hypothesis classes, but we'll see that something
  • 63:53 - 63:56
    very similar holds for infinite hypothesis classes too.
  • 63:56 - 63:58
    But the tradeoff is
  • 63:58 - 64:01
    what if I switch from H to H prime, or I switch from linear to quadratic
  • 64:01 - 64:04
    functions. Then
  • 64:04 - 64:08
    epsilon of H star will become better because
  • 64:08 - 64:12
    the best hypothesis in my hypothesis class will become better.
  • 64:12 - 64:15
    The best quadratic function
  • 64:15 - 64:16
  • 64:16 - 64:19
    by best I mean in the sense of generalization error
  • 64:19 - 64:21
    the hypothesis function
  • 64:21 - 64:25
    the quadratic function with the lowest generalization error
  • 64:25 - 64:26
    has to have
  • 64:26 - 64:27
    equal or
  • 64:27 - 64:28
    more likely lower generalization error than the best
  • 64:28 - 64:30
    linear function.
  • 64:30 - 64:32
  • 64:32 - 64:38
    So by switching to a more complex hypothesis class you can get this first term as you go down.
  • 64:38 - 64:40
    But what I pay for then is that
  • 64:40 - 64:43
    K will increase.
  • 64:43 - 64:46
    By switching to a larger hypothesis class,
  • 64:46 - 64:48
    the first term will go down,
  • 64:48 - 64:52
    but the second term will increase because I now have a larger class of
  • 64:52 - 64:53
    hypotheses
  • 64:53 - 64:55
    and so the second term K
  • 64:55 - 64:57
    will increase.
  • 64:57 - 65:02
    And so this is sometimes called the bias this is usually called the bias variance
  • 65:02 - 65:03
    tradeoff. Whereby
  • 65:03 - 65:08
    going to larger hypothesis class maybe I have the hope for finding a better function,
  • 65:08 - 65:10
    that my risk of sort of
  • 65:10 - 65:14
    not fitting my model so accurately also increases, and
  • 65:14 - 65:19
    that's because illustrated by the second term going up when the
  • 65:19 - 65:21
    size of your hypothesis,
  • 65:21 - 65:28
    when K goes up.
  • 65:28 - 65:32
    And so
  • 65:32 - 65:35
    speaking very loosely,
  • 65:35 - 65:37
    we can think of
  • 65:37 - 65:40
    this first term as corresponding
  • 65:40 - 65:41
    maybe to the bias
  • 65:41 - 65:45
    of the learning algorithm, or the bias of the hypothesis class.
  • 65:45 - 65:50
    And you can again speaking very loosely,
  • 65:50 - 65:53
    think of the second term as corresponding
  • 65:53 - 65:57
    to the variance in your hypothesis, in other words how well you can actually fit a
  • 65:57 - 65:58
    hypothesis in the
  • 65:58 - 66:00
    how well you
  • 66:00 - 66:03
    actually fit this hypothesis class to the data.
  • 66:03 - 66:07
    And by switching to a more complex hypothesis class, your variance increases and your bias decreases.
  • 66:07 - 66:09
  • 66:09 - 66:14
    As a note of warning, it turns out that if you take like a statistics class you've seen
  • 66:14 - 66:15
    definitions of bias and variance,
  • 66:15 - 66:18
    which are often defined in terms of squared error or something.
  • 66:18 - 66:22
    It turns out that for classification problems,
  • 66:22 - 66:25
    there actually is no
  • 66:25 - 66:26
  • 66:26 - 66:27
    universally accepted
  • 66:27 - 66:29
    formal definition of
  • 66:29 - 66:31
    bias and
  • 66:31 - 66:35
    variance for classification problems. For regression problems, there is this square error
  • 66:35 - 66:39
    definition. For classification problems it
  • 66:39 - 66:40
    turns out there've
  • 66:40 - 66:44
    been several competing proposals for definitions of bias and variance. So when I
  • 66:44 - 66:45
    say bias and variance here,
  • 66:45 - 66:47
    think of these as very loose, informal,
  • 66:47 - 66:54
    intuitive definitions, and not formal definitions. Okay.
  • 67:00 - 67:07
    The
  • 67:15 - 67:18
    cartoon associated with intuition I just
  • 67:18 - 67:19
    said would be
  • 67:19 - 67:22
    as follows: Let's
  • 67:22 - 67:24
    say
  • 67:24 - 67:27
  • 67:27 - 67:30
    and everything about the plot will be for a fixed value of M, for a
  • 67:30 - 67:35
    fixed training set size M. Vertical axis I'll plot ever and
  • 67:35 - 67:39
    on the horizontal axis I'll plot model
  • 67:39 - 67:44
    complexity.
  • 67:44 - 67:47
    And by model complexity I mean
  • 67:47 - 67:53
    sort of degree of polynomial,
  • 67:53 - 67:58
  • 67:58 - 68:00
    size of your
  • 68:00 - 68:05
    hypothesis class script H etc. It actually turns out,
  • 68:05 - 68:08
    you remember the bandwidth parameter
  • 68:08 - 68:11
    from
  • 68:11 - 68:14
    locally weighted linear regression, that also
  • 68:14 - 68:17
    has a similar effect in controlling how complex your model is.
  • 68:17 - 68:21
    Model complexity [inaudible] polynomial I guess.
  • 68:21 - 68:23
    So the more complex your model,
  • 68:23 - 68:25
    the better your
  • 68:25 - 68:26
    training error,
  • 68:26 - 68:30
    and so
  • 68:30 - 68:33
    your training error will tend to [inaudible] zero as you increase the complexity of your model because the
  • 68:33 - 68:36
    more complete your model the better you can fit your training set.
  • 68:36 - 68:38
    But
  • 68:38 - 68:45
    because of this bias variance tradeoff,
  • 68:45 - 68:49
    you find
  • 68:49 - 68:53
    that
  • 68:53 - 68:57
    generalization error will come down for a while and then it will go back up.
  • 68:57 - 69:04
    And this regime on the left is when you're underfitting the data or when
  • 69:04 - 69:06
    you have high bias.
  • 69:06 - 69:08
    And this regime
  • 69:08 - 69:10
    on the right
  • 69:10 - 69:16
    is when you have high variance or
  • 69:16 - 69:21
    you're overfitting the data. Okay?
  • 69:21 - 69:24
    And this is why a model of sort of intermediate
  • 69:24 - 69:27
    complexity, somewhere here
  • 69:27 - 69:29
    if often preferable
  • 69:29 - 69:30
    to if
  • 69:30 - 69:33
    [inaudible] and minimize generalization error.
  • 69:33 - 69:36
    Okay?
  • 69:36 - 69:40
    So that's just a cartoon. In the next lecture we'll actually talk about the number of
  • 69:40 - 69:40
    algorithms
  • 69:40 - 69:44
    for trying to automatically select model complexities, say to get
  • 69:44 - 69:48
    you as close as possible to this minimum to
  • 69:48 - 69:55
    this area of minimized generalization error.
  • 70:11 - 70:15
    The last thing I want to do is actually
  • 70:15 - 70:19
    going back to the theorem I wrote out, I just want to take that theorem well,
  • 70:19 - 70:20
    so the theorem I wrote out
  • 70:20 - 70:25
    was an error bound theorem this says for fixed M and delta where probability
  • 70:25 - 70:29
    one minus delta, I get a bound on
  • 70:29 - 70:31
    gamma, which is what this term is.
  • 70:31 - 70:35
    So the very last thing I wanna do today is just come back to this theorem and write out a
  • 70:35 - 70:36
    corollary
  • 70:36 - 70:40
    where I'm gonna fix gamma, I'm gonna fix my error bound, and fix delta
  • 70:40 - 70:41
    and solve for M.
  • 70:41 - 70:46
    And if you do that,
  • 70:46 - 70:47
    you
  • 70:47 - 70:49
    get the following corollary: Let H
  • 70:49 - 70:52
    be
  • 70:52 - 70:55
    fixed with K hypotheses
  • 70:55 - 70:59
    and let
  • 70:59 - 71:01
    any delta
  • 71:01 - 71:04
    and gamma be fixed.
  • 71:09 - 71:11
    Then
  • 71:11 - 71:18
    in order to guarantee that,
  • 71:23 - 71:25
    let's say I want a guarantee
  • 71:25 - 71:27
    that the generalization error
  • 71:27 - 71:31
    of the hypothesis I choose with empirical risk minimization,
  • 71:31 - 71:34
    that this is at most two times gamma worse
  • 71:34 - 71:38
    than the best possible error I could obtain with this hypothesis class.
  • 71:38 - 71:40
    Lets say I want this to
  • 71:40 - 71:45
    hold true with probability at least one minus delta,
  • 71:45 - 71:50
    then it suffices
  • 71:50 - 71:54
    that M
  • 71:54 - 72:01
    is [inaudible] to that. Okay? And this is
  • 72:01 - 72:01
    sort of
  • 72:01 - 72:06
    solving for the error
  • 72:06 - 72:08
    bound for M. One thing we're going to convince yourselves
  • 72:08 - 72:10
    of the easy part of this is
  • 72:10 - 72:15
    if you set that term [inaudible] gamma and solve for M you will get this.
  • 72:15 - 72:19
    One thing I want you to go home and sort of convince yourselves of is that this
  • 72:19 - 72:21
    result really holds true.
  • 72:21 - 72:24
    That this really logically follows from the theorem we've proved.
  • 72:24 - 72:26
    In other words,
  • 72:26 - 72:30
    you can take that formula we wrote and solve for M and because this is the formula you get for M,
  • 72:30 - 72:32
    that's just that's the easy part. That
  • 72:32 - 72:34
    once you go back and convince yourselves that
  • 72:34 - 72:38
    this theorem is a true fact and that it does indeed logically follow from the
  • 72:38 - 72:39
    other one. In
  • 72:39 - 72:40
    particular,
  • 72:40 - 72:44
    make sure that if you solve for that you really get M grading equals this, and why is this M
  • 72:44 - 72:47
    grading that and not M less equal two, and just make sure
  • 72:47 - 72:50
    I can write this down and it sounds plausible why don't you just go back and convince yourself this is really true. Okay?
  • 72:53 - 72:55
    And
  • 72:55 - 72:58
    it turns out that when we prove these bounds in learning theory it turns out
  • 72:58 - 73:03
    that very often the constants are sort of loose. So it turns out that when we prove
  • 73:03 - 73:06
    these bounds usually we're interested
  • 73:06 - 73:10
    usually we're not very interested in the constants, and so I
  • 73:10 - 73:12
    write this as big O of
  • 73:12 - 73:16
    one over gamma squared, log
  • 73:16 - 73:18
    K over delta,
  • 73:18 - 73:21
    and again, the key
  • 73:21 - 73:24
    step in this is that the dependence on M
  • 73:24 - 73:27
    with the size of the hypothesis class is logarithmic.
  • 73:27 - 73:30
    And this will be very important later when we
  • 73:30 - 73:35
    talk about infinite hypothesis classes. Okay? Any
  • 73:35 - 73:42
    questions about this? No? Okay, cool.
  • 73:50 - 73:51
    So
  • 73:51 - 73:54
    next lecture we'll come back, we'll actually start from this result again.
  • 73:54 - 73:58
    Remember this. I'll write this down as the first thing I do in the next lecture
  • 73:58 - 74:03
    and we'll generalize these to infinite hypothesis classes and then talk about
  • 74:03 - 74:05
    practical algorithms for model spectrum. So I'll see you guys in a couple days.
Title:
Lecture 9 | Machine Learning (Stanford)
Description:

Lecture by Professor Andrew Ng for Machine Learning (CS 229) in the Stanford Computer Science department. Professor Ng delves into learning theory, covering bias, variance, empirical risk minimization, union bound and Hoeffding's inequalities.

This course provides a broad introduction to machine learning and statistical pattern recognition. Topics include supervised learning, unsupervised learning, learning theory, reinforcement learning and adaptive control. Recent applications of machine learning, such as to robotic control, data mining, autonomous navigation, bioinformatics, speech recognition, and text and web data processing are also discussed.

Complete Playlist for the Course:
http://www.youtube.com/view_play_list?p=A89DCFA6ADACE599

CS 229 Course Website:
http://www.stanford.edu/class/cs229/

Stanford University:
http://www.stanford.edu/

Stanford University Channel on YouTube:
http://www.youtube.com/stanford

more » « less
Video Language:
English
Duration:
01:14:19
N. Ueda added a translation

English subtitles

Revisions