Return to Video

Detection of 3D objects

  • 0:00 - 0:05
    Hello. After having spent a lot of time on
    the relatively simple two-dimensional
  • 0:05 - 0:09
    problem of handwritten digit recognition,
    we are now ready to tackle the general
  • 0:09 - 0:15
    problem which is data finding 3D objects
    and scenes. So, the settings in which we
  • 0:15 - 0:22
    studied the problem these days is, is most
    commonly that is so-called PASCAL object
  • 0:22 - 0:28
    detection challenge. This is been going on
    for about, this is been going on for about
  • 0:28 - 0:34
    five years or so. And what these folks
    have done is collected a set of about
  • 0:34 - 0:40
    10,000 images where in each of these
    images, they marked a certain set of
  • 0:40 - 0:45
    objects and these object categories
    include dining table, dog, horse,
  • 0:45 - 0:50
    motorbike, person, potted plant, sheep,
    etc. So, they have twenty different
  • 0:50 - 0:57
    categories. For each object belonging to a
    category, they have marked the bounding
  • 0:57 - 1:02
    box. So, for example, here is the bounding
    box corresponding to the dock in this
  • 1:02 - 1:07
    image and there are bounding box
    corresponding to a horse here and also
  • 1:07 - 1:10
    there'll be bounding boxes corresponding
    to the people because in this image, we
  • 1:10 - 1:18
    have horses and people. The goal is to
    detect these objects and so what a
  • 1:18 - 1:23
    computer programmer supposed to do is,
    let's say, we are trying to find dogs.
  • 1:23 - 1:28
    What you are supposed to do is to mark
    bounding boxes corresponding to where the
  • 1:28 - 1:35
    dogs are in the image. And then, we'll be
    judged by whether the dog is in the right
  • 1:35 - 1:40
    location. So, the bounding box has to
    overlap sufficiently with the correct
  • 1:40 - 1:46
    bounding box. So, this is the, the
    dominant data set for studying 3D object
  • 1:46 - 1:51
    recognition. Now, let's see what
    techniques we can use for addressing this
  • 1:51 - 1:57
    problem. So, we start with of course, the
    basic paradigm of the multi-scale sliding
  • 1:57 - 2:02
    window. And this paradigm had been
    introduced for face direction back in the
  • 2:02 - 2:08
    90s. And since then, it's also been used
    for pedestrian detection and so forth. So,
  • 2:08 - 2:14
    the basic idea here is that we're going to
    consider a window. Let's say, starting in
  • 2:14 - 2:21
    the top-left corner of the image. So, this
    green, this green boxes correspond to one
  • 2:21 - 2:28
    of those windows and then we are going to
    ev aluate the answer to that question. Is
  • 2:28 - 2:34
    there are face there? Or there is a bus in
    there? And shift the Windows lightly, ask
  • 2:34 - 2:40
    the same question. And since the people
    could be a, a variety of sizes, we have to
  • 2:40 - 2:46
    read on this process for different sizes
    for Windows just as to detect small
  • 2:46 - 2:54
    objects as well as large objects. A good
    and standard building block is a linear
  • 2:54 - 2:58
    support vector machine trained on
    Histogram or oriented gradient features.
  • 2:58 - 3:05
    And this is a frame work introduced by
    Dalal & Triggs in 2005 and they have
  • 3:05 - 3:09
    details in their paper about how they
    compute each of the blocks and how they
  • 3:09 - 3:14
    normalize and well, if few of you are
    interested in the details, you should read
  • 3:14 - 3:19
    that paper. Now, note that the Dalal &
    Triggs approach was tested on pedestrians
  • 3:19 - 3:26
    and in the case of pedestrians, a single
    block is enough and you try to detect the
  • 3:26 - 3:32
    whole object in one goal. Now, when we
    deal with more complex objects like people
  • 3:32 - 3:38
    in general poses or dogs and cats, we find
    that these are very non-rigid. So, one
  • 3:38 - 3:43
    single rigid template is not affected.
    What we really want are part based
  • 3:43 - 3:50
    approaches. Nowadays, there are two
    dominant part based approaches. The first
  • 3:50 - 3:56
    is the so-called deformable part models
    due to Felzenszwalb et al. There is a
  • 3:56 - 4:03
    paper of that article probably in 2010.
    And another approach is so-called Poselets
  • 4:03 - 4:10
    and this is due to Lubomir Bourdev and
    various and other collaborators in my
  • 4:10 - 4:16
    group. So, what's the basic idea? So, let
    me get into Felzenszwalb's approach first.
  • 4:16 - 4:24
    So, their basic idea is to have a root
    filter which is trying to find the object
  • 4:24 - 4:29
    that hold. And then there will be a set of
    path filters which might correspond to
  • 4:29 - 4:37
    say, trying to detect faces or legs and so
    forth but these path filters have to fire
  • 4:37 - 4:43
    in certain spacial relationships with
    respect to the root vector. So, the oral
  • 4:43 - 4:50
    detector is the combination of holistic
    detector and a set of part filters which
  • 4:50 - 4:55
    had to be in the certain relationship with
    respected to the whole object. And this
  • 4:55 - 5:01
    requires training both the root filter and
    the radiant spot filters an d this can be
  • 5:01 - 5:05
    done using a so-called LatentSVM approach
    which, and it does not require any extra
  • 5:05 - 5:12
    annotation. And note that I said, parts
    such as faces and legs. So, that's me
  • 5:12 - 5:17
    getting carried away. The vector parts
    need not to correspond to anything
  • 5:17 - 5:24
    semantically meaningful. In the case of
    the Poselets approach, the idea is to have
  • 5:24 - 5:30
    semantically meaningful part, parts and,
    and so the way they go about doing this is
  • 5:30 - 5:35
    by making use of extra annotation. So,
    suppose you have images of people that
  • 5:35 - 5:39
    needs images might be annotated with key
    points corresponding to left shoulder,
  • 5:39 - 5:43
    right shoulder, left elbow, right elbow
    and so on. While other object
  • 5:43 - 5:48
    categorically will be other key points
    such as, for example, for an airplane, you
  • 5:48 - 5:54
    might have a key point on the tip of the
    nose or the tip of the wings and so on and
  • 5:54 - 5:57
    so forth. This requires extra work because
    somebody has to go through to all the
  • 5:57 - 6:02
    images in the test and then mark these key
    points but the consequence will be that
  • 6:02 - 6:08
    we'll be able to do a few more things
    afterwards. Here's a slide which shows how
  • 6:08 - 6:14
    the object detection with discriminatively
    trained part based models works. So, this
  • 6:14 - 6:21
    is the DPM model of Felzenszwalb Girshick,
    MacAllester, and Ramanan. And here this
  • 6:21 - 6:27
    model has been illustrated with powerful
    handle bicycle detection. So, in fact, you
  • 6:27 - 6:32
    don't train just one model, you train a
    mixture of models. So, there is a model
  • 6:32 - 6:40
    here corresponding to the side view of a
    bicycle. So, the root filter is shown here
  • 6:40 - 6:45
    so this is kind of the root filter and
    this has kind, is looking for, this is a
  • 6:45 - 6:50
    hot template. It's looking for edges of
    particular orientations as might we found
  • 6:50 - 6:57
    on the side view of a bicycle. Then we
    have various part filters. So, the part
  • 6:57 - 7:02
    filters are in factual here. So, each of
    the rectangles here, this kind of the
  • 7:02 - 7:07
    rectangle corresponds to a part filter.
    So, this might, here, corresponds to
  • 7:07 - 7:15
    something like a template detective for
    wheels. And so, what we have to have to
  • 7:15 - 7:21
    come up with the final score is to combine
    the score corresponding to the hot te
  • 7:21 - 7:26
    mplate of the root filter as well as the
    hot templates for each of the part. Note
  • 7:26 - 7:31
    that this detector for the side view of a
    bicycle will probably not do a good job in
  • 7:31 - 7:37
    consider front views of bicycles like
    here. And so for this, they will have a
  • 7:37 - 7:44
    different mode. So, again the model is
    shown here. And here the wheel, the parts
  • 7:44 - 7:50
    maybe somewhat different. So, overall, you
    have a mixture model with multiple model
  • 7:50 - 7:55
    corresponding to different poses and that
    each model, it says, consists of root
  • 7:55 - 8:01
    filter and various part filters and there
    is some subtlety in training because there
  • 8:01 - 8:07
    are no annotations that were leveled about
    key points and so forth. So, in terms in
  • 8:07 - 8:11
    the learning approach here, you have to
    guess where the part should be as the part
  • 8:11 - 8:15
    of the process of training and you can
    find details in there that needs to
  • 8:15 - 8:24
    [inaudible]. How well does it do? Okay,
    there is standard methodology that we use
  • 8:24 - 8:30
    in computer vision by evaluating detection
    process. And here is how we do this for
  • 8:30 - 8:35
    the case of, say a motorcycle detector.
    So, when computes the so-called precision
  • 8:35 - 8:40
    recall cuts. So, the idea is that the
    algorithm, the detection algorithm is
  • 8:40 - 8:47
    going to come up with guesses of bounding
    boxes where the motorbike maybe. And we
  • 8:47 - 8:52
    can then evaluate for each of these guess
    bounding box. Is it right or wrong and
  • 8:52 - 8:56
    it's just to be right if its intersection
    where you meet in respect to these two
  • 8:56 - 9:05
    motorbike is within 50%. Then, we have a
    choice of how strict to be in a threshold.
  • 9:05 - 9:10
    We could pass through most of our
    candidate guesses bounding boxes and if
  • 9:10 - 9:15
    you guess enough of them then of course,
    you are guaranteed to find all of the
  • 9:15 - 9:19
    motorbikes. So, this rather seemed right.
    So, the way we do this is that we could
  • 9:19 - 9:24
    have to pick a threshold and with that
    threshold, you can evaluate the precision
  • 9:24 - 9:31
    and recall. So, precision and recall.
    These terms have the following meaning.
  • 9:31 - 9:36
    Precisions means what fraction of the
    detections that you declared are actually
  • 9:36 - 9:44
    true motorcycles. Recall is the question
    of how many of the two motorcycles that
  • 9:44 - 9:49
    you, did you manage to detect? So, I
    really want precision to be 100 percent
  • 9:49 - 9:55
    and recall to be 100%. In reality, it
    doesn't well count that way. We're able to
  • 9:55 - 10:01
    detect some fraction of the two motorbikes
    so here, for example, at this point, the
  • 10:01 - 10:07
    precision is 0.7. That means at this point
    we're able to detect, the, the, the
  • 10:07 - 10:13
    detection that we declare often 70 percent
    accurate. Now, this point corresponds to
  • 10:13 - 10:19
    recall of something like 55 percent
    meaning that at that threshold, we hold 55
  • 10:19 - 10:25
    possible for two motorbikes and as we made
    the threshold more lenient, we are going
  • 10:25 - 10:29
    to get more false [inaudible] but we will
    manage to detect more of the two
  • 10:29 - 10:33
    motorbikes. So, as these curves goes down
    in this range, in this, for this
  • 10:33 - 10:39
    particular detected data curve which is
    image to detect something like 70 to 80
  • 10:39 - 10:46
    percent of the true motorcycles. So the
    curves in this figure corresponds to
  • 10:46 - 10:51
    different algorithms and the way we
    compare different algorithms is by
  • 10:51 - 10:59
    measuring the area and the curve. And that
    the ideal case, of course, would be 100%.
  • 10:59 - 11:04
    In fact there is something like 50 percent
    to 60 percent for these cases and that is
  • 11:04 - 11:10
    what we call AP or Average Precision. And
    that is how we compare different
  • 11:10 - 11:17
    algorithms. Here is the precision recall
    curve for a different category namely
  • 11:17 - 11:22
    person detection and the, the different
    curves correspond to different category
  • 11:22 - 11:27
    items so this algorithm is probably not a
    good one. This algorithm is a better
  • 11:27 - 11:33
    algorithm. And notice in both the
    examples, we are not able to detect all
  • 11:33 - 11:38
    the people and if you look through this 30
    percent of the people which are not
  • 11:38 - 11:43
    detected by any approach, usually, there
    is heavy occlusion or unusual pauses and
  • 11:43 - 11:53
    media. So, there are phenomenas that make
    life difficult for us. Finally, the Pascal
  • 11:53 - 12:04
    BOC, people have computed the average
    precision for every class. And they give
  • 12:04 - 12:10
    two measures. Max means the best algorithm
    for that category, So, max, so, the max
  • 12:10 - 12:14
    motorbike is something like say 58%. That
    means that the best algorithm for
  • 12:14 - 12:19
    detecting motorbikes has an average
    precision of 58%. And the median is, of
  • 12:19 - 12:24
    course, the m edian of the different
    algorithm that was submitted. So, we
  • 12:24 - 12:29
    conclude that some categories are easier
    than others. Motorbikes are probably the
  • 12:29 - 12:35
    easiest. Their average precision is 58%.
    And something like potted plant is really
  • 12:35 - 12:41
    hard to detect and the average precision
    there is sixteen%. So, if you want to say
  • 12:41 - 12:46
    where are we going quite well. It's, I
    think, all the categories where the
  • 12:46 - 12:52
    precision is where the average precision
    is over 505 percent and that is motorbike,
  • 12:52 - 13:00
    bicycle, bus, airplane, horse, car, cat,
    train, bus. So about 50%. You may like it
  • 13:00 - 13:05
    or not in the sense that this is the case
    of the class happen to half way. So, since
  • 13:05 - 13:12
    it's about 50%, maybe you can call it
    Boat. Let's a look a little bit at some of
  • 13:12 - 13:17
    the difficult categories. So, here are the
    category of boat and the average precision
  • 13:17 - 13:22
    here is about [inaudible] and if you look
    at the set of examples, you will see why,
  • 13:22 - 13:27
    why this is so hot because there are so
    much radiation in appearance from one boat
  • 13:27 - 13:32
    to another and it's really difficult to
    detect, train to detect damage on all
  • 13:32 - 13:39
    these cases. Okay, and even more difficult
    example. Chairs. So, here we are supposed
  • 13:39 - 13:47
    to mark bounding boxes corresponding to
    the chairs and here they are. Okay, now
  • 13:47 - 13:52
    imagine you're looking for a hot template
    which is going to detect the characters,
  • 13:52 - 13:57
    the edges corresponding to a chair. You
    really can see that there is no hope at
  • 13:57 - 14:03
    managing that. Probably, the way humans
    detect chairs is by making use of the fact
  • 14:03 - 14:08
    that when I, there's a human sitting on
    thatt in a certain pose and, so there are
  • 14:08 - 14:12
    a lot of contextual information which
    currently is not being captured by the, by
  • 14:12 - 14:20
    the algorithms. I'll turn to images of
    people now. Analyzing images of people is
  • 14:20 - 14:28
    very important. It enables us to build
    human good computer interaction APIs, it
  • 14:28 - 14:37
    enables us to analyze video, recognize
    actions and so on and so forth. It's laid
  • 14:37 - 14:43
    hard by the fact that people appear in a
    variety of poses, the variety of clothing
  • 14:43 - 14:48
    can be occluded, can be small, can be
    large, and so on. So, this is really
  • 14:48 - 14:52
    challe nging category even though it's
    perhaps, the most important category for
  • 14:52 - 14:57
    object recognition. So, I'm going to show
    you some research from an approach which
  • 14:57 - 15:03
    is based on poselets, the other part based
    paradigm that I refer to. So, the big idea
  • 15:03 - 15:09
    is that we can build on the success of
    face detector and pedestrian detectors.
  • 15:09 - 15:14
    So, face detection, we know what's well.
    And so also, the pedestrian detection when
  • 15:14 - 15:22
    you're talking about a vertical standing
    or walking pedestrian. So, essentially,
  • 15:22 - 15:26
    both of these rely on, on pattern matching
    and they captured pattern that are common
  • 15:26 - 15:31
    and visually characteristic. But these are
    not the only too common in characteristic
  • 15:31 - 15:35
    patterns. Effectively, we can have
    patterns corresponding to this pair legs.
  • 15:35 - 15:44
    And if we can detect those, we are sure
    that we are looking at a person. And or we
  • 15:44 - 15:47
    can have a pattern which doesn't
    correspond to single anatomical part. This
  • 15:47 - 15:53
    is the half of the face and half of the
    torso and the center of the shoulder. This
  • 15:53 - 15:57
    is fine, I mean this is pretty
    characteristic observation for a person.
  • 15:57 - 16:05
    So, the way, of course, how we train face
    detectors pause that we had images where
  • 16:05 - 16:09
    all face had been marked out. So, then
    the, just the face of the youth, just
  • 16:09 - 16:14
    input positive examples for a machine
    learning algorithm. But, how are we going
  • 16:14 - 16:19
    to find all these configuration
    corresponding to legs and face and
  • 16:19 - 16:25
    shoulders and so on. So, the poselet idea
    is, is exactly to train these detectors
  • 16:25 - 16:31
    but we don't wish to determine these in
    advance. But first, let me show you what
  • 16:31 - 16:38
    examples of Poselets are. So, this is a
    Poselet let implies a small part. And the
  • 16:38 - 16:44
    way it works is that consider the human
    pulse and let is being planted at a small
  • 16:44 - 16:51
    part of it. So, the top rope was that
    corresponds to face, upper body, and the
  • 16:51 - 16:56
    hand in the certain configuration. Second
    row corresponds to two legs. The third
  • 16:56 - 17:02
    row, let row corresponds to the back view
    of a person. So, in fact, we can have a
  • 17:02 - 17:12
    very and, and a pretty long list of these
    Poselets. Now, the, the value of these is
  • 17:12 - 17:17
    that it enables us to do later tasks more
    easily. So, for example, we can train
  • 17:17 - 17:22
    agenda classifieds. So, we want to
    distinguish men from women and that can be
  • 17:22 - 17:29
    done from the face also from the view.
    That back wheel for person. And the legs
  • 17:29 - 17:35
    because up in the clothing want by many
    women are different. So once we have this
  • 17:35 - 17:40
    idea of training positive detectors. We
    can actually train two versions of a
  • 17:40 - 17:47
    positive detector. One is for male faces,
    one for female faces. And, and we can do
  • 17:47 - 17:53
    that for each detector and essentially
    this gives us a handle on how to come up
  • 17:53 - 17:59
    with the classifications of more finding
    classification of people. So, I'm going to
  • 17:59 - 18:05
    show you some results here. So, these are
    actually results from this approach. So,
  • 18:05 - 18:09
    the top row where the things are men and
    the bottom row is where the things are
  • 18:09 - 18:15
    women. So, there are some mistakes here.
    So, for example, these, these are really
  • 18:15 - 18:23
    women and so are these and so there are
    some mistakes but it's surprisingly good.
  • 18:23 - 18:29
    Here is what the detector thinks are
    people wearing long pants in the top row
  • 18:29 - 18:34
    and not wearing long pants in the bottom
    row. So, notice that once we can start to
  • 18:34 - 18:39
    do this, we get to the ability of
    describing people. So, in an image, I want
  • 18:39 - 18:43
    to be able to say that this image is a
    person who is tall, blonde man with
  • 18:43 - 18:52
    wearing the green trousers. Here in the
    top row is what the algorithm thinks are
  • 18:52 - 18:59
    people wearing hats and the bottom row are
    people not wearing hats. This approach
  • 18:59 - 19:05
    applies to detecting actions as well. So,
    here are actions that are revealed in
  • 19:05 - 19:10
    still images. So, you just have a single
    frame here. So, the, so, this image
  • 19:10 - 19:15
    correspond to a sitting person, he is the
    person talking on the telephone, a person
  • 19:15 - 19:20
    riding a horse, a person running and so
    on. So, again, this Poselet paradigm can
  • 19:20 - 19:28
    be adapted to this framework. And, for
    example, we can train Poselets
  • 19:28 - 19:33
    corresponding to phoning people, running
    people, walking people, and riding cars. I
  • 19:33 - 19:41
    should note that the problem of detecting
    action is a much more general problem. And
  • 19:41 - 19:46
    we obviously don't want to adjus t or make
    use of the static information. If we have
  • 19:46 - 19:51
    video and we can compute optical flow
    vectors, then that would give us an extra
  • 19:51 - 19:57
    handle on this problem. And the kinds of
    actions we want to be able to recognizing
  • 19:57 - 20:01
    through movement and posture change,
    object manipulation, conversational
  • 20:01 - 20:07
    gesture, sign language and etc. So, if you
    want, you can think of object as nouns in
  • 20:07 - 20:13
    English and actions as verbs in English.
    And it turns out that's some of the
  • 20:13 - 20:16
    techniques that have been applied for
    object recognition carry over to this
  • 20:16 - 20:21
    domain. So, techniques such as bags of
    spatio-temporal words, these are
  • 20:21 - 20:26
    generalizations of SIFT features to video.
    These turn out to be quite useful and give
  • 20:26 - 20:35
    some of the best results for action
    recognition task. Let me conclude here. I
  • 20:35 - 20:42
    think our community has made a lot of
    progress and object recognition, action
  • 20:42 - 20:47
    recognition and so on. But a lot that
    needs to be done. There is this face that
  • 20:47 - 20:51
    people in the multimedia information
    systems community talk about, the
  • 20:51 - 20:59
    so-called semantic gap. So, their point is
    that typically where images and videos are
  • 20:59 - 21:04
    presented as pixels, pixel brightness
    values, pixel RGB values, and so on. There
  • 21:04 - 21:08
    is what we are really interested in the
    semantic content. What are the objects in
  • 21:08 - 21:13
    the scene? What scene is it? What are the
    events taking place and this is what we
  • 21:13 - 21:19
    would like to live. And we're not there
    yet. There's no way near human performance
  • 21:19 - 21:25
    but I think we have made significant
    progress and more continue to happen over
  • 21:25 - 21:29
    the next few years. Thank you.
Title:
Detection of 3D objects
Video Language:
English
Toru Tamaki edited English subtitles for Detection of 3D objects
Toru Tamaki edited English subtitles for Detection of 3D objects
stanford-bot edited English subtitles for Detection of 3D objects
stanford-bot added a translation

English subtitles

Revisions