Detection of 3D objects

0:00 - 0:05

Hello. After having spent a lot of time on
the relatively simple two-dimensional
0:05 - 0:09

problem of handwritten digit recognition,
we are now ready to tackle the general
0:09 - 0:15

problem which is data finding 3D objects
and scenes. So, the settings in which we
0:15 - 0:22

studied the problem these days is, is most
commonly that is so-called PASCAL object
0:22 - 0:28

detection challenge. This is been going on
for about, this is been going on for about
0:28 - 0:34

five years or so. And what these folks
have done is collected a set of about
0:34 - 0:40

10,000 images where in each of these
images, they marked a certain set of
0:40 - 0:45

objects and these object categories
include dining table, dog, horse,
0:45 - 0:50

motorbike, person, potted plant, sheep,
etc. So, they have twenty different
0:50 - 0:57

categories. For each object belonging to a
category, they have marked the bounding
0:57 - 1:02

box. So, for example, here is the bounding
box corresponding to the dock in this
1:02 - 1:07

image and there are bounding box
corresponding to a horse here and also
1:07 - 1:10

there'll be bounding boxes corresponding
to the people because in this image, we
1:10 - 1:18

have horses and people. The goal is to
detect these objects and so what a
1:18 - 1:23

computer programmer supposed to do is,
let's say, we are trying to find dogs.
1:23 - 1:28

What you are supposed to do is to mark
bounding boxes corresponding to where the
1:28 - 1:35

dogs are in the image. And then, we'll be
judged by whether the dog is in the right
1:35 - 1:40

location. So, the bounding box has to
overlap sufficiently with the correct
1:40 - 1:46

bounding box. So, this is the, the
dominant data set for studying 3D object
1:46 - 1:51

recognition. Now, let's see what
techniques we can use for addressing this
1:51 - 1:57

problem. So, we start with of course, the
basic paradigm of the multi-scale sliding
1:57 - 2:02

window. And this paradigm had been
introduced for face direction back in the
2:02 - 2:08

90s. And since then, it's also been used
for pedestrian detection and so forth. So,
2:08 - 2:14

the basic idea here is that we're going to
consider a window. Let's say, starting in
2:14 - 2:21

the top-left corner of the image. So, this
green, this green boxes correspond to one
2:21 - 2:28

of those windows and then we are going to
ev aluate the answer to that question. Is
2:28 - 2:34

there are face there? Or there is a bus in
there? And shift the Windows lightly, ask
2:34 - 2:40

the same question. And since the people
could be a, a variety of sizes, we have to
2:40 - 2:46

read on this process for different sizes
for Windows just as to detect small
2:46 - 2:54

objects as well as large objects. A good
and standard building block is a linear
2:54 - 2:58

support vector machine trained on
Histogram or oriented gradient features.
2:58 - 3:05

And this is a frame work introduced by
Dalal & Triggs in 2005 and they have
3:05 - 3:09

details in their paper about how they
compute each of the blocks and how they
3:09 - 3:14

normalize and well, if few of you are
interested in the details, you should read
3:14 - 3:19

that paper. Now, note that the Dalal &
Triggs approach was tested on pedestrians
3:19 - 3:26

and in the case of pedestrians, a single
block is enough and you try to detect the
3:26 - 3:32

whole object in one goal. Now, when we
deal with more complex objects like people
3:32 - 3:38

in general poses or dogs and cats, we find
that these are very non-rigid. So, one
3:38 - 3:43

single rigid template is not affected.
What we really want are part based
3:43 - 3:50

approaches. Nowadays, there are two
dominant part based approaches. The first
3:50 - 3:56

is the so-called deformable part models
due to Felzenszwalb et al. There is a
3:56 - 4:03

paper of that article probably in 2010.
And another approach is so-called Poselets
4:03 - 4:10

and this is due to Lubomir Bourdev and
various and other collaborators in my
4:10 - 4:16

group. So, what's the basic idea? So, let
me get into Felzenszwalb's approach first.
4:16 - 4:24

So, their basic idea is to have a root
filter which is trying to find the object
4:24 - 4:29

that hold. And then there will be a set of
path filters which might correspond to
4:29 - 4:37

say, trying to detect faces or legs and so
forth but these path filters have to fire
4:37 - 4:43

in certain spacial relationships with
respect to the root vector. So, the oral
4:43 - 4:50

detector is the combination of holistic
detector and a set of part filters which
4:50 - 4:55

had to be in the certain relationship with
respected to the whole object. And this
4:55 - 5:01

requires training both the root filter and
the radiant spot filters an d this can be
5:01 - 5:05

done using a so-called LatentSVM approach
which, and it does not require any extra
5:05 - 5:12

annotation. And note that I said, parts
such as faces and legs. So, that's me
5:12 - 5:17

getting carried away. The vector parts
need not to correspond to anything
5:17 - 5:24

semantically meaningful. In the case of
the Poselets approach, the idea is to have
5:24 - 5:30

semantically meaningful part, parts and,
and so the way they go about doing this is
5:30 - 5:35

by making use of extra annotation. So,
suppose you have images of people that
5:35 - 5:39

needs images might be annotated with key
points corresponding to left shoulder,
5:39 - 5:43

right shoulder, left elbow, right elbow
and so on. While other object
5:43 - 5:48

categorically will be other key points
such as, for example, for an airplane, you
5:48 - 5:54

might have a key point on the tip of the
nose or the tip of the wings and so on and
5:54 - 5:57

so forth. This requires extra work because
somebody has to go through to all the
5:57 - 6:02

images in the test and then mark these key
points but the consequence will be that
6:02 - 6:08

we'll be able to do a few more things
afterwards. Here's a slide which shows how
6:08 - 6:14

the object detection with discriminatively
trained part based models works. So, this
6:14 - 6:21

is the DPM model of Felzenszwalb Girshick,
MacAllester, and Ramanan. And here this
6:21 - 6:27

model has been illustrated with powerful
handle bicycle detection. So, in fact, you
6:27 - 6:32

don't train just one model, you train a
mixture of models. So, there is a model
6:32 - 6:40

here corresponding to the side view of a
bicycle. So, the root filter is shown here
6:40 - 6:45

so this is kind of the root filter and
this has kind, is looking for, this is a
6:45 - 6:50

hot template. It's looking for edges of
particular orientations as might we found
6:50 - 6:57

on the side view of a bicycle. Then we
have various part filters. So, the part
6:57 - 7:02

filters are in factual here. So, each of
the rectangles here, this kind of the
7:02 - 7:07

rectangle corresponds to a part filter.
So, this might, here, corresponds to
7:07 - 7:15

something like a template detective for
wheels. And so, what we have to have to
7:15 - 7:21

come up with the final score is to combine
the score corresponding to the hot te
7:21 - 7:26

mplate of the root filter as well as the
hot templates for each of the part. Note
7:26 - 7:31

that this detector for the side view of a
bicycle will probably not do a good job in
7:31 - 7:37

consider front views of bicycles like
here. And so for this, they will have a
7:37 - 7:44

different mode. So, again the model is
shown here. And here the wheel, the parts
7:44 - 7:50

maybe somewhat different. So, overall, you
have a mixture model with multiple model
7:50 - 7:55

corresponding to different poses and that
each model, it says, consists of root
7:55 - 8:01

filter and various part filters and there
is some subtlety in training because there
8:01 - 8:07

are no annotations that were leveled about
key points and so forth. So, in terms in
8:07 - 8:11

the learning approach here, you have to
guess where the part should be as the part
8:11 - 8:15

of the process of training and you can
find details in there that needs to
8:15 - 8:24

[inaudible]. How well does it do? Okay,
there is standard methodology that we use
8:24 - 8:30

in computer vision by evaluating detection
process. And here is how we do this for
8:30 - 8:35

the case of, say a motorcycle detector.
So, when computes the so-called precision
8:35 - 8:40

recall cuts. So, the idea is that the
algorithm, the detection algorithm is
8:40 - 8:47

going to come up with guesses of bounding
boxes where the motorbike maybe. And we
8:47 - 8:52

can then evaluate for each of these guess
bounding box. Is it right or wrong and
8:52 - 8:56

it's just to be right if its intersection
where you meet in respect to these two
8:56 - 9:05

motorbike is within 50%. Then, we have a
choice of how strict to be in a threshold.
9:05 - 9:10

We could pass through most of our
candidate guesses bounding boxes and if
9:10 - 9:15

you guess enough of them then of course,
you are guaranteed to find all of the
9:15 - 9:19

motorbikes. So, this rather seemed right.
So, the way we do this is that we could
9:19 - 9:24

have to pick a threshold and with that
threshold, you can evaluate the precision
9:24 - 9:31

and recall. So, precision and recall.
These terms have the following meaning.
9:31 - 9:36

Precisions means what fraction of the
detections that you declared are actually
9:36 - 9:44

true motorcycles. Recall is the question
of how many of the two motorcycles that
9:44 - 9:49

you, did you manage to detect? So, I
really want precision to be 100 percent
9:49 - 9:55

and recall to be 100%. In reality, it
doesn't well count that way. We're able to
9:55 - 10:01

detect some fraction of the two motorbikes
so here, for example, at this point, the
10:01 - 10:07

precision is 0.7. That means at this point
we're able to detect, the, the, the
10:07 - 10:13

detection that we declare often 70 percent
accurate. Now, this point corresponds to
10:13 - 10:19

recall of something like 55 percent
meaning that at that threshold, we hold 55
10:19 - 10:25

possible for two motorbikes and as we made
the threshold more lenient, we are going
10:25 - 10:29

to get more false [inaudible] but we will
manage to detect more of the two
10:29 - 10:33

motorbikes. So, as these curves goes down
in this range, in this, for this
10:33 - 10:39

particular detected data curve which is
image to detect something like 70 to 80
10:39 - 10:46

percent of the true motorcycles. So the
curves in this figure corresponds to
10:46 - 10:51

different algorithms and the way we
compare different algorithms is by
10:51 - 10:59

measuring the area and the curve. And that
the ideal case, of course, would be 100%.
10:59 - 11:04

In fact there is something like 50 percent
to 60 percent for these cases and that is
11:04 - 11:10

what we call AP or Average Precision. And
that is how we compare different
11:10 - 11:17

algorithms. Here is the precision recall
curve for a different category namely
11:17 - 11:22

person detection and the, the different
curves correspond to different category
11:22 - 11:27

items so this algorithm is probably not a
good one. This algorithm is a better
11:27 - 11:33

algorithm. And notice in both the
examples, we are not able to detect all
11:33 - 11:38

the people and if you look through this 30
percent of the people which are not
11:38 - 11:43

detected by any approach, usually, there
is heavy occlusion or unusual pauses and
11:43 - 11:53

media. So, there are phenomenas that make
life difficult for us. Finally, the Pascal
11:53 - 12:04

BOC, people have computed the average
precision for every class. And they give
12:04 - 12:10

two measures. Max means the best algorithm
for that category, So, max, so, the max
12:10 - 12:14

motorbike is something like say 58%. That
means that the best algorithm for
12:14 - 12:19

detecting motorbikes has an average
precision of 58%. And the median is, of
12:19 - 12:24

course, the m edian of the different
algorithm that was submitted. So, we
12:24 - 12:29

conclude that some categories are easier
than others. Motorbikes are probably the
12:29 - 12:35

easiest. Their average precision is 58%.
And something like potted plant is really
12:35 - 12:41

hard to detect and the average precision
there is sixteen%. So, if you want to say
12:41 - 12:46

where are we going quite well. It's, I
think, all the categories where the
12:46 - 12:52

precision is where the average precision
is over 505 percent and that is motorbike,
12:52 - 13:00

bicycle, bus, airplane, horse, car, cat,
train, bus. So about 50%. You may like it
13:00 - 13:05

or not in the sense that this is the case
of the class happen to half way. So, since
13:05 - 13:12

it's about 50%, maybe you can call it
Boat. Let's a look a little bit at some of
13:12 - 13:17

the difficult categories. So, here are the
category of boat and the average precision
13:17 - 13:22

here is about [inaudible] and if you look
at the set of examples, you will see why,
13:22 - 13:27

why this is so hot because there are so
much radiation in appearance from one boat
13:27 - 13:32

to another and it's really difficult to
detect, train to detect damage on all
13:32 - 13:39

these cases. Okay, and even more difficult
example. Chairs. So, here we are supposed
13:39 - 13:47

to mark bounding boxes corresponding to
the chairs and here they are. Okay, now
13:47 - 13:52

imagine you're looking for a hot template
which is going to detect the characters,
13:52 - 13:57

the edges corresponding to a chair. You
really can see that there is no hope at
13:57 - 14:03

managing that. Probably, the way humans
detect chairs is by making use of the fact
14:03 - 14:08

that when I, there's a human sitting on
thatt in a certain pose and, so there are
14:08 - 14:12

a lot of contextual information which
currently is not being captured by the, by
14:12 - 14:20

the algorithms. I'll turn to images of
people now. Analyzing images of people is
14:20 - 14:28

very important. It enables us to build
human good computer interaction APIs, it
14:28 - 14:37

enables us to analyze video, recognize
actions and so on and so forth. It's laid
14:37 - 14:43

hard by the fact that people appear in a
variety of poses, the variety of clothing
14:43 - 14:48

can be occluded, can be small, can be
large, and so on. So, this is really
14:48 - 14:52

challe nging category even though it's
perhaps, the most important category for
14:52 - 14:57

object recognition. So, I'm going to show
you some research from an approach which
14:57 - 15:03

is based on poselets, the other part based
paradigm that I refer to. So, the big idea
15:03 - 15:09

is that we can build on the success of
face detector and pedestrian detectors.
15:09 - 15:14

So, face detection, we know what's well.
And so also, the pedestrian detection when
15:14 - 15:22

you're talking about a vertical standing
or walking pedestrian. So, essentially,
15:22 - 15:26

both of these rely on, on pattern matching
and they captured pattern that are common
15:26 - 15:31

and visually characteristic. But these are
not the only too common in characteristic
15:31 - 15:35

patterns. Effectively, we can have
patterns corresponding to this pair legs.
15:35 - 15:44

And if we can detect those, we are sure
that we are looking at a person. And or we
15:44 - 15:47

can have a pattern which doesn't
correspond to single anatomical part. This
15:47 - 15:53

is the half of the face and half of the
torso and the center of the shoulder. This
15:53 - 15:57

is fine, I mean this is pretty
characteristic observation for a person.
15:57 - 16:05

So, the way, of course, how we train face
detectors pause that we had images where
16:05 - 16:09

all face had been marked out. So, then
the, just the face of the youth, just
16:09 - 16:14

input positive examples for a machine
learning algorithm. But, how are we going
16:14 - 16:19

to find all these configuration
corresponding to legs and face and
16:19 - 16:25

shoulders and so on. So, the poselet idea
is, is exactly to train these detectors
16:25 - 16:31

but we don't wish to determine these in
advance. But first, let me show you what
16:31 - 16:38

examples of Poselets are. So, this is a
Poselet let implies a small part. And the
16:38 - 16:44

way it works is that consider the human
pulse and let is being planted at a small
16:44 - 16:51

part of it. So, the top rope was that
corresponds to face, upper body, and the
16:51 - 16:56

hand in the certain configuration. Second
row corresponds to two legs. The third
16:56 - 17:02

row, let row corresponds to the back view
of a person. So, in fact, we can have a
17:02 - 17:12

very and, and a pretty long list of these
Poselets. Now, the, the value of these is
17:12 - 17:17

that it enables us to do later tasks more
easily. So, for example, we can train
17:17 - 17:22

agenda classifieds. So, we want to
distinguish men from women and that can be
17:22 - 17:29

done from the face also from the view.
That back wheel for person. And the legs
17:29 - 17:35

because up in the clothing want by many
women are different. So once we have this
17:35 - 17:40

idea of training positive detectors. We
can actually train two versions of a
17:40 - 17:47

positive detector. One is for male faces,
one for female faces. And, and we can do
17:47 - 17:53

that for each detector and essentially
this gives us a handle on how to come up
17:53 - 17:59

with the classifications of more finding
classification of people. So, I'm going to
17:59 - 18:05

show you some results here. So, these are
actually results from this approach. So,
18:05 - 18:09

the top row where the things are men and
the bottom row is where the things are
18:09 - 18:15

women. So, there are some mistakes here.
So, for example, these, these are really
18:15 - 18:23

women and so are these and so there are
some mistakes but it's surprisingly good.
18:23 - 18:29

Here is what the detector thinks are
people wearing long pants in the top row
18:29 - 18:34

and not wearing long pants in the bottom
row. So, notice that once we can start to
18:34 - 18:39

do this, we get to the ability of
describing people. So, in an image, I want
18:39 - 18:43

to be able to say that this image is a
person who is tall, blonde man with
18:43 - 18:52

wearing the green trousers. Here in the
top row is what the algorithm thinks are
18:52 - 18:59

people wearing hats and the bottom row are
people not wearing hats. This approach
18:59 - 19:05

applies to detecting actions as well. So,
here are actions that are revealed in
19:05 - 19:10

still images. So, you just have a single
frame here. So, the, so, this image
19:10 - 19:15

correspond to a sitting person, he is the
person talking on the telephone, a person
19:15 - 19:20

riding a horse, a person running and so
on. So, again, this Poselet paradigm can
19:20 - 19:28

be adapted to this framework. And, for
example, we can train Poselets
19:28 - 19:33

corresponding to phoning people, running
people, walking people, and riding cars. I
19:33 - 19:41

should note that the problem of detecting
action is a much more general problem. And
19:41 - 19:46

we obviously don't want to adjus t or make
use of the static information. If we have
19:46 - 19:51

video and we can compute optical flow
vectors, then that would give us an extra
19:51 - 19:57

handle on this problem. And the kinds of
actions we want to be able to recognizing
19:57 - 20:01

through movement and posture change,
object manipulation, conversational
20:01 - 20:07

gesture, sign language and etc. So, if you
want, you can think of object as nouns in
20:07 - 20:13

English and actions as verbs in English.
And it turns out that's some of the
20:13 - 20:16

techniques that have been applied for
object recognition carry over to this
20:16 - 20:21

domain. So, techniques such as bags of
spatio-temporal words, these are
20:21 - 20:26

generalizations of SIFT features to video.
These turn out to be quite useful and give
20:26 - 20:35

some of the best results for action
recognition task. Let me conclude here. I
20:35 - 20:42

think our community has made a lot of
progress and object recognition, action
20:42 - 20:47

recognition and so on. But a lot that
needs to be done. There is this face that
20:47 - 20:51

people in the multimedia information
systems community talk about, the
20:51 - 20:59

so-called semantic gap. So, their point is
that typically where images and videos are
20:59 - 21:04

presented as pixels, pixel brightness
values, pixel RGB values, and so on. There
21:04 - 21:08

is what we are really interested in the
semantic content. What are the objects in
21:08 - 21:13

the scene? What scene is it? What are the
events taking place and this is what we
21:13 - 21:19

would like to live. And we're not there
yet. There's no way near human performance
21:19 - 21:25

but I think we have made significant
progress and more continue to happen over
21:25 - 21:29

the next few years. Thank you.

Title:: Detection of 3D objects
Video Language:: English

	Toru Tamaki edited English subtitles for Detection of 3D objects
	Toru Tamaki edited English subtitles for Detection of 3D objects
	stanford-bot edited English subtitles for Detection of 3D objects
	stanford-bot added a translation

English subtitles

Revisions

Revision 2

Toru Tamaki

Detection of 3D objects

Revisions

Our website uses cookies

Operating cookies (Required)