-
Hello. After having spent a lot of time on
the relatively simple two-dimensional
-
problem of handwritten digit recognition,
we are now ready to tackle the general
-
problem which is data finding 3D objects
and scenes. So, the settings in which we
-
studied the problem these days is, is most
commonly that is so-called PASCAL object
-
detection challenge. This is been going on
for about, this is been going on for about
-
five years or so. And what these folks
have done is collected a set of about
-
10,000 images where in each of these
images, they marked a certain set of
-
objects and these object categories
include dining table, dog, horse,
-
motorbike, person, potted plant, sheep,
etc. So, they have twenty different
-
categories. For each object belonging to a
category, they have marked the bounding
-
box. So, for example, here is the bounding
box corresponding to the dock in this
-
image and there are bounding box
corresponding to a horse here and also
-
there'll be bounding boxes corresponding
to the people because in this image, we
-
have horses and people. The goal is to
detect these objects and so what a
-
computer programmer supposed to do is,
let's say, we are trying to find dogs.
-
What you are supposed to do is to mark
bounding boxes corresponding to where the
-
dogs are in the image. And then, we'll be
judged by whether the dog is in the right
-
location. So, the bounding box has to
overlap sufficiently with the correct
-
bounding box. So, this is the, the
dominant data set for studying 3D object
-
recognition. Now, let's see what
techniques we can use for addressing this
-
problem. So, we start with of course, the
basic paradigm of the multi-scale sliding
-
window. And this paradigm had been
introduced for face direction back in the
-
90s. And since then, it's also been used
for pedestrian detection and so forth. So,
-
the basic idea here is that we're going to
consider a window. Let's say, starting in
-
the top-left corner of the image. So, this
green, this green boxes correspond to one
-
of those windows and then we are going to
ev aluate the answer to that question. Is
-
there are face there? Or there is a bus in
there? And shift the Windows lightly, ask
-
the same question. And since the people
could be a, a variety of sizes, we have to
-
read on this process for different sizes
for Windows just as to detect small
-
objects as well as large objects. A good
and standard building block is a linear
-
support vector machine trained on
Histogram or oriented gradient features.
-
And this is a frame work introduced by
Dalal & Triggs in 2005 and they have
-
details in their paper about how they
compute each of the blocks and how they
-
normalize and well, if few of you are
interested in the details, you should read
-
that paper. Now, note that the Dalal &
Triggs approach was tested on pedestrians
-
and in the case of pedestrians, a single
block is enough and you try to detect the
-
whole object in one goal. Now, when we
deal with more complex objects like people
-
in general poses or dogs and cats, we find
that these are very non-rigid. So, one
-
single rigid template is not affected.
What we really want are part based
-
approaches. Nowadays, there are two
dominant part based approaches. The first
-
is the so-called deformable part models
due to Felzenszwalb et al. There is a
-
paper of that article probably in 2010.
And another approach is so-called Poselets
-
and this is due to Lubomir Bourdev and
various and other collaborators in my
-
group. So, what's the basic idea? So, let
me get into Felzenszwalb's approach first.
-
So, their basic idea is to have a root
filter which is trying to find the object
-
that hold. And then there will be a set of
path filters which might correspond to
-
say, trying to detect faces or legs and so
forth but these path filters have to fire
-
in certain spacial relationships with
respect to the root vector. So, the oral
-
detector is the combination of holistic
detector and a set of part filters which
-
had to be in the certain relationship with
respected to the whole object. And this
-
requires training both the root filter and
the radiant spot filters an d this can be
-
done using a so-called LatentSVM approach
which, and it does not require any extra
-
annotation. And note that I said, parts
such as faces and legs. So, that's me
-
getting carried away. The vector parts
need not to correspond to anything
-
semantically meaningful. In the case of
the Poselets approach, the idea is to have
-
semantically meaningful part, parts and,
and so the way they go about doing this is
-
by making use of extra annotation. So,
suppose you have images of people that
-
needs images might be annotated with key
points corresponding to left shoulder,
-
right shoulder, left elbow, right elbow
and so on. While other object
-
categorically will be other key points
such as, for example, for an airplane, you
-
might have a key point on the tip of the
nose or the tip of the wings and so on and
-
so forth. This requires extra work because
somebody has to go through to all the
-
images in the test and then mark these key
points but the consequence will be that
-
we'll be able to do a few more things
afterwards. Here's a slide which shows how
-
the object detection with discriminatively
trained part based models works. So, this
-
is the DPM model of Felzenszwalb Girshick,
MacAllester, and Ramanan. And here this
-
model has been illustrated with powerful
handle bicycle detection. So, in fact, you
-
don't train just one model, you train a
mixture of models. So, there is a model
-
here corresponding to the side view of a
bicycle. So, the root filter is shown here
-
so this is kind of the root filter and
this has kind, is looking for, this is a
-
hot template. It's looking for edges of
particular orientations as might we found
-
on the side view of a bicycle. Then we
have various part filters. So, the part
-
filters are in factual here. So, each of
the rectangles here, this kind of the
-
rectangle corresponds to a part filter.
So, this might, here, corresponds to
-
something like a template detective for
wheels. And so, what we have to have to
-
come up with the final score is to combine
the score corresponding to the hot te
-
mplate of the root filter as well as the
hot templates for each of the part. Note
-
that this detector for the side view of a
bicycle will probably not do a good job in
-
consider front views of bicycles like
here. And so for this, they will have a
-
different mode. So, again the model is
shown here. And here the wheel, the parts
-
maybe somewhat different. So, overall, you
have a mixture model with multiple model
-
corresponding to different poses and that
each model, it says, consists of root
-
filter and various part filters and there
is some subtlety in training because there
-
are no annotations that were leveled about
key points and so forth. So, in terms in
-
the learning approach here, you have to
guess where the part should be as the part
-
of the process of training and you can
find details in there that needs to
-
[inaudible]. How well does it do? Okay,
there is standard methodology that we use
-
in computer vision by evaluating detection
process. And here is how we do this for
-
the case of, say a motorcycle detector.
So, when computes the so-called precision
-
recall cuts. So, the idea is that the
algorithm, the detection algorithm is
-
going to come up with guesses of bounding
boxes where the motorbike maybe. And we
-
can then evaluate for each of these guess
bounding box. Is it right or wrong and
-
it's just to be right if its intersection
where you meet in respect to these two
-
motorbike is within 50%. Then, we have a
choice of how strict to be in a threshold.
-
We could pass through most of our
candidate guesses bounding boxes and if
-
you guess enough of them then of course,
you are guaranteed to find all of the
-
motorbikes. So, this rather seemed right.
So, the way we do this is that we could
-
have to pick a threshold and with that
threshold, you can evaluate the precision
-
and recall. So, precision and recall.
These terms have the following meaning.
-
Precisions means what fraction of the
detections that you declared are actually
-
true motorcycles. Recall is the question
of how many of the two motorcycles that
-
you, did you manage to detect? So, I
really want precision to be 100 percent
-
and recall to be 100%. In reality, it
doesn't well count that way. We're able to
-
detect some fraction of the two motorbikes
so here, for example, at this point, the
-
precision is 0.7. That means at this point
we're able to detect, the, the, the
-
detection that we declare often 70 percent
accurate. Now, this point corresponds to
-
recall of something like 55 percent
meaning that at that threshold, we hold 55
-
possible for two motorbikes and as we made
the threshold more lenient, we are going
-
to get more false [inaudible] but we will
manage to detect more of the two
-
motorbikes. So, as these curves goes down
in this range, in this, for this
-
particular detected data curve which is
image to detect something like 70 to 80
-
percent of the true motorcycles. So the
curves in this figure corresponds to
-
different algorithms and the way we
compare different algorithms is by
-
measuring the area and the curve. And that
the ideal case, of course, would be 100%.
-
In fact there is something like 50 percent
to 60 percent for these cases and that is
-
what we call AP or Average Precision. And
that is how we compare different
-
algorithms. Here is the precision recall
curve for a different category namely
-
person detection and the, the different
curves correspond to different category
-
items so this algorithm is probably not a
good one. This algorithm is a better
-
algorithm. And notice in both the
examples, we are not able to detect all
-
the people and if you look through this 30
percent of the people which are not
-
detected by any approach, usually, there
is heavy occlusion or unusual pauses and
-
media. So, there are phenomenas that make
life difficult for us. Finally, the Pascal
-
BOC, people have computed the average
precision for every class. And they give
-
two measures. Max means the best algorithm
for that category, So, max, so, the max
-
motorbike is something like say 58%. That
means that the best algorithm for
-
detecting motorbikes has an average
precision of 58%. And the median is, of
-
course, the m edian of the different
algorithm that was submitted. So, we
-
conclude that some categories are easier
than others. Motorbikes are probably the
-
easiest. Their average precision is 58%.
And something like potted plant is really
-
hard to detect and the average precision
there is sixteen%. So, if you want to say
-
where are we going quite well. It's, I
think, all the categories where the
-
precision is where the average precision
is over 505 percent and that is motorbike,
-
bicycle, bus, airplane, horse, car, cat,
train, bus. So about 50%. You may like it
-
or not in the sense that this is the case
of the class happen to half way. So, since
-
it's about 50%, maybe you can call it
Boat. Let's a look a little bit at some of
-
the difficult categories. So, here are the
category of boat and the average precision
-
here is about [inaudible] and if you look
at the set of examples, you will see why,
-
why this is so hot because there are so
much radiation in appearance from one boat
-
to another and it's really difficult to
detect, train to detect damage on all
-
these cases. Okay, and even more difficult
example. Chairs. So, here we are supposed
-
to mark bounding boxes corresponding to
the chairs and here they are. Okay, now
-
imagine you're looking for a hot template
which is going to detect the characters,
-
the edges corresponding to a chair. You
really can see that there is no hope at
-
managing that. Probably, the way humans
detect chairs is by making use of the fact
-
that when I, there's a human sitting on
thatt in a certain pose and, so there are
-
a lot of contextual information which
currently is not being captured by the, by
-
the algorithms. I'll turn to images of
people now. Analyzing images of people is
-
very important. It enables us to build
human good computer interaction APIs, it
-
enables us to analyze video, recognize
actions and so on and so forth. It's laid
-
hard by the fact that people appear in a
variety of poses, the variety of clothing
-
can be occluded, can be small, can be
large, and so on. So, this is really
-
challe nging category even though it's
perhaps, the most important category for
-
object recognition. So, I'm going to show
you some research from an approach which
-
is based on poselets, the other part based
paradigm that I refer to. So, the big idea
-
is that we can build on the success of
face detector and pedestrian detectors.
-
So, face detection, we know what's well.
And so also, the pedestrian detection when
-
you're talking about a vertical standing
or walking pedestrian. So, essentially,
-
both of these rely on, on pattern matching
and they captured pattern that are common
-
and visually characteristic. But these are
not the only too common in characteristic
-
patterns. Effectively, we can have
patterns corresponding to this pair legs.
-
And if we can detect those, we are sure
that we are looking at a person. And or we
-
can have a pattern which doesn't
correspond to single anatomical part. This
-
is the half of the face and half of the
torso and the center of the shoulder. This
-
is fine, I mean this is pretty
characteristic observation for a person.
-
So, the way, of course, how we train face
detectors pause that we had images where
-
all face had been marked out. So, then
the, just the face of the youth, just
-
input positive examples for a machine
learning algorithm. But, how are we going
-
to find all these configuration
corresponding to legs and face and
-
shoulders and so on. So, the poselet idea
is, is exactly to train these detectors
-
but we don't wish to determine these in
advance. But first, let me show you what
-
examples of Poselets are. So, this is a
Poselet let implies a small part. And the
-
way it works is that consider the human
pulse and let is being planted at a small
-
part of it. So, the top rope was that
corresponds to face, upper body, and the
-
hand in the certain configuration. Second
row corresponds to two legs. The third
-
row, let row corresponds to the back view
of a person. So, in fact, we can have a
-
very and, and a pretty long list of these
Poselets. Now, the, the value of these is
-
that it enables us to do later tasks more
easily. So, for example, we can train
-
agenda classifieds. So, we want to
distinguish men from women and that can be
-
done from the face also from the view.
That back wheel for person. And the legs
-
because up in the clothing want by many
women are different. So once we have this
-
idea of training positive detectors. We
can actually train two versions of a
-
positive detector. One is for male faces,
one for female faces. And, and we can do
-
that for each detector and essentially
this gives us a handle on how to come up
-
with the classifications of more finding
classification of people. So, I'm going to
-
show you some results here. So, these are
actually results from this approach. So,
-
the top row where the things are men and
the bottom row is where the things are
-
women. So, there are some mistakes here.
So, for example, these, these are really
-
women and so are these and so there are
some mistakes but it's surprisingly good.
-
Here is what the detector thinks are
people wearing long pants in the top row
-
and not wearing long pants in the bottom
row. So, notice that once we can start to
-
do this, we get to the ability of
describing people. So, in an image, I want
-
to be able to say that this image is a
person who is tall, blonde man with
-
wearing the green trousers. Here in the
top row is what the algorithm thinks are
-
people wearing hats and the bottom row are
people not wearing hats. This approach
-
applies to detecting actions as well. So,
here are actions that are revealed in
-
still images. So, you just have a single
frame here. So, the, so, this image
-
correspond to a sitting person, he is the
person talking on the telephone, a person
-
riding a horse, a person running and so
on. So, again, this Poselet paradigm can
-
be adapted to this framework. And, for
example, we can train Poselets
-
corresponding to phoning people, running
people, walking people, and riding cars. I
-
should note that the problem of detecting
action is a much more general problem. And
-
we obviously don't want to adjus t or make
use of the static information. If we have
-
video and we can compute optical flow
vectors, then that would give us an extra
-
handle on this problem. And the kinds of
actions we want to be able to recognizing
-
through movement and posture change,
object manipulation, conversational
-
gesture, sign language and etc. So, if you
want, you can think of object as nouns in
-
English and actions as verbs in English.
And it turns out that's some of the
-
techniques that have been applied for
object recognition carry over to this
-
domain. So, techniques such as bags of
spatio-temporal words, these are
-
generalizations of SIFT features to video.
These turn out to be quite useful and give
-
some of the best results for action
recognition task. Let me conclude here. I
-
think our community has made a lot of
progress and object recognition, action
-
recognition and so on. But a lot that
needs to be done. There is this face that
-
people in the multimedia information
systems community talk about, the
-
so-called semantic gap. So, their point is
that typically where images and videos are
-
presented as pixels, pixel brightness
values, pixel RGB values, and so on. There
-
is what we are really interested in the
semantic content. What are the objects in
-
the scene? What scene is it? What are the
events taking place and this is what we
-
would like to live. And we're not there
yet. There's no way near human performance
-
but I think we have made significant
progress and more continue to happen over
-
the next few years. Thank you.