Hello. After having spent a lot of time on
the relatively simple two-dimensional
problem of handwritten digit recognition,
we are now ready to tackle the general
problem which is data finding 3D objects
and scenes. So, the settings in which we
studied the problem these days is, is most
commonly that is so-called PASCAL object
detection challenge. This is been going on
for about, this is been going on for about
five years or so. And what these folks
have done is collected a set of about
10,000 images where in each of these
images, they marked a certain set of
objects and these object categories
include dining table, dog, horse,
motorbike, person, potted plant, sheep,
etc. So, they have twenty different
categories. For each object belonging to a
category, they have marked the bounding
box. So, for example, here is the bounding
box corresponding to the dock in this
image and there are bounding box
corresponding to a horse here and also
there'll be bounding boxes corresponding
to the people because in this image, we
have horses and people. The goal is to
detect these objects and so what a
computer programmer supposed to do is,
let's say, we are trying to find dogs.
What you are supposed to do is to mark
bounding boxes corresponding to where the
dogs are in the image. And then, we'll be
judged by whether the dog is in the right
location. So, the bounding box has to
overlap sufficiently with the correct
bounding box. So, this is the, the
dominant data set for studying 3D object
recognition. Now, let's see what
techniques we can use for addressing this
problem. So, we start with of course, the
basic paradigm of the multi-scale sliding
window. And this paradigm had been
introduced for face direction back in the
90s. And since then, it's also been used
for pedestrian detection and so forth. So,
the basic idea here is that we're going to
consider a window. Let's say, starting in
the top-left corner of the image. So, this
green, this green boxes correspond to one
of those windows and then we are going to
ev aluate the answer to that question. Is
there are face there? Or there is a bus in
there? And shift the Windows lightly, ask
the same question. And since the people
could be a, a variety of sizes, we have to
read on this process for different sizes
for Windows just as to detect small
objects as well as large objects. A good
and standard building block is a linear
support vector machine trained on
Histogram or oriented gradient features.
And this is a frame work introduced by
Dalal & Triggs in 2005 and they have
details in their paper about how they
compute each of the blocks and how they
normalize and well, if few of you are
interested in the details, you should read
that paper. Now, note that the Dalal &
Triggs approach was tested on pedestrians
and in the case of pedestrians, a single
block is enough and you try to detect the
whole object in one goal. Now, when we
deal with more complex objects like people
in general poses or dogs and cats, we find
that these are very non-rigid. So, one
single rigid template is not affected.
What we really want are part based
approaches. Nowadays, there are two
dominant part based approaches. The first
is the so-called deformable part models
due to Felzenszwalb et al. There is a
paper of that article probably in 2010.
And another approach is so-called Poselets
and this is due to Lubomir Bourdev and
various and other collaborators in my
group. So, what's the basic idea? So, let
me get into Felzenszwalb's approach first.
So, their basic idea is to have a root
filter which is trying to find the object
that hold. And then there will be a set of
path filters which might correspond to
say, trying to detect faces or legs and so
forth but these path filters have to fire
in certain spacial relationships with
respect to the root vector. So, the oral
detector is the combination of holistic
detector and a set of part filters which
had to be in the certain relationship with
respected to the whole object. And this
requires training both the root filter and
the radiant spot filters an d this can be
done using a so-called LatentSVM approach
which, and it does not require any extra
annotation. And note that I said, parts
such as faces and legs. So, that's me
getting carried away. The vector parts
need not to correspond to anything
semantically meaningful. In the case of
the Poselets approach, the idea is to have
semantically meaningful part, parts and,
and so the way they go about doing this is
by making use of extra annotation. So,
suppose you have images of people that
needs images might be annotated with key
points corresponding to left shoulder,
right shoulder, left elbow, right elbow
and so on. While other object
categorically will be other key points
such as, for example, for an airplane, you
might have a key point on the tip of the
nose or the tip of the wings and so on and
so forth. This requires extra work because
somebody has to go through to all the
images in the test and then mark these key
points but the consequence will be that
we'll be able to do a few more things
afterwards. Here's a slide which shows how
the object detection with discriminatively
trained part based models works. So, this
is the DPM model of Felzenszwalb Girshick,
MacAllester, and Ramanan. And here this
model has been illustrated with powerful
handle bicycle detection. So, in fact, you
don't train just one model, you train a
mixture of models. So, there is a model
here corresponding to the side view of a
bicycle. So, the root filter is shown here
so this is kind of the root filter and
this has kind, is looking for, this is a
hot template. It's looking for edges of
particular orientations as might we found
on the side view of a bicycle. Then we
have various part filters. So, the part
filters are in factual here. So, each of
the rectangles here, this kind of the
rectangle corresponds to a part filter.
So, this might, here, corresponds to
something like a template detective for
wheels. And so, what we have to have to
come up with the final score is to combine
the score corresponding to the hot te
mplate of the root filter as well as the
hot templates for each of the part. Note
that this detector for the side view of a
bicycle will probably not do a good job in
consider front views of bicycles like
here. And so for this, they will have a
different mode. So, again the model is
shown here. And here the wheel, the parts
maybe somewhat different. So, overall, you
have a mixture model with multiple model
corresponding to different poses and that
each model, it says, consists of root
filter and various part filters and there
is some subtlety in training because there
are no annotations that were leveled about
key points and so forth. So, in terms in
the learning approach here, you have to
guess where the part should be as the part
of the process of training and you can
find details in there that needs to
[inaudible]. How well does it do? Okay,
there is standard methodology that we use
in computer vision by evaluating detection
process. And here is how we do this for
the case of, say a motorcycle detector.
So, when computes the so-called precision
recall cuts. So, the idea is that the
algorithm, the detection algorithm is
going to come up with guesses of bounding
boxes where the motorbike maybe. And we
can then evaluate for each of these guess
bounding box. Is it right or wrong and
it's just to be right if its intersection
where you meet in respect to these two
motorbike is within 50%. Then, we have a
choice of how strict to be in a threshold.
We could pass through most of our
candidate guesses bounding boxes and if
you guess enough of them then of course,
you are guaranteed to find all of the
motorbikes. So, this rather seemed right.
So, the way we do this is that we could
have to pick a threshold and with that
threshold, you can evaluate the precision
and recall. So, precision and recall.
These terms have the following meaning.
Precisions means what fraction of the
detections that you declared are actually
true motorcycles. Recall is the question
of how many of the two motorcycles that
you, did you manage to detect? So, I
really want precision to be 100 percent
and recall to be 100%. In reality, it
doesn't well count that way. We're able to
detect some fraction of the two motorbikes
so here, for example, at this point, the
precision is 0.7. That means at this point
we're able to detect, the, the, the
detection that we declare often 70 percent
accurate. Now, this point corresponds to
recall of something like 55 percent
meaning that at that threshold, we hold 55
possible for two motorbikes and as we made
the threshold more lenient, we are going
to get more false [inaudible] but we will
manage to detect more of the two
motorbikes. So, as these curves goes down
in this range, in this, for this
particular detected data curve which is
image to detect something like 70 to 80
percent of the true motorcycles. So the
curves in this figure corresponds to
different algorithms and the way we
compare different algorithms is by
measuring the area and the curve. And that
the ideal case, of course, would be 100%.
In fact there is something like 50 percent
to 60 percent for these cases and that is
what we call AP or Average Precision. And
that is how we compare different
algorithms. Here is the precision recall
curve for a different category namely
person detection and the, the different
curves correspond to different category
items so this algorithm is probably not a
good one. This algorithm is a better
algorithm. And notice in both the
examples, we are not able to detect all
the people and if you look through this 30
percent of the people which are not
detected by any approach, usually, there
is heavy occlusion or unusual pauses and
media. So, there are phenomenas that make
life difficult for us. Finally, the Pascal
BOC, people have computed the average
precision for every class. And they give
two measures. Max means the best algorithm
for that category, So, max, so, the max
motorbike is something like say 58%. That
means that the best algorithm for
detecting motorbikes has an average
precision of 58%. And the median is, of
course, the m edian of the different
algorithm that was submitted. So, we
conclude that some categories are easier
than others. Motorbikes are probably the
easiest. Their average precision is 58%.
And something like potted plant is really
hard to detect and the average precision
there is sixteen%. So, if you want to say
where are we going quite well. It's, I
think, all the categories where the
precision is where the average precision
is over 505 percent and that is motorbike,
bicycle, bus, airplane, horse, car, cat,
train, bus. So about 50%. You may like it
or not in the sense that this is the case
of the class happen to half way. So, since
it's about 50%, maybe you can call it
Boat. Let's a look a little bit at some of
the difficult categories. So, here are the
category of boat and the average precision
here is about [inaudible] and if you look
at the set of examples, you will see why,
why this is so hot because there are so
much radiation in appearance from one boat
to another and it's really difficult to
detect, train to detect damage on all
these cases. Okay, and even more difficult
example. Chairs. So, here we are supposed
to mark bounding boxes corresponding to
the chairs and here they are. Okay, now
imagine you're looking for a hot template
which is going to detect the characters,
the edges corresponding to a chair. You
really can see that there is no hope at
managing that. Probably, the way humans
detect chairs is by making use of the fact
that when I, there's a human sitting on
thatt in a certain pose and, so there are
a lot of contextual information which
currently is not being captured by the, by
the algorithms. I'll turn to images of
people now. Analyzing images of people is
very important. It enables us to build
human good computer interaction APIs, it
enables us to analyze video, recognize
actions and so on and so forth. It's laid
hard by the fact that people appear in a
variety of poses, the variety of clothing
can be occluded, can be small, can be
large, and so on. So, this is really
challe nging category even though it's
perhaps, the most important category for
object recognition. So, I'm going to show
you some research from an approach which
is based on poselets, the other part based
paradigm that I refer to. So, the big idea
is that we can build on the success of
face detector and pedestrian detectors.
So, face detection, we know what's well.
And so also, the pedestrian detection when
you're talking about a vertical standing
or walking pedestrian. So, essentially,
both of these rely on, on pattern matching
and they captured pattern that are common
and visually characteristic. But these are
not the only too common in characteristic
patterns. Effectively, we can have
patterns corresponding to this pair legs.
And if we can detect those, we are sure
that we are looking at a person. And or we
can have a pattern which doesn't
correspond to single anatomical part. This
is the half of the face and half of the
torso and the center of the shoulder. This
is fine, I mean this is pretty
characteristic observation for a person.
So, the way, of course, how we train face
detectors pause that we had images where
all face had been marked out. So, then
the, just the face of the youth, just
input positive examples for a machine
learning algorithm. But, how are we going
to find all these configuration
corresponding to legs and face and
shoulders and so on. So, the poselet idea
is, is exactly to train these detectors
but we don't wish to determine these in
advance. But first, let me show you what
examples of Poselets are. So, this is a
Poselet let implies a small part. And the
way it works is that consider the human
pulse and let is being planted at a small
part of it. So, the top rope was that
corresponds to face, upper body, and the
hand in the certain configuration. Second
row corresponds to two legs. The third
row, let row corresponds to the back view
of a person. So, in fact, we can have a
very and, and a pretty long list of these
Poselets. Now, the, the value of these is
that it enables us to do later tasks more
easily. So, for example, we can train
agenda classifieds. So, we want to
distinguish men from women and that can be
done from the face also from the view.
That back wheel for person. And the legs
because up in the clothing want by many
women are different. So once we have this
idea of training positive detectors. We
can actually train two versions of a
positive detector. One is for male faces,
one for female faces. And, and we can do
that for each detector and essentially
this gives us a handle on how to come up
with the classifications of more finding
classification of people. So, I'm going to
show you some results here. So, these are
actually results from this approach. So,
the top row where the things are men and
the bottom row is where the things are
women. So, there are some mistakes here.
So, for example, these, these are really
women and so are these and so there are
some mistakes but it's surprisingly good.
Here is what the detector thinks are
people wearing long pants in the top row
and not wearing long pants in the bottom
row. So, notice that once we can start to
do this, we get to the ability of
describing people. So, in an image, I want
to be able to say that this image is a
person who is tall, blonde man with
wearing the green trousers. Here in the
top row is what the algorithm thinks are
people wearing hats and the bottom row are
people not wearing hats. This approach
applies to detecting actions as well. So,
here are actions that are revealed in
still images. So, you just have a single
frame here. So, the, so, this image
correspond to a sitting person, he is the
person talking on the telephone, a person
riding a horse, a person running and so
on. So, again, this Poselet paradigm can
be adapted to this framework. And, for
example, we can train Poselets
corresponding to phoning people, running
people, walking people, and riding cars. I
should note that the problem of detecting
action is a much more general problem. And
we obviously don't want to adjus t or make
use of the static information. If we have
video and we can compute optical flow
vectors, then that would give us an extra
handle on this problem. And the kinds of
actions we want to be able to recognizing
through movement and posture change,
object manipulation, conversational
gesture, sign language and etc. So, if you
want, you can think of object as nouns in
English and actions as verbs in English.
And it turns out that's some of the
techniques that have been applied for
object recognition carry over to this
domain. So, techniques such as bags of
spatio-temporal words, these are
generalizations of SIFT features to video.
These turn out to be quite useful and give
some of the best results for action
recognition task. Let me conclude here. I
think our community has made a lot of
progress and object recognition, action
recognition and so on. But a lot that
needs to be done. There is this face that
people in the multimedia information
systems community talk about, the
so-called semantic gap. So, their point is
that typically where images and videos are
presented as pixels, pixel brightness
values, pixel RGB values, and so on. There
is what we are really interested in the
semantic content. What are the objects in
the scene? What scene is it? What are the
events taking place and this is what we
would like to live. And we're not there
yet. There's no way near human performance
but I think we have made significant
progress and more continue to happen over
the next few years. Thank you.