Hello. After having spent a lot of time on
the relatively simple two-dimensional

problem of handwritten digit recognition,
we are now ready to tackle the general

problem which is data finding 3D objects
and scenes. So, the settings in which we

studied the problem these days is, is most
commonly that is so-called PASCAL object

detection challenge. This is been going on
for about, this is been going on for about

five years or so. And what these folks
have done is collected a set of about

10,000 images where in each of these
images, they marked a certain set of

objects and these object categories
include dining table, dog, horse,

motorbike, person, potted plant, sheep,
etc. So, they have twenty different

categories. For each object belonging to a
category, they have marked the bounding

box. So, for example, here is the bounding
box corresponding to the dock in this

image and there are bounding box
corresponding to a horse here and also

there'll be bounding boxes corresponding
to the people because in this image, we

have horses and people. The goal is to
detect these objects and so what a

computer programmer supposed to do is,
let's say, we are trying to find dogs.

What you are supposed to do is to mark
bounding boxes corresponding to where the

dogs are in the image. And then, we'll be
judged by whether the dog is in the right

location. So, the bounding box has to
overlap sufficiently with the correct

bounding box. So, this is the, the
dominant data set for studying 3D object

recognition. Now, let's see what
techniques we can use for addressing this

problem. So, we start with of course, the
basic paradigm of the multi-scale sliding

window. And this paradigm had been
introduced for face direction back in the

90s. And since then, it's also been used
for pedestrian detection and so forth. So,

the basic idea here is that we're going to
consider a window. Let's say, starting in

the top-left corner of the image. So, this
green, this green boxes correspond to one

of those windows and then we are going to
ev aluate the answer to that question. Is

there are face there? Or there is a bus in
there? And shift the Windows lightly, ask

the same question. And since the people
could be a, a variety of sizes, we have to

read on this process for different sizes
for Windows just as to detect small

objects as well as large objects. A good
and standard building block is a linear

support vector machine trained on
Histogram or oriented gradient features.

And this is a frame work introduced by
Dalal & Triggs in 2005 and they have

details in their paper about how they
compute each of the blocks and how they

normalize and well, if few of you are
interested in the details, you should read

that paper. Now, note that the Dalal &
Triggs approach was tested on pedestrians

and in the case of pedestrians, a single
block is enough and you try to detect the

whole object in one goal. Now, when we
deal with more complex objects like people

in general poses or dogs and cats, we find
that these are very non-rigid. So, one

single rigid template is not affected.
What we really want are part based

approaches. Nowadays, there are two
dominant part based approaches. The first

is the so-called deformable part models
due to Felzenszwalb et al. There is a

paper of that article probably in 2010.
And another approach is so-called Poselets

and this is due to Lubomir Bourdev and
various and other collaborators in my

group. So, what's the basic idea? So, let
me get into Felzenszwalb's approach first.

So, their basic idea is to have a root
filter which is trying to find the object

that hold. And then there will be a set of
path filters which might correspond to

say, trying to detect faces or legs and so
forth but these path filters have to fire

in certain spacial relationships with
respect to the root vector. So, the oral

detector is the combination of holistic
detector and a set of part filters which

had to be in the certain relationship with
respected to the whole object. And this

requires training both the root filter and
the radiant spot filters an d this can be

done using a so-called LatentSVM approach
which, and it does not require any extra

annotation. And note that I said, parts
such as faces and legs. So, that's me

getting carried away. The vector parts
need not to correspond to anything

semantically meaningful. In the case of
the Poselets approach, the idea is to have

semantically meaningful part, parts and,
and so the way they go about doing this is

by making use of extra annotation. So,
suppose you have images of people that

needs images might be annotated with key
points corresponding to left shoulder,

right shoulder, left elbow, right elbow
and so on. While other object

categorically will be other key points
such as, for example, for an airplane, you

might have a key point on the tip of the
nose or the tip of the wings and so on and

so forth. This requires extra work because
somebody has to go through to all the

images in the test and then mark these key
points but the consequence will be that

we'll be able to do a few more things
afterwards. Here's a slide which shows how

the object detection with discriminatively
trained part based models works. So, this

is the DPM model of Felzenszwalb Girshick,
MacAllester, and Ramanan. And here this

model has been illustrated with powerful
handle bicycle detection. So, in fact, you

don't train just one model, you train a
mixture of models. So, there is a model

here corresponding to the side view of a
bicycle. So, the root filter is shown here

so this is kind of the root filter and
this has kind, is looking for, this is a

hot template. It's looking for edges of
particular orientations as might we found

on the side view of a bicycle. Then we
have various part filters. So, the part

filters are in factual here. So, each of
the rectangles here, this kind of the

rectangle corresponds to a part filter.
So, this might, here, corresponds to

something like a template detective for
wheels. And so, what we have to have to

come up with the final score is to combine
the score corresponding to the hot te

mplate of the root filter as well as the
hot templates for each of the part. Note

that this detector for the side view of a
bicycle will probably not do a good job in

consider front views of bicycles like
here. And so for this, they will have a

different mode. So, again the model is
shown here. And here the wheel, the parts

maybe somewhat different. So, overall, you
have a mixture model with multiple model

corresponding to different poses and that
each model, it says, consists of root

filter and various part filters and there
is some subtlety in training because there

are no annotations that were leveled about
key points and so forth. So, in terms in

the learning approach here, you have to
guess where the part should be as the part

of the process of training and you can
find details in there that needs to

[inaudible]. How well does it do? Okay,
there is standard methodology that we use

in computer vision by evaluating detection
process. And here is how we do this for

the case of, say a motorcycle detector.
So, when computes the so-called precision

recall cuts. So, the idea is that the
algorithm, the detection algorithm is

going to come up with guesses of bounding
boxes where the motorbike maybe. And we

can then evaluate for each of these guess
bounding box. Is it right or wrong and

it's just to be right if its intersection
where you meet in respect to these two

motorbike is within 50%. Then, we have a
choice of how strict to be in a threshold.

We could pass through most of our
candidate guesses bounding boxes and if

you guess enough of them then of course,
you are guaranteed to find all of the

motorbikes. So, this rather seemed right.
So, the way we do this is that we could

have to pick a threshold and with that
threshold, you can evaluate the precision

and recall. So, precision and recall.
These terms have the following meaning.

Precisions means what fraction of the
detections that you declared are actually

true motorcycles. Recall is the question
of how many of the two motorcycles that

you, did you manage to detect? So, I
really want precision to be 100 percent

and recall to be 100%. In reality, it
doesn't well count that way. We're able to

detect some fraction of the two motorbikes
so here, for example, at this point, the

precision is 0.7. That means at this point
we're able to detect, the, the, the

detection that we declare often 70 percent
accurate. Now, this point corresponds to

recall of something like 55 percent
meaning that at that threshold, we hold 55

possible for two motorbikes and as we made
the threshold more lenient, we are going

to get more false [inaudible] but we will
manage to detect more of the two

motorbikes. So, as these curves goes down
in this range, in this, for this

particular detected data curve which is
image to detect something like 70 to 80

percent of the true motorcycles. So the
curves in this figure corresponds to

different algorithms and the way we
compare different algorithms is by

measuring the area and the curve. And that
the ideal case, of course, would be 100%.

In fact there is something like 50 percent
to 60 percent for these cases and that is

what we call AP or Average Precision. And
that is how we compare different

algorithms. Here is the precision recall
curve for a different category namely

person detection and the, the different
curves correspond to different category

items so this algorithm is probably not a
good one. This algorithm is a better

algorithm. And notice in both the
examples, we are not able to detect all

the people and if you look through this 30
percent of the people which are not

detected by any approach, usually, there
is heavy occlusion or unusual pauses and

media. So, there are phenomenas that make
life difficult for us. Finally, the Pascal

BOC, people have computed the average
precision for every class. And they give

two measures. Max means the best algorithm
for that category, So, max, so, the max

motorbike is something like say 58%. That
means that the best algorithm for

detecting motorbikes has an average
precision of 58%. And the median is, of

course, the m edian of the different
algorithm that was submitted. So, we

conclude that some categories are easier
than others. Motorbikes are probably the

easiest. Their average precision is 58%.
And something like potted plant is really

hard to detect and the average precision
there is sixteen%. So, if you want to say

where are we going quite well. It's, I
think, all the categories where the

precision is where the average precision
is over 505 percent and that is motorbike,

bicycle, bus, airplane, horse, car, cat,
train, bus. So about 50%. You may like it

or not in the sense that this is the case
of the class happen to half way. So, since

it's about 50%, maybe you can call it
Boat. Let's a look a little bit at some of

the difficult categories. So, here are the
category of boat and the average precision

here is about [inaudible] and if you look
at the set of examples, you will see why,

why this is so hot because there are so
much radiation in appearance from one boat

to another and it's really difficult to
detect, train to detect damage on all

these cases. Okay, and even more difficult
example. Chairs. So, here we are supposed

to mark bounding boxes corresponding to
the chairs and here they are. Okay, now

imagine you're looking for a hot template
which is going to detect the characters,

the edges corresponding to a chair. You
really can see that there is no hope at

managing that. Probably, the way humans
detect chairs is by making use of the fact

that when I, there's a human sitting on
thatt in a certain pose and, so there are

a lot of contextual information which
currently is not being captured by the, by

the algorithms. I'll turn to images of
people now. Analyzing images of people is

very important. It enables us to build
human good computer interaction APIs, it

enables us to analyze video, recognize
actions and so on and so forth. It's laid

hard by the fact that people appear in a
variety of poses, the variety of clothing

can be occluded, can be small, can be
large, and so on. So, this is really

challe nging category even though it's
perhaps, the most important category for

object recognition. So, I'm going to show
you some research from an approach which

is based on poselets, the other part based
paradigm that I refer to. So, the big idea

is that we can build on the success of
face detector and pedestrian detectors.

So, face detection, we know what's well.
And so also, the pedestrian detection when

you're talking about a vertical standing
or walking pedestrian. So, essentially,

both of these rely on, on pattern matching
and they captured pattern that are common

and visually characteristic. But these are
not the only too common in characteristic

patterns. Effectively, we can have
patterns corresponding to this pair legs.

And if we can detect those, we are sure
that we are looking at a person. And or we

can have a pattern which doesn't
correspond to single anatomical part. This

is the half of the face and half of the
torso and the center of the shoulder. This

is fine, I mean this is pretty
characteristic observation for a person.

So, the way, of course, how we train face
detectors pause that we had images where

all face had been marked out. So, then
the, just the face of the youth, just

input positive examples for a machine
learning algorithm. But, how are we going

to find all these configuration
corresponding to legs and face and

shoulders and so on. So, the poselet idea
is, is exactly to train these detectors

but we don't wish to determine these in
advance. But first, let me show you what

examples of Poselets are. So, this is a
Poselet let implies a small part. And the

way it works is that consider the human
pulse and let is being planted at a small

part of it. So, the top rope was that
corresponds to face, upper body, and the

hand in the certain configuration. Second
row corresponds to two legs. The third

row, let row corresponds to the back view
of a person. So, in fact, we can have a

very and, and a pretty long list of these
Poselets. Now, the, the value of these is

that it enables us to do later tasks more
easily. So, for example, we can train

agenda classifieds. So, we want to
distinguish men from women and that can be

done from the face also from the view.
That back wheel for person. And the legs

because up in the clothing want by many
women are different. So once we have this

idea of training positive detectors. We
can actually train two versions of a

positive detector. One is for male faces,
one for female faces. And, and we can do

that for each detector and essentially
this gives us a handle on how to come up

with the classifications of more finding
classification of people. So, I'm going to

show you some results here. So, these are
actually results from this approach. So,

the top row where the things are men and
the bottom row is where the things are

women. So, there are some mistakes here.
So, for example, these, these are really

women and so are these and so there are
some mistakes but it's surprisingly good.

Here is what the detector thinks are
people wearing long pants in the top row

and not wearing long pants in the bottom
row. So, notice that once we can start to

do this, we get to the ability of
describing people. So, in an image, I want

to be able to say that this image is a
person who is tall, blonde man with

wearing the green trousers. Here in the
top row is what the algorithm thinks are

people wearing hats and the bottom row are
people not wearing hats. This approach

applies to detecting actions as well. So,
here are actions that are revealed in

still images. So, you just have a single
frame here. So, the, so, this image

correspond to a sitting person, he is the
person talking on the telephone, a person

riding a horse, a person running and so
on. So, again, this Poselet paradigm can

be adapted to this framework. And, for
example, we can train Poselets

corresponding to phoning people, running
people, walking people, and riding cars. I

should note that the problem of detecting
action is a much more general problem. And

we obviously don't want to adjus t or make
use of the static information. If we have

video and we can compute optical flow
vectors, then that would give us an extra

handle on this problem. And the kinds of
actions we want to be able to recognizing

through movement and posture change,
object manipulation, conversational

gesture, sign language and etc. So, if you
want, you can think of object as nouns in

English and actions as verbs in English.
And it turns out that's some of the

techniques that have been applied for
object recognition carry over to this

domain. So, techniques such as bags of
spatio-temporal words, these are

generalizations of SIFT features to video.
These turn out to be quite useful and give

some of the best results for action
recognition task. Let me conclude here. I

think our community has made a lot of
progress and object recognition, action

recognition and so on. But a lot that
needs to be done. There is this face that

people in the multimedia information
systems community talk about, the

so-called semantic gap. So, their point is
that typically where images and videos are

presented as pixels, pixel brightness
values, pixel RGB values, and so on. There

is what we are really interested in the
semantic content. What are the objects in

the scene? What scene is it? What are the
events taking place and this is what we

would like to live. And we're not there
yet. There's no way near human performance

but I think we have made significant
progress and more continue to happen over

the next few years. Thank you.