0:00:00.233,0:00:05.220
Hello. After having spent a lot of time on[br]the relatively simple two-dimensional

0:00:05.220,0:00:09.428
problem of handwritten digit recognition,[br]we are now ready to tackle the general

0:00:09.428,0:00:14.825
problem which is data finding 3D objects[br]and scenes. So, the settings in which we

0:00:14.825,0:00:21.501
studied the problem these days is, is most[br]commonly that is so-called PASCAL object

0:00:21.501,0:00:28.222
detection challenge. This is been going on[br]for about, this is been going on for about

0:00:28.222,0:00:33.687
five years or so. And what these folks[br]have done is collected a set of about

0:00:33.687,0:00:39.843
10,000 images where in each of these[br]images, they marked a certain set of

0:00:39.843,0:00:44.554
objects and these object categories[br]include dining table, dog, horse,

0:00:44.554,0:00:50.235
motorbike, person, potted plant, sheep,[br]etc. So, they have twenty different

0:00:50.235,0:00:56.504
categories. For each object belonging to a[br]category, they have marked the bounding

0:00:56.504,0:01:02.087
box. So, for example, here is the bounding[br]box corresponding to the dock in this

0:01:02.087,0:01:06.892
image and there are bounding box[br]corresponding to a horse here and also

0:01:06.892,0:01:10.429
there'll be bounding boxes corresponding[br]to the people because in this image, we

0:01:10.429,0:01:18.413
have horses and people. The goal is to[br]detect these objects and so what a

0:01:18.413,0:01:22.975
computer programmer supposed to do is,[br]let's say, we are trying to find dogs.

0:01:22.975,0:01:27.983
What you are supposed to do is to mark[br]bounding boxes corresponding to where the

0:01:27.983,0:01:35.261
dogs are in the image. And then, we'll be[br]judged by whether the dog is in the right

0:01:35.261,0:01:39.586
location. So, the bounding box has to[br]overlap sufficiently with the correct

0:01:39.586,0:01:45.993
bounding box. So, this is the, the[br]dominant data set for studying 3D object

0:01:45.993,0:01:50.719
recognition. Now, let's see what[br]techniques we can use for addressing this

0:01:50.719,0:01:56.856
problem. So, we start with of course, the[br]basic paradigm of the multi-scale sliding

0:01:56.856,0:02:01.706
window. And this paradigm had been[br]introduced for face direction back in the

0:02:01.706,0:02:08.244
90s. And since then, it's also been used[br]for pedestrian detection and so forth. So,

0:02:08.244,0:02:13.505
the basic idea here is that we're going to[br]consider a window. Let's say, starting in

0:02:13.505,0:02:20.981
the top-left corner of the image. So, this[br]green, this green boxes correspond to one

0:02:20.981,0:02:28.387
of those windows and then we are going to[br]ev aluate the answer to that question. Is

0:02:28.387,0:02:34.254
there are face there? Or there is a bus in[br]there? And shift the Windows lightly, ask

0:02:34.254,0:02:39.791
the same question. And since the people[br]could be a, a variety of sizes, we have to

0:02:39.791,0:02:45.815
read on this process for different sizes[br]for Windows just as to detect small

0:02:45.815,0:02:53.717
objects as well as large objects. A good[br]and standard building block is a linear

0:02:53.717,0:02:58.106
support vector machine trained on[br]Histogram or oriented gradient features.

0:02:58.106,0:03:05.048
And this is a frame work introduced by[br]Dalal & Triggs in 2005 and they have

0:03:05.048,0:03:09.319
details in their paper about how they[br]compute each of the blocks and how they

0:03:09.319,0:03:13.531
normalize and well, if few of you are[br]interested in the details, you should read

0:03:13.531,0:03:18.866
that paper. Now, note that the Dalal &[br]Triggs approach was tested on pedestrians

0:03:18.866,0:03:25.759
and in the case of pedestrians, a single[br]block is enough and you try to detect the

0:03:25.759,0:03:31.736
whole object in one goal. Now, when we[br]deal with more complex objects like people

0:03:31.736,0:03:37.715
in general poses or dogs and cats, we find[br]that these are very non-rigid. So, one

0:03:37.715,0:03:43.373
single rigid template is not affected.[br]What we really want are part based

0:03:43.373,0:03:50.220
approaches. Nowadays, there are two[br]dominant part based approaches. The first

0:03:50.220,0:03:56.116
is the so-called deformable part models[br]due to Felzenszwalb et al. There is a

0:03:56.116,0:04:03.207
paper of that article probably in 2010.[br]And another approach is so-called Poselets

0:04:03.207,0:04:09.915
and this is due to Lubomir Bourdev and[br]various and other collaborators in my

0:04:09.915,0:04:16.031
group. So, what's the basic idea? So, let[br]me get into Felzenszwalb's approach first.

0:04:16.031,0:04:23.734
So, their basic idea is to have a root[br]filter which is trying to find the object

0:04:23.734,0:04:28.670
that hold. And then there will be a set of[br]path filters which might correspond to

0:04:28.670,0:04:36.951
say, trying to detect faces or legs and so[br]forth but these path filters have to fire

0:04:36.951,0:04:43.412
in certain spacial relationships with[br]respect to the root vector. So, the oral

0:04:43.412,0:04:50.087
detector is the combination of holistic[br]detector and a set of part filters which

0:04:50.087,0:04:54.584
had to be in the certain relationship with[br]respected to the whole object. And this

0:04:54.584,0:05:00.743
requires training both the root filter and[br]the radiant spot filters an d this can be

0:05:00.743,0:05:05.168
done using a so-called LatentSVM approach[br]which, and it does not require any extra

0:05:05.168,0:05:11.663
annotation. And note that I said, parts[br]such as faces and legs. So, that's me

0:05:11.663,0:05:16.611
getting carried away. The vector parts[br]need not to correspond to anything

0:05:16.611,0:05:23.751
semantically meaningful. In the case of[br]the Poselets approach, the idea is to have

0:05:23.751,0:05:29.735
semantically meaningful part, parts and,[br]and so the way they go about doing this is

0:05:29.735,0:05:34.534
by making use of extra annotation. So,[br]suppose you have images of people that

0:05:34.534,0:05:38.503
needs images might be annotated with key[br]points corresponding to left shoulder,

0:05:38.503,0:05:42.688
right shoulder, left elbow, right elbow[br]and so on. While other object

0:05:42.688,0:05:47.721
categorically will be other key points[br]such as, for example, for an airplane, you

0:05:47.721,0:05:53.775
might have a key point on the tip of the[br]nose or the tip of the wings and so on and

0:05:53.775,0:05:56.801
so forth. This requires extra work because[br]somebody has to go through to all the

0:05:56.801,0:06:01.851
images in the test and then mark these key[br]points but the consequence will be that

0:06:01.851,0:06:07.950
we'll be able to do a few more things[br]afterwards. Here's a slide which shows how

0:06:07.950,0:06:14.469
the object detection with discriminatively[br]trained part based models works. So, this

0:06:14.469,0:06:21.024
is the DPM model of Felzenszwalb Girshick,[br]MacAllester, and Ramanan. And here this

0:06:21.024,0:06:26.919
model has been illustrated with powerful[br]handle bicycle detection. So, in fact, you

0:06:26.919,0:06:31.844
don't train just one model, you train a[br]mixture of models. So, there is a model

0:06:31.844,0:06:39.714
here corresponding to the side view of a[br]bicycle. So, the root filter is shown here

0:06:39.714,0:06:45.165
so this is kind of the root filter and[br]this has kind, is looking for, this is a

0:06:45.165,0:06:49.751
hot template. It's looking for edges of[br]particular orientations as might we found

0:06:49.751,0:06:57.181
on the side view of a bicycle. Then we[br]have various part filters. So, the part

0:06:57.181,0:07:02.204
filters are in factual here. So, each of[br]the rectangles here, this kind of the

0:07:02.204,0:07:06.869
rectangle corresponds to a part filter.[br]So, this might, here, corresponds to

0:07:06.869,0:07:15.257
something like a template detective for[br]wheels. And so, what we have to have to

0:07:15.257,0:07:20.969
come up with the final score is to combine[br]the score corresponding to the hot te

0:07:20.969,0:07:26.485
mplate of the root filter as well as the[br]hot templates for each of the part. Note

0:07:26.485,0:07:31.122
that this detector for the side view of a[br]bicycle will probably not do a good job in

0:07:31.122,0:07:36.826
consider front views of bicycles like[br]here. And so for this, they will have a

0:07:36.826,0:07:44.490
different mode. So, again the model is[br]shown here. And here the wheel, the parts

0:07:44.490,0:07:50.273
maybe somewhat different. So, overall, you[br]have a mixture model with multiple model

0:07:50.273,0:07:55.113
corresponding to different poses and that[br]each model, it says, consists of root

0:07:55.113,0:08:01.483
filter and various part filters and there[br]is some subtlety in training because there

0:08:01.483,0:08:07.163
are no annotations that were leveled about[br]key points and so forth. So, in terms in

0:08:07.163,0:08:11.042
the learning approach here, you have to[br]guess where the part should be as the part

0:08:11.042,0:08:15.420
of the process of training and you can[br]find details in there that needs to

0:08:15.420,0:08:23.544
[inaudible]. How well does it do? Okay,[br]there is standard methodology that we use

0:08:23.544,0:08:29.519
in computer vision by evaluating detection[br]process. And here is how we do this for

0:08:29.519,0:08:35.255
the case of, say a motorcycle detector.[br]So, when computes the so-called precision

0:08:35.255,0:08:40.238
recall cuts. So, the idea is that the[br]algorithm, the detection algorithm is

0:08:40.238,0:08:47.430
going to come up with guesses of bounding[br]boxes where the motorbike maybe. And we

0:08:47.430,0:08:52.147
can then evaluate for each of these guess[br]bounding box. Is it right or wrong and

0:08:52.147,0:08:56.241
it's just to be right if its intersection[br]where you meet in respect to these two

0:08:56.241,0:09:04.838
motorbike is within 50%. Then, we have a[br]choice of how strict to be in a threshold.

0:09:04.838,0:09:10.295
We could pass through most of our[br]candidate guesses bounding boxes and if

0:09:10.295,0:09:14.573
you guess enough of them then of course,[br]you are guaranteed to find all of the

0:09:14.573,0:09:18.902
motorbikes. So, this rather seemed right.[br]So, the way we do this is that we could

0:09:18.902,0:09:24.116
have to pick a threshold and with that[br]threshold, you can evaluate the precision

0:09:24.116,0:09:30.775
and recall. So, precision and recall.[br]These terms have the following meaning.

0:09:30.775,0:09:36.452
Precisions means what fraction of the[br]detections that you declared are actually

0:09:36.452,0:09:44.117
true motorcycles. Recall is the question[br]of how many of the two motorcycles that

0:09:44.117,0:09:49.205
you, did you manage to detect? So, I[br]really want precision to be 100 percent

0:09:49.205,0:09:54.519
and recall to be 100%. In reality, it[br]doesn't well count that way. We're able to

0:09:54.519,0:10:00.621
detect some fraction of the two motorbikes[br]so here, for example, at this point, the

0:10:00.621,0:10:07.112
precision is 0.7. That means at this point[br]we're able to detect, the, the, the

0:10:07.112,0:10:12.704
detection that we declare often 70 percent[br]accurate. Now, this point corresponds to

0:10:12.704,0:10:19.082
recall of something like 55 percent[br]meaning that at that threshold, we hold 55

0:10:19.082,0:10:24.847
possible for two motorbikes and as we made[br]the threshold more lenient, we are going

0:10:24.847,0:10:28.644
to get more false [inaudible] but we will[br]manage to detect more of the two

0:10:28.644,0:10:32.905
motorbikes. So, as these curves goes down[br]in this range, in this, for this

0:10:32.905,0:10:38.603
particular detected data curve which is[br]image to detect something like 70 to 80

0:10:38.603,0:10:45.605
percent of the true motorcycles. So the[br]curves in this figure corresponds to

0:10:45.605,0:10:51.309
different algorithms and the way we[br]compare different algorithms is by

0:10:51.309,0:10:59.018
measuring the area and the curve. And that[br]the ideal case, of course, would be 100%.

0:10:59.018,0:11:04.432
In fact there is something like 50 percent[br]to 60 percent for these cases and that is

0:11:04.432,0:11:09.840
what we call AP or Average Precision. And[br]that is how we compare different

0:11:09.840,0:11:16.740
algorithms. Here is the precision recall[br]curve for a different category namely

0:11:16.740,0:11:21.751
person detection and the, the different[br]curves correspond to different category

0:11:21.751,0:11:27.215
items so this algorithm is probably not a[br]good one. This algorithm is a better

0:11:27.215,0:11:32.941
algorithm. And notice in both the[br]examples, we are not able to detect all

0:11:32.941,0:11:38.352
the people and if you look through this 30[br]percent of the people which are not

0:11:38.352,0:11:43.059
detected by any approach, usually, there[br]is heavy occlusion or unusual pauses and

0:11:43.059,0:11:53.453
media. So, there are phenomenas that make[br]life difficult for us. Finally, the Pascal

0:11:53.453,0:12:04.331
BOC, people have computed the average[br]precision for every class. And they give

0:12:04.331,0:12:09.887
two measures. Max means the best algorithm[br]for that category, So, max, so, the max

0:12:09.887,0:12:13.830
motorbike is something like say 58%. That[br]means that the best algorithm for

0:12:13.830,0:12:19.490
detecting motorbikes has an average[br]precision of 58%. And the median is, of

0:12:19.490,0:12:23.888
course, the m edian of the different[br]algorithm that was submitted. So, we

0:12:23.888,0:12:29.237
conclude that some categories are easier[br]than others. Motorbikes are probably the

0:12:29.237,0:12:35.100
easiest. Their average precision is 58%.[br]And something like potted plant is really

0:12:35.100,0:12:41.186
hard to detect and the average precision[br]there is sixteen%. So, if you want to say

0:12:41.186,0:12:46.345
where are we going quite well. It's, I[br]think, all the categories where the

0:12:46.345,0:12:51.849
precision is where the average precision[br]is over 505 percent and that is motorbike,

0:12:51.849,0:12:59.961
bicycle, bus, airplane, horse, car, cat,[br]train, bus. So about 50%. You may like it

0:12:59.961,0:13:05.372
or not in the sense that this is the case[br]of the class happen to half way. So, since

0:13:05.372,0:13:11.567
it's about 50%, maybe you can call it[br]Boat. Let's a look a little bit at some of

0:13:11.567,0:13:16.847
the difficult categories. So, here are the[br]category of boat and the average precision

0:13:16.847,0:13:22.343
here is about [inaudible] and if you look[br]at the set of examples, you will see why,

0:13:22.343,0:13:26.773
why this is so hot because there are so[br]much radiation in appearance from one boat

0:13:26.773,0:13:32.169
to another and it's really difficult to[br]detect, train to detect damage on all

0:13:32.169,0:13:39.397
these cases. Okay, and even more difficult[br]example. Chairs. So, here we are supposed

0:13:39.397,0:13:46.742
to mark bounding boxes corresponding to[br]the chairs and here they are. Okay, now

0:13:46.742,0:13:51.868
imagine you're looking for a hot template[br]which is going to detect the characters,

0:13:51.868,0:13:56.981
the edges corresponding to a chair. You[br]really can see that there is no hope at

0:13:56.981,0:14:03.092
managing that. Probably, the way humans[br]detect chairs is by making use of the fact

0:14:03.092,0:14:07.545
that when I, there's a human sitting on[br]thatt in a certain pose and, so there are

0:14:07.545,0:14:11.826
a lot of contextual information which[br]currently is not being captured by the, by

0:14:11.826,0:14:19.887
the algorithms. I'll turn to images of[br]people now. Analyzing images of people is

0:14:19.887,0:14:27.741
very important. It enables us to build[br]human good computer interaction APIs, it

0:14:27.741,0:14:36.678
enables us to analyze video, recognize[br]actions and so on and so forth. It's laid

0:14:36.678,0:14:42.557
hard by the fact that people appear in a[br]variety of poses, the variety of clothing

0:14:42.557,0:14:47.892
can be occluded, can be small, can be[br]large, and so on. So, this is really

0:14:47.892,0:14:51.613
challe nging category even though it's[br]perhaps, the most important category for

0:14:51.613,0:14:57.418
object recognition. So, I'm going to show[br]you some research from an approach which

0:14:57.418,0:15:02.704
is based on poselets, the other part based[br]paradigm that I refer to. So, the big idea

0:15:02.704,0:15:08.611
is that we can build on the success of[br]face detector and pedestrian detectors.

0:15:08.611,0:15:14.193
So, face detection, we know what's well.[br]And so also, the pedestrian detection when

0:15:14.193,0:15:21.682
you're talking about a vertical standing[br]or walking pedestrian. So, essentially,

0:15:21.682,0:15:25.760
both of these rely on, on pattern matching[br]and they captured pattern that are common

0:15:25.760,0:15:30.504
and visually characteristic. But these are[br]not the only too common in characteristic

0:15:30.504,0:15:35.269
patterns. Effectively, we can have[br]patterns corresponding to this pair legs.

0:15:35.269,0:15:43.640
And if we can detect those, we are sure[br]that we are looking at a person. And or we

0:15:43.640,0:15:47.166
can have a pattern which doesn't[br]correspond to single anatomical part. This

0:15:47.166,0:15:52.793
is the half of the face and half of the[br]torso and the center of the shoulder. This

0:15:52.793,0:15:56.616
is fine, I mean this is pretty[br]characteristic observation for a person.

0:15:56.616,0:16:04.882
So, the way, of course, how we train face[br]detectors pause that we had images where

0:16:04.882,0:16:08.998
all face had been marked out. So, then[br]the, just the face of the youth, just

0:16:08.998,0:16:14.105
input positive examples for a machine[br]learning algorithm. But, how are we going

0:16:14.105,0:16:19.079
to find all these configuration[br]corresponding to legs and face and

0:16:19.079,0:16:25.127
shoulders and so on. So, the poselet idea[br]is, is exactly to train these detectors

0:16:25.127,0:16:30.947
but we don't wish to determine these in[br]advance. But first, let me show you what

0:16:30.947,0:16:38.204
examples of Poselets are. So, this is a[br]Poselet let implies a small part. And the

0:16:38.204,0:16:44.428
way it works is that consider the human[br]pulse and let is being planted at a small

0:16:44.428,0:16:51.193
part of it. So, the top rope was that[br]corresponds to face, upper body, and the

0:16:51.193,0:16:56.327
hand in the certain configuration. Second[br]row corresponds to two legs. The third

0:16:56.327,0:17:02.376
row, let row corresponds to the back view[br]of a person. So, in fact, we can have a

0:17:02.376,0:17:11.535
very and, and a pretty long list of these[br]Poselets. Now, the, the value of these is

0:17:11.535,0:17:17.128
that it enables us to do later tasks more[br]easily. So, for example, we can train

0:17:17.128,0:17:22.246
agenda classifieds. So, we want to[br]distinguish men from women and that can be

0:17:22.246,0:17:28.788
done from the face also from the view.[br]That back wheel for person. And the legs

0:17:28.788,0:17:34.847
because up in the clothing want by many[br]women are different. So once we have this

0:17:34.847,0:17:40.470
idea of training positive detectors. We[br]can actually train two versions of a

0:17:40.470,0:17:47.194
positive detector. One is for male faces,[br]one for female faces. And, and we can do

0:17:47.194,0:17:52.885
that for each detector and essentially[br]this gives us a handle on how to come up

0:17:52.885,0:17:59.201
with the classifications of more finding[br]classification of people. So, I'm going to

0:17:59.201,0:18:04.673
show you some results here. So, these are[br]actually results from this approach. So,

0:18:04.673,0:18:09.086
the top row where the things are men and[br]the bottom row is where the things are

0:18:09.086,0:18:14.669
women. So, there are some mistakes here.[br]So, for example, these, these are really

0:18:14.669,0:18:23.273
women and so are these and so there are[br]some mistakes but it's surprisingly good.

0:18:23.273,0:18:28.703
Here is what the detector thinks are[br]people wearing long pants in the top row

0:18:28.703,0:18:34.159
and not wearing long pants in the bottom[br]row. So, notice that once we can start to

0:18:34.159,0:18:38.549
do this, we get to the ability of[br]describing people. So, in an image, I want

0:18:38.549,0:18:42.646
to be able to say that this image is a[br]person who is tall, blonde man with

0:18:42.646,0:18:52.107
wearing the green trousers. Here in the[br]top row is what the algorithm thinks are

0:18:52.107,0:18:58.569
people wearing hats and the bottom row are[br]people not wearing hats. This approach

0:18:58.569,0:19:05.337
applies to detecting actions as well. So,[br]here are actions that are revealed in

0:19:05.337,0:19:10.076
still images. So, you just have a single[br]frame here. So, the, so, this image

0:19:10.076,0:19:14.836
correspond to a sitting person, he is the[br]person talking on the telephone, a person

0:19:14.836,0:19:20.362
riding a horse, a person running and so[br]on. So, again, this Poselet paradigm can

0:19:20.362,0:19:27.785
be adapted to this framework. And, for[br]example, we can train Poselets

0:19:27.785,0:19:32.645
corresponding to phoning people, running[br]people, walking people, and riding cars. I

0:19:32.645,0:19:40.973
should note that the problem of detecting[br]action is a much more general problem. And

0:19:40.973,0:19:45.724
we obviously don't want to adjus t or make[br]use of the static information. If we have

0:19:45.724,0:19:51.395
video and we can compute optical flow[br]vectors, then that would give us an extra

0:19:51.395,0:19:57.407
handle on this problem. And the kinds of[br]actions we want to be able to recognizing

0:19:57.407,0:20:01.101
through movement and posture change,[br]object manipulation, conversational

0:20:01.101,0:20:06.625
gesture, sign language and etc. So, if you[br]want, you can think of object as nouns in

0:20:06.625,0:20:12.882
English and actions as verbs in English.[br]And it turns out that's some of the

0:20:12.882,0:20:15.985
techniques that have been applied for[br]object recognition carry over to this

0:20:15.985,0:20:21.202
domain. So, techniques such as bags of[br]spatio-temporal words, these are

0:20:21.202,0:20:26.314
generalizations of SIFT features to video.[br]These turn out to be quite useful and give

0:20:26.314,0:20:34.568
some of the best results for action[br]recognition task. Let me conclude here. I

0:20:34.568,0:20:41.598
think our community has made a lot of[br]progress and object recognition, action

0:20:41.598,0:20:46.897
recognition and so on. But a lot that[br]needs to be done. There is this face that

0:20:46.897,0:20:51.109
people in the multimedia information[br]systems community talk about, the

0:20:51.109,0:20:58.860
so-called semantic gap. So, their point is[br]that typically where images and videos are

0:20:58.860,0:21:04.003
presented as pixels, pixel brightness[br]values, pixel RGB values, and so on. There

0:21:04.003,0:21:08.471
is what we are really interested in the[br]semantic content. What are the objects in

0:21:08.471,0:21:12.942
the scene? What scene is it? What are the[br]events taking place and this is what we

0:21:12.942,0:21:19.059
would like to live. And we're not there[br]yet. There's no way near human performance

0:21:19.059,0:21:25.115
but I think we have made significant[br]progress and more continue to happen over

0:21:25.115,0:21:29.115
the next few years. Thank you.