0:00:00.233,0:00:05.220 Hello. After having spent a lot of time on[br]the relatively simple two-dimensional 0:00:05.220,0:00:09.428 problem of handwritten digit recognition,[br]we are now ready to tackle the general 0:00:09.428,0:00:14.825 problem which is data finding 3D objects[br]and scenes. So, the settings in which we 0:00:14.825,0:00:21.501 studied the problem these days is, is most[br]commonly that is so-called PASCAL object 0:00:21.501,0:00:28.222 detection challenge. This is been going on[br]for about, this is been going on for about 0:00:28.222,0:00:33.687 five years or so. And what these folks[br]have done is collected a set of about 0:00:33.687,0:00:39.843 10,000 images where in each of these[br]images, they marked a certain set of 0:00:39.843,0:00:44.554 objects and these object categories[br]include dining table, dog, horse, 0:00:44.554,0:00:50.235 motorbike, person, potted plant, sheep,[br]etc. So, they have twenty different 0:00:50.235,0:00:56.504 categories. For each object belonging to a[br]category, they have marked the bounding 0:00:56.504,0:01:02.087 box. So, for example, here is the bounding[br]box corresponding to the dock in this 0:01:02.087,0:01:06.892 image and there are bounding box[br]corresponding to a horse here and also 0:01:06.892,0:01:10.429 there'll be bounding boxes corresponding[br]to the people because in this image, we 0:01:10.429,0:01:18.413 have horses and people. The goal is to[br]detect these objects and so what a 0:01:18.413,0:01:22.975 computer programmer supposed to do is,[br]let's say, we are trying to find dogs. 0:01:22.975,0:01:27.983 What you are supposed to do is to mark[br]bounding boxes corresponding to where the 0:01:27.983,0:01:35.261 dogs are in the image. And then, we'll be[br]judged by whether the dog is in the right 0:01:35.261,0:01:39.586 location. So, the bounding box has to[br]overlap sufficiently with the correct 0:01:39.586,0:01:45.993 bounding box. So, this is the, the[br]dominant data set for studying 3D object 0:01:45.993,0:01:50.719 recognition. Now, let's see what[br]techniques we can use for addressing this 0:01:50.719,0:01:56.856 problem. So, we start with of course, the[br]basic paradigm of the multi-scale sliding 0:01:56.856,0:02:01.706 window. And this paradigm had been[br]introduced for face direction back in the 0:02:01.706,0:02:08.244 90s. And since then, it's also been used[br]for pedestrian detection and so forth. So, 0:02:08.244,0:02:13.505 the basic idea here is that we're going to[br]consider a window. Let's say, starting in 0:02:13.505,0:02:20.981 the top-left corner of the image. So, this[br]green, this green boxes correspond to one 0:02:20.981,0:02:28.387 of those windows and then we are going to[br]ev aluate the answer to that question. Is 0:02:28.387,0:02:34.254 there are face there? Or there is a bus in[br]there? And shift the Windows lightly, ask 0:02:34.254,0:02:39.791 the same question. And since the people[br]could be a, a variety of sizes, we have to 0:02:39.791,0:02:45.815 read on this process for different sizes[br]for Windows just as to detect small 0:02:45.815,0:02:53.717 objects as well as large objects. A good[br]and standard building block is a linear 0:02:53.717,0:02:58.106 support vector machine trained on[br]Histogram or oriented gradient features. 0:02:58.106,0:03:05.048 And this is a frame work introduced by[br]Dalal & Triggs in 2005 and they have 0:03:05.048,0:03:09.319 details in their paper about how they[br]compute each of the blocks and how they 0:03:09.319,0:03:13.531 normalize and well, if few of you are[br]interested in the details, you should read 0:03:13.531,0:03:18.866 that paper. Now, note that the Dalal &[br]Triggs approach was tested on pedestrians 0:03:18.866,0:03:25.759 and in the case of pedestrians, a single[br]block is enough and you try to detect the 0:03:25.759,0:03:31.736 whole object in one goal. Now, when we[br]deal with more complex objects like people 0:03:31.736,0:03:37.715 in general poses or dogs and cats, we find[br]that these are very non-rigid. So, one 0:03:37.715,0:03:43.373 single rigid template is not affected.[br]What we really want are part based 0:03:43.373,0:03:50.220 approaches. Nowadays, there are two[br]dominant part based approaches. The first 0:03:50.220,0:03:56.116 is the so-called deformable part models[br]due to Felzenszwalb et al. There is a 0:03:56.116,0:04:03.207 paper of that article probably in 2010.[br]And another approach is so-called Poselets 0:04:03.207,0:04:09.915 and this is due to Lubomir Bourdev and[br]various and other collaborators in my 0:04:09.915,0:04:16.031 group. So, what's the basic idea? So, let[br]me get into Felzenszwalb's approach first. 0:04:16.031,0:04:23.734 So, their basic idea is to have a root[br]filter which is trying to find the object 0:04:23.734,0:04:28.670 that hold. And then there will be a set of[br]path filters which might correspond to 0:04:28.670,0:04:36.951 say, trying to detect faces or legs and so[br]forth but these path filters have to fire 0:04:36.951,0:04:43.412 in certain spacial relationships with[br]respect to the root vector. So, the oral 0:04:43.412,0:04:50.087 detector is the combination of holistic[br]detector and a set of part filters which 0:04:50.087,0:04:54.584 had to be in the certain relationship with[br]respected to the whole object. And this 0:04:54.584,0:05:00.743 requires training both the root filter and[br]the radiant spot filters an d this can be 0:05:00.743,0:05:05.168 done using a so-called LatentSVM approach[br]which, and it does not require any extra 0:05:05.168,0:05:11.663 annotation. And note that I said, parts[br]such as faces and legs. So, that's me 0:05:11.663,0:05:16.611 getting carried away. The vector parts[br]need not to correspond to anything 0:05:16.611,0:05:23.751 semantically meaningful. In the case of[br]the Poselets approach, the idea is to have 0:05:23.751,0:05:29.735 semantically meaningful part, parts and,[br]and so the way they go about doing this is 0:05:29.735,0:05:34.534 by making use of extra annotation. So,[br]suppose you have images of people that 0:05:34.534,0:05:38.503 needs images might be annotated with key[br]points corresponding to left shoulder, 0:05:38.503,0:05:42.688 right shoulder, left elbow, right elbow[br]and so on. While other object 0:05:42.688,0:05:47.721 categorically will be other key points[br]such as, for example, for an airplane, you 0:05:47.721,0:05:53.775 might have a key point on the tip of the[br]nose or the tip of the wings and so on and 0:05:53.775,0:05:56.801 so forth. This requires extra work because[br]somebody has to go through to all the 0:05:56.801,0:06:01.851 images in the test and then mark these key[br]points but the consequence will be that 0:06:01.851,0:06:07.950 we'll be able to do a few more things[br]afterwards. Here's a slide which shows how 0:06:07.950,0:06:14.469 the object detection with discriminatively[br]trained part based models works. So, this 0:06:14.469,0:06:21.024 is the DPM model of Felzenszwalb Girshick,[br]MacAllester, and Ramanan. And here this 0:06:21.024,0:06:26.919 model has been illustrated with powerful[br]handle bicycle detection. So, in fact, you 0:06:26.919,0:06:31.844 don't train just one model, you train a[br]mixture of models. So, there is a model 0:06:31.844,0:06:39.714 here corresponding to the side view of a[br]bicycle. So, the root filter is shown here 0:06:39.714,0:06:45.165 so this is kind of the root filter and[br]this has kind, is looking for, this is a 0:06:45.165,0:06:49.751 hot template. It's looking for edges of[br]particular orientations as might we found 0:06:49.751,0:06:57.181 on the side view of a bicycle. Then we[br]have various part filters. So, the part 0:06:57.181,0:07:02.204 filters are in factual here. So, each of[br]the rectangles here, this kind of the 0:07:02.204,0:07:06.869 rectangle corresponds to a part filter.[br]So, this might, here, corresponds to 0:07:06.869,0:07:15.257 something like a template detective for[br]wheels. And so, what we have to have to 0:07:15.257,0:07:20.969 come up with the final score is to combine[br]the score corresponding to the hot te 0:07:20.969,0:07:26.485 mplate of the root filter as well as the[br]hot templates for each of the part. Note 0:07:26.485,0:07:31.122 that this detector for the side view of a[br]bicycle will probably not do a good job in 0:07:31.122,0:07:36.826 consider front views of bicycles like[br]here. And so for this, they will have a 0:07:36.826,0:07:44.490 different mode. So, again the model is[br]shown here. And here the wheel, the parts 0:07:44.490,0:07:50.273 maybe somewhat different. So, overall, you[br]have a mixture model with multiple model 0:07:50.273,0:07:55.113 corresponding to different poses and that[br]each model, it says, consists of root 0:07:55.113,0:08:01.483 filter and various part filters and there[br]is some subtlety in training because there 0:08:01.483,0:08:07.163 are no annotations that were leveled about[br]key points and so forth. So, in terms in 0:08:07.163,0:08:11.042 the learning approach here, you have to[br]guess where the part should be as the part 0:08:11.042,0:08:15.420 of the process of training and you can[br]find details in there that needs to 0:08:15.420,0:08:23.544 [inaudible]. How well does it do? Okay,[br]there is standard methodology that we use 0:08:23.544,0:08:29.519 in computer vision by evaluating detection[br]process. And here is how we do this for 0:08:29.519,0:08:35.255 the case of, say a motorcycle detector.[br]So, when computes the so-called precision 0:08:35.255,0:08:40.238 recall cuts. So, the idea is that the[br]algorithm, the detection algorithm is 0:08:40.238,0:08:47.430 going to come up with guesses of bounding[br]boxes where the motorbike maybe. And we 0:08:47.430,0:08:52.147 can then evaluate for each of these guess[br]bounding box. Is it right or wrong and 0:08:52.147,0:08:56.241 it's just to be right if its intersection[br]where you meet in respect to these two 0:08:56.241,0:09:04.838 motorbike is within 50%. Then, we have a[br]choice of how strict to be in a threshold. 0:09:04.838,0:09:10.295 We could pass through most of our[br]candidate guesses bounding boxes and if 0:09:10.295,0:09:14.573 you guess enough of them then of course,[br]you are guaranteed to find all of the 0:09:14.573,0:09:18.902 motorbikes. So, this rather seemed right.[br]So, the way we do this is that we could 0:09:18.902,0:09:24.116 have to pick a threshold and with that[br]threshold, you can evaluate the precision 0:09:24.116,0:09:30.775 and recall. So, precision and recall.[br]These terms have the following meaning. 0:09:30.775,0:09:36.452 Precisions means what fraction of the[br]detections that you declared are actually 0:09:36.452,0:09:44.117 true motorcycles. Recall is the question[br]of how many of the two motorcycles that 0:09:44.117,0:09:49.205 you, did you manage to detect? So, I[br]really want precision to be 100 percent 0:09:49.205,0:09:54.519 and recall to be 100%. In reality, it[br]doesn't well count that way. We're able to 0:09:54.519,0:10:00.621 detect some fraction of the two motorbikes[br]so here, for example, at this point, the 0:10:00.621,0:10:07.112 precision is 0.7. That means at this point[br]we're able to detect, the, the, the 0:10:07.112,0:10:12.704 detection that we declare often 70 percent[br]accurate. Now, this point corresponds to 0:10:12.704,0:10:19.082 recall of something like 55 percent[br]meaning that at that threshold, we hold 55 0:10:19.082,0:10:24.847 possible for two motorbikes and as we made[br]the threshold more lenient, we are going 0:10:24.847,0:10:28.644 to get more false [inaudible] but we will[br]manage to detect more of the two 0:10:28.644,0:10:32.905 motorbikes. So, as these curves goes down[br]in this range, in this, for this 0:10:32.905,0:10:38.603 particular detected data curve which is[br]image to detect something like 70 to 80 0:10:38.603,0:10:45.605 percent of the true motorcycles. So the[br]curves in this figure corresponds to 0:10:45.605,0:10:51.309 different algorithms and the way we[br]compare different algorithms is by 0:10:51.309,0:10:59.018 measuring the area and the curve. And that[br]the ideal case, of course, would be 100%. 0:10:59.018,0:11:04.432 In fact there is something like 50 percent[br]to 60 percent for these cases and that is 0:11:04.432,0:11:09.840 what we call AP or Average Precision. And[br]that is how we compare different 0:11:09.840,0:11:16.740 algorithms. Here is the precision recall[br]curve for a different category namely 0:11:16.740,0:11:21.751 person detection and the, the different[br]curves correspond to different category 0:11:21.751,0:11:27.215 items so this algorithm is probably not a[br]good one. This algorithm is a better 0:11:27.215,0:11:32.941 algorithm. And notice in both the[br]examples, we are not able to detect all 0:11:32.941,0:11:38.352 the people and if you look through this 30[br]percent of the people which are not 0:11:38.352,0:11:43.059 detected by any approach, usually, there[br]is heavy occlusion or unusual pauses and 0:11:43.059,0:11:53.453 media. So, there are phenomenas that make[br]life difficult for us. Finally, the Pascal 0:11:53.453,0:12:04.331 BOC, people have computed the average[br]precision for every class. And they give 0:12:04.331,0:12:09.887 two measures. Max means the best algorithm[br]for that category, So, max, so, the max 0:12:09.887,0:12:13.830 motorbike is something like say 58%. That[br]means that the best algorithm for 0:12:13.830,0:12:19.490 detecting motorbikes has an average[br]precision of 58%. And the median is, of 0:12:19.490,0:12:23.888 course, the m edian of the different[br]algorithm that was submitted. So, we 0:12:23.888,0:12:29.237 conclude that some categories are easier[br]than others. Motorbikes are probably the 0:12:29.237,0:12:35.100 easiest. Their average precision is 58%.[br]And something like potted plant is really 0:12:35.100,0:12:41.186 hard to detect and the average precision[br]there is sixteen%. So, if you want to say 0:12:41.186,0:12:46.345 where are we going quite well. It's, I[br]think, all the categories where the 0:12:46.345,0:12:51.849 precision is where the average precision[br]is over 505 percent and that is motorbike, 0:12:51.849,0:12:59.961 bicycle, bus, airplane, horse, car, cat,[br]train, bus. So about 50%. You may like it 0:12:59.961,0:13:05.372 or not in the sense that this is the case[br]of the class happen to half way. So, since 0:13:05.372,0:13:11.567 it's about 50%, maybe you can call it[br]Boat. Let's a look a little bit at some of 0:13:11.567,0:13:16.847 the difficult categories. So, here are the[br]category of boat and the average precision 0:13:16.847,0:13:22.343 here is about [inaudible] and if you look[br]at the set of examples, you will see why, 0:13:22.343,0:13:26.773 why this is so hot because there are so[br]much radiation in appearance from one boat 0:13:26.773,0:13:32.169 to another and it's really difficult to[br]detect, train to detect damage on all 0:13:32.169,0:13:39.397 these cases. Okay, and even more difficult[br]example. Chairs. So, here we are supposed 0:13:39.397,0:13:46.742 to mark bounding boxes corresponding to[br]the chairs and here they are. Okay, now 0:13:46.742,0:13:51.868 imagine you're looking for a hot template[br]which is going to detect the characters, 0:13:51.868,0:13:56.981 the edges corresponding to a chair. You[br]really can see that there is no hope at 0:13:56.981,0:14:03.092 managing that. Probably, the way humans[br]detect chairs is by making use of the fact 0:14:03.092,0:14:07.545 that when I, there's a human sitting on[br]thatt in a certain pose and, so there are 0:14:07.545,0:14:11.826 a lot of contextual information which[br]currently is not being captured by the, by 0:14:11.826,0:14:19.887 the algorithms. I'll turn to images of[br]people now. Analyzing images of people is 0:14:19.887,0:14:27.741 very important. It enables us to build[br]human good computer interaction APIs, it 0:14:27.741,0:14:36.678 enables us to analyze video, recognize[br]actions and so on and so forth. It's laid 0:14:36.678,0:14:42.557 hard by the fact that people appear in a[br]variety of poses, the variety of clothing 0:14:42.557,0:14:47.892 can be occluded, can be small, can be[br]large, and so on. So, this is really 0:14:47.892,0:14:51.613 challe nging category even though it's[br]perhaps, the most important category for 0:14:51.613,0:14:57.418 object recognition. So, I'm going to show[br]you some research from an approach which 0:14:57.418,0:15:02.704 is based on poselets, the other part based[br]paradigm that I refer to. So, the big idea 0:15:02.704,0:15:08.611 is that we can build on the success of[br]face detector and pedestrian detectors. 0:15:08.611,0:15:14.193 So, face detection, we know what's well.[br]And so also, the pedestrian detection when 0:15:14.193,0:15:21.682 you're talking about a vertical standing[br]or walking pedestrian. So, essentially, 0:15:21.682,0:15:25.760 both of these rely on, on pattern matching[br]and they captured pattern that are common 0:15:25.760,0:15:30.504 and visually characteristic. But these are[br]not the only too common in characteristic 0:15:30.504,0:15:35.269 patterns. Effectively, we can have[br]patterns corresponding to this pair legs. 0:15:35.269,0:15:43.640 And if we can detect those, we are sure[br]that we are looking at a person. And or we 0:15:43.640,0:15:47.166 can have a pattern which doesn't[br]correspond to single anatomical part. This 0:15:47.166,0:15:52.793 is the half of the face and half of the[br]torso and the center of the shoulder. This 0:15:52.793,0:15:56.616 is fine, I mean this is pretty[br]characteristic observation for a person. 0:15:56.616,0:16:04.882 So, the way, of course, how we train face[br]detectors pause that we had images where 0:16:04.882,0:16:08.998 all face had been marked out. So, then[br]the, just the face of the youth, just 0:16:08.998,0:16:14.105 input positive examples for a machine[br]learning algorithm. But, how are we going 0:16:14.105,0:16:19.079 to find all these configuration[br]corresponding to legs and face and 0:16:19.079,0:16:25.127 shoulders and so on. So, the poselet idea[br]is, is exactly to train these detectors 0:16:25.127,0:16:30.947 but we don't wish to determine these in[br]advance. But first, let me show you what 0:16:30.947,0:16:38.204 examples of Poselets are. So, this is a[br]Poselet let implies a small part. And the 0:16:38.204,0:16:44.428 way it works is that consider the human[br]pulse and let is being planted at a small 0:16:44.428,0:16:51.193 part of it. So, the top rope was that[br]corresponds to face, upper body, and the 0:16:51.193,0:16:56.327 hand in the certain configuration. Second[br]row corresponds to two legs. The third 0:16:56.327,0:17:02.376 row, let row corresponds to the back view[br]of a person. So, in fact, we can have a 0:17:02.376,0:17:11.535 very and, and a pretty long list of these[br]Poselets. Now, the, the value of these is 0:17:11.535,0:17:17.128 that it enables us to do later tasks more[br]easily. So, for example, we can train 0:17:17.128,0:17:22.246 agenda classifieds. So, we want to[br]distinguish men from women and that can be 0:17:22.246,0:17:28.788 done from the face also from the view.[br]That back wheel for person. And the legs 0:17:28.788,0:17:34.847 because up in the clothing want by many[br]women are different. So once we have this 0:17:34.847,0:17:40.470 idea of training positive detectors. We[br]can actually train two versions of a 0:17:40.470,0:17:47.194 positive detector. One is for male faces,[br]one for female faces. And, and we can do 0:17:47.194,0:17:52.885 that for each detector and essentially[br]this gives us a handle on how to come up 0:17:52.885,0:17:59.201 with the classifications of more finding[br]classification of people. So, I'm going to 0:17:59.201,0:18:04.673 show you some results here. So, these are[br]actually results from this approach. So, 0:18:04.673,0:18:09.086 the top row where the things are men and[br]the bottom row is where the things are 0:18:09.086,0:18:14.669 women. So, there are some mistakes here.[br]So, for example, these, these are really 0:18:14.669,0:18:23.273 women and so are these and so there are[br]some mistakes but it's surprisingly good. 0:18:23.273,0:18:28.703 Here is what the detector thinks are[br]people wearing long pants in the top row 0:18:28.703,0:18:34.159 and not wearing long pants in the bottom[br]row. So, notice that once we can start to 0:18:34.159,0:18:38.549 do this, we get to the ability of[br]describing people. So, in an image, I want 0:18:38.549,0:18:42.646 to be able to say that this image is a[br]person who is tall, blonde man with 0:18:42.646,0:18:52.107 wearing the green trousers. Here in the[br]top row is what the algorithm thinks are 0:18:52.107,0:18:58.569 people wearing hats and the bottom row are[br]people not wearing hats. This approach 0:18:58.569,0:19:05.337 applies to detecting actions as well. So,[br]here are actions that are revealed in 0:19:05.337,0:19:10.076 still images. So, you just have a single[br]frame here. So, the, so, this image 0:19:10.076,0:19:14.836 correspond to a sitting person, he is the[br]person talking on the telephone, a person 0:19:14.836,0:19:20.362 riding a horse, a person running and so[br]on. So, again, this Poselet paradigm can 0:19:20.362,0:19:27.785 be adapted to this framework. And, for[br]example, we can train Poselets 0:19:27.785,0:19:32.645 corresponding to phoning people, running[br]people, walking people, and riding cars. I 0:19:32.645,0:19:40.973 should note that the problem of detecting[br]action is a much more general problem. And 0:19:40.973,0:19:45.724 we obviously don't want to adjus t or make[br]use of the static information. If we have 0:19:45.724,0:19:51.395 video and we can compute optical flow[br]vectors, then that would give us an extra 0:19:51.395,0:19:57.407 handle on this problem. And the kinds of[br]actions we want to be able to recognizing 0:19:57.407,0:20:01.101 through movement and posture change,[br]object manipulation, conversational 0:20:01.101,0:20:06.625 gesture, sign language and etc. So, if you[br]want, you can think of object as nouns in 0:20:06.625,0:20:12.882 English and actions as verbs in English.[br]And it turns out that's some of the 0:20:12.882,0:20:15.985 techniques that have been applied for[br]object recognition carry over to this 0:20:15.985,0:20:21.202 domain. So, techniques such as bags of[br]spatio-temporal words, these are 0:20:21.202,0:20:26.314 generalizations of SIFT features to video.[br]These turn out to be quite useful and give 0:20:26.314,0:20:34.568 some of the best results for action[br]recognition task. Let me conclude here. I 0:20:34.568,0:20:41.598 think our community has made a lot of[br]progress and object recognition, action 0:20:41.598,0:20:46.897 recognition and so on. But a lot that[br]needs to be done. There is this face that 0:20:46.897,0:20:51.109 people in the multimedia information[br]systems community talk about, the 0:20:51.109,0:20:58.860 so-called semantic gap. So, their point is[br]that typically where images and videos are 0:20:58.860,0:21:04.003 presented as pixels, pixel brightness[br]values, pixel RGB values, and so on. There 0:21:04.003,0:21:08.471 is what we are really interested in the[br]semantic content. What are the objects in 0:21:08.471,0:21:12.942 the scene? What scene is it? What are the[br]events taking place and this is what we 0:21:12.942,0:21:19.059 would like to live. And we're not there[br]yet. There's no way near human performance 0:21:19.059,0:21:25.115 but I think we have made significant[br]progress and more continue to happen over 0:21:25.115,0:21:29.115 the next few years. Thank you.