1 00:00:00,233 --> 00:00:05,220 Hello. After having spent a lot of time on the relatively simple two-dimensional 2 00:00:05,220 --> 00:00:09,428 problem of handwritten digit recognition, we are now ready to tackle the general 3 00:00:09,428 --> 00:00:14,825 problem which is data finding 3D objects and scenes. So, the settings in which we 4 00:00:14,825 --> 00:00:21,501 studied the problem these days is, is most commonly that is so-called PASCAL object 5 00:00:21,501 --> 00:00:28,222 detection challenge. This is been going on for about, this is been going on for about 6 00:00:28,222 --> 00:00:33,687 five years or so. And what these folks have done is collected a set of about 7 00:00:33,687 --> 00:00:39,843 10,000 images where in each of these images, they marked a certain set of 8 00:00:39,843 --> 00:00:44,554 objects and these object categories include dining table, dog, horse, 9 00:00:44,554 --> 00:00:50,235 motorbike, person, potted plant, sheep, etc. So, they have twenty different 10 00:00:50,235 --> 00:00:56,504 categories. For each object belonging to a category, they have marked the bounding 11 00:00:56,504 --> 00:01:02,087 box. So, for example, here is the bounding box corresponding to the dock in this 12 00:01:02,087 --> 00:01:06,892 image and there are bounding box corresponding to a horse here and also 13 00:01:06,892 --> 00:01:10,429 there'll be bounding boxes corresponding to the people because in this image, we 14 00:01:10,429 --> 00:01:18,413 have horses and people. The goal is to detect these objects and so what a 15 00:01:18,413 --> 00:01:22,975 computer programmer supposed to do is, let's say, we are trying to find dogs. 16 00:01:22,975 --> 00:01:27,983 What you are supposed to do is to mark bounding boxes corresponding to where the 17 00:01:27,983 --> 00:01:35,261 dogs are in the image. And then, we'll be judged by whether the dog is in the right 18 00:01:35,261 --> 00:01:39,586 location. So, the bounding box has to overlap sufficiently with the correct 19 00:01:39,586 --> 00:01:45,993 bounding box. So, this is the, the dominant data set for studying 3D object 20 00:01:45,993 --> 00:01:50,719 recognition. Now, let's see what techniques we can use for addressing this 21 00:01:50,719 --> 00:01:56,856 problem. So, we start with of course, the basic paradigm of the multi-scale sliding 22 00:01:56,856 --> 00:02:01,706 window. And this paradigm had been introduced for face direction back in the 23 00:02:01,706 --> 00:02:08,244 90s. And since then, it's also been used for pedestrian detection and so forth. So, 24 00:02:08,244 --> 00:02:13,505 the basic idea here is that we're going to consider a window. Let's say, starting in 25 00:02:13,505 --> 00:02:20,981 the top-left corner of the image. So, this green, this green boxes correspond to one 26 00:02:20,981 --> 00:02:28,387 of those windows and then we are going to ev aluate the answer to that question. Is 27 00:02:28,387 --> 00:02:34,254 there are face there? Or there is a bus in there? And shift the Windows lightly, ask 28 00:02:34,254 --> 00:02:39,791 the same question. And since the people could be a, a variety of sizes, we have to 29 00:02:39,791 --> 00:02:45,815 read on this process for different sizes for Windows just as to detect small 30 00:02:45,815 --> 00:02:53,717 objects as well as large objects. A good and standard building block is a linear 31 00:02:53,717 --> 00:02:58,106 support vector machine trained on Histogram or oriented gradient features. 32 00:02:58,106 --> 00:03:05,048 And this is a frame work introduced by Dalal & Triggs in 2005 and they have 33 00:03:05,048 --> 00:03:09,319 details in their paper about how they compute each of the blocks and how they 34 00:03:09,319 --> 00:03:13,531 normalize and well, if few of you are interested in the details, you should read 35 00:03:13,531 --> 00:03:18,866 that paper. Now, note that the Dalal & Triggs approach was tested on pedestrians 36 00:03:18,866 --> 00:03:25,759 and in the case of pedestrians, a single block is enough and you try to detect the 37 00:03:25,759 --> 00:03:31,736 whole object in one goal. Now, when we deal with more complex objects like people 38 00:03:31,736 --> 00:03:37,715 in general poses or dogs and cats, we find that these are very non-rigid. So, one 39 00:03:37,715 --> 00:03:43,373 single rigid template is not affected. What we really want are part based 40 00:03:43,373 --> 00:03:50,220 approaches. Nowadays, there are two dominant part based approaches. The first 41 00:03:50,220 --> 00:03:56,116 is the so-called deformable part models due to Felzenszwalb et al. There is a 42 00:03:56,116 --> 00:04:03,207 paper of that article probably in 2010. And another approach is so-called Poselets 43 00:04:03,207 --> 00:04:09,915 and this is due to Lubomir Bourdev and various and other collaborators in my 44 00:04:09,915 --> 00:04:16,031 group. So, what's the basic idea? So, let me get into Felzenszwalb's approach first. 45 00:04:16,031 --> 00:04:23,734 So, their basic idea is to have a root filter which is trying to find the object 46 00:04:23,734 --> 00:04:28,670 that hold. And then there will be a set of path filters which might correspond to 47 00:04:28,670 --> 00:04:36,951 say, trying to detect faces or legs and so forth but these path filters have to fire 48 00:04:36,951 --> 00:04:43,412 in certain spacial relationships with respect to the root vector. So, the oral 49 00:04:43,412 --> 00:04:50,087 detector is the combination of holistic detector and a set of part filters which 50 00:04:50,087 --> 00:04:54,584 had to be in the certain relationship with respected to the whole object. And this 51 00:04:54,584 --> 00:05:00,743 requires training both the root filter and the radiant spot filters an d this can be 52 00:05:00,743 --> 00:05:05,168 done using a so-called LatentSVM approach which, and it does not require any extra 53 00:05:05,168 --> 00:05:11,663 annotation. And note that I said, parts such as faces and legs. So, that's me 54 00:05:11,663 --> 00:05:16,611 getting carried away. The vector parts need not to correspond to anything 55 00:05:16,611 --> 00:05:23,751 semantically meaningful. In the case of the Poselets approach, the idea is to have 56 00:05:23,751 --> 00:05:29,735 semantically meaningful part, parts and, and so the way they go about doing this is 57 00:05:29,735 --> 00:05:34,534 by making use of extra annotation. So, suppose you have images of people that 58 00:05:34,534 --> 00:05:38,503 needs images might be annotated with key points corresponding to left shoulder, 59 00:05:38,503 --> 00:05:42,688 right shoulder, left elbow, right elbow and so on. While other object 60 00:05:42,688 --> 00:05:47,721 categorically will be other key points such as, for example, for an airplane, you 61 00:05:47,721 --> 00:05:53,775 might have a key point on the tip of the nose or the tip of the wings and so on and 62 00:05:53,775 --> 00:05:56,801 so forth. This requires extra work because somebody has to go through to all the 63 00:05:56,801 --> 00:06:01,851 images in the test and then mark these key points but the consequence will be that 64 00:06:01,851 --> 00:06:07,950 we'll be able to do a few more things afterwards. Here's a slide which shows how 65 00:06:07,950 --> 00:06:14,469 the object detection with discriminatively trained part based models works. So, this 66 00:06:14,469 --> 00:06:21,024 is the DPM model of Felzenszwalb Girshick, MacAllester, and Ramanan. And here this 67 00:06:21,024 --> 00:06:26,919 model has been illustrated with powerful handle bicycle detection. So, in fact, you 68 00:06:26,919 --> 00:06:31,844 don't train just one model, you train a mixture of models. So, there is a model 69 00:06:31,844 --> 00:06:39,714 here corresponding to the side view of a bicycle. So, the root filter is shown here 70 00:06:39,714 --> 00:06:45,165 so this is kind of the root filter and this has kind, is looking for, this is a 71 00:06:45,165 --> 00:06:49,751 hot template. It's looking for edges of particular orientations as might we found 72 00:06:49,751 --> 00:06:57,181 on the side view of a bicycle. Then we have various part filters. So, the part 73 00:06:57,181 --> 00:07:02,204 filters are in factual here. So, each of the rectangles here, this kind of the 74 00:07:02,204 --> 00:07:06,869 rectangle corresponds to a part filter. So, this might, here, corresponds to 75 00:07:06,869 --> 00:07:15,257 something like a template detective for wheels. And so, what we have to have to 76 00:07:15,257 --> 00:07:20,969 come up with the final score is to combine the score corresponding to the hot te 77 00:07:20,969 --> 00:07:26,485 mplate of the root filter as well as the hot templates for each of the part. Note 78 00:07:26,485 --> 00:07:31,122 that this detector for the side view of a bicycle will probably not do a good job in 79 00:07:31,122 --> 00:07:36,826 consider front views of bicycles like here. And so for this, they will have a 80 00:07:36,826 --> 00:07:44,490 different mode. So, again the model is shown here. And here the wheel, the parts 81 00:07:44,490 --> 00:07:50,273 maybe somewhat different. So, overall, you have a mixture model with multiple model 82 00:07:50,273 --> 00:07:55,113 corresponding to different poses and that each model, it says, consists of root 83 00:07:55,113 --> 00:08:01,483 filter and various part filters and there is some subtlety in training because there 84 00:08:01,483 --> 00:08:07,163 are no annotations that were leveled about key points and so forth. So, in terms in 85 00:08:07,163 --> 00:08:11,042 the learning approach here, you have to guess where the part should be as the part 86 00:08:11,042 --> 00:08:15,420 of the process of training and you can find details in there that needs to 87 00:08:15,420 --> 00:08:23,544 [inaudible]. How well does it do? Okay, there is standard methodology that we use 88 00:08:23,544 --> 00:08:29,519 in computer vision by evaluating detection process. And here is how we do this for 89 00:08:29,519 --> 00:08:35,255 the case of, say a motorcycle detector. So, when computes the so-called precision 90 00:08:35,255 --> 00:08:40,238 recall cuts. So, the idea is that the algorithm, the detection algorithm is 91 00:08:40,238 --> 00:08:47,430 going to come up with guesses of bounding boxes where the motorbike maybe. And we 92 00:08:47,430 --> 00:08:52,147 can then evaluate for each of these guess bounding box. Is it right or wrong and 93 00:08:52,147 --> 00:08:56,241 it's just to be right if its intersection where you meet in respect to these two 94 00:08:56,241 --> 00:09:04,838 motorbike is within 50%. Then, we have a choice of how strict to be in a threshold. 95 00:09:04,838 --> 00:09:10,295 We could pass through most of our candidate guesses bounding boxes and if 96 00:09:10,295 --> 00:09:14,573 you guess enough of them then of course, you are guaranteed to find all of the 97 00:09:14,573 --> 00:09:18,902 motorbikes. So, this rather seemed right. So, the way we do this is that we could 98 00:09:18,902 --> 00:09:24,116 have to pick a threshold and with that threshold, you can evaluate the precision 99 00:09:24,116 --> 00:09:30,775 and recall. So, precision and recall. These terms have the following meaning. 100 00:09:30,775 --> 00:09:36,452 Precisions means what fraction of the detections that you declared are actually 101 00:09:36,452 --> 00:09:44,117 true motorcycles. Recall is the question of how many of the two motorcycles that 102 00:09:44,117 --> 00:09:49,205 you, did you manage to detect? So, I really want precision to be 100 percent 103 00:09:49,205 --> 00:09:54,519 and recall to be 100%. In reality, it doesn't well count that way. We're able to 104 00:09:54,519 --> 00:10:00,621 detect some fraction of the two motorbikes so here, for example, at this point, the 105 00:10:00,621 --> 00:10:07,112 precision is 0.7. That means at this point we're able to detect, the, the, the 106 00:10:07,112 --> 00:10:12,704 detection that we declare often 70 percent accurate. Now, this point corresponds to 107 00:10:12,704 --> 00:10:19,082 recall of something like 55 percent meaning that at that threshold, we hold 55 108 00:10:19,082 --> 00:10:24,847 possible for two motorbikes and as we made the threshold more lenient, we are going 109 00:10:24,847 --> 00:10:28,644 to get more false [inaudible] but we will manage to detect more of the two 110 00:10:28,644 --> 00:10:32,905 motorbikes. So, as these curves goes down in this range, in this, for this 111 00:10:32,905 --> 00:10:38,603 particular detected data curve which is image to detect something like 70 to 80 112 00:10:38,603 --> 00:10:45,605 percent of the true motorcycles. So the curves in this figure corresponds to 113 00:10:45,605 --> 00:10:51,309 different algorithms and the way we compare different algorithms is by 114 00:10:51,309 --> 00:10:59,018 measuring the area and the curve. And that the ideal case, of course, would be 100%. 115 00:10:59,018 --> 00:11:04,432 In fact there is something like 50 percent to 60 percent for these cases and that is 116 00:11:04,432 --> 00:11:09,840 what we call AP or Average Precision. And that is how we compare different 117 00:11:09,840 --> 00:11:16,740 algorithms. Here is the precision recall curve for a different category namely 118 00:11:16,740 --> 00:11:21,751 person detection and the, the different curves correspond to different category 119 00:11:21,751 --> 00:11:27,215 items so this algorithm is probably not a good one. This algorithm is a better 120 00:11:27,215 --> 00:11:32,941 algorithm. And notice in both the examples, we are not able to detect all 121 00:11:32,941 --> 00:11:38,352 the people and if you look through this 30 percent of the people which are not 122 00:11:38,352 --> 00:11:43,059 detected by any approach, usually, there is heavy occlusion or unusual pauses and 123 00:11:43,059 --> 00:11:53,453 media. So, there are phenomenas that make life difficult for us. Finally, the Pascal 124 00:11:53,453 --> 00:12:04,331 BOC, people have computed the average precision for every class. And they give 125 00:12:04,331 --> 00:12:09,887 two measures. Max means the best algorithm for that category, So, max, so, the max 126 00:12:09,887 --> 00:12:13,830 motorbike is something like say 58%. That means that the best algorithm for 127 00:12:13,830 --> 00:12:19,490 detecting motorbikes has an average precision of 58%. And the median is, of 128 00:12:19,490 --> 00:12:23,888 course, the m edian of the different algorithm that was submitted. So, we 129 00:12:23,888 --> 00:12:29,237 conclude that some categories are easier than others. Motorbikes are probably the 130 00:12:29,237 --> 00:12:35,100 easiest. Their average precision is 58%. And something like potted plant is really 131 00:12:35,100 --> 00:12:41,186 hard to detect and the average precision there is sixteen%. So, if you want to say 132 00:12:41,186 --> 00:12:46,345 where are we going quite well. It's, I think, all the categories where the 133 00:12:46,345 --> 00:12:51,849 precision is where the average precision is over 505 percent and that is motorbike, 134 00:12:51,849 --> 00:12:59,961 bicycle, bus, airplane, horse, car, cat, train, bus. So about 50%. You may like it 135 00:12:59,961 --> 00:13:05,372 or not in the sense that this is the case of the class happen to half way. So, since 136 00:13:05,372 --> 00:13:11,567 it's about 50%, maybe you can call it Boat. Let's a look a little bit at some of 137 00:13:11,567 --> 00:13:16,847 the difficult categories. So, here are the category of boat and the average precision 138 00:13:16,847 --> 00:13:22,343 here is about [inaudible] and if you look at the set of examples, you will see why, 139 00:13:22,343 --> 00:13:26,773 why this is so hot because there are so much radiation in appearance from one boat 140 00:13:26,773 --> 00:13:32,169 to another and it's really difficult to detect, train to detect damage on all 141 00:13:32,169 --> 00:13:39,397 these cases. Okay, and even more difficult example. Chairs. So, here we are supposed 142 00:13:39,397 --> 00:13:46,742 to mark bounding boxes corresponding to the chairs and here they are. Okay, now 143 00:13:46,742 --> 00:13:51,868 imagine you're looking for a hot template which is going to detect the characters, 144 00:13:51,868 --> 00:13:56,981 the edges corresponding to a chair. You really can see that there is no hope at 145 00:13:56,981 --> 00:14:03,092 managing that. Probably, the way humans detect chairs is by making use of the fact 146 00:14:03,092 --> 00:14:07,545 that when I, there's a human sitting on thatt in a certain pose and, so there are 147 00:14:07,545 --> 00:14:11,826 a lot of contextual information which currently is not being captured by the, by 148 00:14:11,826 --> 00:14:19,887 the algorithms. I'll turn to images of people now. Analyzing images of people is 149 00:14:19,887 --> 00:14:27,741 very important. It enables us to build human good computer interaction APIs, it 150 00:14:27,741 --> 00:14:36,678 enables us to analyze video, recognize actions and so on and so forth. It's laid 151 00:14:36,678 --> 00:14:42,557 hard by the fact that people appear in a variety of poses, the variety of clothing 152 00:14:42,557 --> 00:14:47,892 can be occluded, can be small, can be large, and so on. So, this is really 153 00:14:47,892 --> 00:14:51,613 challe nging category even though it's perhaps, the most important category for 154 00:14:51,613 --> 00:14:57,418 object recognition. So, I'm going to show you some research from an approach which 155 00:14:57,418 --> 00:15:02,704 is based on poselets, the other part based paradigm that I refer to. So, the big idea 156 00:15:02,704 --> 00:15:08,611 is that we can build on the success of face detector and pedestrian detectors. 157 00:15:08,611 --> 00:15:14,193 So, face detection, we know what's well. And so also, the pedestrian detection when 158 00:15:14,193 --> 00:15:21,682 you're talking about a vertical standing or walking pedestrian. So, essentially, 159 00:15:21,682 --> 00:15:25,760 both of these rely on, on pattern matching and they captured pattern that are common 160 00:15:25,760 --> 00:15:30,504 and visually characteristic. But these are not the only too common in characteristic 161 00:15:30,504 --> 00:15:35,269 patterns. Effectively, we can have patterns corresponding to this pair legs. 162 00:15:35,269 --> 00:15:43,640 And if we can detect those, we are sure that we are looking at a person. And or we 163 00:15:43,640 --> 00:15:47,166 can have a pattern which doesn't correspond to single anatomical part. This 164 00:15:47,166 --> 00:15:52,793 is the half of the face and half of the torso and the center of the shoulder. This 165 00:15:52,793 --> 00:15:56,616 is fine, I mean this is pretty characteristic observation for a person. 166 00:15:56,616 --> 00:16:04,882 So, the way, of course, how we train face detectors pause that we had images where 167 00:16:04,882 --> 00:16:08,998 all face had been marked out. So, then the, just the face of the youth, just 168 00:16:08,998 --> 00:16:14,105 input positive examples for a machine learning algorithm. But, how are we going 169 00:16:14,105 --> 00:16:19,079 to find all these configuration corresponding to legs and face and 170 00:16:19,079 --> 00:16:25,127 shoulders and so on. So, the poselet idea is, is exactly to train these detectors 171 00:16:25,127 --> 00:16:30,947 but we don't wish to determine these in advance. But first, let me show you what 172 00:16:30,947 --> 00:16:38,204 examples of Poselets are. So, this is a Poselet let implies a small part. And the 173 00:16:38,204 --> 00:16:44,428 way it works is that consider the human pulse and let is being planted at a small 174 00:16:44,428 --> 00:16:51,193 part of it. So, the top rope was that corresponds to face, upper body, and the 175 00:16:51,193 --> 00:16:56,327 hand in the certain configuration. Second row corresponds to two legs. The third 176 00:16:56,327 --> 00:17:02,376 row, let row corresponds to the back view of a person. So, in fact, we can have a 177 00:17:02,376 --> 00:17:11,535 very and, and a pretty long list of these Poselets. Now, the, the value of these is 178 00:17:11,535 --> 00:17:17,128 that it enables us to do later tasks more easily. So, for example, we can train 179 00:17:17,128 --> 00:17:22,246 agenda classifieds. So, we want to distinguish men from women and that can be 180 00:17:22,246 --> 00:17:28,788 done from the face also from the view. That back wheel for person. And the legs 181 00:17:28,788 --> 00:17:34,847 because up in the clothing want by many women are different. So once we have this 182 00:17:34,847 --> 00:17:40,470 idea of training positive detectors. We can actually train two versions of a 183 00:17:40,470 --> 00:17:47,194 positive detector. One is for male faces, one for female faces. And, and we can do 184 00:17:47,194 --> 00:17:52,885 that for each detector and essentially this gives us a handle on how to come up 185 00:17:52,885 --> 00:17:59,201 with the classifications of more finding classification of people. So, I'm going to 186 00:17:59,201 --> 00:18:04,673 show you some results here. So, these are actually results from this approach. So, 187 00:18:04,673 --> 00:18:09,086 the top row where the things are men and the bottom row is where the things are 188 00:18:09,086 --> 00:18:14,669 women. So, there are some mistakes here. So, for example, these, these are really 189 00:18:14,669 --> 00:18:23,273 women and so are these and so there are some mistakes but it's surprisingly good. 190 00:18:23,273 --> 00:18:28,703 Here is what the detector thinks are people wearing long pants in the top row 191 00:18:28,703 --> 00:18:34,159 and not wearing long pants in the bottom row. So, notice that once we can start to 192 00:18:34,159 --> 00:18:38,549 do this, we get to the ability of describing people. So, in an image, I want 193 00:18:38,549 --> 00:18:42,646 to be able to say that this image is a person who is tall, blonde man with 194 00:18:42,646 --> 00:18:52,107 wearing the green trousers. Here in the top row is what the algorithm thinks are 195 00:18:52,107 --> 00:18:58,569 people wearing hats and the bottom row are people not wearing hats. This approach 196 00:18:58,569 --> 00:19:05,337 applies to detecting actions as well. So, here are actions that are revealed in 197 00:19:05,337 --> 00:19:10,076 still images. So, you just have a single frame here. So, the, so, this image 198 00:19:10,076 --> 00:19:14,836 correspond to a sitting person, he is the person talking on the telephone, a person 199 00:19:14,836 --> 00:19:20,362 riding a horse, a person running and so on. So, again, this Poselet paradigm can 200 00:19:20,362 --> 00:19:27,785 be adapted to this framework. And, for example, we can train Poselets 201 00:19:27,785 --> 00:19:32,645 corresponding to phoning people, running people, walking people, and riding cars. I 202 00:19:32,645 --> 00:19:40,973 should note that the problem of detecting action is a much more general problem. And 203 00:19:40,973 --> 00:19:45,724 we obviously don't want to adjus t or make use of the static information. If we have 204 00:19:45,724 --> 00:19:51,395 video and we can compute optical flow vectors, then that would give us an extra 205 00:19:51,395 --> 00:19:57,407 handle on this problem. And the kinds of actions we want to be able to recognizing 206 00:19:57,407 --> 00:20:01,101 through movement and posture change, object manipulation, conversational 207 00:20:01,101 --> 00:20:06,625 gesture, sign language and etc. So, if you want, you can think of object as nouns in 208 00:20:06,625 --> 00:20:12,882 English and actions as verbs in English. And it turns out that's some of the 209 00:20:12,882 --> 00:20:15,985 techniques that have been applied for object recognition carry over to this 210 00:20:15,985 --> 00:20:21,202 domain. So, techniques such as bags of spatio-temporal words, these are 211 00:20:21,202 --> 00:20:26,314 generalizations of SIFT features to video. These turn out to be quite useful and give 212 00:20:26,314 --> 00:20:34,568 some of the best results for action recognition task. Let me conclude here. I 213 00:20:34,568 --> 00:20:41,598 think our community has made a lot of progress and object recognition, action 214 00:20:41,598 --> 00:20:46,897 recognition and so on. But a lot that needs to be done. There is this face that 215 00:20:46,897 --> 00:20:51,109 people in the multimedia information systems community talk about, the 216 00:20:51,109 --> 00:20:58,860 so-called semantic gap. So, their point is that typically where images and videos are 217 00:20:58,860 --> 00:21:04,003 presented as pixels, pixel brightness values, pixel RGB values, and so on. There 218 00:21:04,003 --> 00:21:08,471 is what we are really interested in the semantic content. What are the objects in 219 00:21:08,471 --> 00:21:12,942 the scene? What scene is it? What are the events taking place and this is what we 220 00:21:12,942 --> 00:21:19,059 would like to live. And we're not there yet. There's no way near human performance 221 00:21:19,059 --> 00:21:25,115 but I think we have made significant progress and more continue to happen over 222 00:21:25,115 --> 00:21:29,115 the next few years. Thank you.