Hello. After having spent a lot of time on the relatively simple two-dimensional problem of handwritten digit recognition, we are now ready to tackle the general problem which is data finding 3D objects and scenes. So, the settings in which we studied the problem these days is, is most commonly that is so-called PASCAL object detection challenge. This is been going on for about, this is been going on for about five years or so. And what these folks have done is collected a set of about 10,000 images where in each of these images, they marked a certain set of objects and these object categories include dining table, dog, horse, motorbike, person, potted plant, sheep, etc. So, they have twenty different categories. For each object belonging to a category, they have marked the bounding box. So, for example, here is the bounding box corresponding to the dock in this image and there are bounding box corresponding to a horse here and also there'll be bounding boxes corresponding to the people because in this image, we have horses and people. The goal is to detect these objects and so what a computer programmer supposed to do is, let's say, we are trying to find dogs. What you are supposed to do is to mark bounding boxes corresponding to where the dogs are in the image. And then, we'll be judged by whether the dog is in the right location. So, the bounding box has to overlap sufficiently with the correct bounding box. So, this is the, the dominant data set for studying 3D object recognition. Now, let's see what techniques we can use for addressing this problem. So, we start with of course, the basic paradigm of the multi-scale sliding window. And this paradigm had been introduced for face direction back in the 90s. And since then, it's also been used for pedestrian detection and so forth. So, the basic idea here is that we're going to consider a window. Let's say, starting in the top-left corner of the image. So, this green, this green boxes correspond to one of those windows and then we are going to ev aluate the answer to that question. Is there are face there? Or there is a bus in there? And shift the Windows lightly, ask the same question. And since the people could be a, a variety of sizes, we have to read on this process for different sizes for Windows just as to detect small objects as well as large objects. A good and standard building block is a linear support vector machine trained on Histogram or oriented gradient features. And this is a frame work introduced by Dalal & Triggs in 2005 and they have details in their paper about how they compute each of the blocks and how they normalize and well, if few of you are interested in the details, you should read that paper. Now, note that the Dalal & Triggs approach was tested on pedestrians and in the case of pedestrians, a single block is enough and you try to detect the whole object in one goal. Now, when we deal with more complex objects like people in general poses or dogs and cats, we find that these are very non-rigid. So, one single rigid template is not affected. What we really want are part based approaches. Nowadays, there are two dominant part based approaches. The first is the so-called deformable part models due to Felzenszwalb et al. There is a paper of that article probably in 2010. And another approach is so-called Poselets and this is due to Lubomir Bourdev and various and other collaborators in my group. So, what's the basic idea? So, let me get into Felzenszwalb's approach first. So, their basic idea is to have a root filter which is trying to find the object that hold. And then there will be a set of path filters which might correspond to say, trying to detect faces or legs and so forth but these path filters have to fire in certain spacial relationships with respect to the root vector. So, the oral detector is the combination of holistic detector and a set of part filters which had to be in the certain relationship with respected to the whole object. And this requires training both the root filter and the radiant spot filters an d this can be done using a so-called LatentSVM approach which, and it does not require any extra annotation. And note that I said, parts such as faces and legs. So, that's me getting carried away. The vector parts need not to correspond to anything semantically meaningful. In the case of the Poselets approach, the idea is to have semantically meaningful part, parts and, and so the way they go about doing this is by making use of extra annotation. So, suppose you have images of people that needs images might be annotated with key points corresponding to left shoulder, right shoulder, left elbow, right elbow and so on. While other object categorically will be other key points such as, for example, for an airplane, you might have a key point on the tip of the nose or the tip of the wings and so on and so forth. This requires extra work because somebody has to go through to all the images in the test and then mark these key points but the consequence will be that we'll be able to do a few more things afterwards. Here's a slide which shows how the object detection with discriminatively trained part based models works. So, this is the DPM model of Felzenszwalb Girshick, MacAllester, and Ramanan. And here this model has been illustrated with powerful handle bicycle detection. So, in fact, you don't train just one model, you train a mixture of models. So, there is a model here corresponding to the side view of a bicycle. So, the root filter is shown here so this is kind of the root filter and this has kind, is looking for, this is a hot template. It's looking for edges of particular orientations as might we found on the side view of a bicycle. Then we have various part filters. So, the part filters are in factual here. So, each of the rectangles here, this kind of the rectangle corresponds to a part filter. So, this might, here, corresponds to something like a template detective for wheels. And so, what we have to have to come up with the final score is to combine the score corresponding to the hot te mplate of the root filter as well as the hot templates for each of the part. Note that this detector for the side view of a bicycle will probably not do a good job in consider front views of bicycles like here. And so for this, they will have a different mode. So, again the model is shown here. And here the wheel, the parts maybe somewhat different. So, overall, you have a mixture model with multiple model corresponding to different poses and that each model, it says, consists of root filter and various part filters and there is some subtlety in training because there are no annotations that were leveled about key points and so forth. So, in terms in the learning approach here, you have to guess where the part should be as the part of the process of training and you can find details in there that needs to [inaudible]. How well does it do? Okay, there is standard methodology that we use in computer vision by evaluating detection process. And here is how we do this for the case of, say a motorcycle detector. So, when computes the so-called precision recall cuts. So, the idea is that the algorithm, the detection algorithm is going to come up with guesses of bounding boxes where the motorbike maybe. And we can then evaluate for each of these guess bounding box. Is it right or wrong and it's just to be right if its intersection where you meet in respect to these two motorbike is within 50%. Then, we have a choice of how strict to be in a threshold. We could pass through most of our candidate guesses bounding boxes and if you guess enough of them then of course, you are guaranteed to find all of the motorbikes. So, this rather seemed right. So, the way we do this is that we could have to pick a threshold and with that threshold, you can evaluate the precision and recall. So, precision and recall. These terms have the following meaning. Precisions means what fraction of the detections that you declared are actually true motorcycles. Recall is the question of how many of the two motorcycles that you, did you manage to detect? So, I really want precision to be 100 percent and recall to be 100%. In reality, it doesn't well count that way. We're able to detect some fraction of the two motorbikes so here, for example, at this point, the precision is 0.7. That means at this point we're able to detect, the, the, the detection that we declare often 70 percent accurate. Now, this point corresponds to recall of something like 55 percent meaning that at that threshold, we hold 55 possible for two motorbikes and as we made the threshold more lenient, we are going to get more false [inaudible] but we will manage to detect more of the two motorbikes. So, as these curves goes down in this range, in this, for this particular detected data curve which is image to detect something like 70 to 80 percent of the true motorcycles. So the curves in this figure corresponds to different algorithms and the way we compare different algorithms is by measuring the area and the curve. And that the ideal case, of course, would be 100%. In fact there is something like 50 percent to 60 percent for these cases and that is what we call AP or Average Precision. And that is how we compare different algorithms. Here is the precision recall curve for a different category namely person detection and the, the different curves correspond to different category items so this algorithm is probably not a good one. This algorithm is a better algorithm. And notice in both the examples, we are not able to detect all the people and if you look through this 30 percent of the people which are not detected by any approach, usually, there is heavy occlusion or unusual pauses and media. So, there are phenomenas that make life difficult for us. Finally, the Pascal BOC, people have computed the average precision for every class. And they give two measures. Max means the best algorithm for that category, So, max, so, the max motorbike is something like say 58%. That means that the best algorithm for detecting motorbikes has an average precision of 58%. And the median is, of course, the m edian of the different algorithm that was submitted. So, we conclude that some categories are easier than others. Motorbikes are probably the easiest. Their average precision is 58%. And something like potted plant is really hard to detect and the average precision there is sixteen%. So, if you want to say where are we going quite well. It's, I think, all the categories where the precision is where the average precision is over 505 percent and that is motorbike, bicycle, bus, airplane, horse, car, cat, train, bus. So about 50%. You may like it or not in the sense that this is the case of the class happen to half way. So, since it's about 50%, maybe you can call it Boat. Let's a look a little bit at some of the difficult categories. So, here are the category of boat and the average precision here is about [inaudible] and if you look at the set of examples, you will see why, why this is so hot because there are so much radiation in appearance from one boat to another and it's really difficult to detect, train to detect damage on all these cases. Okay, and even more difficult example. Chairs. So, here we are supposed to mark bounding boxes corresponding to the chairs and here they are. Okay, now imagine you're looking for a hot template which is going to detect the characters, the edges corresponding to a chair. You really can see that there is no hope at managing that. Probably, the way humans detect chairs is by making use of the fact that when I, there's a human sitting on thatt in a certain pose and, so there are a lot of contextual information which currently is not being captured by the, by the algorithms. I'll turn to images of people now. Analyzing images of people is very important. It enables us to build human good computer interaction APIs, it enables us to analyze video, recognize actions and so on and so forth. It's laid hard by the fact that people appear in a variety of poses, the variety of clothing can be occluded, can be small, can be large, and so on. So, this is really challe nging category even though it's perhaps, the most important category for object recognition. So, I'm going to show you some research from an approach which is based on poselets, the other part based paradigm that I refer to. So, the big idea is that we can build on the success of face detector and pedestrian detectors. So, face detection, we know what's well. And so also, the pedestrian detection when you're talking about a vertical standing or walking pedestrian. So, essentially, both of these rely on, on pattern matching and they captured pattern that are common and visually characteristic. But these are not the only too common in characteristic patterns. Effectively, we can have patterns corresponding to this pair legs. And if we can detect those, we are sure that we are looking at a person. And or we can have a pattern which doesn't correspond to single anatomical part. This is the half of the face and half of the torso and the center of the shoulder. This is fine, I mean this is pretty characteristic observation for a person. So, the way, of course, how we train face detectors pause that we had images where all face had been marked out. So, then the, just the face of the youth, just input positive examples for a machine learning algorithm. But, how are we going to find all these configuration corresponding to legs and face and shoulders and so on. So, the poselet idea is, is exactly to train these detectors but we don't wish to determine these in advance. But first, let me show you what examples of Poselets are. So, this is a Poselet let implies a small part. And the way it works is that consider the human pulse and let is being planted at a small part of it. So, the top rope was that corresponds to face, upper body, and the hand in the certain configuration. Second row corresponds to two legs. The third row, let row corresponds to the back view of a person. So, in fact, we can have a very and, and a pretty long list of these Poselets. Now, the, the value of these is that it enables us to do later tasks more easily. So, for example, we can train agenda classifieds. So, we want to distinguish men from women and that can be done from the face also from the view. That back wheel for person. And the legs because up in the clothing want by many women are different. So once we have this idea of training positive detectors. We can actually train two versions of a positive detector. One is for male faces, one for female faces. And, and we can do that for each detector and essentially this gives us a handle on how to come up with the classifications of more finding classification of people. So, I'm going to show you some results here. So, these are actually results from this approach. So, the top row where the things are men and the bottom row is where the things are women. So, there are some mistakes here. So, for example, these, these are really women and so are these and so there are some mistakes but it's surprisingly good. Here is what the detector thinks are people wearing long pants in the top row and not wearing long pants in the bottom row. So, notice that once we can start to do this, we get to the ability of describing people. So, in an image, I want to be able to say that this image is a person who is tall, blonde man with wearing the green trousers. Here in the top row is what the algorithm thinks are people wearing hats and the bottom row are people not wearing hats. This approach applies to detecting actions as well. So, here are actions that are revealed in still images. So, you just have a single frame here. So, the, so, this image correspond to a sitting person, he is the person talking on the telephone, a person riding a horse, a person running and so on. So, again, this Poselet paradigm can be adapted to this framework. And, for example, we can train Poselets corresponding to phoning people, running people, walking people, and riding cars. I should note that the problem of detecting action is a much more general problem. And we obviously don't want to adjus t or make use of the static information. If we have video and we can compute optical flow vectors, then that would give us an extra handle on this problem. And the kinds of actions we want to be able to recognizing through movement and posture change, object manipulation, conversational gesture, sign language and etc. So, if you want, you can think of object as nouns in English and actions as verbs in English. And it turns out that's some of the techniques that have been applied for object recognition carry over to this domain. So, techniques such as bags of spatio-temporal words, these are generalizations of SIFT features to video. These turn out to be quite useful and give some of the best results for action recognition task. Let me conclude here. I think our community has made a lot of progress and object recognition, action recognition and so on. But a lot that needs to be done. There is this face that people in the multimedia information systems community talk about, the so-called semantic gap. So, their point is that typically where images and videos are presented as pixels, pixel brightness values, pixel RGB values, and so on. There is what we are really interested in the semantic content. What are the objects in the scene? What scene is it? What are the events taking place and this is what we would like to live. And we're not there yet. There's no way near human performance but I think we have made significant progress and more continue to happen over the next few years. Thank you.