Return to Video

How a computer learns to recognize objects instantly

  • 0:01 - 0:02
    Ten years ago,
  • 0:02 - 0:05
    computer vision researchers
    thought that getting a computer
  • 0:05 - 0:07
    to tell the difference
    between a cat and a dog
  • 0:08 - 0:09
    would be almost impossible,
  • 0:10 - 0:13
    even with the significant advance
    in the state of artificial intelligence.
  • 0:13 - 0:17
    Now we can do it at a level
    greater than 99 percent accuracy.
  • 0:18 - 0:20
    This is called image classification --
  • 0:20 - 0:23
    give it an image,
    put a label to that image --
  • 0:23 - 0:26
    and computers know
    thousands of other categories as well.
  • 0:27 - 0:30
    I'm a graduate student
    at the University of Washington,
  • 0:30 - 0:31
    and I work on a project called Darknet,
  • 0:32 - 0:33
    which is a neural network framework
  • 0:33 - 0:36
    for training and testing
    computer vision models.
  • 0:36 - 0:39
    So let's just see what Darknet thinks
  • 0:39 - 0:41
    of this image that we have.
  • 0:43 - 0:45
    When we run our classifier
  • 0:45 - 0:46
    on this image,
  • 0:46 - 0:49
    we see we don't just get
    a prediction of dog or cat,
  • 0:49 - 0:51
    we actually get
    specific breed predictions.
  • 0:51 - 0:53
    That's the level
    of granularity we have now.
  • 0:53 - 0:55
    And it's correct.
  • 0:55 - 0:57
    My dog is in fact a malamute.
  • 0:57 - 1:01
    So we've made amazing strides
    in image classification,
  • 1:01 - 1:03
    but what happens
    when we run our classifier
  • 1:03 - 1:05
    on an image that looks like this?
  • 1:07 - 1:08
    Well ...
  • 1:13 - 1:17
    We see that the classifier comes back
    with a pretty similar prediction.
  • 1:17 - 1:20
    And it's correct,
    there is a malamute in the image,
  • 1:20 - 1:23
    but just given this label,
    we don't actually know that much
  • 1:23 - 1:25
    about what's going on in the image.
  • 1:25 - 1:27
    We need something more powerful.
  • 1:27 - 1:30
    I work on a problem
    called object detection,
  • 1:30 - 1:33
    where we look at an image
    and try to find all of the objects,
  • 1:33 - 1:34
    put bounding boxes around them
  • 1:34 - 1:36
    and say what those objects are.
  • 1:36 - 1:40
    So here's what happens
    when we run a detector on this image.
  • 1:41 - 1:43
    Now, with this kind of result,
  • 1:44 - 1:46
    we can do a lot more
    with our computer vision algorithms.
  • 1:46 - 1:49
    We see that it knows
    that there's a cat and a dog.
  • 1:49 - 1:51
    It knows their relative locations,
  • 1:52 - 1:53
    their size.
  • 1:53 - 1:55
    It may even know some extra information.
  • 1:55 - 1:57
    There's a book sitting in the background.
  • 1:57 - 2:01
    And if you want to build a system
    on top of computer vision,
  • 2:01 - 2:04
    say a self-driving vehicle
    or a robotic system,
  • 2:04 - 2:06
    this is the kind
    of information that you want.
  • 2:07 - 2:10
    You want something so that
    you can interact with the physical world.
  • 2:11 - 2:13
    Now, when I started working
    on object detection,
  • 2:13 - 2:16
    it took 20 seconds
    to process a single image.
  • 2:16 - 2:20
    And to get a feel for why
    speed is so important in this domain,
  • 2:21 - 2:24
    here's an example of an object detector
  • 2:24 - 2:26
    that takes two seconds
    to process an image.
  • 2:26 - 2:29
    So this is 10 times faster
  • 2:29 - 2:32
    than the 20-seconds-per-image detector,
  • 2:32 - 2:35
    and you can see that by the time
    it makes predictions,
  • 2:35 - 2:37
    the entire state of the world has changed,
  • 2:38 - 2:40
    and this wouldn't be very useful
  • 2:40 - 2:42
    for an application.
  • 2:42 - 2:44
    If we speed this up
    by another factor of 10,
  • 2:44 - 2:47
    this is a detector running
    at five frames per second.
  • 2:47 - 2:49
    This is a lot better,
  • 2:49 - 2:51
    but for example,
  • 2:51 - 2:53
    if there's any significant movement,
  • 2:53 - 2:56
    I wouldn't want a system
    like this driving my car.
  • 2:57 - 3:00
    This is our detection system
    running in real time on my laptop.
  • 3:01 - 3:04
    So it smoothly tracks me
    as I move around the frame,
  • 3:04 - 3:08
    and it's robust to a wide variety
    of changes in size,
  • 3:09 - 3:11
    pose,
  • 3:11 - 3:13
    forward, backward.
  • 3:13 - 3:14
    This is great.
  • 3:14 - 3:16
    This is what we really need
  • 3:16 - 3:19
    if we're going to build systems
    on top of computer vision.
  • 3:19 - 3:23
    (Applause)
  • 3:24 - 3:26
    So in just a few years,
  • 3:26 - 3:29
    we've gone from 20 seconds per image
  • 3:29 - 3:33
    to 20 milliseconds per image,
    a thousand times faster.
  • 3:33 - 3:34
    How did we get there?
  • 3:34 - 3:37
    Well, in the past,
    object detection systems
  • 3:37 - 3:39
    would take an image like this
  • 3:39 - 3:42
    and split it into a bunch of regions
  • 3:42 - 3:45
    and then run a classifier
    on each of these regions,
  • 3:45 - 3:47
    and high scores for that classifier
  • 3:47 - 3:51
    would be considered
    detections in the image.
  • 3:51 - 3:55
    But this involved running a classifier
    thousands of times over an image,
  • 3:55 - 3:58
    thousands of neural network evaluations
    to produce detection.
  • 3:59 - 4:04
    Instead, we trained a single network
    to do all of detection for us.
  • 4:04 - 4:08
    It produces all of the bounding boxes
    and class probabilities simultaneously.
  • 4:09 - 4:12
    With our system, instead of looking
    at an image thousands of times
  • 4:12 - 4:14
    to produce detection,
  • 4:14 - 4:15
    you only look once,
  • 4:15 - 4:18
    and that's why we call it
    the YOLO method of object detection.
  • 4:19 - 4:23
    So with this speed,
    we're not just limited to images;
  • 4:23 - 4:26
    we can process video in real time.
  • 4:26 - 4:29
    And now, instead of just seeing
    that cat and dog,
  • 4:29 - 4:32
    we can see them move around
    and interact with each other.
  • 4:35 - 4:37
    This is a detector that we trained
  • 4:37 - 4:41
    on 80 different classes
  • 4:41 - 4:44
    in Microsoft's COCO dataset.
  • 4:44 - 4:48
    It has all sorts of things
    like spoon and fork, bowl,
  • 4:48 - 4:49
    common objects like that.
  • 4:50 - 4:53
    It has a variety of more exotic things:
  • 4:53 - 4:57
    animals, cars, zebras, giraffes.
  • 4:57 - 4:59
    And now we're going to do something fun.
  • 4:59 - 5:01
    We're just going to go
    out into the audience
  • 5:01 - 5:03
    and see what kind of things we can detect.
  • 5:03 - 5:04
    Does anyone want a stuffed animal?
  • 5:06 - 5:08
    There are some teddy bears out there.
  • 5:10 - 5:15
    And we can turn down
    our threshold for detection a little bit,
  • 5:15 - 5:18
    so we can find more of you guys
    out in the audience.
  • 5:20 - 5:22
    Let's see if we can get these stop signs.
  • 5:22 - 5:24
    We find some backpacks.
  • 5:26 - 5:28
    Let's just zoom in a little bit.
  • 5:30 - 5:32
    And this is great.
  • 5:32 - 5:35
    And all of the processing
    is happening in real time
  • 5:35 - 5:36
    on the laptop.
  • 5:37 - 5:39
    And it's important to remember
  • 5:39 - 5:42
    that this is a general purpose
    object detection system,
  • 5:42 - 5:47
    so we can train this for any image domain.
  • 5:48 - 5:51
    The same code that we use
  • 5:51 - 5:53
    to find stop signs or pedestrians,
  • 5:53 - 5:55
    bicycles in a self-driving vehicle,
  • 5:55 - 5:58
    can be used to find cancer cells
  • 5:58 - 6:01
    in a tissue biopsy.
  • 6:01 - 6:05
    And there are researchers around the globe
    already using this technology
  • 6:06 - 6:10
    for advances in things
    like medicine, robotics.
  • 6:10 - 6:11
    This morning, I read a paper
  • 6:11 - 6:16
    where they were taking a census
    of animals in Nairobi National Park
  • 6:16 - 6:19
    with YOLO as part
    of this detection system.
  • 6:19 - 6:22
    And that's because Darknet is open source
  • 6:22 - 6:24
    and in the public domain,
    free for anyone to use.
  • 6:26 - 6:31
    (Applause)
  • 6:31 - 6:36
    But we wanted to make detection
    even more accessible and usable,
  • 6:36 - 6:40
    so through a combination
    of model optimization,
  • 6:40 - 6:43
    network binarization and approximation,
  • 6:43 - 6:47
    we actually have object detection
    running on a phone.
  • 6:53 - 6:58
    (Applause)
  • 6:59 - 7:04
    And I'm really excited because
    now we have a pretty powerful solution
  • 7:04 - 7:06
    to this low-level computer vision problem,
  • 7:06 - 7:10
    and anyone can take it
    and build something with it.
  • 7:10 - 7:13
    So now the rest is up to all of you
  • 7:13 - 7:16
    and people around the world
    with access to this software,
  • 7:16 - 7:20
    and I can't wait to see what people
    will build with this technology.
  • 7:20 - 7:21
    Thank you.
  • 7:21 - 7:25
    (Applause)
Title:
How a computer learns to recognize objects instantly
Speaker:
Joseph Redmon
Description:

Ten years ago, researchers thought that getting a computer to tell the difference between a cat and a dog would be almost impossible. Today, computer vision systems do it with greater than 99 percent accuracy. How? Joseph Redmon works on the YOLO (You Only Look Once) system, an open-source method of object detection that can identify objects in images and video -- from zebras to stop signs -- with lightning-quick speed. In a remarkable live demo, Redmon shows off this important step forward for applications like self-driving cars, robotics and even cancer detection.

more » « less
Video Language:
English
Team:
closed TED
Project:
TEDTalks
Duration:
07:37

English subtitles

Revisions Compare revisions