Return to Video

How we're teaching computers to understand pictures

  • 0:02 - 0:06
    Let me show you something.
  • 0:06 - 0:10
    (Video) Girl: Okay, that's a cat
    sitting in a bed.
  • 0:10 - 0:14
    The boy is petting the elephant.
  • 0:14 - 0:19
    Those are people
    that are going on an airplane.
  • 0:19 - 0:21
    That's a big airplane.
  • 0:21 - 0:24
    Fei-Fei Li: This is
    a three-year-old child
  • 0:24 - 0:27
    describing what she sees
    in a series of photos.
  • 0:27 - 0:30
    She might still have a lot
    to learn about this world,
  • 0:30 - 0:35
    but she's already an expert
    at one very important task:
  • 0:35 - 0:38
    to make sense of what she sees.
  • 0:38 - 0:42
    Our society is more
    technologically advanced than ever.
  • 0:42 - 0:46
    We send people to the moon,
    we make phones that talk to us
  • 0:46 - 0:51
    or customize radio stations
    that can play only music we like.
  • 0:51 - 0:55
    Yet, our most advanced
    machines and computers
  • 0:55 - 0:58
    still struggle at this task.
  • 0:58 - 1:01
    So I'm here today
    to give you a progress report
  • 1:01 - 1:05
    on the latest advances
    in our research in computer vision,
  • 1:05 - 1:10
    one of the most frontier
    and potentially revolutionary
  • 1:10 - 1:13
    technologies in computer science.
  • 1:13 - 1:17
    Yes, we have prototyped cars
    that can drive by themselves,
  • 1:17 - 1:21
    but without smart vision,
    they cannot really tell the difference
  • 1:21 - 1:25
    between a crumpled paper bag
    on the road, which can be run over,
  • 1:25 - 1:29
    and a rock that size,
    which should be avoided.
  • 1:29 - 1:33
    We have made fabulous megapixel cameras,
  • 1:33 - 1:36
    but we have not delivered
    sight to the blind.
  • 1:36 - 1:40
    Drones can fly over massive land,
  • 1:40 - 1:42
    but don't have enough vision technology
  • 1:42 - 1:45
    to help us to track
    the changes of the rainforests.
  • 1:45 - 1:48
    Security cameras are everywhere,
  • 1:48 - 1:53
    but they do not alert us when a child
    is drowning in a swimming pool.
  • 1:54 - 2:00
    Photos and videos are becoming
    an integral part of global life.
  • 2:00 - 2:04
    They're being generated at a pace
    that's far beyond what any human,
  • 2:04 - 2:07
    or teams of humans, could hope to view,
  • 2:07 - 2:11
    and you and I are contributing
    to that at this TED.
  • 2:11 - 2:16
    Yet our most advanced software
    is still struggling at understanding
  • 2:16 - 2:20
    and managing this enormous content.
  • 2:20 - 2:25
    So in other words,
    collectively as a society,
  • 2:25 - 2:27
    we're very much blind,
  • 2:27 - 2:30
    because our smartest
    machines are still blind.
  • 2:32 - 2:34
    "Why is this so hard?" you may ask.
  • 2:34 - 2:37
    Cameras can take pictures like this one
  • 2:37 - 2:41
    by converting lights into
    a two-dimensional array of numbers
  • 2:41 - 2:43
    known as pixels,
  • 2:43 - 2:45
    but these are just lifeless numbers.
  • 2:45 - 2:48
    They do not carry meaning in themselves.
  • 2:48 - 2:52
    Just like to hear is not
    the same as to listen,
  • 2:52 - 2:57
    to take pictures is not
    the same as to see,
  • 2:57 - 3:00
    and by seeing,
    we really mean understanding.
  • 3:01 - 3:07
    In fact, it took Mother Nature
    540 million years of hard work
  • 3:07 - 3:09
    to do this task,
  • 3:09 - 3:11
    and much of that effort
  • 3:11 - 3:17
    went into developing the visual
    processing apparatus of our brains,
  • 3:17 - 3:19
    not the eyes themselves.
  • 3:19 - 3:22
    So vision begins with the eyes,
  • 3:22 - 3:26
    but it truly takes place in the brain.
  • 3:26 - 3:31
    So for 15 years now, starting
    from my Ph.D. at Caltech
  • 3:31 - 3:34
    and then leading Stanford's Vision Lab,
  • 3:34 - 3:39
    I've been working with my mentors,
    collaborators and students
  • 3:39 - 3:42
    to teach computers to see.
  • 3:43 - 3:46
    Our research field is called
    computer vision and machine learning.
  • 3:46 - 3:50
    It's part of the general field
    of artificial intelligence.
  • 3:51 - 3:56
    So ultimately, we want to teach
    the machines to see just like we do:
  • 3:56 - 4:02
    naming objects, identifying people,
    inferring 3D geometry of things,
  • 4:02 - 4:08
    understanding relations, emotions,
    actions and intentions.
  • 4:08 - 4:14
    You and I weave together entire stories
    of people, places and things
  • 4:14 - 4:16
    the moment we lay our gaze on them.
  • 4:17 - 4:23
    The first step towards this goal
    is to teach a computer to see objects,
  • 4:23 - 4:26
    the building block of the visual world.
  • 4:26 - 4:30
    In its simplest terms,
    imagine this teaching process
  • 4:30 - 4:33
    as showing the computers
    some training images
  • 4:33 - 4:37
    of a particular object, let's say cats,
  • 4:37 - 4:41
    and designing a model that learns
    from these training images.
  • 4:41 - 4:43
    How hard can this be?
  • 4:43 - 4:47
    After all, a cat is just
    a collection of shapes and colors,
  • 4:47 - 4:52
    and this is what we did
    in the early days of object modeling.
  • 4:52 - 4:55
    We'd tell the computer algorithm
    in a mathematical language
  • 4:55 - 4:59
    that a cat has a round face,
    a chubby body,
  • 4:59 - 5:01
    two pointy ears, and a long tail,
  • 5:01 - 5:02
    and that looked all fine.
  • 5:03 - 5:05
    But what about this cat?
  • 5:05 - 5:06
    (Laughter)
  • 5:06 - 5:08
    It's all curled up.
  • 5:08 - 5:12
    Now you have to add another shape
    and viewpoint to the object model.
  • 5:12 - 5:14
    But what if cats are hidden?
  • 5:15 - 5:17
    What about these silly cats?
  • 5:19 - 5:22
    Now you get my point.
  • 5:22 - 5:25
    Even something as simple
    as a household pet
  • 5:25 - 5:29
    can present an infinite number
    of variations to the object model,
  • 5:29 - 5:32
    and that's just one object.
  • 5:33 - 5:35
    So about eight years ago,
  • 5:35 - 5:40
    a very simple and profound observation
    changed my thinking.
  • 5:41 - 5:44
    No one tells a child how to see,
  • 5:44 - 5:46
    especially in the early years.
  • 5:46 - 5:51
    They learn this through
    real-world experiences and examples.
  • 5:51 - 5:54
    If you consider a child's eyes
  • 5:54 - 5:57
    as a pair of biological cameras,
  • 5:57 - 6:01
    they take one picture
    about every 200 milliseconds,
  • 6:01 - 6:04
    the average time an eye movement is made.
  • 6:04 - 6:10
    So by age three, a child would have seen
    hundreds of millions of pictures
  • 6:10 - 6:11
    of the real world.
  • 6:11 - 6:14
    That's a lot of training examples.
  • 6:14 - 6:20
    So instead of focusing solely
    on better and better algorithms,
  • 6:20 - 6:26
    my insight was to give the algorithms
    the kind of training data
  • 6:26 - 6:29
    that a child was given through experiences
  • 6:29 - 6:33
    in both quantity and quality.
  • 6:33 - 6:35
    Once we know this,
  • 6:35 - 6:38
    we knew we needed to collect a data set
  • 6:38 - 6:42
    that has far more images
    than we have ever had before,
  • 6:42 - 6:45
    perhaps thousands of times more,
  • 6:45 - 6:49
    and together with Professor
    Kai Li at Princeton University,
  • 6:49 - 6:54
    we launched the ImageNet project in 2007.
  • 6:54 - 6:57
    Luckily, we didn't have to mount
    a camera on our head
  • 6:57 - 6:59
    and wait for many years.
  • 6:59 - 7:01
    We went to the Internet,
  • 7:01 - 7:05
    the biggest treasure trove of pictures
    that humans have ever created.
  • 7:05 - 7:08
    We downloaded nearly a billion images
  • 7:08 - 7:14
    and used crowdsourcing technology
    like the Amazon Mechanical Turk platform
  • 7:14 - 7:16
    to help us to label these images.
  • 7:16 - 7:21
    At its peak, ImageNet was one of
    the biggest employers
  • 7:21 - 7:24
    of the Amazon Mechanical Turk workers:
  • 7:24 - 7:28
    together, almost 50,000 workers
  • 7:28 - 7:32
    from 167 countries around the world
  • 7:32 - 7:36
    helped us to clean, sort and label
  • 7:36 - 7:40
    nearly a billion candidate images.
  • 7:41 - 7:43
    That was how much effort it took
  • 7:43 - 7:47
    to capture even a fraction
    of the imagery
  • 7:47 - 7:51
    a child's mind takes in
    in the early developmental years.
  • 7:52 - 7:56
    In hindsight, this idea of using big data
  • 7:56 - 8:01
    to train computer algorithms
    may seem obvious now,
  • 8:01 - 8:05
    but back in 2007, it was not so obvious.
  • 8:05 - 8:09
    We were fairly alone on this journey
    for quite a while.
  • 8:09 - 8:14
    Some very friendly colleagues advised me
    to do something more useful for my tenure,
  • 8:14 - 8:18
    and we were constantly struggling
    for research funding.
  • 8:18 - 8:20
    Once, I even joked to my graduate students
  • 8:20 - 8:24
    that I would just reopen
    my dry cleaner's shop to fund ImageNet.
  • 8:24 - 8:29
    After all, that's how I funded
    my college years.
  • 8:29 - 8:31
    So we carried on.
  • 8:31 - 8:35
    In 2009, the ImageNet project delivered
  • 8:35 - 8:39
    a database of 15 million images
  • 8:39 - 8:44
    across 22,000 classes
    of objects and things
  • 8:44 - 8:47
    organized by everyday English words.
  • 8:47 - 8:50
    In both quantity and quality,
  • 8:50 - 8:53
    this was an unprecedented scale.
  • 8:53 - 8:56
    As an example, in the case of cats,
  • 8:56 - 8:59
    we have more than 62,000 cats
  • 8:59 - 9:03
    of all kinds of looks and poses
  • 9:03 - 9:08
    and across all species
    of domestic and wild cats.
  • 9:08 - 9:12
    We were thrilled
    to have put together ImageNet,
  • 9:12 - 9:16
    and we wanted the whole research world
    to benefit from it,
  • 9:16 - 9:20
    so in the TED fashion,
    we opened up the entire data set
  • 9:20 - 9:23
    to the worldwide
    research community for free.
  • 9:25 - 9:29
    (Applause)
  • 9:29 - 9:34
    Now that we have the data
    to nourish our computer brain,
  • 9:34 - 9:38
    we're ready to come back
    to the algorithms themselves.
  • 9:38 - 9:43
    As it turned out, the wealth
    of information provided by ImageNet
  • 9:43 - 9:48
    was a perfect match to a particular class
    of machine learning algorithms
  • 9:48 - 9:50
    called convolutional neural network,
  • 9:50 - 9:55
    pioneered by Kunihiko Fukushima,
    Geoff Hinton, and Yann LeCun
  • 9:55 - 9:59
    back in the 1970s and '80s.
  • 9:59 - 10:05
    Just like the brain consists
    of billions of highly connected neurons,
  • 10:05 - 10:08
    a basic operating unit in a neural network
  • 10:08 - 10:11
    is a neuron-like node.
  • 10:11 - 10:13
    It takes input from other nodes
  • 10:13 - 10:16
    and sends output to others.
  • 10:16 - 10:21
    Moreover, these hundreds of thousands
    or even millions of nodes
  • 10:21 - 10:24
    are organized in hierarchical layers,
  • 10:24 - 10:27
    also similar to the brain.
  • 10:27 - 10:31
    In a typical neural network we use
    to train our object recognition model,
  • 10:31 - 10:35
    it has 24 million nodes,
  • 10:35 - 10:38
    140 million parameters,
  • 10:38 - 10:41
    and 15 billion connections.
  • 10:41 - 10:43
    That's an enormous model.
  • 10:43 - 10:47
    Powered by the massive data from ImageNet
  • 10:47 - 10:52
    and the modern CPUs and GPUs
    to train such a humongous model,
  • 10:52 - 10:55
    the convolutional neural network
  • 10:55 - 10:58
    blossomed in a way that no one expected.
  • 10:58 - 11:01
    It became the winning architecture
  • 11:01 - 11:06
    to generate exciting new results
    in object recognition.
  • 11:06 - 11:09
    This is a computer telling us
  • 11:09 - 11:11
    this picture contains a cat
  • 11:11 - 11:13
    and where the cat is.
  • 11:13 - 11:15
    Of course there are more things than cats,
  • 11:15 - 11:18
    so here's a computer algorithm telling us
  • 11:18 - 11:21
    the picture contains
    a boy and a teddy bear;
  • 11:21 - 11:25
    a dog, a person, and a small kite
    in the background;
  • 11:25 - 11:28
    or a picture of very busy things
  • 11:28 - 11:33
    like a man, a skateboard,
    railings, a lampost, and so on.
  • 11:33 - 11:38
    Sometimes, when the computer
    is not so confident about what it sees,
  • 11:39 - 11:42
    we have taught it to be smart enough
  • 11:42 - 11:46
    to give us a safe answer
    instead of committing too much,
  • 11:46 - 11:48
    just like we would do,
  • 11:48 - 11:53
    but other times our computer algorithm
    is remarkable at telling us
  • 11:53 - 11:55
    what exactly the objects are,
  • 11:55 - 11:59
    like the make, model, year of the cars.
  • 11:59 - 12:04
    We applied this algorithm to millions
    of Google Street View images
  • 12:04 - 12:07
    across hundreds of American cities,
  • 12:07 - 12:10
    and we have learned something
    really interesting:
  • 12:10 - 12:14
    first, it confirmed our common wisdom
  • 12:14 - 12:17
    that car prices correlate very well
  • 12:17 - 12:19
    with household incomes.
  • 12:19 - 12:24
    But surprisingly, car prices
    also correlate well
  • 12:24 - 12:26
    with crime rates in cities,
  • 12:27 - 12:31
    or voting patterns by zip codes.
  • 12:32 - 12:34
    So wait a minute. Is that it?
  • 12:34 - 12:39
    Has the computer already matched
    or even surpassed human capabilities?
  • 12:39 - 12:42
    Not so fast.
  • 12:42 - 12:46
    So far, we have just taught
    the computer to see objects.
  • 12:46 - 12:51
    This is like a small child
    learning to utter a few nouns.
  • 12:51 - 12:54
    It's an incredible accomplishment,
  • 12:54 - 12:56
    but it's only the first step.
  • 12:56 - 13:00
    Soon, another developmental
    milestone will be hit,
  • 13:00 - 13:03
    and children begin
    to communicate in sentences.
  • 13:03 - 13:08
    So instead of saying
    this is a cat in the picture,
  • 13:08 - 13:13
    you already heard the little girl
    telling us this is a cat lying on a bed.
  • 13:13 - 13:18
    So to teach a computer
    to see a picture and generate sentences,
  • 13:18 - 13:22
    the marriage between big data
    and machine learning algorithm
  • 13:22 - 13:25
    has to take another step.
  • 13:25 - 13:29
    Now, the computer has to learn
    from both pictures
  • 13:29 - 13:32
    as well as natural language sentences
  • 13:32 - 13:35
    generated by humans.
  • 13:35 - 13:39
    Just like the brain integrates
    vision and language,
  • 13:39 - 13:44
    we developed a model
    that connects parts of visual things
  • 13:44 - 13:46
    like visual snippets
  • 13:46 - 13:50
    with words and phrases in sentences.
  • 13:50 - 13:53
    About four months ago,
  • 13:53 - 13:56
    we finally tied all this together
  • 13:56 - 13:59
    and produced one of the first
    computer vision models
  • 13:59 - 14:03
    that is capable of generating
    a human-like sentence
  • 14:03 - 14:07
    when it sees a picture for the first time.
  • 14:07 - 14:12
    Now, I'm ready to show you
    what the computer says
  • 14:12 - 14:14
    when it sees the picture
  • 14:14 - 14:17
    that the little girl saw
    at the beginning of this talk.
  • 14:20 - 14:23
    (Video) Computer: A man is standing
    next to an elephant.
  • 14:24 - 14:28
    A large airplane sitting on top
    of an airport runway.
  • 14:29 - 14:33
    FFL: Of course, we're still working hard
    to improve our algorithms,
  • 14:33 - 14:36
    and it still has a lot to learn.
  • 14:36 - 14:38
    (Applause)
  • 14:40 - 14:43
    And the computer still makes mistakes.
  • 14:43 - 14:46
    (Video) Computer: A cat lying
    on a bed in a blanket.
  • 14:46 - 14:49
    FFL: So of course, when it sees
    too many cats,
  • 14:49 - 14:52
    it thinks everything
    might look like a cat.
  • 14:53 - 14:56
    (Video) Computer: A young boy
    is holding a baseball bat.
  • 14:56 - 14:58
    (Laughter)
  • 14:58 - 15:03
    FFL: Or, if it hasn't seen a toothbrush,
    it confuses it with a baseball bat.
  • 15:03 - 15:07
    (Video) Computer: A man riding a horse
    down a street next to a building.
  • 15:07 - 15:09
    (Laughter)
  • 15:09 - 15:12
    FFL: We haven't taught Art 101
    to the computers.
  • 15:14 - 15:17
    (Video) Computer: A zebra standing
    in a field of grass.
  • 15:17 - 15:20
    FFL: And it hasn't learned to appreciate
    the stunning beauty of nature
  • 15:20 - 15:22
    like you and I do.
  • 15:22 - 15:25
    So it has been a long journey.
  • 15:25 - 15:30
    To get from age zero to three was hard.
  • 15:30 - 15:35
    The real challenge is to go
    from three to 13 and far beyond.
  • 15:35 - 15:39
    Let me remind you with this picture
    of the boy and the cake again.
  • 15:39 - 15:44
    So far, we have taught
    the computer to see objects
  • 15:44 - 15:48
    or even tell us a simple story
    when seeing a picture.
  • 15:48 - 15:52
    (Video) Computer: A person sitting
    at a table with a cake.
  • 15:52 - 15:54
    FFL: But there's so much more
    to this picture
  • 15:54 - 15:56
    than just a person and a cake.
  • 15:56 - 16:01
    What the computer doesn't see
    is that this is a special Italian cake
  • 16:01 - 16:04
    that's only served during Easter time.
  • 16:04 - 16:07
    The boy is wearing his favorite t-shirt
  • 16:07 - 16:11
    given to him as a gift by his father
    after a trip to Sydney,
  • 16:11 - 16:15
    and you and I can all tell how happy he is
  • 16:15 - 16:18
    and what's exactly on his mind
    at that moment.
  • 16:19 - 16:22
    This is my son Leo.
  • 16:22 - 16:25
    On my quest for visual intelligence,
  • 16:25 - 16:27
    I think of Leo constantly
  • 16:27 - 16:30
    and the future world he will live in.
  • 16:30 - 16:32
    When machines can see,
  • 16:32 - 16:37
    doctors and nurses will have
    extra pairs of tireless eyes
  • 16:37 - 16:41
    to help them to diagnose
    and take care of patients.
  • 16:41 - 16:45
    Cars will run smarter
    and safer on the road.
  • 16:45 - 16:48
    Robots, not just humans,
  • 16:48 - 16:53
    will help us to brave the disaster zones
    to save the trapped and wounded.
  • 16:54 - 16:58
    We will discover new species,
    better materials,
  • 16:58 - 17:02
    and explore unseen frontiers
    with the help of the machines.
  • 17:03 - 17:07
    Little by little, we're giving sight
    to the machines.
  • 17:07 - 17:10
    First, we teach them to see.
  • 17:10 - 17:13
    Then, they help us to see better.
  • 17:13 - 17:17
    For the first time, human eyes
    won't be the only ones
  • 17:17 - 17:20
    pondering and exploring our world.
  • 17:20 - 17:23
    We will not only use the machines
    for their intelligence,
  • 17:23 - 17:30
    we will also collaborate with them
    in ways that we cannot even imagine.
  • 17:30 - 17:32
    This is my quest:
  • 17:32 - 17:34
    to give computers visual intelligence
  • 17:34 - 17:40
    and to create a better future
    for Leo and for the world.
  • 17:40 - 17:41
    Thank you.
  • 17:41 - 17:45
    (Applause)
Title:
How we're teaching computers to understand pictures
Speaker:
Fei-Fei Li
Description:

When a very young child looks at a picture, she can identify simple elements: "cat," "book," "chair." Now, computers are getting smart enough to do that too. What's next? In a thrilling talk, computer vision expert Fei-Fei Li describes the state of the art — including the database of 15 million photos her team built to "teach" a computer to understand pictures — and the key insights yet to come.

more » « less
Video Language:
English
Team:
closed TED
Project:
TEDTalks
Duration:
17:58

English subtitles

Revisions Compare revisions