How we're teaching computers to understand pictures

0:02 - 0:06

Let me show you something.
0:06 - 0:10

(Video) Girl: Okay, that's a cat
sitting in a bed.
0:10 - 0:14

The boy is petting the elephant.
0:14 - 0:19

Those are people
that are going on an airplane.
0:19 - 0:21

That's a big airplane.
0:21 - 0:24

Fei-Fei Li: This is
a three-year-old child
0:24 - 0:27

describing what she sees
in a series of photos.
0:27 - 0:30

She might still have a lot
to learn about this world,
0:30 - 0:35

but she's already an expert
at one very important task:
0:35 - 0:38

to make sense of what she sees.
0:38 - 0:42

Our society is more
technologically advanced than ever.
0:42 - 0:46

We send people to the moon,
we make phones that talk to us
0:46 - 0:51

or customize radio stations
that can play only music we like.
0:51 - 0:55

Yet, our most advanced
machines and computers
0:55 - 0:58

still struggle at this task.
0:58 - 1:01

So I'm here today
to give you a progress report
1:01 - 1:05

on the latest advances
in our research in computer vision,
1:05 - 1:10

one of the most frontier
and potentially revolutionary
1:10 - 1:13

technologies in computer science.
1:13 - 1:17

Yes, we have prototyped cars
that can drive by themselves,
1:17 - 1:21

but without smart vision,
they cannot really tell the difference
1:21 - 1:25

between a crumpled paper bag
on the road, which can be run over,
1:25 - 1:29

and a rock that size,
which should be avoided.
1:29 - 1:33

We have made fabulous megapixel cameras,
1:33 - 1:36

but we have not delivered
sight to the blind.
1:36 - 1:40

Drones can fly over massive land,
1:40 - 1:42

but don't have enough vision technology
1:42 - 1:45

to help us to track
the changes of the rainforests.
1:45 - 1:48

Security cameras are everywhere,
1:48 - 1:53

but they do not alert us when a child
is drowning in a swimming pool.
1:54 - 2:00

Photos and videos are becoming
an integral part of global life.
2:00 - 2:04

They're being generated at a pace
that's far beyond what any human,
2:04 - 2:07

or teams of humans, could hope to view,
2:07 - 2:11

and you and I are contributing
to that at this TED.
2:11 - 2:16

Yet our most advanced software
is still struggling at understanding
2:16 - 2:20

and managing this enormous content.
2:20 - 2:25

So in other words,
collectively as a society,
2:25 - 2:27

we're very much blind,
2:27 - 2:30

because our smartest
machines are still blind.
2:32 - 2:34

"Why is this so hard?" you may ask.
2:34 - 2:37

Cameras can take pictures like this one
2:37 - 2:41

by converting lights into
a two-dimensional array of numbers
2:41 - 2:43

known as pixels,
2:43 - 2:45

but these are just lifeless numbers.
2:45 - 2:48

They do not carry meaning in themselves.
2:48 - 2:52

Just like to hear is not
the same as to listen,
2:52 - 2:57

to take pictures is not
the same as to see,
2:57 - 3:00

and by seeing,
we really mean understanding.
3:01 - 3:07

In fact, it took Mother Nature
540 million years of hard work
3:07 - 3:09

to do this task,
3:09 - 3:11

and much of that effort
3:11 - 3:17

went into developing the visual
processing apparatus of our brains,
3:17 - 3:19

not the eyes themselves.
3:19 - 3:22

So vision begins with the eyes,
3:22 - 3:26

but it truly takes place in the brain.
3:26 - 3:31

So for 15 years now, starting
from my Ph.D. at Caltech
3:31 - 3:34

and then leading Stanford's Vision Lab,
3:34 - 3:39

I've been working with my mentors,
collaborators and students
3:39 - 3:42

to teach computers to see.
3:43 - 3:46

Our research field is called
computer vision and machine learning.
3:46 - 3:50

It's part of the general field
of artificial intelligence.
3:51 - 3:56

So ultimately, we want to teach
the machines to see just like we do:
3:56 - 4:02

naming objects, identifying people,
inferring 3D geometry of things,
4:02 - 4:08

understanding relations, emotions,
actions and intentions.
4:08 - 4:14

You and I weave together entire stories
of people, places and things
4:14 - 4:16

the moment we lay our gaze on them.
4:17 - 4:23

The first step towards this goal
is to teach a computer to see objects,
4:23 - 4:26

the building block of the visual world.
4:26 - 4:30

In its simplest terms,
imagine this teaching process
4:30 - 4:33

as showing the computers
some training images
4:33 - 4:37

of a particular object, let's say cats,
4:37 - 4:41

and designing a model that learns
from these training images.
4:41 - 4:43

How hard can this be?
4:43 - 4:47

After all, a cat is just
a collection of shapes and colors,
4:47 - 4:52

and this is what we did
in the early days of object modeling.
4:52 - 4:55

We'd tell the computer algorithm
in a mathematical language
4:55 - 4:59

that a cat has a round face,
a chubby body,
4:59 - 5:01

two pointy ears, and a long tail,
5:01 - 5:02

and that looked all fine.
5:03 - 5:05

But what about this cat?
5:05 - 5:06

(Laughter)
5:06 - 5:08

It's all curled up.
5:08 - 5:12

Now you have to add another shape
and viewpoint to the object model.
5:12 - 5:14

But what if cats are hidden?
5:15 - 5:17

What about these silly cats?
5:19 - 5:22

Now you get my point.
5:22 - 5:25

Even something as simple
as a household pet
5:25 - 5:29

can present an infinite number
of variations to the object model,
5:29 - 5:32

and that's just one object.
5:33 - 5:35

So about eight years ago,
5:35 - 5:40

a very simple and profound observation
changed my thinking.
5:41 - 5:44

No one tells a child how to see,
5:44 - 5:46

especially in the early years.
5:46 - 5:51

They learn this through
real world experiences and examples.
5:51 - 5:54

If you consider a child's eyes
5:54 - 5:57

as a pair of biological cameras,
5:57 - 6:01

they take one picture
about every 200 milliseconds,
6:01 - 6:04

the average time an eye movement is made.
6:04 - 6:10

So by age three, a child would have seen
hundreds of millions of pictures
6:10 - 6:11

of the real world.
6:11 - 6:14

That's a lot of training examples.
6:14 - 6:20

So instead of focusing solely
on better and better algorithms,
6:20 - 6:26

my insight was to give the algorithms
the kind of training data
6:26 - 6:29

that a child was given through experiences
6:29 - 6:33

in both quantity and quality.
6:33 - 6:35

Once we know this,
6:35 - 6:38

we knew we needed to collect a data set
6:38 - 6:42

that has far more images
than we have ever had before,
6:42 - 6:45

perhaps thousands of times more,
6:45 - 6:49

and together with professor
Kai Li at Princeton University,
6:49 - 6:54

we launched the Image Lab project in 2007.
6:54 - 6:57

Luckily, we didn't have to mount
a camera on our head
6:57 - 6:59

and wait for many years.
6:59 - 7:01

We went to the Internet,
7:01 - 7:05

the biggest treasure trove of pictures
that humans have ever created.
7:05 - 7:08

We downloaded nearly a billion images
7:08 - 7:11

and used the crowdsourcing technology
7:11 - 7:14

like Amazon Mechanical Turk platform
7:14 - 7:16

to help us to label these images.
7:16 - 7:21

At its peak, ImageNet was one of
the biggest employers
7:21 - 7:24

of the Amazon Mechanical Turk workers:
7:24 - 7:28

together, almost 50,000 workers
7:28 - 7:32

from 167 countries around the world
7:32 - 7:36

helped us to clean, sort, and label
7:36 - 7:40

nearly a billion candidate images.
7:41 - 7:43

That was how much effort it took
7:43 - 7:47

to capture even a fraction
of the imagery
7:47 - 7:51

a child's mind takes in
in the early developmental years.
7:52 - 7:56

In hindsight, this idea of using big data
7:56 - 8:01

to train computer algorithms
may seem obvious now,
8:01 - 8:05

but back in 2007, it was not so obvious.
8:05 - 8:09

We were fairly alone on this journey
for quite a while.
8:09 - 8:14

Some very friendly colleagues advised me
to do something more useful for my tenure,
8:14 - 8:18

and we were constantly struggling
for research funding.
8:18 - 8:20

Once, I even joked to my graduate students
8:20 - 8:24

that I would just reopen
my dry-cleaner shop to fund ImageNet.
8:24 - 8:29

After all, that's how I funded
through my college years.
8:29 - 8:31

So we carried on.
8:31 - 8:35

In 2009, ImageNet project delivered
8:35 - 8:39

a database of 15 million images
8:39 - 8:44

across 22,000 classes
of objects and things
8:44 - 8:47

organized by everyday English words.
8:47 - 8:50

In both quantity and quality,
8:50 - 8:53

this was an unprecedented scale.
8:53 - 8:56

As an example, in the case of cats,
8:56 - 8:59

we have more than 62,000 cats
8:59 - 9:03

of all kinds of looks and poses
9:03 - 9:08

and across all species
of domestic and wild cats.
9:08 - 9:12

We were thrilled
to have put together ImageNet,
9:12 - 9:16

and we wanted the whole research world
to benefit from it,
9:16 - 9:20

so in the TED fashion,
we opened up the entire data set
9:20 - 9:23

to the worldwide
research community for free.
9:25 - 9:29

(Applause)
9:29 - 9:34

Now that we have the data
to nourish our computer brain,
9:34 - 9:38

we're ready to come back
to the algorithms themselves.
9:38 - 9:43

As it turned out, the wealth
of information provided by ImageNet
9:43 - 9:48

was a perfect match to a particular class
of machine learning algorithms
9:48 - 9:50

called convolutional neural network,
9:50 - 9:55

pioneered by Kunihiko Fukushima,
Geoff Hinton, and [???]
9:55 - 9:59

back in the 1970s and '80s.
9:59 - 10:05

Just like the brain is consisted
of billions of highly connect neurons,
10:05 - 10:08

a basic operating unit in a neural network
10:08 - 10:11

is a neuron-like node.
10:11 - 10:13

It takes input from other nodes
10:13 - 10:16

and sends output to others.
10:16 - 10:21

Moreover, these hundreds of thousands
or even millions of nodes
10:21 - 10:24

are organized in hierarchical layers,
10:24 - 10:27

also similar to the brain.
10:27 - 10:31

In a typical neural network we use
to train our object recognition model,
10:31 - 10:35

it has 24 million nodes,
10:35 - 10:38

140 million parameters,
10:38 - 10:41

and 15 billion connections.
10:41 - 10:43

That's an enormous model.
10:43 - 10:47

Powered by the massive data from ImageNet
10:47 - 10:52

and the modern CPUs and GPUs
to train such a humongous model,
10:52 - 10:55

the convolutional neural network
10:55 - 10:58

blossomed in a way that no one expected.
10:58 - 11:01

It became the winning architecture
11:01 - 11:06

to generate exciting new results
in object recognition.
11:06 - 11:09

This is a computer telling us
11:09 - 11:11

this picture contains a cat
11:11 - 11:13

and where the cat is.
11:13 - 11:15

Of course there are more things than cats,
11:15 - 11:18

so here's a computer algorithm telling us
11:18 - 11:21

the picture contains
a boy and a Teddy bear;
11:21 - 11:25

a dog, a person, and a small kite
in the background;
11:25 - 11:28

or a picture of very basic things
11:28 - 11:33

like a man, a skateboard,
railings, a lampost, and so on.
11:33 - 11:38

Sometimes, when the computer
is not so confident about what it sees,
11:39 - 11:42

we have taught it to be smart enough
11:42 - 11:46

to give us a safe answer
instead of committing too much,
11:46 - 11:48

just like we would do,
11:48 - 11:53

but other times our computer algorithm
is remarkable at telling us
11:53 - 11:55

what exactly the objects are,
11:55 - 11:59

like the make, model, year of the cars.
11:59 - 12:04

We applied this algorithm to millions
of Google Street View images
12:04 - 12:07

across hundreds of American cities,
12:07 - 12:10

and we have learned something
really interesting:
12:10 - 12:14

first, it confirmed our common wisdom
12:14 - 12:17

that car prices correlate very well
12:17 - 12:19

with household incomes.
12:19 - 12:24

But surprising, car prices
also correlates well
12:24 - 12:26

with crime rates in cities,
12:27 - 12:31

or voting patterns by zip codes.
12:32 - 12:34

So wait a minute: is that it?
12:34 - 12:39

Has the computer already matched
or even surpassed human capabilities?
12:39 - 12:42

Not so fast.
12:42 - 12:46

So far, we have just taught
the computer to see objects.
12:46 - 12:51

This is like a small child
to learn to utter a few nouns.
12:51 - 12:54

It's an incredible accomplishment,
12:54 - 12:56

but it's only the first step.
12:56 - 13:00

Soon, another developmental
milestone will hit,
13:00 - 13:03

and children begin
to communicate in sentences.
13:03 - 13:08

So instead of saying
this is a cat in the picture,
13:08 - 13:13

you already heard the little girl
telling us this is a cat lying on a bed.
13:13 - 13:18

So to teach a computer
to see a picture and generate sentences,
13:18 - 13:22

the marriage between big data
and machine learning algorithm
13:22 - 13:25

has to take another step.
13:25 - 13:29

Now, the computer has to learn
from both pictures
13:29 - 13:32

as well as natural language sentences
13:32 - 13:35

generated by humans.
13:35 - 13:39

Just like the brain integrates
vision and language,
13:39 - 13:44

we developed a model
that connects parts of visual things
13:44 - 13:46

like visual snippets
13:46 - 13:50

with words and phrases in sentences.
13:50 - 13:53

About four months ago,
13:53 - 13:56

we finally tied all this together
13:56 - 13:59

and produced one of the first
computer vision models
13:59 - 14:03

that is capable of generating
a human-like sentence
14:03 - 14:07

when it sees a picture for the first time.
14:07 - 14:12

Now, I'm ready to show you
what the computer says
14:12 - 14:14

when it sees the picture
14:14 - 14:17

that the little girl saw
at the beginning of this talk.
14:20 - 14:23

(Video) Computer: A man is standing
next to an elephant.
14:24 - 14:28

A large airplane sitting on top
of an airport runway.
14:29 - 14:33

FFL: Of course, we're still working hard
to improve our algorithms,
14:33 - 14:36

and it still has a lot to learn.
14:36 - 14:38

(Applause)
14:40 - 14:43

And the computer still makes mistakes.
14:43 - 14:46

(Video) Computer: A cat lying
on a bed in a blanket.
14:46 - 14:49

FFL: So of course, when it sees
too many cats,
14:49 - 14:52

it thinks everything
might look like a cat.
14:53 - 14:56

(Video) Computer: A young boy
is holding a baseball bat.
14:56 - 14:58

(Laughter)
14:58 - 15:03

FFL: Or, if it hasn't seen a toothbrush,
it confuses it with a baseball bat.
15:03 - 15:07

(Video) Computer: A man riding a horse
down a street next to a building.
15:07 - 15:09

(Laughter)
15:09 - 15:12

FFL: We haven't taught Art 101
to the computers.
15:14 - 15:17

(Video) Computer: A zebra standing
in a field of grass.
15:17 - 15:20

FFL: And it hasn't learned to appreciate
the stunning beauty of nature
15:20 - 15:22

like you and I do.
15:22 - 15:25

So it has been a long journey.
15:25 - 15:30

To get from age zero to three was hard.
15:30 - 15:35

The real challenge is to go
from three to 13 and far beyond.
15:35 - 15:39

Let me remind you with this picture
of the boy and the cake again.
15:39 - 15:44

So far, we have taught
the computer to see objects
15:44 - 15:48

or even tell us a simple story
when seeing the picture.
15:48 - 15:52

(Video) Computer: A person sitting
at a table with a cake.
15:52 - 15:54

FFL: But there's so much more
to this picture
15:54 - 15:56

than just a person and a cake.
15:56 - 16:01

What the computer doesn't see
is that this is a special Italian cake
16:01 - 16:04

that's only served during Easter time.
16:04 - 16:07

The boy is wearing his favorite t-shirt
16:07 - 16:11

given to him as a gift by his father
after a trip to Sydney,
16:11 - 16:15

and you and I can all tell how happy he is
16:15 - 16:18

and what's exactly on his mind
at that moment.
16:19 - 16:22

This is my son [???].
16:22 - 16:25

On my quest for visual intelligence,
16:25 - 16:27

I think of [???] constantly
16:27 - 16:30

and the future world he will live in.
16:30 - 16:32

When machines can see,
16:32 - 16:37

doctors and nurses will have
extra pairs of tireless eyes
16:37 - 16:41

to help them to diagnose
and take care of patients.
16:41 - 16:45

Cars will run smarter
and safer on the road.
16:45 - 16:48

Robots, not just humans,
16:48 - 16:53

will help us to brave the disaster zones
to save the trapped and wounded.
16:54 - 16:58

We will discover new species,
better materials,
16:58 - 17:02

and explore unseen frontiers
with the help of the machines.
17:03 - 17:07

Little by little, we're giving sights
to the machines.
17:07 - 17:10

First, we teach them to see.
17:10 - 17:13

Then, they help us to see better.
17:13 - 17:17

For the first time, human eyes
won't be the only ones
17:17 - 17:20

pondering and exploring our world.
17:20 - 17:23

We will not only use the machines
of their intelligence,
17:23 - 17:30

we will also collaborate with them
in ways that we cannot even imagine.
17:31 - 17:32

This is my quest:
17:32 - 17:34

to give computers visual intelligence
17:34 - 17:40

and to create a better future
for [???] and for the world.
17:40 - 17:41

Thank you.
17:41 - 17:45

(Applause)

Title:: How we're teaching computers to understand pictures
Speaker:: Fei-Fei Li
Description:: more » « less
Video Language:: English
Team:: closed TED
Project:: TEDTalks
Duration:: 17:58

	Morton Bast edited English subtitles for How we're teaching computers to understand pictures
	Morton Bast edited English subtitles for How we're teaching computers to understand pictures
	Morton Bast edited English subtitles for How we're teaching computers to understand pictures
	Morton Bast edited English subtitles for How we're teaching computers to understand pictures
	Morton Bast approved English subtitles for How we're teaching computers to understand pictures
	Morton Bast edited English subtitles for How we're teaching computers to understand pictures
	Morton Bast edited English subtitles for How we're teaching computers to understand pictures
	Morton Bast edited English subtitles for How we're teaching computers to understand pictures

Show all

English subtitles

Revisions Compare revisions

Revision 7 Edited

Morton Bast
Revision 6 Edited

Morton Bast
Revision 5 Edited

Morton Bast
Revision 4 Edited

Morton Bast
Revision 3 Edited

Madeleine Aronson
Revision 2 Edited

Joseph Geni
Revision 1

Amara Bot

	Revision Number	Author	Created
	7	Morton Bast
	6	Morton Bast
	5	Morton Bast
	4	Morton Bast
	3	Madeleine Aronson
	2	Joseph Geni
	1	Amara Bot

How we're teaching computers to understand pictures

Revisions Compare revisions

Our website uses cookies

Operating cookies (Required)