How we're teaching computers to understand pictures
-
0:02 - 0:06Let me show you something.
-
0:06 - 0:10(Video) Girl: Okay, that's a cat
sitting in a bed. -
0:10 - 0:14The boy is petting the elephant.
-
0:14 - 0:19Those are people
that are going on an airplane. -
0:19 - 0:21That's a big airplane.
-
0:21 - 0:24Fei-Fei Li: This is
a three-year-old child -
0:24 - 0:27describing what she sees
in a series of photos. -
0:27 - 0:30She might still have a lot
to learn about this world, -
0:30 - 0:35but she's already an expert
at one very important task: -
0:35 - 0:38to make sense of what she sees.
-
0:38 - 0:42Our society is more
technologically advanced than ever. -
0:42 - 0:46We send people to the moon,
we make phones that talk to us -
0:46 - 0:51or customize radio stations
that can play only music we like. -
0:51 - 0:55Yet, our most advanced
machines and computers -
0:55 - 0:58still struggle at this task.
-
0:58 - 1:01So I'm here today
to give you a progress report -
1:01 - 1:05on the latest advances
in our research in computer vision, -
1:05 - 1:10one of the most frontier
and potentially revolutionary -
1:10 - 1:13technologies in computer science.
-
1:13 - 1:17Yes, we have prototyped cars
that can drive by themselves, -
1:17 - 1:21but without smart vision,
they cannot really tell the difference -
1:21 - 1:25between a crumpled paper bag
on the road, which can be run over, -
1:25 - 1:29and a rock that size,
which should be avoided. -
1:29 - 1:33We have made fabulous megapixel cameras,
-
1:33 - 1:36but we have not delivered
sight to the blind. -
1:36 - 1:40Drones can fly over massive land,
-
1:40 - 1:42but don't have enough vision technology
-
1:42 - 1:45to help us to track
the changes of the rainforests. -
1:45 - 1:48Security cameras are everywhere,
-
1:48 - 1:53but they do not alert us when a child
is drowning in a swimming pool. -
1:54 - 2:00Photos and videos are becoming
an integral part of global life. -
2:00 - 2:04They're being generated at a pace
that's far beyond what any human, -
2:04 - 2:07or teams of humans, could hope to view,
-
2:07 - 2:11and you and I are contributing
to that at this TED. -
2:11 - 2:16Yet our most advanced software
is still struggling at understanding -
2:16 - 2:20and managing this enormous content.
-
2:20 - 2:25So in other words,
collectively as a society, -
2:25 - 2:27we're very much blind,
-
2:27 - 2:30because our smartest
machines are still blind. -
2:32 - 2:34"Why is this so hard?" you may ask.
-
2:34 - 2:37Cameras can take pictures like this one
-
2:37 - 2:41by converting lights into
a two-dimensional array of numbers -
2:41 - 2:43known as pixels,
-
2:43 - 2:45but these are just lifeless numbers.
-
2:45 - 2:48They do not carry meaning in themselves.
-
2:48 - 2:52Just like to hear is not
the same as to listen, -
2:52 - 2:57to take pictures is not
the same as to see, -
2:57 - 3:00and by seeing,
we really mean understanding. -
3:01 - 3:07In fact, it took Mother Nature
540 million years of hard work -
3:07 - 3:09to do this task,
-
3:09 - 3:11and much of that effort
-
3:11 - 3:17went into developing the visual
processing apparatus of our brains, -
3:17 - 3:19not the eyes themselves.
-
3:19 - 3:22So vision begins with the eyes,
-
3:22 - 3:26but it truly takes place in the brain.
-
3:26 - 3:31So for 15 years now, starting
from my Ph.D. at Caltech -
3:31 - 3:34and then leading Stanford's Vision Lab,
-
3:34 - 3:39I've been working with my mentors,
collaborators and students -
3:39 - 3:42to teach computers to see.
-
3:43 - 3:46Our research field is called
computer vision and machine learning. -
3:46 - 3:50It's part of the general field
of artificial intelligence. -
3:51 - 3:56So ultimately, we want to teach
the machines to see just like we do: -
3:56 - 4:02naming objects, identifying people,
inferring 3D geometry of things, -
4:02 - 4:08understanding relations, emotions,
actions and intentions. -
4:08 - 4:14You and I weave together entire stories
of people, places and things -
4:14 - 4:16the moment we lay our gaze on them.
-
4:17 - 4:23The first step towards this goal
is to teach a computer to see objects, -
4:23 - 4:26the building block of the visual world.
-
4:26 - 4:30In its simplest terms,
imagine this teaching process -
4:30 - 4:33as showing the computers
some training images -
4:33 - 4:37of a particular object, let's say cats,
-
4:37 - 4:41and designing a model that learns
from these training images. -
4:41 - 4:43How hard can this be?
-
4:43 - 4:47After all, a cat is just
a collection of shapes and colors, -
4:47 - 4:52and this is what we did
in the early days of object modeling. -
4:52 - 4:55We'd tell the computer algorithm
in a mathematical language -
4:55 - 4:59that a cat has a round face,
a chubby body, -
4:59 - 5:01two pointy ears, and a long tail,
-
5:01 - 5:02and that looked all fine.
-
5:03 - 5:05But what about this cat?
-
5:05 - 5:06(Laughter)
-
5:06 - 5:08It's all curled up.
-
5:08 - 5:12Now you have to add another shape
and viewpoint to the object model. -
5:12 - 5:14But what if cats are hidden?
-
5:15 - 5:17What about these silly cats?
-
5:19 - 5:22Now you get my point.
-
5:22 - 5:25Even something as simple
as a household pet -
5:25 - 5:29can present an infinite number
of variations to the object model, -
5:29 - 5:32and that's just one object.
-
5:33 - 5:35So about eight years ago,
-
5:35 - 5:40a very simple and profound observation
changed my thinking. -
5:41 - 5:44No one tells a child how to see,
-
5:44 - 5:46especially in the early years.
-
5:46 - 5:51They learn this through
real-world experiences and examples. -
5:51 - 5:54If you consider a child's eyes
-
5:54 - 5:57as a pair of biological cameras,
-
5:57 - 6:01they take one picture
about every 200 milliseconds, -
6:01 - 6:04the average time an eye movement is made.
-
6:04 - 6:10So by age three, a child would have seen
hundreds of millions of pictures -
6:10 - 6:11of the real world.
-
6:11 - 6:14That's a lot of training examples.
-
6:14 - 6:20So instead of focusing solely
on better and better algorithms, -
6:20 - 6:26my insight was to give the algorithms
the kind of training data -
6:26 - 6:29that a child was given through experiences
-
6:29 - 6:33in both quantity and quality.
-
6:33 - 6:35Once we know this,
-
6:35 - 6:38we knew we needed to collect a data set
-
6:38 - 6:42that has far more images
than we have ever had before, -
6:42 - 6:45perhaps thousands of times more,
-
6:45 - 6:49and together with Professor
Kai Li at Princeton University, -
6:49 - 6:54we launched the ImageNet project in 2007.
-
6:54 - 6:57Luckily, we didn't have to mount
a camera on our head -
6:57 - 6:59and wait for many years.
-
6:59 - 7:01We went to the Internet,
-
7:01 - 7:05the biggest treasure trove of pictures
that humans have ever created. -
7:05 - 7:08We downloaded nearly a billion images
-
7:08 - 7:14and used crowdsourcing technology
like the Amazon Mechanical Turk platform -
7:14 - 7:16to help us to label these images.
-
7:16 - 7:21At its peak, ImageNet was one of
the biggest employers -
7:21 - 7:24of the Amazon Mechanical Turk workers:
-
7:24 - 7:28together, almost 50,000 workers
-
7:28 - 7:32from 167 countries around the world
-
7:32 - 7:36helped us to clean, sort and label
-
7:36 - 7:40nearly a billion candidate images.
-
7:41 - 7:43That was how much effort it took
-
7:43 - 7:47to capture even a fraction
of the imagery -
7:47 - 7:51a child's mind takes in
in the early developmental years. -
7:52 - 7:56In hindsight, this idea of using big data
-
7:56 - 8:01to train computer algorithms
may seem obvious now, -
8:01 - 8:05but back in 2007, it was not so obvious.
-
8:05 - 8:09We were fairly alone on this journey
for quite a while. -
8:09 - 8:14Some very friendly colleagues advised me
to do something more useful for my tenure, -
8:14 - 8:18and we were constantly struggling
for research funding. -
8:18 - 8:20Once, I even joked to my graduate students
-
8:20 - 8:24that I would just reopen
my dry cleaner's shop to fund ImageNet. -
8:24 - 8:29After all, that's how I funded
my college years. -
8:29 - 8:31So we carried on.
-
8:31 - 8:35In 2009, the ImageNet project delivered
-
8:35 - 8:39a database of 15 million images
-
8:39 - 8:44across 22,000 classes
of objects and things -
8:44 - 8:47organized by everyday English words.
-
8:47 - 8:50In both quantity and quality,
-
8:50 - 8:53this was an unprecedented scale.
-
8:53 - 8:56As an example, in the case of cats,
-
8:56 - 8:59we have more than 62,000 cats
-
8:59 - 9:03of all kinds of looks and poses
-
9:03 - 9:08and across all species
of domestic and wild cats. -
9:08 - 9:12We were thrilled
to have put together ImageNet, -
9:12 - 9:16and we wanted the whole research world
to benefit from it, -
9:16 - 9:20so in the TED fashion,
we opened up the entire data set -
9:20 - 9:23to the worldwide
research community for free. -
9:25 - 9:29(Applause)
-
9:29 - 9:34Now that we have the data
to nourish our computer brain, -
9:34 - 9:38we're ready to come back
to the algorithms themselves. -
9:38 - 9:43As it turned out, the wealth
of information provided by ImageNet -
9:43 - 9:48was a perfect match to a particular class
of machine learning algorithms -
9:48 - 9:50called convolutional neural network,
-
9:50 - 9:55pioneered by Kunihiko Fukushima,
Geoff Hinton, and Yann LeCun -
9:55 - 9:59back in the 1970s and '80s.
-
9:59 - 10:05Just like the brain consists
of billions of highly connected neurons, -
10:05 - 10:08a basic operating unit in a neural network
-
10:08 - 10:11is a neuron-like node.
-
10:11 - 10:13It takes input from other nodes
-
10:13 - 10:16and sends output to others.
-
10:16 - 10:21Moreover, these hundreds of thousands
or even millions of nodes -
10:21 - 10:24are organized in hierarchical layers,
-
10:24 - 10:27also similar to the brain.
-
10:27 - 10:31In a typical neural network we use
to train our object recognition model, -
10:31 - 10:35it has 24 million nodes,
-
10:35 - 10:38140 million parameters,
-
10:38 - 10:41and 15 billion connections.
-
10:41 - 10:43That's an enormous model.
-
10:43 - 10:47Powered by the massive data from ImageNet
-
10:47 - 10:52and the modern CPUs and GPUs
to train such a humongous model, -
10:52 - 10:55the convolutional neural network
-
10:55 - 10:58blossomed in a way that no one expected.
-
10:58 - 11:01It became the winning architecture
-
11:01 - 11:06to generate exciting new results
in object recognition. -
11:06 - 11:09This is a computer telling us
-
11:09 - 11:11this picture contains a cat
-
11:11 - 11:13and where the cat is.
-
11:13 - 11:15Of course there are more things than cats,
-
11:15 - 11:18so here's a computer algorithm telling us
-
11:18 - 11:21the picture contains
a boy and a teddy bear; -
11:21 - 11:25a dog, a person, and a small kite
in the background; -
11:25 - 11:28or a picture of very busy things
-
11:28 - 11:33like a man, a skateboard,
railings, a lampost, and so on. -
11:33 - 11:38Sometimes, when the computer
is not so confident about what it sees, -
11:39 - 11:42we have taught it to be smart enough
-
11:42 - 11:46to give us a safe answer
instead of committing too much, -
11:46 - 11:48just like we would do,
-
11:48 - 11:53but other times our computer algorithm
is remarkable at telling us -
11:53 - 11:55what exactly the objects are,
-
11:55 - 11:59like the make, model, year of the cars.
-
11:59 - 12:04We applied this algorithm to millions
of Google Street View images -
12:04 - 12:07across hundreds of American cities,
-
12:07 - 12:10and we have learned something
really interesting: -
12:10 - 12:14first, it confirmed our common wisdom
-
12:14 - 12:17that car prices correlate very well
-
12:17 - 12:19with household incomes.
-
12:19 - 12:24But surprisingly, car prices
also correlate well -
12:24 - 12:26with crime rates in cities,
-
12:27 - 12:31or voting patterns by zip codes.
-
12:32 - 12:34So wait a minute. Is that it?
-
12:34 - 12:39Has the computer already matched
or even surpassed human capabilities? -
12:39 - 12:42Not so fast.
-
12:42 - 12:46So far, we have just taught
the computer to see objects. -
12:46 - 12:51This is like a small child
learning to utter a few nouns. -
12:51 - 12:54It's an incredible accomplishment,
-
12:54 - 12:56but it's only the first step.
-
12:56 - 13:00Soon, another developmental
milestone will be hit, -
13:00 - 13:03and children begin
to communicate in sentences. -
13:03 - 13:08So instead of saying
this is a cat in the picture, -
13:08 - 13:13you already heard the little girl
telling us this is a cat lying on a bed. -
13:13 - 13:18So to teach a computer
to see a picture and generate sentences, -
13:18 - 13:22the marriage between big data
and machine learning algorithm -
13:22 - 13:25has to take another step.
-
13:25 - 13:29Now, the computer has to learn
from both pictures -
13:29 - 13:32as well as natural language sentences
-
13:32 - 13:35generated by humans.
-
13:35 - 13:39Just like the brain integrates
vision and language, -
13:39 - 13:44we developed a model
that connects parts of visual things -
13:44 - 13:46like visual snippets
-
13:46 - 13:50with words and phrases in sentences.
-
13:50 - 13:53About four months ago,
-
13:53 - 13:56we finally tied all this together
-
13:56 - 13:59and produced one of the first
computer vision models -
13:59 - 14:03that is capable of generating
a human-like sentence -
14:03 - 14:07when it sees a picture for the first time.
-
14:07 - 14:12Now, I'm ready to show you
what the computer says -
14:12 - 14:14when it sees the picture
-
14:14 - 14:17that the little girl saw
at the beginning of this talk. -
14:20 - 14:23(Video) Computer: A man is standing
next to an elephant. -
14:24 - 14:28A large airplane sitting on top
of an airport runway. -
14:29 - 14:33FFL: Of course, we're still working hard
to improve our algorithms, -
14:33 - 14:36and it still has a lot to learn.
-
14:36 - 14:38(Applause)
-
14:40 - 14:43And the computer still makes mistakes.
-
14:43 - 14:46(Video) Computer: A cat lying
on a bed in a blanket. -
14:46 - 14:49FFL: So of course, when it sees
too many cats, -
14:49 - 14:52it thinks everything
might look like a cat. -
14:53 - 14:56(Video) Computer: A young boy
is holding a baseball bat. -
14:56 - 14:58(Laughter)
-
14:58 - 15:03FFL: Or, if it hasn't seen a toothbrush,
it confuses it with a baseball bat. -
15:03 - 15:07(Video) Computer: A man riding a horse
down a street next to a building. -
15:07 - 15:09(Laughter)
-
15:09 - 15:12FFL: We haven't taught Art 101
to the computers. -
15:14 - 15:17(Video) Computer: A zebra standing
in a field of grass. -
15:17 - 15:20FFL: And it hasn't learned to appreciate
the stunning beauty of nature -
15:20 - 15:22like you and I do.
-
15:22 - 15:25So it has been a long journey.
-
15:25 - 15:30To get from age zero to three was hard.
-
15:30 - 15:35The real challenge is to go
from three to 13 and far beyond. -
15:35 - 15:39Let me remind you with this picture
of the boy and the cake again. -
15:39 - 15:44So far, we have taught
the computer to see objects -
15:44 - 15:48or even tell us a simple story
when seeing a picture. -
15:48 - 15:52(Video) Computer: A person sitting
at a table with a cake. -
15:52 - 15:54FFL: But there's so much more
to this picture -
15:54 - 15:56than just a person and a cake.
-
15:56 - 16:01What the computer doesn't see
is that this is a special Italian cake -
16:01 - 16:04that's only served during Easter time.
-
16:04 - 16:07The boy is wearing his favorite t-shirt
-
16:07 - 16:11given to him as a gift by his father
after a trip to Sydney, -
16:11 - 16:15and you and I can all tell how happy he is
-
16:15 - 16:18and what's exactly on his mind
at that moment. -
16:19 - 16:22This is my son Leo.
-
16:22 - 16:25On my quest for visual intelligence,
-
16:25 - 16:27I think of Leo constantly
-
16:27 - 16:30and the future world he will live in.
-
16:30 - 16:32When machines can see,
-
16:32 - 16:37doctors and nurses will have
extra pairs of tireless eyes -
16:37 - 16:41to help them to diagnose
and take care of patients. -
16:41 - 16:45Cars will run smarter
and safer on the road. -
16:45 - 16:48Robots, not just humans,
-
16:48 - 16:53will help us to brave the disaster zones
to save the trapped and wounded. -
16:54 - 16:58We will discover new species,
better materials, -
16:58 - 17:02and explore unseen frontiers
with the help of the machines. -
17:03 - 17:07Little by little, we're giving sight
to the machines. -
17:07 - 17:10First, we teach them to see.
-
17:10 - 17:13Then, they help us to see better.
-
17:13 - 17:17For the first time, human eyes
won't be the only ones -
17:17 - 17:20pondering and exploring our world.
-
17:20 - 17:23We will not only use the machines
for their intelligence, -
17:23 - 17:30we will also collaborate with them
in ways that we cannot even imagine. -
17:30 - 17:32This is my quest:
-
17:32 - 17:34to give computers visual intelligence
-
17:34 - 17:40and to create a better future
for Leo and for the world. -
17:40 - 17:41Thank you.
-
17:41 - 17:45(Applause)
- Title:
- How we're teaching computers to understand pictures
- Speaker:
- Fei-Fei Li
- Description:
-
When a very young child looks at a picture, she can identify simple elements: "cat," "book," "chair." Now, computers are getting smart enough to do that too. What's next? In a thrilling talk, computer vision expert Fei-Fei Li describes the state of the art — including the database of 15 million photos her team built to "teach" a computer to understand pictures — and the key insights yet to come.
- Video Language:
- English
- Team:
- closed TED
- Project:
- TEDTalks
- Duration:
- 17:58
Morton Bast edited English subtitles for How we're teaching computers to understand pictures | ||
Morton Bast edited English subtitles for How we're teaching computers to understand pictures | ||
Morton Bast edited English subtitles for How we're teaching computers to understand pictures | ||
Morton Bast edited English subtitles for How we're teaching computers to understand pictures | ||
Morton Bast approved English subtitles for How we're teaching computers to understand pictures | ||
Morton Bast edited English subtitles for How we're teaching computers to understand pictures | ||
Morton Bast edited English subtitles for How we're teaching computers to understand pictures | ||
Morton Bast edited English subtitles for How we're teaching computers to understand pictures |