Return to Video

Behind the Mic: The Science of Talking with Computers

  • 0:04 - 0:07
    ♪ (Music fades in) ♪
  • 0:15 - 0:19
    (Chirping)
  • 0:25 - 0:30
    (Vocalizations, different languages)
  • 0:32 - 0:35
    (Talking overlaps in background)
  • 0:36 - 0:38
    (Laughing)
  • 0:38 - 0:44
    (Computerized beeping)
  • 0:45 - 0:47
    (Beep)
  • 0:49 - 0:54
    (Man) We come into this world with the
    innate ability to learn to interact
  • 0:54 - 0:57
    with other sentient beings.
  • 0:58 - 1:00
    (Child vocalizing)
  • 1:00 - 1:03
    (Man) Suppose you had to interact with
    other people by writing little messages.
  • 1:03 - 1:05
    (Child vocalizes)
  • 1:05 - 1:07
    (Man) It'd be a real pain.
  • 1:07 - 1:09
    (Man) And that's how we interact
    with computers.
  • 1:09 - 1:12
    It's much easier just to talk to them...
    just so much easier...
  • 1:12 - 1:13
    (Child vocalizes)
  • 1:14 - 1:16
    (Man) If the computers could understand
    what we're saying.
  • 1:18 - 1:20
    For that, you need really
    good speech recognition.
  • 1:21 - 1:24
    (Narrator) The first speech recognition
    system was developed by Bell Laboratories
  • 1:24 - 1:28
    in 1952. It could only recognize
    numbers spoken by one person.
  • 1:28 - 1:32
    In the 1970s, Carnegie-Mellon
    came out with the Harpy System.
  • 1:32 - 1:37
    This was able to recognize over
    1,000 words and different pronunciations
  • 1:37 - 1:40
    (Narrator) of the same word.
    - (Man) Tomato - (Woman) Tomato
  • 1:40 - 1:43
    (Narrator) Speech recognition continued
    in the 80s with the introduction of the
  • 1:43 - 1:46
    Hidden Markov Model, which
    used a more mathematical approach
  • 1:46 - 1:50
    to analyzing sound waves that led to
    many breakthroughs we have today.
  • 1:50 - 1:53
    You're taking in very raw audio wave forms
  • 1:53 - 1:55
    like you get through a microphone
  • 1:55 - 1:56
    on your phone
  • 1:56 - 1:57
    or whatever...
  • 1:57 - 2:02
    (Woman) We chop it into small pieces
    and it tries to identify which phoneme
  • 2:02 - 2:05
    was spoken in that piece of speech.
  • 2:05 - 2:09
    - Phoneme is a primitive unit for
    expressing words.
  • 2:10 - 2:14
    (voicing phonemes shown above)
  • 2:15 - 2:20
    And then it stitches those together
    into likely words like Palo Alto.
  • 2:20 - 2:24
    - Speech recognition today is good at
    transcribing what you've said...
  • 2:24 - 2:25
    (Man, to phone) What's the weather
    like in Topeka?
  • 2:25 - 2:30
    (Man) You can talk about travels, your
    contacts, like, "Where can I get pizza?"
  • 2:30 - 2:32
    (Phone) Here are the listings for Pizza.
  • 2:32 - 2:34
    (Man) "How tall is the Eiffel Tower?"
    (Phone) The Eiffel Tower is ...
  • 2:34 - 2:37
    (Woman) We've made tremendous
    improvements very quickly.
  • 2:37 - 2:39
    (Man, to phone) Who is the 21st
    President of the United States?
  • 2:40 - 2:42
    (Phone beeps)
    (Phone) Chester A. Arthur was the 21st...
  • 2:42 - 2:44
    (Man, to phone) Okay, Google,
    where is he from?
  • 2:44 - 2:47
    (Man) Years ago, you had to be an engineer
    to interact with computers.
  • 2:48 - 2:50
    Today, everybody can interact.
  • 2:50 - 2:54
    - One thing still in its
    infancy is understanding.
  • 2:54 - 2:56
    - We need a far more sophisticated
    language understanding model
  • 2:56 - 2:59
    that understands what the sentence means.
  • 2:59 - 3:01
    We're still a very long way from that.
  • 3:01 - 3:03
    (Beeping)
  • 3:04 - 3:07
    ♪ (Soft background music) ♪
  • 3:08 - 3:12
    (Woman) Our ability to use language is one
    of the things that helps us have culture.
  • 3:13 - 3:19
    It's one of the things that helps
    us pass on traditions across generations.
  • 3:20 - 3:26
    Figuring out how the system of language
    works, even though it seems easy,
  • 3:26 - 3:33
    turns out to be very hard, but is one that
    every baby understands by 2 years old.
  • 3:33 - 3:38
    (Girl) There's two of them.
    (Woman) There's two Ls, yeah (spells word)
  • 3:38 - 3:41
    - Language is extremely complex
    and sophisticated...
  • 3:41 - 3:42
    - From the semantics
  • 3:42 - 3:44
    - (Man in chair) Ironies...
    - (Woman) Strong accents...
  • 3:44 - 3:45
    - (Man) Facial expressions...
  • 3:45 - 3:47
    - Human emotions, because that's
    part of how we communicate.
  • 3:47 - 3:49
    - Humor...
  • 3:49 - 3:52
    (Aside) Do I have to be careful
    not to offend the dinosaur?
  • 3:52 - 3:54
    - Language has so many different
    layers and that's why it's
  • 3:54 - 3:56
    such a difficult problem.
  • 3:56 - 3:59
    (Man) The present human brain
    and the learning algorithms in it
  • 3:59 - 4:02
    are far, far better at things like
    language understanding
  • 4:02 - 4:05
    and they're still a lot better
    at pun recognition.
  • 4:05 - 4:09
    - Whether or not we replicate exactly
    what the brain does, to understand
  • 4:09 - 4:14
    language and speech, is still a question.
  • 4:15 - 4:17
    (Beeping)
  • 4:18 - 4:23
    (Man) For many years, we believed that
    neural networks should work better than
  • 4:23 - 4:27
    the dumb existing technology that's
    basically just "table look-up."
  • 4:27 - 4:33
    Then, in 2009, two of my students
    (with some help from me) got it
  • 4:33 - 4:37
    working better. The first time it was
    just a little better.
  • 4:37 - 4:40
    But it was obvious that this could be
    improved to work much better.
  • 4:40 - 4:44
    (Man) The brain has this system of neurons
    all computing in parallel.
  • 4:45 - 4:49
    All knowledge in the brain is in the
    strength of connection between neurons.
  • 4:49 - 4:53
    What I mean by "neural net" is something
    that is simulated on a conventional
  • 4:53 - 4:59
    computer, but is designed to work in
    roughly the same ways as the brain.
  • 5:00 - 5:04
    Until quite recently, people got features
    by hand engineering them.
  • 5:04 - 5:09
    They looked at sine waves and did fourier
    analysis and tried to figure out
  • 5:09 - 5:12
    what features they should feed to the
    pattern recognition system.
  • 5:12 - 5:15
    The thing about neural networks is that
    they learn their own features.
  • 5:15 - 5:20
    In particular, they can learn features
    and features of features, etc,
  • 5:21 - 5:24
    and that's lead to huge improvement
    in speech recognition.
  • 5:24 - 5:27
    - But you can also use them for language
    understanding tasks.
  • 5:27 - 5:33
    How you do this is to represent words
    in very high-dimensional spaces.
  • 5:33 - 5:36
    - (Man) We can now deal with analogies
    where a word is represented as a list
  • 5:36 - 5:38
    of numbers.
  • 5:38 - 5:44
    For example, if I take 100 numbers that
    represent "Paris," and I subtract from it
  • 5:44 - 5:50
    "France" and add "Italy," if I look
    at the numbers I have, the closest
  • 5:50 - 5:53
    thing is a list of numbers that
    represents "Rome."
  • 5:53 - 5:58
    By first converting words into numbers,
    using a neural net, you can actually
  • 5:58 - 6:01
    do this analogical reasoning.
  • 6:01 - 6:06
    I predict that, in the next five years, it
    will be clear that these neural networks
  • 6:06 - 6:11
    with new learning algorithms will give us
    much better language understanding.
  • 6:13 - 6:19
    (Woman) When we started out, we thought
    things like chess or mathematics or logic
  • 6:19 - 6:21
    would be things that were really hard.
  • 6:21 - 6:26
    They're not that hard. We ended up with
    a machine that played as well as
  • 6:26 - 6:28
    a Grand Master at chess.
  • 6:28 - 6:33
    What we thought would be easy for
    a computer system, like language,
  • 6:33 - 6:37
    has turned out incredibly hard.
  • 6:37 - 6:42
    (Man) I can't even imagine the moment of
    success quite yet because there are so many
  • 6:42 - 6:47
    pieces of this puzzle that are unsolved,
    both from a science point of view
  • 6:47 - 6:51
    as well as a technical implementation
    point of view.
  • 6:51 - 6:52
    There are a lot of unknowns.
  • 6:52 - 6:56
    (Woman) Those are the great revolutions.
    Not just what we fiddle with what
  • 6:56 - 7:00
    we already know, but when we discover
    something completely new and unexpected.
  • 7:00 - 7:03
    (Man) Once you are in the area of
  • 7:03 - 7:09
    human-level performance,
    that will be pretty remarkable.
  • 7:13 - 7:14
    (Beep)
Title:
Behind the Mic: The Science of Talking with Computers
Description:

more » « less
Video Language:
English
Team:
Captions Requested
Duration:
07:19
  • Thank you so much, Michael. You descriptions of non-verbal sounds are great!

    Claude

  • You're welcome - thank you very much. :)

  • Thank you, Michael and Claude.

English subtitles

Revisions Compare revisions