Return to Video

The AES block cipher (14 min)

  • 0:00 - 0:05
    Over the years it became clear that DES
    and triple DES are simply not designed for
  • 0:05 - 0:09
    modern hardware and are too slow. As a
    result, NIS started a new process to
  • 0:09 - 0:14
    standardize in a new block cypher called
    the Advanced Encryption Standard or AES
  • 0:14 - 0:19
    for short. Nis started it's effort in 1997
    when it requested, proposals for a new
  • 0:19 - 0:23
    block cipher. It received fifteen
    submissions a year later. And finally in
  • 0:23 - 0:27
    the year 2000, it adopted the cypher
    called Rindall as the advanced encryption
  • 0:27 - 0:32
    standard. This was a cypher designed in
    Belgium. We already said that it's block
  • 0:32 - 0:38
    size is 128 bits and it has three possible
    key sizes. 128 bits, 192, and 256. Now,
  • 0:38 - 0:45
    the assumption is that the larger the key
    size is, the more secure the block cipher
  • 0:45 - 0:50
    is as a pseudo random permutation. But
    because it also has more rounds involved
  • 0:50 - 0:56
    in its operation. The slower the cipher
    becomes. So the larger the key supposedly,
  • 0:56 - 1:02
    the more secure the cipher, but also the
    slower it becomes. So for example, AES 128
  • 1:02 - 1:07
    is the fastest of these ciphers and AES
    256 is the slowest. Now AES is built as
  • 1:07 - 1:12
    what?s called a substitution permutation
    network. It's not a Faistl network.
  • 1:12 - 1:16
    Remember that in a Faistl network, half
    the bit were unchanged from round to
  • 1:16 - 1:21
    round. In a substitution permutation
    network all the bits are changed in every
  • 1:21 - 1:25
    round. And the network works as follows,
    so here we have the first round of the
  • 1:25 - 1:30
    substitution permutation network, where
    the first thing we do is we X or the
  • 1:30 - 1:34
    current state with the round key. In this
    case the first round key. Then we go thru
  • 1:34 - 1:39
    a substitution layer, where blocks of Date
    are replaced with other blocks based on
  • 1:39 - 1:43
    what the substitution table says. And then
    we go through a permutation layer where
  • 1:43 - 1:48
    bits are permuted and shuffled around. And
    then we do this again. We XR with an X
  • 1:48 - 1:52
    round key, we go thru a substitution
    phase, and we're permute to dance around.
  • 1:52 - 1:57
    And so on, and so on, and so forth Until
    we reach the final round where we x or
  • 1:57 - 2:02
    with the very last around key, and then
    out comes the output. Now, and important
  • 2:02 - 2:06
    point about this design. Is that, in fact,
    because of how it's built, every step in
  • 2:06 - 2:11
    this network needs to be reversible, so
    that the whole thing is reversible. And so
  • 2:11 - 2:16
    the way we would, decrypt, essentially, is
    we would take the output and simply apply
  • 2:16 - 2:20
    each step of the network in reverse order.
    So we start with the permutation step, and
  • 2:20 - 2:24
    we have to make sure that step is
    reversible. Then we look at the
  • 2:24 - 2:28
    substitution layer, and we have to make
    sure this step is reversible. And this is
  • 2:28 - 2:33
    very different from DES. In DES, if you
    remember, the substitution tables were not
  • 2:33 - 2:37
    reversible at all. In fact, they
    map six bits to four bits. Whereas
  • 2:37 - 2:41
    here, everything has to be reversible,
    otherwise it would be impossible to
  • 2:41 - 2:45
    decrypt. And of course the x or with the
    round key is reversible as well. Okay? So
  • 2:45 - 2:51
    inversion of a substitution permutation
    network is simply done by applying all of
  • 2:51 - 2:56
    the steps in the reverse order. So now
    that we understand the generic
  • 2:56 - 3:02
    construction. Lets look at the specifics
    of AS. So AS operates on a 128 bit block.
  • 3:02 - 3:08
    Which is sixteen bytes. So what we do with
    AS is we write those sixteen bytes as a
  • 3:08 - 3:13
    four by four matrix. Each cell in the
    matrix contains one byte. And then we
  • 3:13 - 3:17
    start with the first round so we X over
    with the first round key and then apply a
  • 3:17 - 3:21
    certain function. That, includes
    substitutions and permutations and other
  • 3:21 - 3:26
    operations on the state. And again, these
    three functions that are applied here have
  • 3:26 - 3:30
    to be invertible so that in fact the
    cypher can be decrypted. And then we xor
  • 3:30 - 3:35
    with the next round key and we do that
    again. Again we apply the round function
  • 3:35 - 3:39
    and xor the round key. And we do that
    again and again and again. We do it ten
  • 3:39 - 3:43
    times. Although interestingly in the last
    round, the mix column step is actually
  • 3:43 - 3:49
    missing. And then finally, we XOR with the
    last rounds key, and out comes the output.
  • 3:49 - 3:54
    Again, in every phase here, we always,
    always, always keep this four by four
  • 3:54 - 3:59
    array. And so the output is also four by
    four, which is sixteen bytes, which is 128
  • 3:59 - 4:05
    bits. Now the long key themselves of
    course come from a sixteen byte AS key
  • 4:05 - 4:11
    using key expansion. So the key expansion
    maps from a sixteen bytes AS key
  • 4:11 - 4:17
    into eleven keys, each one being sixteen
    bytes. So these keys themselves are also a
  • 4:17 - 4:22
    four by four array that's XORed into the
    current state. Okay, so that's the
  • 4:22 - 4:26
    schematic of how AES works. And the only
    thing that's left to do is specify these
  • 4:26 - 4:30
    three functions, byte sub, shift row, and
    mixed column. And those are fairly easy to
  • 4:30 - 4:34
    explain. So I'm just gonna give you the
    high level description of what they are.
  • 4:34 - 4:39
    And, those interested in the details can
    look it up online. So the way byte
  • 4:39 - 4:45
    substitution works, is literally it's one
    S box containing 256 bytes. And
  • 4:45 - 4:51
    essentially, what it does is it applies
    the S box to every byte in the current
  • 4:51 - 4:56
    states. So, let me explain what I mean by
    that. So the current state is gonna be
  • 4:56 - 5:02
    this four by four, table. So here we have
    the four by four table. And to each
  • 5:02 - 5:07
    element in this table, we apply the S box.
    So let's call it the A table. And then
  • 5:07 - 5:13
    what we do is, essentially, for all four
    by four entries, essentially, the next
  • 5:13 - 5:19
    step, Aij. Becomes the current step
    evaluated at the look up table. So we use
  • 5:19 - 5:25
    the current cell as an entry, as an index,
    into the look up table. And then the value
  • 5:25 - 5:30
    of the look up table is what's output.
    Okay. So, that's the first step. The next
  • 5:30 - 5:35
    step that happens is a shift pro step,
    which is basically just a permutation. So
  • 5:35 - 5:40
    essentially we kind of do a stick lick
    shift on each one of the rows. So you can
  • 5:40 - 5:45
    see the second row is stick lick shifted
    by one position. This third row is
  • 5:45 - 5:50
    [inaudible] shifted by two positions and
    the third row is [inaudible] shifted by
  • 5:50 - 5:54
    three positions. And the last thing we do
    is mix columns where literally we apply a
  • 5:54 - 5:59
    linear transformation to each one of these
    columns. So there's a certain matrix that
  • 5:59 - 6:03
    multiplies each one of these columns and
    it becomes, the next column. So these
  • 6:03 - 6:07
    linear transformation is applied
    independently to each one of the columns.
  • 6:07 - 6:12
    Now, I should point out that, so far,
    shift rows and mixed columns are very easy
  • 6:12 - 6:17
    to implement in code. And I should say
    that the [inaudible] institution itself is
  • 6:17 - 6:22
    also easily computable, so that you can
    actually write code that takes less than
  • 6:22 - 6:28
    256 bytes to write. And you can kind of
    shrink the description of AES by literally
  • 6:28 - 6:33
    storing code that computes the table
    rather than hardwiring the table into your
  • 6:33 - 6:38
    implementation. And in fact, this is kind
    of a generic fact about AES, that if you
  • 6:38 - 6:44
    can have allowed no pre computation at
    all, including computing the S box on the
  • 6:44 - 6:49
    fly. Then in fact you get a fairly small
    implementation of AES, so it, it could fit
  • 6:49 - 6:53
    on very constrained environments where
    there isn't enough room to hold,
  • 6:53 - 6:57
    complicated code. But of course, this will
    be the slowest implementation, because
  • 6:57 - 7:01
    everything is computed now on the fly, and
    as a result, the implementation,
  • 7:01 - 7:05
    obviously, is gonna be, slower than things
    that were pre-computed. And then there is
  • 7:05 - 7:09
    this trade off. For example if you have a
    lot of space, and you can support large
  • 7:09 - 7:13
    code. You can actually precompute quite a
    bit of the three steps that I just
  • 7:13 - 7:17
    mentioned. In fact, there are multiple
    options of pre-computation, you can build
  • 7:17 - 7:21
    a table that's only four kilobyte big. You
    can build a table that is even longer,
  • 7:21 - 7:24
    maybe 24 kilobytes. So basically you will
    have these big tables in your
  • 7:24 - 7:28
    implementation. But then your actual
    performance is going to be really good,
  • 7:28 - 7:32
    because all your doing is just table
    look-ups and XORs. You're not doing
  • 7:32 - 7:35
    any other complicated arithmetic. And as a
    result, if you can do a lot of
  • 7:35 - 7:39
    pre-computation, these three steps here,
    [inaudible] should. [inaudible] and mixed
  • 7:39 - 7:43
    columns can be converted just into a
    number, a small number of table lookups
  • 7:43 - 7:48
    and some [inaudible]. All you can do is
    just compute the S box, so now your
  • 7:48 - 7:53
    implementation would just have 256 bytes.
    Hard coded The rest would just be code
  • 7:53 - 7:57
    that's actually computing these three
    functions. The performance would be slower
  • 7:57 - 8:01
    than in the previous step but the code
    footprint would also be smaller. So in
  • 8:01 - 8:05
    overall, there's this nice tradeoff
    between code size and performance. So on
  • 8:05 - 8:09
    high-end machines, on high-end servers,
    where you can afford to have a lot of
  • 8:09 - 8:13
    code, you can precompute and store these
    big tables and get the best performance.
  • 8:13 - 8:17
    Whereas on low-end machines like eight
    byte smart cards or think of like an eight
  • 8:17 - 8:21
    byte wristwatch, you would actually have a
    relatively small implementation of AES.
  • 8:21 - 8:26
    But as a result of course it won't be so
    fast. So here's an example that's a little
  • 8:26 - 8:30
    unusual, suppose you wanted to implement
    AES in Javascript so you can send an AES
  • 8:30 - 8:34
    library to the browser and have the
    browser actually do AES by itself. So in
  • 8:34 - 8:39
    this case what you'd like to, to is you'd
    like to both shrink the code size, so that
  • 8:39 - 8:43
    on the network there's minimum traffic to
    send a library over to the browser but, at
  • 8:43 - 8:48
    the same time, you'd like the browser
    performance to be as fast as possible. And
  • 8:48 - 8:52
    so this is something that we did a while
    ago essentially the idea is that the code
  • 8:52 - 8:57
    that actually gets send to the browser
    doesn't have any pre computer table and as
  • 8:57 - 9:01
    a result is fairly small code. But then
    the minute it lands on the browser, what
  • 9:01 - 9:05
    the browser will do is actually pre
    compute all the tables. So in some sense
  • 9:05 - 9:10
    the code goes from just being small and
    compact. It gets bloated with all these
  • 9:10 - 9:14
    precomputed tables. But those are stored
    on the laptop, which presumably has a lot
  • 9:14 - 9:17
    of memory. And then once you have the
    precomputed tables you actually encrypt
  • 9:17 - 9:21
    using them. And that's how you get the
    best performance. Okay? So if you have to
  • 9:21 - 9:25
    stand in implementation AES over the
    network, you can kind of get the best of
  • 9:25 - 9:29
    all worlds. Whereas, the code over the
    network is small, but when it reaches the
  • 9:29 - 9:33
    target client, it can kind of inflate
    itself. And then get the best performance
  • 9:33 - 9:38
    as it's doing encryption on the clients.
    Now AES is such a popular block cipher,
  • 9:38 - 9:43
    now essentially when you build crypto into
    products essentially your supposed to be
  • 9:43 - 9:48
    using AES, and as a result Intel actually
    put AES support into the processor itself.
  • 9:48 - 9:53
    So since Westmere there are special
    instructions in the Intel processor to
  • 9:53 - 9:58
    help accelerate AES. And so I listed these
    instructions here. They come in two pairs,
  • 9:58 - 10:03
    aesenc and aesenclast. And then, there's aesekygenassist. So, let me explain
  • 10:03 - 10:08
    what they do. So, aesenc essentially
    implements one round of AES. Namely, apply
  • 10:08 - 10:13
    the three functions in the XOR with the
    round key. And aesenclast basically
  • 10:13 - 10:17
    implements the last round of AES.
    Remember, the last round didn't have the
  • 10:17 - 10:22
    mix columns phase. It only had the subs
    bytes and shift rows. And so that's why
  • 10:22 - 10:27
    the aesenclast does. And the way you
    call these instructions is using 128 byte
  • 10:27 - 10:32
    registers which correspond to the states
    of AES. And so you would have one register
  • 10:32 - 10:37
    containing the states and one register
    containing the current round key. And then
  • 10:37 - 10:42
    when you call AES on these two registers,
    basically, they would run one round of AES
  • 10:42 - 10:48
    and place the result inside of this XMM
    one state register. And as a result if you
  • 10:48 - 10:53
    wanted to implement the whole AES. All you
    would do is, call aesenc nine times. Then
  • 10:53 - 10:58
    you would call aesenclast one time. These
    ten instructions are basically the entire
  • 10:58 - 11:03
    implementation of AES. That's it. It's that
    easy to implement AES on this hardware
  • 11:03 - 11:07
    and they claim because these operations
    are now done inside the processor not
  • 11:07 - 11:11
    using external instructions that are
    implemented in the processor. They claim
  • 11:11 - 11:16
    that they can get a fourteen X speedup
    over say an implementation that's running
  • 11:16 - 11:20
    in the same hardware but implementing AES without these special instructions. So
  • 11:20 - 11:24
    this is quite a significant speed up and
    the facts are now lots of. Products that
  • 11:24 - 11:28
    make use of these special instructions.
    And let's just say that this is not
  • 11:28 - 11:32
    specific to Intel, if you're in
    [inaudible], and they also implemented
  • 11:32 - 11:36
    exactly kinda similar instructions in
    their bulldozer architecture and further
  • 11:36 - 11:40
    and future architectures. Okay, so let's
    talk about the security of AES. I wanna
  • 11:40 - 11:44
    mention just two attacks here. Obviously,
    AES has been studied quite a bit. But the
  • 11:44 - 11:49
    only two attacks on the full AES are the
    following two. So, first of all, if you
  • 11:49 - 11:53
    wanted to do key recovery, the best
    attack, basically, is only four times
  • 11:53 - 11:57
    faster than exhaustive search. Which mean
    that instead of a hundred and. 28 bits
  • 11:57 - 12:03
    key, really you should be thinking of AES.
    Is 126 bit key. Cause exhaustive search,
  • 12:03 - 12:08
    really it's gonna four times faster than
    it should. Of course due to the 126, it's
  • 12:08 - 12:13
    still. More time than we have to compute,
    and this really does not hurt the security
  • 12:13 - 12:17
    alias. The more significant attack is,
    actually, on AES-256. It turns out there's a
  • 12:17 - 12:23
    weakness in the key expansion design of
    aes which allows for, what's called a
  • 12:23 - 12:28
    related key attack. So, what's a related
    key attack? Essentially, if you give me
  • 12:28 - 12:33
    about two to the 100 input output pairs
    for aes, but from four related keys. So,
  • 12:33 - 12:38
    these are keys that are very closely
    related, namely key number one. Is just
  • 12:38 - 12:42
    one bit flip of key #two, which is just
    one flip, bit flip of key #three, which is
  • 12:42 - 12:47
    just one flip bit flip of key #four. These
    are very closely related keys, if you like
  • 12:47 - 12:52
    your [inaudible] distances very short. But
    if you do that, then, in fact, there is a
  • 12:52 - 12:57
    two to the 100 attack. Now you should say,
    well, two to the 100 is still impractical.
  • 12:57 - 13:01
    This is still more time than we can
    actually run today. But nevertheless, the
  • 13:01 - 13:05
    fact that it's so much better than an
    exhaustive search attack, it's so much
  • 13:05 - 13:09
    better than two to the 56, is kind of a
    limitation of the cipher. But generally,
  • 13:09 - 13:14
    it's not a significant limitation, because
    it requires related keys. And so, in
  • 13:14 - 13:18
    practice, of course, you're supposed to be
    choosing your keys at random, so that you
  • 13:18 - 13:22
    have no related keys in your system. And
    as a result, this attack wouldn't apply.
  • 13:22 - 13:26
    But if you do have related keys, then
    there's a problem. So this is the end of
  • 13:26 - 13:30
    the segment, and in the next segment we're
    gonna talk about more provably secure
  • 13:30 - 13:32
    constructions for block ciphers.
Title:
The AES block cipher (14 min)
Video Language:
English

English subtitles

Revisions