Return to Video

"How NOT to Measure Latency" by Gil Tene

  • Not Synced
    Hi everyone, I'm Gil Tene.
  • Not Synced
    I'm going to be talking about this subject
    that I call "How NOT to Measure Latency".
  • Not Synced
    It's a subject that I've been talking now
    about for 3 years or so.
  • Not Synced
    I keep the title and change all
    the slides every time.
  • Not Synced
    A bunch of this stuff is new.
  • Not Synced
    So if you've seen any of my previous "How NOT to",
    you'll see only some things that are common.
  • Not Synced
    A nickname for the subject is this...
  • Not Synced
    Because I often will get that reaction
    from some people in the audience.
  • Not Synced
    Ever since I've told people that it's a
    nickname,
  • Not Synced
    They feel free to actually exclaim,
    "Oh S@%#!".
  • Not Synced
    And feel free to do that here in this talk.
  • Not Synced
    I'll prompt you in a couple of places
    where it is natural.
  • Not Synced
    But if just have the urge, go ahead.
  • Not Synced
    So just a tiny bit about me.
  • Not Synced
    I am the co-founder of Azul Systems.
  • Not Synced
    I play around with garbage collection a lot.
  • Not Synced
    Here is some evidence of me playing around
    with garbage collection in my kitchen.
  • Not Synced
    That's a trash compactor.
  • Not Synced
    The compaction function wasn't working right,
    so I had to fix it.
  • Not Synced
    I thought it'd be funny to take a picture
    with a book.
  • Not Synced
    I've also built a lot of things.
  • Not Synced
    I've been playing with computers since
    the early 80's.
  • Not Synced
    I've built hardware.
  • Not Synced
    I've helped design chips.
  • Not Synced
    I've built software at many
    different levels.
  • Not Synced
    Operating systems, drivers...
    JVM's obviously.
  • Not Synced
    And lots of big systems at the system level.
  • Not Synced
    Built our own app server in the late 90's
    because web logic wasn't around yet.
  • Not Synced
    So, I've made a lot of mistakes,
    and I've learned from a few of them.
  • Not Synced
    This is actually a combination of a bunch
    of those mistakes looking at latency.
  • Not Synced
    I do have this hobby of depressing people
    by pulling the wool up from over your eyes,
  • Not Synced
    and this is what this talk is about.
  • Not Synced
    So, I need to give you a choice right here.
  • Not Synced
    There's the door.
  • Not Synced
    You can take the blue pill,
    and you can leave.
  • Not Synced
    Tomorrow you can keep believing whatever
    it is you want to believe.
  • Not Synced
    But if you stay here and take the red pill,
    I will show you a glimpse of how
  • Not Synced
    far down the rabbit hole goes,
    and it will never be the same again.
  • Not Synced
    Let's talk about latency.
  • Not Synced
    And when I say latency, I'm talking about
    latency response time, any of those things
  • Not Synced
    where you measure time from 'here to here',
    and you're interested in how long it took.
  • Not Synced
    We do this all the time, but I see a lot
    of mish-mash in how people
  • Not Synced
    treat the data, or think about it.
  • Not Synced
    Latency is basically the time it took
    something to happen once.
  • Not Synced
    That one time, how long did it take.
  • Not Synced
    And when we measure stuff, like we did
    a million operations in the last hour,
  • Not Synced
    we have a million latencies. Not one,
    we have a million of them.
  • Not Synced
    Our actual goal is to figure out how to
    describe that million.
  • Not Synced
    How did the million behave?
  • Not Synced
    For example, 'they're all really good, and
    they're all exactly the same', would be a
  • Not Synced
    behavior that you will never see,
    but that would be a great behavior.
  • Not Synced
    So we need to talk about how things behave,
    communicate, think, evaluate,
  • Not Synced
    set requirements for, talk to other people,
    but these are all common things around that.
  • Not Synced
    To do that, we have to describe the
    distribution, the set, the behavior,
  • Not Synced
    but not the one.
  • Not Synced
    For example, the behavior that says "the
    the common case was x" is a piece of
  • Not Synced
    information about the behavior,
    but it's a tiny sliver.
  • Not Synced
    Usually the least relevant one.
  • Not Synced
    Well, there's some less relevant ones,
    but not a strongly relevant one,
  • Not Synced
    and one that people often focus on.
  • Not Synced
    To take a look at what we actually do
    with this stuff, almost on a daily basis,
  • Not Synced
    this is a snapshot from a monitoring system.
  • Not Synced
    A small dashboard on a big screen
    in a monitoring system.
  • Not Synced
    Where you're watching the response time of
    a system over time.
  • Not Synced
    This is a two hour window.
  • Not Synced
    These lines that are 95th percentile,
    90, 75, 50, and 25th percentiles,
  • Not Synced
    you can look at how they behave over time.
  • Not Synced
    We're a small audience here, if you look at
    this picture, what draws your eye?
  • Not Synced
    What do you want to go investigate here
    or pay attention to ?
  • Not Synced
    It's the big red spike there, right?
  • Not Synced
    So we could look at the red spike,
    cause it's different,
  • Not Synced
    and say, "Woah, the 95th percentile shot up
    here. And look, the 90th percentile
  • Not Synced
    shot up at about the same time.
  • Not Synced
    The rest of them didn't shoot up,
    so maybe something happened here
  • Not Synced
    that affected that much, I should probably
    pay attention to it
  • Not Synced
    because it's a monitoring system, and
    I like things to be calm."
  • Not Synced
    You could go investigate the why.
  • Not Synced
    At this point, I've managed to waste
    about 90 seconds of your life,
  • Not Synced
    looking at a completely meaningless chart,
    which unfortunately you do
  • Not Synced
    every day, all the time.
  • Not Synced
    This chart is the chart you want to show
    somebody if you want to
  • Not Synced
    hide the truth from them.
  • Not Synced
    If you want to pull the wool
    over their eyes.
  • Not Synced
    This is the chart of the good stuff.
  • Not Synced
    What's not on this chart?
  • Not Synced
    The 5% worse things that happened during
    this two hours.
  • Not Synced
    They're not here.
  • Not Synced
    This is only the good things that happened
    during the things.
  • Not Synced
    And to get this spike, that 5% had to be
    so bad that it even pulled
  • Not Synced
    the 95th percentile all up.
  • Not Synced
    There is zero information here at all about
    what happened bad during this two hours,
  • Not Synced
    which makes it a bad fit for
    a monitoring system.
  • Not Synced
    It's a really good thing for
    a marketing system.
  • Not Synced
    It's a great way to get the bonus from your boss, even though you didn't do the work.
  • Not Synced
    If you want to learn how to do that,
    we can do another talk about that.
  • Not Synced
    But this is not a good way to look at latency.
  • Not Synced
    It's the opposite of good.
  • Not Synced
    Unfortunately, this is one of the most
    common tools used for
  • Not Synced
    server monitoring on earth right now.
  • Not Synced
    That's where the snapshot is from,
    and this is what people look at.
  • Not Synced
    I find this chart to be a goldmine
    of information.
  • Not Synced
    When I first showed it in another talk
    like this, I had this really cool experience.
  • Not Synced
    Somebody came up to me and said, "Hey,
    as I was sitting here, I was texting one
  • Not Synced
    of our guys, and he was saying,
  • Not Synced
    'look, we have this issue with
    our 95th percentile'."
  • Not Synced
    And I got this chart from him!
  • Not Synced
    So I went and said, "Hey, what does the
    rest of the spectrum look like?"
  • Not Synced
    This is the actual chart they got.
  • Not Synced
    And when they look at the rest of the
    spectrum, it looked like that.
  • Not Synced
    That's what was hiding.
  • Not Synced
    I noticed the scales are a little different.
  • Not Synced
    That yellow line is that yellow line.
  • Not Synced
    So that's a much more representative number.
  • Not Synced
    Is it? Is that good enough?
  • Not Synced
    That's the 99th percentile.
  • Not Synced
    We still have another 1% of really bad
    stuff that's hiding above the blue line.
  • Not Synced
    I wonder how big that is?
  • Not Synced
    I don't know because he didn't have the data.
  • Not Synced
    So a common problem that we have is that
    we only plot what's convenient.
  • Not Synced
    We only plot what gives us nice,
    colorful graphs.
  • Not Synced
    And often, when we have to choose between
    the stuff that hides the rest of the data,
  • Not Synced
    and the stuff that is noise, we choose
    the noise to display.
  • Not Synced
    I like to rant about latency.
  • Not Synced
    This is from a blog that I don't write
    enough in, but the format for it was simple.
  • Not Synced
    I tweet a single tweet about latency,
    latency tip of the day,
  • Not Synced
    and then I rant about my own tweet.
  • Not Synced
    As an example, this chart is a goldmine
    of information because it has so many
  • Not Synced
    different things that are wrong in it,
    but we won't get into all of them.
  • Not Synced
    You can read it online.
  • Not Synced
    Anyway, this is one to take away from
    what we just said.
  • Not Synced
    If you are not measuring and showing the
    maximum value, what is it you are hiding?
  • Not Synced
    And from whom?
  • Not Synced
    If you're job is to hide the truth from
    others, this is a good way to do it.
  • Not Synced
    But if actually are interested in what's
    going on, the number one indicator
  • Not Synced
    you should never get rid of is the
    maximum value.
  • Not Synced
    That is not noise, that is the signal.
  • Not Synced
    The rest of it is noise.
  • Not Synced
    Okay, let's look at this chart for some
    more cool stuff.
  • Not Synced
    I'm gonna zoom in to a small part
    of the chart, and ask you what that means.
  • Not Synced
    What is the average of the 95th percentile
    over 2 hours mean?
  • Not Synced
    What is the math that does that?
  • Not Synced
    What does it do?
  • Not Synced
    Let's look at that, and I'll give you
    an example with another percentile.
  • Not Synced
    The 100th percentile. The max, right?
  • Not Synced
    Let's take a data set.
  • Not Synced
    Suppose this was the maximum every minute
    for 15 minutes.
  • Not Synced
    What does it mean to say that the average
    max over the last 15 minutes was 42?
  • Not Synced
    I specifically chose the data to
    make that happen.
  • Not Synced
    It's a meaningless statement.
  • Not Synced
    It's a completely meaningless statement.
  • Not Synced
    But when you see 95th percentile,
    average 184, you think that the 95th
  • Not Synced
    percentile for the last two hours
    was around 184.
  • Not Synced
    It makes you think that.
  • Not Synced
    Putting this on a piece of paper is not
    just noise and irrelevant,
  • Not Synced
    it's a way to mislead people.
  • Not Synced
    It's a way to mislead yourself, because
    you'll start to believe your own mistruths.
  • Not Synced
    This is true for any percentile.
  • Not Synced
    There is no percentile that you could do
    this math on.
  • Not Synced
    Another tip, you cannot average percentiles.
  • Not Synced
    That math doesn't happen.
  • Not Synced
    But percentiles do matter. You really
    want to know about them.
  • Not Synced
    And a common misperception is that we want
    to look at the main part of the spectrum,
  • Not Synced
    not those outliers and perfection stuff.
  • Not Synced
    Only people that actually bet their house
    every day, or the bank on it,
  • Not Synced
    need to know about the "five-nine's",
    and all those.
  • Not Synced
    The 99th percentile is a pretty
    good number.
  • Not Synced
    Is 99% really rare?
  • Not Synced
    Let's look at some stuff, because we can
    ask questions like, "If I were looking
  • Not Synced
    at a webpage, what is the chance of me
    hitting the 99th percentile?"
  • Not Synced
    Of things like this: a search engine node,
    or a key value store,
  • Not Synced
    or a database, or a CDN, right?
  • Not Synced
    Because they will report their 99th percentile.
  • Not Synced
    They won't tell you anything above that,
    but how many of the
  • Not Synced
    webpages that we go to
    actually experience this?
  • Not Synced
    You want to say 1%, right?
  • Not Synced
    Well, I went to some webpages and I counted
    how many "http" requests were generated
  • Not Synced
    by one click into that webpage,
    and here are the numbers.
  • Not Synced
    I ended that about a year ago.
  • Not Synced
    They've probably gone up since then.
  • Not Synced
    Now that translates into this math.
  • Not Synced
    This is the likelihood of one click seeing
    the 99th percentile.
  • Not Synced
    And the only page where that is less than
    50% is the clean google search page.
  • Not Synced
    Where only a quarter will see the
    99th percentile.
  • Not Synced
    The 99th percentile is the thing that most
    of your webpages will see.
  • Not Synced
    Most of them will be there.
  • Not Synced
    Now, we could look at other things.
  • Not Synced
    We can pick which things to focus on.
  • Not Synced
    Let's say I had to pick between the 95th
    percentile, and the three 9's (99.9%).
  • Not Synced
    The three 9's is way into perfection mode
    for most people, or they think.
  • Not Synced
    Which one of those represents our
    community better?
  • Not Synced
    Our population?
  • Not Synced
    Our users?
  • Not Synced
    Our experience?
  • Not Synced
    Let's run a hypothetical.
  • Not Synced
    Suppose we don't have that many pages,
    and that many resources like we said before.
  • Not Synced
    We'll be much more conservative.
  • Not Synced
    A user session will only go through five
    clicks, and each click will only bring up
  • Not Synced
    up to 40 things.
  • Not Synced
    A lot less, and they're all as clean
    as the google page.
  • Not Synced
    How many of the users will not experience
    something worse than the 95th percentile?
  • Not Synced
    Because that's what the 95th percentile
    is good for, the people who see that.
  • Not Synced
    Anybody above that, is that.
  • Not Synced
    What are the chances of not seeing it?
  • Not Synced
    That's an interesting number.
  • Not Synced
    So you're watching a number that is
    relevant to 0.003% of your users.
  • Not Synced
    99.997% of your users are going to
    see worse than this number.
  • Not Synced
    Why are you looking at it?
  • Not Synced
    Why are you spending time
    thinking about it?
  • Not Synced
    In reverse, we could say how many people
    are going to see something
  • Not Synced
    worse than the three 9's (99.9%)?
  • Not Synced
    That's going to be 18%.
  • Not Synced
    In reverse, 82% of the people will see
    the three 9's (99.9%) or better.
  • Not Synced
    That's a slightly better representation.
  • Not Synced
    Probably not good enough either.
  • Not Synced
    We could look at some more math with them,
    same kind of scenario.
  • Not Synced
    What percentile of http response time
    will be the thing that 95%
  • Not Synced
    of people experience in this scenario?
  • Not Synced
    It's the 99.97 percentile that 95%
    of people see.
  • Not Synced
    And if you want to know what 99%
    of the people see,
  • Not Synced
    that's four and a half 9's (99.995%).
  • Not Synced
    You want to know that number from Akamai
    if you want to predict what 1%
  • Not Synced
    of your users are going to experience.
  • Not Synced
    When you know the 99th percentile,
    you kind of know a tiny bit.
  • Not Synced
    So here's another tip.
  • Not Synced
    And this is not an exaggeration,
    by the way.
  • Not Synced
    The median, which is a much smaller
    percentile, has that minuscule a chance
  • Not Synced
    of ever being the number that
    anybody experiences.
  • Not Synced
    This is the chance of getting worse
    than the median.
  • Not Synced
    Which makes the median an irrelevant
    number to look at.
  • Not Synced
    Unfortunately, it's probably the most
    common one looked at.
  • Not Synced
    When people say "the typical",
    they look at the thing that
  • Not Synced
    everything will be worse than.
  • Not Synced
    Okay, I'm sorry about that part.
  • Not Synced
    We'll do some other parts.
  • Not Synced
    Now, why is it that when we look at these
    monitoring systems, we don't see
  • Not Synced
    data with a lot of 9's?
  • Not Synced
    Why do we stop at the
    90, 95, 99th percentile?
  • Not Synced
    Why don't we look further?
  • Not Synced
    Now, some of it is because people think,
    "Well that's perfection, I don't need it."
  • Not Synced
    The other part is that it's hard.
  • Not Synced
    It's hard because you can't
    average percentiles.
  • Not Synced
    We already talked about that.
  • Not Synced
    But you also can't derive your
    five 9's (99.999%) out of a lot
  • Not Synced
    of 10 second samples of percentiles.
  • Not Synced
    And the reason for that is, "Hey, in 10
    seconds, maybe I only had 1,000 things."
  • Not Synced
    I could take all the 10 seconds in the
    world, there's no way to say what the
  • Not Synced
    hour five 9's (99.999%) were, what the
    minutes five 9's were
  • Not Synced
    if I'm collecting just this data.
  • Not Synced
    And unfortunately, the data being collected
    and reported to the back ends of monitoring
  • Not Synced
    is usually summarized at a second,
    5 seconds, 10 seconds, etc.
  • Not Synced
    Basically throwing away all the good data,
    and leaving you with absolutely no way
  • Not Synced
    to compute large 9's for longer
    periods of time.
  • Not Synced
    So, this is where you might want to look
    at HDR Histogram.
  • Not Synced
    It's an open source thing I've created
    a few years ago.
  • Not Synced
    I did it in Java, and know there's a
    C, C-Sharp, Python, Erlang,
  • Not Synced
    and Go ports of this that I didn't create.
  • Not Synced
    And it lets you actually get an entire
    percentile spectrum.
  • Not Synced
    Some of you here I know are
    already using it.
  • Not Synced
    And you can look at all the percentiles.
  • Not Synced
    Any number of 9's that's in the data, if
    you just keep it right and report it right,
  • Not Synced
    it's got a log format, you can
    store things forever.
  • Not Synced
    Well, for a long time.
  • Not Synced
    Okay, so it lets you have nice things.
  • Not Synced
    Enough for that advertisement.
  • Not Synced
    Now, latency... Well, I think this is
    slightly out of order.
  • Not Synced
    Yeah, sorry.
  • Not Synced
    This is the red/blue pill part, so I warn
    you, this is your last chance.
  • Not Synced
    There's a problem I call the
    coordinated omission problem.
  • Not Synced
    The coordinated omission problem is
    basically a conspiracy.
  • Not Synced
    It's a conspiracy that we're all part of.
  • Not Synced
    I don't think anybody actually meant
    to do it, but once I've noticed it,
  • Not Synced
    everywhere I look, there it is.
  • Not Synced
    Now, I've been using a specific way of
    showing you numbers so far.
  • Not Synced
    Has anybody here noticed how
    I spell percentile?
  • Not Synced
    (Audience Member): "You put lie at the
    end of the percent sign."
  • Not Synced
    Yeah, good.
  • Not Synced
    So coordinated omission problem is the
    "lie" in %lies.
  • Not Synced
    And this is how it works.
  • Not Synced
    One common way to do this is
    to use a load generator.
  • Not Synced
    Pretty much all load generator's
    have this problem.
  • Not Synced
    There are two that I know of that don't.
  • Not Synced
    What you do with a load generator,
    is you test.
  • Not Synced
    You issue requests, or send packets.
  • Not Synced
    And you measure how long something took.
  • Not Synced
    And as long as the numbers go right,
    measure them, put them in a bucket,
  • Not Synced
    study them later, and get your
    percentiles from it.
  • Not Synced
    But what if the thing that you are
    measuring took longer than the time
  • Not Synced
    it would've taken until you send
    the next thing?
  • Not Synced
    You're supposed to send something
    every second,
  • Not Synced
    but this one took a second and a half.
  • Not Synced
    Well you've got to wait before
    you send the next one.
  • Not Synced
    You just avoided measuring something
    when the system was problematic.
  • Not Synced
    You've coordinated with it.
  • Not Synced
    You weren't looking at it then.
  • Not Synced
    That's common scenario A: You've backed
    off, and avoided measuring when it was bad.
  • Not Synced
    Another way, is you measure inside your code.
  • Not Synced
    We all do this. We all have to do this,
  • Not Synced
    where we measure time, do something,
    then measure time.
  • Not Synced
    The delta between them is how long it took.
  • Not Synced
    We can then put it in a stats bucket,
    and then do the percentiles in that.
  • Not Synced
    Unfortunately, if the system freezes right
    here, for any reason,
  • Not Synced
    an interrupted contact switch,
  • Not Synced
    a cash buffer flushed to disk,
  • Not Synced
    a garbage collection,
  • Not Synced
    a re-indexing of your database,
    this is a database.
  • Not Synced
    This is Cassandra by the way,
    measuring itself.
  • Not Synced
    In any of the above, then you will
    have one bad report
  • Not Synced
    while 10,000 things are waiting in line.
  • Not Synced
    And when they come in, they will look
    really, really good.
  • Not Synced
    Even though each one of them has had
    a really bad experience.
  • Not Synced

    It can even get worse, where maybe the
    freeze happened outside the timing,
  • Not Synced
    and you won't even know there was a freeze.
  • Not Synced
    Now these are examples of admitting data
    that is bad on a very selective basis.
  • Not Synced
    It's not random sampling.
  • Not Synced
    It's, "I don't like bad data",
  • Not Synced
    or "I couldn't handle it",
  • Not Synced
    or "I don't know about it",
  • Not Synced
    so we'll just talk about the good.
  • Not Synced
    What does that do to your data?
  • Not Synced
    Because it often makes people feel like,
  • Not Synced
    "Okay, yeah, I understand,
    but it's a little bit of noise."
  • Not Synced
    Let's run some hypotheticals,
    and I'll show you some real numbers.
  • Not Synced
    Imagine a perfect system.
  • Not Synced
    It's doing 100 requests a second,
    at exactly a millisecond each.
  • Not Synced
    But we go and freeze the system,
    after 100 seconds of perfect operations
  • Not Synced
    for 100 seconds, and then repeat.
  • Not Synced
    Now, I'm going to describe how the system
    behaves in terms that should mean something,
  • Not Synced
    and then we'll measure it.
  • Not Synced
    If we actually wanted to describe the
    system,
  • Not Synced
    on the left we have an average
    of one millisecond by the finish,
  • Not Synced
    and on the right we have an
    average of 50 seconds.
  • Not Synced
    Why 50? Because if I randomly came in
    in that 100 seconds,
  • Not Synced
    I'll get anything from 0 to 100
    with even distribution.
  • Not Synced
    The overall average over 200 seconds
    is 25 seconds.
  • Not Synced
    If I just came in here and said,
    "Surprise, how long did this take?"
  • Not Synced
    On average, it will be 25.
  • Not Synced
    I can also do the percentiles.
  • Not Synced
    50th percentile will be really good,
    and then it'll get really bad.
  • Not Synced
    The four 9's is terrible.
  • Not Synced
    This is a fair honest description of
    this system if this is what it did.
  • Not Synced
    And you can make the system do that.
  • Not Synced
    That's what Control Z is good for.
  • Not Synced
    You can make any of your systems do that.
  • Not Synced
    Now lets go measure this system with
    a load generator,
  • Not Synced
    or with a monitoring system.
  • Not Synced
    The common ones.
  • Not Synced
    The ones everybody does.
  • Not Synced
    On the left, we're going to get 10,000
    results of one millisecond each.
  • Not Synced
    Great.
  • Not Synced
    And we're going to get one result of
    100 seconds.
  • Not Synced
    Wow, really big response time.
  • Not Synced
    This is our data.
  • Not Synced
    This is OUR data.
  • Not Synced
    So now you go do math with it.
  • Not Synced
    The average of that is 10.9 milliseconds.
  • Not Synced
    A little less than 25 seconds.
  • Not Synced
    And here are the percentiles.
  • Not Synced
    Your load generator monitoring system
    will tell you that this system is perfect.
  • Not Synced
    You could go to production with it.
  • Not Synced
    You like what you see.
  • Not Synced
    Look at that, four 9's.
  • Not Synced
    It is lying to you.
  • Not Synced
    To your face.
  • Not Synced
    And you can catch it doing that with a
    Control Z-Test.
  • Not Synced
    But people tend to not want to do that,
    because then what are they going to do?
  • Not Synced
    If you just do that test, and calibrate
    your system, and you find it
  • Not Synced
    telling you that, about this, the next
    step should be to throw all the numbers away.
  • Not Synced
    Don't believe anything else it says.
  • Not Synced
    If it lies this big, what else did it do?
  • Not Synced
    Don't waste your time on numbers
    from uncalibrated systems.
  • Not Synced
    Now the problem here was, that if you
    want to measure the system,
  • Not Synced
    you have to measure at random rates,
    or same rates.
  • Not Synced
    If you measure 10,000 things in 100 seconds,
    there should be another 10,000 things here.
  • Not Synced
    If you measure them, you would've gotten
    all the right numbers.
  • Not Synced
    Coordinated omission is the simple act of
    erasing all that bad stuff.
  • Not Synced
    The conspiracy here is that we all do it
    without meaning to.
  • Not Synced
    I don't know who put that in our systems,
    but it happens to all of us .
  • Not Synced
    Now, I often get people saying,
    "Okay, I get it. All the numbers are wrong,
  • Not Synced
    but at least for my job where I tune
    performance, and I try to make things
  • Not Synced
    faster, I can use the numbers to figure
    out if I'm going in the right direction."
  • Not Synced
    Is it better, or is it worse? Let me
    dispel that for you for a second.
  • Not Synced
    Suppose I went and took this system,
    and improved it dramatically.
  • Not Synced
    Rather than freezing for 100 seconds,
    it will now answer every question.
  • Not Synced
    It'll take a little longer,
    5 milliseconds instead of one,
  • Not Synced
    but it's much better than freezing, right?
  • Not Synced
    So let's measure that system that we spent
    weeks and weeks improving,
  • Not Synced
    and see if it's better.
  • Not Synced
    That's the data.
  • Not Synced
    If we do the percentiles, it'll tell us
    that we just really hurt the four 9's.
  • Not Synced
    We made it go 5 times worse than before.
  • Not Synced
    We should revert this change, go back to
    that much better system we had before.
  • Not Synced
    So this is just to make sure that you
    don't think that you can have
  • Not Synced
    any intuition based on any of these numbers.
  • Not Synced
    They go backwards sometimes.
  • Not Synced
    You don't know which way is good or bad.
  • Not Synced
    And you'll never know which way is good
    or bad with a system that lies like that.
  • Not Synced
    The other cool technique is
    what I call "Cheating Twice".
  • Not Synced
    You have a constant load generator,
    and it needs to do 100 per second.
  • Not Synced
    When it woke up after 200 seconds,
    it says,
  • Not Synced
    "Woah, were 9,999 behind.
    We've got to issue those requests."
  • Not Synced
    So it issues those requests.
  • Not Synced
    At this point, not only did it get rid of
    all the bad requests,
  • Not Synced
    it replaced every one of them with
    a perfect request.
  • Not Synced
    Coining the four 9's (99.99%), all the way
    to four and a half 9's (99.995%),
  • Not Synced
    it's twice as wrong as dropping them.
  • Not Synced
    So these are all cool things that
    happen to you.
  • Not Synced
    I'm not going to spend much time on how
    to fix those and avoid those.
  • Not Synced
    There's a lot of other material that you
    can find with me
  • Not Synced
    talking about that, in longer talks.
  • Not Synced
    But this is pretty bad.
  • Not Synced
    And like I said...
  • Not Synced
    That should've been up there before.
  • Not Synced
    How did this repeat itself?
  • Not Synced
    Did I create a loop in the
    presentation somehow?
  • Not Synced
    I don't know how to do that.
  • Not Synced
    Let's see if I can get through here.
  • Not Synced
    Hopefully editing later will take it out.
  • Not Synced
    So we have the cheats twice.
  • Not Synced
    There, okay.
  • Not Synced
    So, after we look at coordinated
    omission that way,
  • Not Synced
    we should also look at response time,
    and service time.
  • Not Synced
    Coordinated omission, what it really is
    achieving for you, unfortunately,
  • Not Synced
    is that it makes something that you think
    is response time, and only shows you
  • Not Synced
    the service time component of latency.
  • Not Synced
    This is a simple depiction of what service
    time and response times are.
  • Not Synced
    This guy is taking a certain amount of
    time to take payment
  • Not Synced
    or make a cup of coffee.
  • Not Synced
    That's service time.
  • Not Synced
    How long does it take to do the work?
  • Not Synced
    This person has experienced
    the response time,
  • Not Synced
    which includes the amount of time they
    have to wait before they
  • Not Synced
    get to the person that does the work.
  • Not Synced
    And the difference between those
    two is immense.
  • Not Synced
    The coordinated omission problem makes
    something that you think is
  • Not Synced
    response time, only measure the
    service time,
  • Not Synced
    and basically hide the fact that things
    stalled, waited in line,
  • Not Synced
    that this guy might've taken a lunch break,
  • Not Synced
    and now we have line around,
    building three times.
  • Not Synced
    Service time stays the same.
  • Not Synced
    This is the backwards part...
  • Not Synced
    Now, let's look at what it
    actually looks like.
  • Not Synced
    In a load generator that I fixed,
    I measured both
  • Not Synced
    response time and service time,
  • Not Synced
    this happens to be Casandra,
  • Not Synced
    at a very low load.
  • Not Synced
    And you can see that they're very very
    similar, at a very low load.
  • Not Synced
    Why? Because there's nobody in line.
  • Not Synced
    This thing is really fast.
  • Not Synced
    We're not asking for too much.
  • Not Synced
    Casandra's pretty fast,
    so they're the same.
  • Not Synced
    But if I increase the load, we
    start seeing gaps.
  • Not Synced
    If I increase the load a little more,
    the gap grows.
  • Not Synced
    If I increase the load a little more,
    the gap grows.
  • Not Synced
    Now this is not the failure point yet.
  • Not Synced
    If I actually increase it all the way past
    the point where the system
  • Not Synced
    can't even do the work I want,
    service time stays the same,
  • Not Synced
    response time goes through the roof.
  • Not Synced
    This was when it was 100 and something
    milliseconds, now it's 7 and a half seconds.
  • Not Synced
    Why 7 and a half seconds?
  • Not Synced
    Cause you're waiting in line that long
    to go around the block.
  • Not Synced
    The guy just can't serve as many people
    as are showing up in line, you fall behind.
  • Not Synced
    This is a virtual world reaction to this.
  • Not Synced
    I really like this slide, it's where I came
    up with the notion of a blue/red pill.
  • Not Synced
    When you actually measure reality, people
    tend to have this reaction when
  • Not Synced
    they compare the two.
  • Not Synced
    And if we actually look at these on the
    two sides of a collapse point of a system,
  • Not Synced
    this specific system can only do 87,000
    things a second.
  • Not Synced
    No matter how hard you press it,
    that's all it'll do.
  • Not Synced
    The service time on the two sides of
    the collapse looks virtually identical,
  • Not Synced
    which it would.
  • Not Synced
    But if you compare the response time,
    you have a very different picture.
  • Not Synced
    And I'm showing this picture so you get
    a feeling for what to look at
  • Not Synced
    on whether or not you're measuring
    the right one.
Title:
"How NOT to Measure Latency" by Gil Tene
Description:

more » « less
Video Language:
English
Team:
Captions Requested
Duration:
42:59

English subtitles

Incomplete

Revisions Compare revisions