Return to Video

How Uber Uses your Phone as a Backup Datacenter

  • 0:07 - 0:11
    (audience claps)
    (Presenter) Ok.
  • 0:11 - 0:15
    Alright everybody so, let's dive in.
  • 0:15 - 0:18
    So let's talk about how Uber trips
    even happen
  • 0:18 - 0:20
    before we get into the nitty gritty
    of how
  • 0:20 - 0:23
    we save them from the clutches
    of a datacenter failover.
  • 0:23 - 0:27
    So you might have heard we're all about
    connecting riders and drivers.
  • 0:27 - 0:28
    This is what it looks like.
  • 0:28 - 0:30
    You've probably at least
    seen the rider app.
  • 0:30 - 0:32
    You get it out, you see
    some cars on the map.
  • 0:32 - 0:34
    You pick where you want
    to get a pickup location.
  • 0:34 - 0:36
    At the same time,
    all these guys that
  • 0:36 - 0:39
    you're seeing on the map,
    they have a phone open somewhere.
  • 0:39 - 0:41
    They're logged in waiting for a dispatch.
  • 0:41 - 0:43
    They're all pinging
    into the same datacenter
  • 0:43 - 0:45
    for the city that you both are in.
  • 0:45 - 0:48
    So then what happens is you put
    that pin somewhere,
  • 0:48 - 0:51
    and you get ready to pick up your trip,
    you get ready to request.
  • 0:52 - 0:55
    You hit request and
    that guy's phone starts beeping.
  • 0:55 - 0:58
    Hopefully if everything works out,
    he'll accept that trip.
  • 0:59 - 1:03
    All these things that we're talking about
    here, the request of a trip,
  • 1:03 - 1:06
    the offering it to a driver,
    him accepting it.
  • 1:06 - 1:09
    That's something we call
    a state change transition.
  • 1:09 - 1:12
    From the moment that you start
    requesting the trip,
  • 1:12 - 1:16
    we start creating your trip data
    in the backend datacenter.
  • 1:16 - 1:21
    And that transaction that might live
    for anything like 5, 10, 15, 20, 30,
  • 1:21 - 1:24
    however many minutes it takes you
    to take your trip.
  • 1:24 - 1:27
    We have to consistently handle that trip
    to get it all the way through
  • 1:27 - 1:31
    to completion to get you
    where you're going happily.
  • 1:31 - 1:33
    So every time this state change happens,
  • 1:37 - 1:41
    things happen in the world, so next up
    he goes ahead and shows up to you.
  • 1:42 - 1:45
    He arrives, you get in the car,
    he begins the trip.
  • 1:45 - 1:47
    Everything's going fine.
  • 1:49 - 1:52
    So this is of course...
    some of these state changes
  • 1:52 - 1:55
    are more or less important
    to everything that's going on.
  • 1:55 - 1:58
    The begin trip and the end trip are the
    real important ones, of course.
  • 1:58 - 2:00
    The ones that we don't want
    to lose the most.
  • 2:00 - 2:02
    But all these are really important
    to keep onto.
  • 2:02 - 2:07
    So what happens in the sense of
    failure is your trip is gone.
  • 2:07 - 2:11
    You're both back to,
    "oh my god, where'd my trip go?"
  • 2:11 - 2:15
    There you're just seeing empty cars again
    and he's back into an open thing
  • 2:15 - 2:18
    like where you were when you
    started/opened the application
  • 2:18 - 2:18
    in the first place.
  • 2:18 - 2:22
    So this is what used to happen for us
    not too long ago.
  • 2:22 - 2:26
    So how do we fix this,
    how do you fix this in general?
  • 2:26 - 2:28
    So classically you might try
    and say,
  • 2:28 - 2:31
    "Well, let's take all the data in that one
    datacenter and copy it,
  • 2:31 - 2:34
    replicate it to a backend data center."
  • 2:34 - 2:37
    This is pretty well understood, classic
    way to solve this problem.
  • 2:37 - 2:40
    You control the active data center
    and the backup data center
  • 2:40 - 2:42
    so it's pretty easy to reason about.
  • 2:42 - 2:44
    People feel comfortable with this scheme.
  • 2:44 - 2:47
    It could work more or less well depending
    on what database you're using.
  • 2:48 - 2:49
    But there's some drawbacks.
  • 2:49 - 2:51
    It gets kinda complicated
    beyond two datacenters.
  • 2:51 - 2:55
    It's always gonna be subject
    to replication lag
  • 2:55 - 2:58
    because the datacenters are separated by
    this thing called the internet,
  • 2:58 - 3:01
    or maybe leased lines
    if you get really into it.
  • 3:02 - 3:07
    So it requires a constant level of high
    bandwidth, especially if you're not using
  • 3:07 - 3:10
    a database well-suited to replication,
    or if you haven't really tuned
  • 3:10 - 3:14
    your business model to get
    the deltas really good.
  • 3:14 - 3:17
    So, we chose not to go with this route.
  • 3:17 - 3:20
    We instead said, "What if we could solve
    it to vac down to the driver phone?"
  • 3:20 - 3:24
    Because since we're already in constant
    communication with these driver phones,
  • 3:24 - 3:27
    what if we could just save the data there
    to the driver phone?
  • 3:27 - 3:31
    Then he could failover to any datacenter,
    rather than having to control,
  • 3:31 - 3:34
    "Well here's the backup datacenter for
    this city, the backup datacenter
  • 3:34 - 3:39
    for this city," and then, "Oh, no no, what
    if in a failover, we fail the wrong phones
  • 3:39 - 3:42
    to the wrong datacenter and now we lose
    all their trips again?"
  • 3:42 - 3:43
    That would not be cool.
  • 3:43 - 3:47
    So we really decided to go with this
    mobile implementation approach
  • 3:47 - 3:50
    of saving the trips to the driver phone.
  • 3:50 - 3:53
    But of course, it doesn't come without a
    trade-off, the trade-off here being
  • 3:53 - 3:57
    you've got to implement some kind of a
    replication protocol in the driver phone
  • 3:57 - 4:01
    consistently between whatever
    platforms you support.
  • 4:01 - 4:03
    In our case, iOS and Android.
  • 4:06 - 4:08
    ...But if it could, how would this work?
  • 4:08 - 4:11
    So all these state transitions
    are happening when the phones
  • 4:11 - 4:13
    communicate with our datacenter.
  • 4:13 - 4:18
    So if in response to his request to begin
    trip or arrive, or accept, or any of this,
  • 4:18 - 4:22
    if we could send some data back down to
    his phone and have him keep ahold of it,
  • 4:22 - 4:27
    then in the case of a datacenter failover,
    when his phone pings into
  • 4:27 - 4:31
    the new datacenter, we could request that
    data right back off of his phone, and get
  • 4:31 - 4:34
    you guys right back on your trip with
    maybe only a minimal blip,
  • 4:34 - 4:36
    in the worst case.
  • 4:37 - 4:41
    So in a high level, that's the idea, but
    there are some challenges of course
  • 4:41 - 4:42
    in implementing that.
  • 4:43 - 4:46
    Not all the trip information that we would
    want to save is something we want
  • 4:47 - 4:50
    the driver to have access to, like to be
    able to get your trip and end it
  • 4:50 - 4:54
    consistently in the other datacenter, we'd
    have to have the full rider information.
  • 4:54 - 4:58
    If you're fare splitting with some friends
    it would need to be all rider information.
  • 4:58 - 5:02
    So, there's a lot of things that we need
    to save here to save your trip
  • 5:02 - 5:04
    that we don't want to expose
    to the driver.
  • 5:04 - 5:07
    Also, you have to pretty much assume that
    the driver phones are more or less
  • 5:07 - 5:11
    trustable, either because people are doing
    nefarious things with them,
  • 5:11 - 5:13
    or people not the drivers
    have compromised them,
  • 5:13 - 5:16
    or somebody else between you
    and the driver. Who knows?
  • 5:16 - 5:20
    So for most of these reasons, we decided
    we had to go with the crypto approach
  • 5:20 - 5:23
    and encrypt all the data that we store on
    the phones to prevent against tampering
  • 5:23 - 5:25
    and leak of any kind of PII.
  • 5:27 - 5:32
    And also, towards all these security
    designs and also simple reliability of
  • 5:32 - 5:36
    interacting with these phones, you want to
    keep the replication protocol as simple
  • 5:36 - 5:40
    as possible to make it easy to reason
    about, easy to debug, remove failure cases
  • 5:40 - 5:43
    and you also want to
    minimize the extra bandwidth.
  • 5:44 - 5:46
    I kinda glossed over the bandwidth unpacks
  • 5:46 - 5:48
    when I said backend replication
    isn't really an option.
  • 5:48 - 5:51
    But at least here when you're designing
    this replication protocol,
  • 5:51 - 5:55
    at the application layer you can be
    much more in tune with what data
  • 5:55 - 5:57
    you're serializing, and what
    you're deltafying or not
  • 5:57 - 6:00
    and really mind your bandwidth impact.
  • 6:01 - 6:05
    Especially since it's going over a mobile
    network, this becomes really salient.
  • 6:06 - 6:08
    So, how do you keep it simple?
  • 6:08 - 6:12
    In our case, we decided to go with a very
    simple key value store with all your
  • 6:12 - 6:17
    typical operations: get, set, delete,
    and list all the keys please,
  • 6:17 - 6:21
    with one caveat being you can only
    set a key once so you can't
  • 6:21 - 6:25
    accidentally overwrite a key;
    it eliminates a whole class
  • 6:25 - 6:29
    of weird programming errors or
    out of order message delivery errors
  • 6:29 - 6:32
    that you might have in such a system.
  • 6:32 - 6:35
    This did however then force us to move,
    we'll call "versioning"
  • 6:35 - 6:37
    into the keyspace though.
  • 6:37 - 6:41
    You can't just say, "Oh, I've got a key
    for this trip and please update it
  • 6:41 - 6:42
    to the new version on each state change."
  • 6:42 - 6:47
    No; instead you have to have a key for
    trip and version, and you have to do
  • 6:47 - 6:51
    a set of new one, delete the old one,
    and that at least gives you the nice
  • 6:51 - 6:54
    property that if that fails partway
    through, between the send and the delete,
  • 6:54 - 6:58
    you fail into having two things stored,
    rather than no things stored.
  • 6:59 - 7:02
    So there are some nice properties to
    keeping a nice simple
  • 7:02 - 7:04
    key-value protocol here.
  • 7:04 - 7:09
    And that makes failover resolution really
    easy because it's simply a matter of,
  • 7:09 - 7:11
    "What keys do you have?
    What trips do you store?
  • 7:11 - 7:13
    What keys do I have in the backend
    datacenter?"
  • 7:13 - 7:18
    Compare those, and come to a resolution
    between those set of trips.
  • 7:19 - 7:23
    So that's a quick overview
    of how we built this system.
  • 7:23 - 7:26
    My colleague here, Nikunj Aggarwal
    is going to give you a rundown
  • 7:26 - 7:30
    of some more details of how we
    really got the reliability of this system
  • 7:30 - 7:31
    to work at scale.
  • 7:33 - 7:35
    (audience claps)
  • 7:38 - 7:41
    Alright, hi! I'm Nikunj.
  • 7:41 - 7:45
    So we talked about the idea
    and the motivation behind the idea;
  • 7:45 - 7:49
    now let's dive into how did we design such
    a solution, and what kind of tradeoffs
  • 7:49 - 7:52
    did we have to make
    while we were doing the design.
  • 7:52 - 7:57
    So first thing we wanted to ensure was
    that the system we built is non-blocking
  • 7:57 - 8:00
    but still provide eventual consistency.
  • 8:00 - 8:05
    So basically...any backend application
    using this system should be able to make
  • 8:05 - 8:08
    further progress, even when
    the system is down.
  • 8:08 - 8:13
    So the only trade-off the application
    should be making is that it may
  • 8:13 - 8:16
    take some time for the data
    to actually be stored on the phone.
  • 8:16 - 8:19
    However, using this application
    should not affect
  • 8:19 - 8:21
    any normal business operations for them.
  • 8:22 - 8:27
    Secondly, we wanted to have an ability
    to move between datacenters without
  • 8:27 - 8:30
    worrying about data already there.
  • 8:30 - 8:33
    So when we failover from
    one datacenter to another,
  • 8:33 - 8:37
    that datacenter still had states in there,
  • 8:37 - 8:42
    and it still has its view
    of active drivers and trips.
  • 8:42 - 8:46
    And no service in that datacenter is aware
    that a failure actually happened.
  • 8:46 - 8:53
    So at some later time, if we fail back to
    the same datacenter, then its view
  • 8:53 - 8:57
    of the drivers and trips may be actually
    different than what the drivers
  • 8:57 - 9:01
    actually have and if we trusted that
    datacenter, then the drivers may get on a
  • 9:01 - 9:04
    stale trip, which is
    a very bad experience.
  • 9:04 - 9:08
    So we need some way to reconcile that data
    between the drivers and the server.
  • 9:08 - 9:14
    Finally, we want to be able to measure
    the success of the system all the time.
  • 9:14 - 9:20
    So the system is only fully executed
    during a failure, and a datacenter failure
  • 9:20 - 9:23
    is a pretty rare occurrence, and we don't
    want to be in a situation where
  • 9:23 - 9:27
    we detect issues with the system when we
    need it the most.
  • 9:27 - 9:32
    So what we want is an ability to
    constantly be able to measure the success
  • 9:32 - 9:37
    of the system so that we are confident in
    it when a failure acutally happens.
  • 9:37 - 9:42
    So in keeping all these issues in mind,
    this is a very high level
  • 9:42 - 9:44
    view of the system.
  • 9:44 - 9:47
    I'm not going to go into details of any
    of the services,
  • 9:47 - 9:49
    since it's a mobile conference.
  • 9:49 - 9:52
    So the first thing that happens is that
    driver makes an update,
  • 9:52 - 9:56
    or as Josh called it, a state change,
    on his app.
  • 9:56 - 9:59
    For example, he may pick up a passenger.
  • 9:59 - 10:02
    Now that update comes as a request to the
    dispatching service.
  • 10:02 - 10:08
    Now the dispatching service, depending on
    the type of request, it updates the trip
  • 10:08 - 10:13
    model for that trip, and then it sends the
    update to the replication service.
  • 10:13 - 10:18
    Now the replication service will enqueue
    that request in its own datastore
  • 10:18 - 10:23
    and immediately return a successful
    response to the dispatching service,
  • 10:23 - 10:27
    and then finally the dispatching service
    will update its own datastore
  • 10:27 - 10:30
    and then return a success to mobile.
  • 10:30 - 10:34
    It may alter it in some other way on the
    mobile, for example, things might have
  • 10:34 - 10:37
    changed since the last time
    mobile pinged in, for example...
  • 10:39 - 10:42
    ...If it's an UberPool trip,
    then the driver may have to pick up
  • 10:42 - 10:44
    another passenger.
  • 10:44 - 10:46
    Or if the rider entered some destination,
  • 10:46 - 10:50
    we might have to tell
    the driver about that.
  • 10:50 - 10:54
    And in the background, the replication
    service encrypts that data, obviously,
  • 10:54 - 10:57
    since we don't want drivers
    to have access to all that,
  • 10:57 - 11:00
    and then sends it to a messaging service.
  • 11:00 - 11:04
    So messaging service is something that's
    rebuilt as part of the system.
  • 11:05 - 11:09
    It maintains a bidirectional communication
    channel with all drivers
  • 11:09 - 11:11
    on the Uber platform.
  • 11:11 - 11:14
    And this communication channel
    is actually separate from
  • 11:14 - 11:18
    the original request response channel
    which we've been traditionally using
  • 11:18 - 11:20
    at Uber for drivers to communicate
    with the server.
  • 11:20 - 11:24
    So this way, we are not affecting any
    normal business operation
  • 11:24 - 11:26
    due to this service.
  • 11:26 - 11:30
    So the messaging service then sends the
    message to the phone
  • 11:30 - 11:33
    and get an acknowledgement from them.
  • 11:35 - 11:41
    So from this design, what we have achieved
    is that we've isolated the applications
  • 11:41 - 11:47
    from any replication latencies or failures
    because our replication service
  • 11:47 - 11:53
    returns immediately and the only extra
    thing the application is doing
  • 11:53 - 11:57
    by opting in to this replication strategy
    is making an extra service call
  • 11:57 - 12:02
    to the replication service, which is going
    to be pretty cheap since it's within
  • 12:02 - 12:05
    the same datacenter,
    not traveling through internet.
  • 12:05 - 12:12
    Secondly, now having this separate channel
    gives us the ability to arbitrarily query
  • 12:12 - 12:16
    the states of the phone without affecting
    any normal business operations
  • 12:16 - 12:20
    and we can use that phone as a basic
    key-value store now.
  • 12:24 - 12:29
    Next...Okay so now comes the issue of
    moving between datacenters.
  • 12:29 - 12:33
    As I said earlier, when we failover we are
    actually leaving states behind
  • 12:33 - 12:35
    in that datacenter.
  • 12:35 - 12:37
    So how do we deal with stale states?
  • 12:37 - 12:40
    So the first approach we tried
    was actually do some manual cleanup.
  • 12:40 - 12:44
    So we wrote some cleanup scripts
    and every time you failover
  • 12:44 - 12:49
    from our primary datatcenter to our backup
    datacenter, somebody will run that script
  • 12:49 - 12:53
    in a primary and it will go to the
    datastores for the dispatching service
  • 12:53 - 12:55
    and it will clean out
    all the states there.
  • 12:55 - 12:59
    However, this approach had operational
    pain because somebody had to run it.
  • 12:59 - 13:05
    Moreover, we allowed the ability
    to failover per city so you can actually
  • 13:05 - 13:10
    choose to failover specific cities instead
    of the whole world, and in those cases
  • 13:10 - 13:13
    the script started becoming complicated.
  • 13:13 - 13:19
    So then we decided to tweak our design a
    little bit so that we solve this problem.
  • 13:19 - 13:25
    So the first thing we did was...as Josh
    mentioned earlier, the key which is
  • 13:25 - 13:30
    stored on the phone contains the trip
    identifier and the version within it.
  • 13:30 - 13:36
    So the version used to be an incrementing
    number so that we can keep track of
  • 13:36 - 13:37
    any followed progress you're making.
  • 13:37 - 13:42
    However, we changed that to a
    modified vector clock.
  • 13:42 - 13:47
    So using that vector clock, we can now
    compare data on the phone
  • 13:47 - 13:49
    and data on the server.
  • 13:49 - 13:53
    And if there is a miss, we can detect any
    causality violations using that.
  • 13:53 - 13:58
    And we can also resolve that using a very
    basic conflict resolution strategy.
  • 13:58 - 14:02
    So this way, we handle any issues
    with ongoing trips.
  • 14:02 - 14:06
    Now next came the
    issue of completed trips.
  • 14:06 - 14:12
    So, traditionally what we'd been doing is,
    when a trip is completed, we will delete
  • 14:12 - 14:14
    all the data about the trip
    from the phone.
  • 14:14 - 14:18
    We did that because we didn't want the
    replication data on the phone
  • 14:18 - 14:20
    to grow unbounded.
  • 14:20 - 14:23
    And once a trip is completed,
    it's probably no longer required
  • 14:23 - 14:25
    for restoration.
  • 14:25 - 14:29
    However, that has the side-effect that
    mobile has no idea now that this trip
  • 14:29 - 14:31
    ever happened.
  • 14:31 - 14:36
    So what will happen is if we failback to
    a datacenter with some stale data
  • 14:36 - 14:40
    about this trip, then you might actually
    end up putting the right driver
  • 14:40 - 14:44
    on that same trip, which is a pretty
    bad experience because he's suddenly
  • 14:44 - 14:47
    now driving somebody which he already
    dropped off, and he's probably
  • 14:47 - 14:49
    not gonna be paid for that.
  • 14:49 - 14:55
    So what we did to fix that was...
    on trip completion, we would store
  • 14:55 - 15:01
    a special key on the phone, and the
    version in that key has a flag in it.
  • 15:01 - 15:04
    That's why I called it
    a modified vector clock.
  • 15:04 - 15:08
    So it has a flag that says that this trip
    has already been completed,
  • 15:08 - 15:10
    and we store that on the phone.
  • 15:10 - 15:15
    Now when the replication service sees that
    this driver has this flag for the trip,
  • 15:15 - 15:19
    then can tell the dispatching service that
    "Hey, this trip has already been completed
  • 15:19 - 15:21
    and you should probably delete it."
  • 15:21 - 15:25
    So that way, we handle completed trips.
  • 15:25 - 15:31
    So if you think about it, storing trip
    data is kind of expensive because we have
  • 15:31 - 15:37
    this huge encrypted blob, of JSON maybe.
  • 15:37 - 15:43
    But we can store the...
    large completed trips
  • 15:43 - 15:46
    because there is no data
    associated with them.
  • 15:46 - 15:49
    So we can probably store weeks' worth
    of completed trips in the same amount
  • 15:49 - 15:53
    of memory as we would store one trip data.
  • 15:53 - 15:56
    So that's how we solve stale states.
  • 16:00 - 16:04
    So now, next comes the issue of ensuring
    four nines for reliability.
  • 16:04 - 16:10
    So we decided to exercise the system
    more often than a datacenter failure
  • 16:10 - 16:15
    because we wanted to get confident
    that the system actually works.
  • 16:15 - 16:19
    So our first approach was
    to do manual failovers.
  • 16:19 - 16:24
    So basically what happened was that
    bunch of us will gather in a room
  • 16:24 - 16:28
    every Monday and then pick a few cities
    and fail them over.
  • 16:28 - 16:31
    And after we fail them over to another
    datacenter, we'll see...
  • 16:31 - 16:37
    what was the success rate
    for the restoration, and if
  • 16:37 - 16:41
    there were any failures, then try to look
    at the logs and debug any issues there.
  • 16:41 - 16:44
    However, there were several problems with
    this approach.
  • 16:44 - 16:49
    First, it was very operationally painful.
    So, we had to do this every week.
  • 16:49 - 16:53
    And for a small fraction of trips,
    which did not get restored,
  • 16:53 - 16:56
    we will actually have to do
    fare adjustment
  • 16:56 - 16:58
    for both the rider and the driver.
  • 16:58 - 17:01
    Secondly, it led to a very poor
    customer experience
  • 17:01 - 17:05
    because for that same fraction,
    they were suddenly bumped off trip
  • 17:05 - 17:08
    and they got totally confused,
    like what happened to them?
  • 17:08 - 17:15
    Thirdly, it had a low coverage because
    we were covering only a few cities.
  • 17:15 - 17:19
    However, in the past we've seen problems
    which affected only a specific city.
  • 17:19 - 17:23
    Maybe because there was a new feature
    allowed in the city
  • 17:23 - 17:25
    which was not global yet.
  • 17:25 - 17:29
    So this approach does not help us
    catch those cases until it's too late.
  • 17:29 - 17:34
    Finally, we had no idea whether the
    backup datacenter can handle the load.
  • 17:34 - 17:39
    So in our current architecture, we have a
    primary datacenter which handles
  • 17:39 - 17:42
    all the requests and then backup
    datacenter which is waiting to handle
  • 17:42 - 17:45
    all those requests in case
    the primary goes down.
  • 17:45 - 17:47
    But how do we know that the
    backup datacenter
  • 17:47 - 17:48
    can handle all those requests?
  • 17:48 - 17:52
    So one way is, maybe you can provision
    the same number of boxes
  • 17:52 - 17:54
    and same type of hardware
    in the backup datacenter.
  • 17:54 - 17:57
    But what if there's a configuration issue?
  • 17:57 - 18:00
    In some of the services,
    we would never catch that.
  • 18:00 - 18:05
    And even if they're exactly the same,
    how do you know that each service
  • 18:05 - 18:09
    in the backup datacenter can handle
    a sudden flood of requests which comes
  • 18:09 - 18:11
    when there is a failure?
  • 18:11 - 18:14
    So we needed some way to fix
    all these problems.
  • 18:16 - 18:19
    So then to understand how to get
    good confidence in the system and
  • 18:19 - 18:23
    to measure it well, we looked at
    the key concepts behind the system
  • 18:23 - 18:25
    which we really wanted to work.
  • 18:25 - 18:30
    So first thing was we wanted to ensure
    that all mutations which are done
  • 18:30 - 18:33
    by the dispatching service are actually
    stored on the phone.
  • 18:33 - 18:38
    So for example, a driver, right after he
    picks up a passenger,
  • 18:38 - 18:40
    he may lose connectivity.
  • 18:40 - 18:43
    And so replication data may not
    be sent to the phone immediately
  • 18:43 - 18:48
    but we want to ensure that the data
    eventually makes it to the phone.
  • 18:48 - 18:52
    Secondly, we wanted to make sure
    that the stored data can actually be used
  • 18:52 - 18:54
    for replication.
  • 18:54 - 18:58
    For example, there may be some
    encryption-decryption issue with the data
  • 18:58 - 19:01
    and the data gets corrupted
    and it's no longer needed.
  • 19:01 - 19:04
    So even if you're storing the data,
    you cannot use it.
  • 19:04 - 19:06
    So there's no point.
  • 19:06 - 19:10
    Or, restoration actually involves
    rehydrating the states
  • 19:10 - 19:13
    within the dispatching service
    using the data.
  • 19:13 - 19:16
    So even if the data is fine,
    if there's any problem
  • 19:16 - 19:21
    during that rehydration process,
    some service behaving weirdly,
  • 19:21 - 19:25
    you would still have no use for that data
    and you would still lose the trip,
  • 19:25 - 19:28
    even though the data is perfectly fine.
  • 19:28 - 19:32
    Finally, as I mentioned earlier,
    we needed a way to figure out
  • 19:32 - 19:35
    whether the backup datacenters
    can handle the load.
  • 19:36 - 19:41
    So to monitor the health of the system
    better, we wrote another service.
  • 19:41 - 19:48
    Every hour it will get a list of all
    active drivers and trips
  • 19:48 - 19:51
    from our dispatching service.
  • 19:51 - 19:57
    And for all those drivers, it will use
    that messaging channel to ask for
  • 19:57 - 19:59
    their replication data.
  • 19:59 - 20:02
    And once it has the replication data,
    it will compare that data
  • 20:02 - 20:05
    with the data which the application
    expects.
  • 20:05 - 20:09
    And doing that, we get a lot of good
    metrics around, like...
  • 20:10 - 20:14
    What percentage of drivers have data
    successfully stored to them?
  • 20:14 - 20:18
    And you can even break down metrics
    by region or by any app versions.
  • 20:18 - 20:21
    So this really helped us
    drill into the problem.
  • 20:25 - 20:29
    Finally, to know whether the stored data
    can be used for replication,
  • 20:29 - 20:32
    and that the backup datacenter
    can handle the load,
  • 20:32 - 20:36
    what we do is we use all the data
    which we got in the previous step
  • 20:36 - 20:39
    and we send that to our backup datacenter.
  • 20:39 - 20:42
    And within the backup datacenter
    we perform what we call
  • 20:42 - 20:44
    a shatter restoration.
  • 20:44 - 20:50
    And since there is nobody else making any
    changes in that backup datacenter,
  • 20:50 - 20:54
    after the restoration is completed,
    we can just query the dispatching service
  • 20:54 - 20:56
    in the backup datacenter.
  • 20:56 - 21:01
    And say, "Hey, how many active riders,
    drivers, and trips do you have?"
  • 21:01 - 21:04
    And we can compare that number
    with the number we got
  • 21:04 - 21:06
    in our snapshot
    from the primary datacenter.
  • 21:06 - 21:11
    And using that, we get really valuable
    information around what's our success rate
  • 21:11 - 21:15
    and we can do similar breakdowns by
    different parameters
  • 21:15 - 21:17
    like region or app version.
  • 21:17 - 21:21
    Finally, we also get metrics around
    how well the backup datacenter did.
  • 21:21 - 21:25
    So did we subject it to a lot of load,
    or can it handle the traffic
  • 21:25 - 21:27
    when there is a real failure?
  • 21:27 - 21:32
    Also, any configuration issue
    in the backup datacenter
  • 21:32 - 21:35
    can be easily caught by this approach.
  • 21:35 - 21:40
    So using this service, we are
    constantly testing the system
  • 21:40 - 21:45
    and making sure we have confidence in it
    and can use it during a failure.
  • 21:45 - 21:48
    Cause if there's no confidence
    in the system, then it's pointless.
  • 21:51 - 21:54
    So yeah, that was the idea
    behind the system
  • 21:54 - 21:57
    and how we implemented it.
  • 21:57 - 22:01
    I did not get a chance to go into
    different [inaudible] of detail,
  • 22:01 - 22:05
    but if you guys have any questions,
    you can always reach out to us
  • 22:05 - 22:07
    during the office hours.
  • 22:07 - 22:09
    So thanks guys for coming
    and listening to us.
  • 22:10 - 22:14
    (audience claps)
Title:
How Uber Uses your Phone as a Backup Datacenter
Description:

more » « less
Video Language:
English
Team:
Captions Requested
Duration:
22:19

English subtitles

Revisions Compare revisions