0:00:06.549,0:00:10.549
(audience claps)[br](Presenter) Ok.

0:00:10.549,0:00:14.549
Alright everybody so, let's dive in.

0:00:14.549,0:00:18.089
So let's talk about how Uber trips[br]even happen

0:00:18.089,0:00:20.099
before we get into the nitty gritty[br]of how

0:00:20.099,0:00:22.949
we save them from the clutches[br]of a datacenter failover.

0:00:23.270,0:00:26.560
So you might have heard we're all about[br]connecting riders and drivers.

0:00:26.560,0:00:27.860
This is what it looks like.

0:00:27.860,0:00:30.040
You've probably at least[br]seen the rider app.

0:00:30.040,0:00:32.160
You get it out, you see[br]some cars on the map.

0:00:32.160,0:00:34.460
You pick where you want[br]to get a pickup location.

0:00:34.460,0:00:36.180
At the same time,[br]all these guys that

0:00:36.180,0:00:38.980
you're seeing on the map,[br]they have a phone open somewhere.

0:00:38.980,0:00:40.980
They're logged in waiting for a dispatch.

0:00:40.980,0:00:43.050
They're all pinging[br]into the same datacenter

0:00:43.050,0:00:45.000
for the city that you both are in.

0:00:45.482,0:00:47.902
So then what happens is you put[br]that pin somewhere,

0:00:47.902,0:00:51.082
and you get ready to pick up your trip,[br]you get ready to request.

0:00:51.884,0:00:54.844
You hit request and [br]that guy's phone starts beeping.

0:00:55.415,0:00:58.125
Hopefully if everything works out,[br]he'll accept that trip.

0:00:58.984,0:01:02.834
All these things that we're talking about[br]here, the request of a trip,

0:01:02.834,0:01:06.134
the offering it to a driver,[br]him accepting it.

0:01:06.134,0:01:09.274
That's something we call[br]a state change transition.

0:01:09.322,0:01:11.712
From the moment that you start[br]requesting the trip,

0:01:11.712,0:01:15.562
we start creating your trip data[br]in the backend datacenter.

0:01:15.770,0:01:21.400
And that transaction that might live[br]for anything like 5, 10, 15, 20, 30,

0:01:21.400,0:01:24.060
however many minutes it takes you[br]to take your trip.

0:01:24.060,0:01:27.360
We have to consistently handle that trip[br]to get it all the way through

0:01:27.360,0:01:30.620
to completion to get you[br]where you're going happily.

0:01:31.156,0:01:33.466
So every time this state change happens,

0:01:36.609,0:01:41.339
things happen in the world, so next up [br]he goes ahead and shows up to you.

0:01:41.976,0:01:45.216
He arrives, you get in the car, [br]he begins the trip.

0:01:45.216,0:01:47.016
Everything's going fine.

0:01:48.928,0:01:51.868
So this is of course...[br]some of these state changes

0:01:51.868,0:01:54.658
are more or less important[br]to everything that's going on.

0:01:54.658,0:01:57.998
The begin trip and the end trip are the[br]real important ones, of course.

0:01:57.998,0:02:00.128
The ones that we don't want[br]to lose the most.

0:02:00.128,0:02:02.378
But all these are really important[br]to keep onto.

0:02:02.378,0:02:07.388
So what happens in the sense of[br]failure is your trip is gone.

0:02:07.388,0:02:10.548
You're both back to, [br]"oh my god, where'd my trip go?"

0:02:10.808,0:02:14.718
There you're just seeing empty cars again[br]and he's back into an open thing

0:02:14.718,0:02:17.528
like where you were when you [br]started/opened the application

0:02:17.528,0:02:18.448
in the first place.

0:02:18.448,0:02:22.448
So this is what used to happen for us[br]not too long ago.

0:02:22.448,0:02:25.698
So how do we fix this,[br]how do you fix this in general?

0:02:25.698,0:02:27.978
So classically you might try [br]and say,

0:02:27.978,0:02:31.468
"Well, let's take all the data in that one[br]datacenter and copy it,

0:02:31.468,0:02:33.542
replicate it to a backend data center."

0:02:33.542,0:02:36.749
This is pretty well understood, classic[br]way to solve this problem.

0:02:36.942,0:02:40.187
You control the active data center[br]and the backup data center

0:02:40.187,0:02:41.946
so it's pretty easy to reason about.

0:02:41.946,0:02:44.058
People feel comfortable with this scheme.

0:02:44.073,0:02:47.457
It could work more or less well depending[br]on what database you're using.

0:02:47.733,0:02:49.023
But there's some drawbacks.

0:02:49.023,0:02:51.313
It gets kinda complicated[br]beyond two datacenters.

0:02:51.421,0:02:54.634
It's always gonna be subject[br]to replication lag

0:02:54.634,0:02:58.013
because the datacenters are separated by[br]this thing called the internet,

0:02:58.013,0:03:00.562
or maybe leased lines[br]if you get really into it.

0:03:01.710,0:03:06.541
So it requires a constant level of high[br]bandwidth, especially if you're not using

0:03:06.541,0:03:09.780
a database well-suited to replication,[br]or if you haven't really tuned

0:03:09.780,0:03:13.643
your business model to get[br]the deltas really good.

0:03:14.351,0:03:16.616
So, we chose not to go with this route.

0:03:16.616,0:03:20.263
We instead said, "What if we could solve[br]it to vac down to the driver phone?"

0:03:20.263,0:03:24.065
Because since we're already in constant[br]communication with these driver phones,

0:03:24.065,0:03:27.078
what if we could just save the data there[br]to the driver phone?

0:03:27.199,0:03:30.585
Then he could failover to any datacenter,[br]rather than having to control,

0:03:30.585,0:03:33.954
"Well here's the backup datacenter for[br]this city, the backup datacenter

0:03:33.954,0:03:38.564
for this city," and then, "Oh, no no, what[br]if in a failover, we fail the wrong phones

0:03:38.564,0:03:41.529
to the wrong datacenter and now we lose[br]all their trips again?"

0:03:41.529,0:03:43.166
That would not be cool.

0:03:43.304,0:03:47.345
So we really decided to go with this[br]mobile implementation approach

0:03:47.345,0:03:49.791
of saving the trips to the driver phone.

0:03:49.791,0:03:53.371
But of course, it doesn't come without a[br]trade-off, the trade-off here being

0:03:53.371,0:03:57.361
you've got to implement some kind of a[br]replication protocol in the driver phone

0:03:57.361,0:04:00.781
consistently between whatever[br]platforms you support.

0:04:00.781,0:04:02.876
In our case, iOS and Android.

0:04:06.005,0:04:07.956
...But if it could, how would this work?

0:04:07.956,0:04:11.110
So all these state transitions[br]are happening when the phones

0:04:11.110,0:04:13.059
communicate with our datacenter.

0:04:13.059,0:04:18.159
So if in response to his request to begin[br]trip or arrive, or accept, or any of this,

0:04:18.159,0:04:21.955
if we could send some data back down to[br]his phone and have him keep ahold of it,

0:04:21.955,0:04:26.652
then in the case of a datacenter failover,[br]when his phone pings into

0:04:26.652,0:04:30.632
the new datacenter, we could request that[br]data right back off of his phone, and get

0:04:30.632,0:04:34.064
you guys right back on your trip with[br]maybe only a minimal blip,

0:04:34.064,0:04:35.663
in the worst case.

0:04:36.719,0:04:40.866
So in a high level, that's the idea, but[br]there are some challenges of course

0:04:40.866,0:04:42.435
in implementing that.

0:04:42.675,0:04:46.402
Not all the trip information that we would[br]want to save is something we want

0:04:46.541,0:04:50.349
the driver to have access to, like to be[br]able to get your trip and end it

0:04:50.349,0:04:54.266
consistently in the other datacenter, we'd[br]have to have the full rider information.

0:04:54.266,0:04:58.278
If you're fare splitting with some friends[br]it would need to be all rider information.

0:04:58.278,0:05:01.621
So, there's a lot of things that we need[br]to save here to save your trip

0:05:01.621,0:05:03.881
that we don't want to expose[br]to the driver.

0:05:03.881,0:05:07.477
Also, you have to pretty much assume that[br]the driver phones are more or less

0:05:07.477,0:05:10.770
trustable, either because people are doing[br]nefarious things with them,

0:05:10.770,0:05:13.266
or people not the drivers[br]have compromised them,

0:05:13.266,0:05:15.930
or somebody else between you[br]and the driver. Who knows?

0:05:15.930,0:05:19.650
So for most of these reasons, we decided[br]we had to go with the crypto approach

0:05:19.650,0:05:23.479
and encrypt all the data that we store on[br]the phones to prevent against tampering

0:05:23.479,0:05:25.265
and leak of any kind of PII.

0:05:27.453,0:05:31.953
And also, towards all these security[br]designs and also simple reliability of

0:05:31.953,0:05:35.838
interacting with these phones, you want to[br]keep the replication protocol as simple

0:05:35.838,0:05:39.618
as possible to make it easy to reason[br]about, easy to debug, remove failure cases

0:05:39.618,0:05:43.339
and you also want to[br]minimize the extra bandwidth.

0:05:43.589,0:05:45.602
I kinda glossed over the bandwidth unpacks

0:05:45.602,0:05:48.290
when I said backend replication[br]isn't really an option.

0:05:48.290,0:05:51.394
But at least here when you're designing[br]this replication protocol,

0:05:51.394,0:05:54.894
at the application layer you can be[br]much more in tune with what data

0:05:54.894,0:05:57.416
you're serializing, and what[br]you're deltafying or not

0:05:57.416,0:06:00.332
and really mind your bandwidth impact.

0:06:01.017,0:06:04.736
Especially since it's going over a mobile[br]network, this becomes really salient.

0:06:05.782,0:06:07.611
So, how do you keep it simple?

0:06:07.611,0:06:12.172
In our case, we decided to go with a very[br]simple key value store with all your

0:06:12.172,0:06:16.776
typical operations: get, set, delete,[br]and list all the keys please,

0:06:16.776,0:06:21.422
with one caveat being you can only[br]set a key once so you can't

0:06:21.422,0:06:25.096
accidentally overwrite a key;[br]it eliminates a whole class

0:06:25.096,0:06:28.826
of weird programming errors or[br]out of order message delivery errors

0:06:28.826,0:06:31.756
that you might have in such a system.

0:06:31.756,0:06:34.851
This did however then force us to move,[br]we'll call "versioning"

0:06:34.851,0:06:36.892
into the keyspace though.

0:06:36.892,0:06:40.567
You can't just say, "Oh, I've got a key[br]for this trip and please update it

0:06:40.567,0:06:42.336
to the new version on each state change."

0:06:42.336,0:06:47.047
No; instead you have to have a key for[br]trip and version, and you have to do

0:06:47.047,0:06:50.671
a set of new one, delete the old one,[br]and that at least gives you the nice

0:06:50.671,0:06:54.481
property that if that fails partway[br]through, between the send and the delete,

0:06:54.481,0:06:57.969
you fail into having two things stored,[br]rather than no things stored.

0:06:59.281,0:07:02.001
So there are some nice properties to[br]keeping a nice simple

0:07:02.001,0:07:03.731
key-value protocol here.

0:07:03.731,0:07:08.767
And that makes failover resolution really[br]easy because it's simply a matter of,

0:07:08.767,0:07:11.048
"What keys do you have?[br]What trips do you store?

0:07:11.048,0:07:13.238
What keys do I have in the backend[br]datacenter?"

0:07:13.238,0:07:17.845
Compare those, and come to a resolution[br]between those set of trips.

0:07:19.106,0:07:22.638
So that's a quick overview[br]of how we built this system.

0:07:22.638,0:07:26.386
My colleague here, Nikunj Aggarwal[br]is going to give you a rundown

0:07:26.386,0:07:29.801
of some more details of how we[br]really got the reliability of this system

0:07:29.801,0:07:30.801
to work at scale.

0:07:32.693,0:07:34.697
(audience claps)

0:07:38.261,0:07:41.039
Alright, hi! I'm Nikunj.

0:07:41.493,0:07:45.018
So we talked about the idea[br]and the motivation behind the idea;

0:07:45.018,0:07:48.914
now let's dive into how did we design such[br]a solution, and what kind of tradeoffs

0:07:48.914,0:07:51.980
did we have to make[br]while we were doing the design.

0:07:51.980,0:07:56.682
So first thing we wanted to ensure was[br]that the system we built is non-blocking

0:07:56.682,0:07:59.977
but still provide eventual consistency.

0:07:59.977,0:08:04.952
So basically...any backend application[br]using this system should be able to make

0:08:04.952,0:08:08.056
further progress, even when[br]the system is down.

0:08:08.056,0:08:12.531
So the only trade-off the application[br]should be making is that it may

0:08:12.531,0:08:15.853
take some time for the data[br]to actually be stored on the phone.

0:08:15.853,0:08:19.107
However, using this application[br]should not affect

0:08:19.107,0:08:21.470
any normal business operations for them.

0:08:22.390,0:08:27.258
Secondly, we wanted to have an ability[br]to move between datacenters without

0:08:27.258,0:08:30.014
worrying about data already there.

0:08:30.014,0:08:33.299
So when we failover from[br]one datacenter to another,

0:08:33.299,0:08:37.169
that datacenter still had states in there,

0:08:37.169,0:08:41.941
and it still has its view[br]of active drivers and trips.

0:08:41.941,0:08:46.454
And no service in that datacenter is aware[br]that a failure actually happened.

0:08:46.454,0:08:52.708
So at some later time, if we fail back to[br]the same datacenter, then its view

0:08:52.708,0:08:56.547
of the drivers and trips may be actually[br]different than what the drivers

0:08:56.547,0:09:00.914
actually have and if we trusted that[br]datacenter, then the drivers may get on a

0:09:00.914,0:09:03.876
stale trip, which is[br]a very bad experience.

0:09:03.876,0:09:08.224
So we need some way to reconcile that data[br]between the drivers and the server.

0:09:08.224,0:09:13.734
Finally, we want to be able to measure[br]the success of the system all the time.

0:09:13.734,0:09:19.757
So the system is only fully executed[br]during a failure, and a datacenter failure

0:09:19.757,0:09:23.216
is a pretty rare occurrence, and we don't[br]want to be in a situation where

0:09:23.216,0:09:27.173
we detect issues with the system when we[br]need it the most.

0:09:27.173,0:09:32.431
So what we want is an ability to[br]constantly be able to measure the success

0:09:32.431,0:09:36.810
of the system so that we are confident in[br]it when a failure acutally happens.

0:09:36.810,0:09:41.766
So in keeping all these issues in mind,[br]this is a very high level

0:09:41.766,0:09:43.990
view of the system.

0:09:43.990,0:09:46.839
I'm not going to go into details of any[br]of the services,

0:09:46.839,0:09:48.999
since it's a mobile conference.

0:09:48.999,0:09:52.438
So the first thing that happens is that[br]driver makes an update,

0:09:52.438,0:09:55.785
or as Josh called it, a state change,[br]on his app.

0:09:55.785,0:09:58.645
For example, he may pick up a passenger.

0:09:58.645,0:10:02.206
Now that update comes as a request to the[br]dispatching service.

0:10:02.206,0:10:07.762
Now the dispatching service, depending on[br]the type of request, it updates the trip

0:10:07.762,0:10:13.284
model for that trip, and then it sends the[br]update to the replication service.

0:10:13.284,0:10:18.061
Now the replication service will enqueue[br]that request in its own datastore

0:10:18.061,0:10:22.586
and immediately return a successful[br]response to the dispatching service,

0:10:22.586,0:10:26.994
and then finally the dispatching service[br]will update its own datastore

0:10:26.994,0:10:30.053
and then return a success to mobile.

0:10:30.053,0:10:33.982
It may alter it in some other way on the[br]mobile, for example, things might have

0:10:33.982,0:10:36.829
changed since the last time[br]mobile pinged in, for example...

0:10:38.809,0:10:42.359
...If it's an UberPool trip,[br]then the driver may have to pick up

0:10:42.359,0:10:43.572
another passenger.

0:10:43.572,0:10:46.206
Or if the rider entered some destination,

0:10:46.206,0:10:49.754
we might have to tell[br]the driver about that.

0:10:49.754,0:10:54.434
And in the background, the replication[br]service encrypts that data, obviously,

0:10:54.434,0:10:57.382
since we don't want drivers[br]to have access to all that,

0:10:57.382,0:11:00.306
and then sends it to a messaging service.[br]

0:11:00.306,0:11:04.296
So messaging service is something that's[br]rebuilt as part of the system.

0:11:04.577,0:11:08.707
It maintains a bidirectional communication[br]channel with all drivers

0:11:08.707,0:11:10.672
on the Uber platform.

0:11:10.672,0:11:13.975
And this communication channel[br]is actually separate from

0:11:13.975,0:11:17.678
the original request response channel[br]which we've been traditionally using

0:11:17.678,0:11:20.342
at Uber for drivers to communicate[br]with the server.

0:11:20.342,0:11:24.270
So this way, we are not affecting any[br]normal business operation

0:11:24.270,0:11:26.397
due to this service.

0:11:26.397,0:11:29.666
So the messaging service then sends the[br]message to the phone

0:11:29.666,0:11:33.122
and get an acknowledgement from them.

0:11:34.942,0:11:40.810
So from this design, what we have achieved[br]is that we've isolated the applications

0:11:40.810,0:11:47.128
from any replication latencies or failures[br]because our replication service

0:11:47.128,0:11:52.594
returns immediately and the only extra[br]thing the application is doing

0:11:52.594,0:11:57.234
by opting in to this replication strategy[br]is making an extra service call

0:11:57.234,0:12:01.549
to the replication service, which is going[br]to be pretty cheap since it's within

0:12:01.549,0:12:04.552
the same datacenter,[br]not traveling through internet.

0:12:04.552,0:12:11.552
Secondly, now having this separate channel[br]gives us the ability to arbitrarily query

0:12:11.552,0:12:15.920
the states of the phone without affecting[br]any normal business operations

0:12:15.920,0:12:19.662
and we can use that phone as a basic[br]key-value store now.

0:12:23.502,0:12:28.602
Next...Okay so now comes the issue of[br]moving between datacenters.

0:12:28.602,0:12:33.043
As I said earlier, when we failover we are[br]actually leaving states behind

0:12:33.043,0:12:34.866
in that datacenter.

0:12:34.866,0:12:36.866
So how do we deal with stale states?

0:12:36.866,0:12:40.439
So the first approach we tried[br]was actually do some manual cleanup.

0:12:40.439,0:12:43.727
So we wrote some cleanup scripts[br]and every time you failover

0:12:43.727,0:12:48.773
from our primary datatcenter to our backup[br]datacenter, somebody will run that script

0:12:48.773,0:12:52.910
in a primary and it will go to the[br]datastores for the dispatching service

0:12:52.910,0:12:55.082
and it will clean out[br]all the states there.

0:12:55.082,0:12:58.669
However, this approach had operational[br]pain because somebody had to run it.

0:12:58.669,0:13:05.377
Moreover, we allowed the ability[br]to failover per city so you can actually

0:13:05.377,0:13:09.650
choose to failover specific cities instead[br]of the whole world, and in those cases

0:13:09.650,0:13:12.843
the script started becoming complicated.

0:13:12.843,0:13:19.044
So then we decided to tweak our design a[br]little bit so that we solve this problem.

0:13:19.044,0:13:25.180
So the first thing we did was...as Josh[br]mentioned earlier, the key which is

0:13:25.180,0:13:30.229
stored on the phone contains the trip[br]identifier and the version within it.

0:13:30.229,0:13:35.679
So the version used to be an incrementing[br]number so that we can keep track of

0:13:35.679,0:13:37.469
any followed progress you're making.

0:13:37.469,0:13:41.871
However, we changed that to a[br]modified vector clock.

0:13:42.252,0:13:46.759
So using that vector clock, we can now[br]compare data on the phone

0:13:46.759,0:13:48.639
and data on the server.

0:13:48.639,0:13:53.227
And if there is a miss, we can detect any[br]causality violations using that.

0:13:53.227,0:13:57.954
And we can also resolve that using a very[br]basic conflict resolution strategy.

0:13:57.954,0:14:02.126
So this way, we handle any issues[br]with ongoing trips.

0:14:02.126,0:14:06.109
Now next came the[br]issue of completed trips.

0:14:06.109,0:14:11.707
So, traditionally what we'd been doing is,[br]when a trip is completed, we will delete

0:14:11.707,0:14:14.143
all the data about the trip[br]from the phone.

0:14:14.270,0:14:18.345
We did that because we didn't want the[br]replication data on the phone

0:14:18.345,0:14:20.043
to grow unbounded.

0:14:20.043,0:14:22.978
And once a trip is completed,[br]it's probably no longer required

0:14:22.978,0:14:24.562
for restoration.

0:14:24.562,0:14:28.861
However, that has the side-effect that[br]mobile has no idea now that this trip

0:14:28.861,0:14:30.869
ever happened.

0:14:30.869,0:14:35.878
So what will happen is if we failback to[br]a datacenter with some stale data

0:14:35.878,0:14:40.219
about this trip, then you might actually[br]end up putting the right driver

0:14:40.219,0:14:43.667
on that same trip, which is a pretty[br]bad experience because he's suddenly

0:14:43.667,0:14:46.889
now driving somebody which he already[br]dropped off, and he's probably

0:14:46.889,0:14:48.525
not gonna be paid for that.

0:14:48.525,0:14:55.239
So what we did to fix that was...[br]on trip completion, we would store

0:14:55.239,0:15:01.003
a special key on the phone, and the[br]version in that key has a flag in it.

0:15:01.003,0:15:04.176
That's why I called it[br]a modified vector clock.

0:15:04.176,0:15:07.879
So it has a flag that says that this trip[br]has already been completed,

0:15:07.879,0:15:10.213
and we store that on the phone.

0:15:10.213,0:15:15.211
Now when the replication service sees that[br]this driver has this flag for the trip,

0:15:15.211,0:15:19.203
then can tell the dispatching service that[br]"Hey, this trip has already been completed

0:15:19.203,0:15:21.422
and you should probably delete it."

0:15:21.422,0:15:24.613
So that way, we handle completed trips.

0:15:24.613,0:15:30.803
So if you think about it, storing trip[br]data is kind of expensive because we have

0:15:30.803,0:15:36.529
this huge encrypted blob, of JSON maybe.

0:15:36.529,0:15:42.828
But we can store the...[br]large completed trips

0:15:42.828,0:15:45.750
because there is no data[br]associated with them.

0:15:45.750,0:15:49.473
So we can probably store weeks' worth[br]of completed trips in the same amount

0:15:49.473,0:15:53.362
of memory as we would store one trip data.

0:15:53.362,0:15:55.683
So that's how we solve stale states.

0:15:59.683,0:16:04.341
So now, next comes the issue of ensuring[br]four nines for reliability.

0:16:04.341,0:16:10.305
So we decided to exercise the system[br]more often than a datacenter failure

0:16:10.305,0:16:14.729
because we wanted to get confident[br]that the system actually works.

0:16:14.729,0:16:19.498
So our first approach was[br]to do manual failovers.

0:16:19.498,0:16:23.513
So basically what happened was that[br]bunch of us will gather in a room

0:16:23.513,0:16:27.555
every Monday and then pick a few cities[br]and fail them over.

0:16:27.555,0:16:31.058
And after we fail them over to another[br]datacenter, we'll see...

0:16:31.058,0:16:36.776
what was the success rate[br]for the restoration, and if

0:16:36.776,0:16:40.883
there were any failures, then try to look[br]at the logs and debug any issues there.

0:16:40.883,0:16:43.862
However, there were several problems with[br]this approach.

0:16:43.862,0:16:48.700
First, it was very operationally painful.[br]So, we had to do this every week.

0:16:48.700,0:16:53.359
And for a small fraction of trips,[br]which did not get restored,

0:16:53.359,0:16:56.405
we will actually have to do[br]fare adjustment

0:16:56.405,0:16:58.304
for both the rider and the driver.

0:16:58.304,0:17:01.194
Secondly, it led to a very poor[br]customer experience

0:17:01.194,0:17:05.088
because for that same fraction, [br]they were suddenly bumped off trip

0:17:05.088,0:17:08.330
and they got totally confused,[br]like what happened to them?

0:17:08.330,0:17:14.698
Thirdly, it had a low coverage because[br]we were covering only a few cities.

0:17:14.698,0:17:19.333
However, in the past we've seen problems[br]which affected only a specific city.

0:17:19.333,0:17:22.967
Maybe because there was a new feature[br]allowed in the city

0:17:22.967,0:17:25.075
which was not global yet.

0:17:25.075,0:17:29.474
So this approach does not help us[br]catch those cases until it's too late.

0:17:29.474,0:17:33.951
Finally, we had no idea whether the[br]backup datacenter can handle the load.

0:17:33.951,0:17:38.808
So in our current architecture, we have a[br]primary datacenter which handles

0:17:38.808,0:17:42.254
all the requests and then backup[br]datacenter which is waiting to handle

0:17:42.254,0:17:44.582
all those requests in case[br]the primary goes down.

0:17:44.582,0:17:46.724
But how do we know that the[br]backup datacenter

0:17:46.724,0:17:48.227
can handle all those requests?

0:17:48.227,0:17:51.547
So one way is, maybe you can provision[br]the same number of boxes

0:17:51.547,0:17:53.963
and same type of hardware[br]in the backup datacenter.

0:17:53.963,0:17:56.682
But what if there's a configuration issue?

0:17:56.682,0:17:59.569
In some of the services, [br]we would never catch that.

0:17:59.569,0:18:04.904
And even if they're exactly the same,[br]how do you know that each service

0:18:04.904,0:18:09.329
in the backup datacenter can handle[br]a sudden flood of requests which comes

0:18:09.329,0:18:11.481
when there is a failure?

0:18:11.481,0:18:14.195
So we needed some way to fix[br]all these problems.

0:18:15.603,0:18:19.293
So then to understand how to get[br]good confidence in the system and

0:18:19.293,0:18:22.831
to measure it well, we looked at[br]the key concepts behind the system

0:18:22.831,0:18:25.177
which we really wanted to work.

0:18:25.177,0:18:29.896
So first thing was we wanted to ensure[br]that all mutations which are done

0:18:29.896,0:18:32.699
by the dispatching service are actually[br]stored on the phone.

0:18:32.699,0:18:37.908
So for example, a driver, right after he[br]picks up a passenger,

0:18:37.908,0:18:39.763
he may lose connectivity.

0:18:39.763,0:18:43.360
And so replication data may not[br]be sent to the phone immediately

0:18:43.360,0:18:47.718
but we want to ensure that the data[br]eventually makes it to the phone.

0:18:47.718,0:18:51.917
Secondly, we wanted to make sure[br]that the stored data can actually be used

0:18:51.917,0:18:54.089
for replication.

0:18:54.089,0:18:58.480
For example, there may be some[br]encryption-decryption issue with the data

0:18:58.480,0:19:01.314
and the data gets corrupted[br]and it's no longer needed.

0:19:01.314,0:19:04.047
So even if you're storing the data,[br]you cannot use it.

0:19:04.047,0:19:05.532
So there's no point.

0:19:05.532,0:19:10.473
Or, restoration actually involves[br]rehydrating the states

0:19:10.473,0:19:12.989
within the dispatching service[br]using the data.

0:19:12.989,0:19:15.835
So even if the data is fine,[br]if there's any problem

0:19:15.835,0:19:20.653
during that rehydration process,[br]some service behaving weirdly,

0:19:20.653,0:19:25.154
you would still have no use for that data[br]and you would still lose the trip,

0:19:25.154,0:19:28.273
even though the data is perfectly fine.

0:19:28.273,0:19:32.251
Finally, as I mentioned earlier,[br]we needed a way to figure out

0:19:32.251,0:19:34.810
whether the backup datacenters[br]can handle the load.

0:19:35.993,0:19:41.013
So to monitor the health of the system[br]better, we wrote another service.

0:19:41.013,0:19:48.307
Every hour it will get a list of all[br]active drivers and trips

0:19:48.307,0:19:50.573
from our dispatching service.

0:19:50.573,0:19:56.518
And for all those drivers, it will use[br]that messaging channel to ask for

0:19:56.518,0:19:58.747
their replication data.

0:19:58.747,0:20:01.933
And once it has the replication data,[br]it will compare that data

0:20:01.933,0:20:05.369
with the data which the application[br]expects.

0:20:05.369,0:20:08.913
And doing that, we get a lot of good[br]metrics around, like...

0:20:10.293,0:20:13.634
What percentage of drivers have data[br]successfully stored to them?

0:20:13.634,0:20:18.083
And you can even break down metrics[br]by region or by any app versions.

0:20:18.083,0:20:20.757
So this really helped us[br]drill into the problem.

0:20:25.058,0:20:29.006
Finally, to know whether the stored data[br]can be used for replication,

0:20:29.006,0:20:31.715
and that the backup datacenter[br]can handle the load,

0:20:31.715,0:20:35.649
what we do is we use all the data[br]which we got in the previous step

0:20:35.649,0:20:38.596
and we send that to our backup datacenter.

0:20:38.596,0:20:42.168
And within the backup datacenter[br]we perform what we call

0:20:42.168,0:20:44.099
a shatter restoration.

0:20:44.099,0:20:50.185
And since there is nobody else making any[br]changes in that backup datacenter,

0:20:50.185,0:20:54.446
after the restoration is completed,[br]we can just query the dispatching service

0:20:54.446,0:20:56.444
in the backup datacenter.

0:20:56.444,0:21:00.685
And say, "Hey, how many active riders,[br]drivers, and trips do you have?"

0:21:00.685,0:21:03.903
And we can compare that number[br]with the number we got

0:21:03.903,0:21:06.302
in our snapshot[br]from the primary datacenter.

0:21:06.302,0:21:10.707
And using that, we get really valuable[br]information around what's our success rate

0:21:10.707,0:21:14.905
and we can do similar breakdowns by[br]different parameters

0:21:14.905,0:21:16.919
like region or app version.

0:21:16.919,0:21:21.108
Finally, we also get metrics around[br]how well the backup datacenter did.

0:21:21.108,0:21:24.809
So did we subject it to a lot of load,[br]or can it handle the traffic

0:21:24.809,0:21:26.700
when there is a real failure?

0:21:26.700,0:21:31.640
Also, any configuration issue[br]in the backup datacenter

0:21:31.640,0:21:35.148
can be easily caught by this approach.

0:21:35.148,0:21:39.706
So using this service, we are[br]constantly testing the system

0:21:39.706,0:21:44.965
and making sure we have confidence in it[br]and can use it during a failure.

0:21:44.965,0:21:48.141
Cause if there's no confidence[br]in the system, then it's pointless.

0:21:50.931,0:21:53.995
So yeah, that was the idea[br]behind the system

0:21:53.995,0:21:56.627
and how we implemented it.

0:21:57.083,0:22:00.650
I did not get a chance to go into[br]different [inaudible] of detail,

0:22:00.650,0:22:04.739
but if you guys have any questions,[br]you can always reach out to us

0:22:04.739,0:22:07.054
during the office hours.

0:22:07.054,0:22:09.292
So thanks guys for coming[br]and listening to us.

0:22:09.781,0:22:13.804
(audience claps)