0:00:06.549,0:00:10.549 (audience claps)[br](Presenter) Ok. 0:00:10.549,0:00:14.549 Alright everybody so, let's dive in. 0:00:14.549,0:00:18.089 So let's talk about how Uber trips[br]even happen 0:00:18.089,0:00:20.099 before we get into the nitty gritty[br]of how 0:00:20.099,0:00:22.949 we save them from the clutches[br]of a datacenter failover. 0:00:23.270,0:00:26.560 So you might have heard we're all about[br]connecting riders and drivers. 0:00:26.560,0:00:27.860 This is what it looks like. 0:00:27.860,0:00:30.040 You've probably at least[br]seen the rider app. 0:00:30.040,0:00:32.160 You get it out, you see[br]some cars on the map. 0:00:32.160,0:00:34.460 You pick where you want[br]to get a pickup location. 0:00:34.460,0:00:36.180 At the same time,[br]all these guys that 0:00:36.180,0:00:38.980 you're seeing on the map,[br]they have a phone open somewhere. 0:00:38.980,0:00:40.980 They're logged in waiting for a dispatch. 0:00:40.980,0:00:43.050 They're all pinging[br]into the same datacenter 0:00:43.050,0:00:45.000 for the city that you both are in. 0:00:45.482,0:00:47.902 So then what happens is you put[br]that pin somewhere, 0:00:47.902,0:00:51.082 and you get ready to pick up your trip,[br]you get ready to request. 0:00:51.884,0:00:54.844 You hit request and [br]that guy's phone starts beeping. 0:00:55.415,0:00:58.125 Hopefully if everything works out,[br]he'll accept that trip. 0:00:58.984,0:01:02.834 All these things that we're talking about[br]here, the request of a trip, 0:01:02.834,0:01:06.134 the offering it to a driver,[br]him accepting it. 0:01:06.134,0:01:09.274 That's something we call[br]a state change transition. 0:01:09.322,0:01:11.712 From the moment that you start[br]requesting the trip, 0:01:11.712,0:01:15.562 we start creating your trip data[br]in the backend datacenter. 0:01:15.770,0:01:21.400 And that transaction that might live[br]for anything like 5, 10, 15, 20, 30, 0:01:21.400,0:01:24.060 however many minutes it takes you[br]to take your trip. 0:01:24.060,0:01:27.360 We have to consistently handle that trip[br]to get it all the way through 0:01:27.360,0:01:30.620 to completion to get you[br]where you're going happily. 0:01:31.156,0:01:33.466 So every time this state change happens, 0:01:36.609,0:01:41.339 things happen in the world, so next up [br]he goes ahead and shows up to you. 0:01:41.976,0:01:45.216 He arrives, you get in the car, [br]he begins the trip. 0:01:45.216,0:01:47.016 Everything's going fine. 0:01:48.928,0:01:51.868 So this is of course...[br]some of these state changes 0:01:51.868,0:01:54.658 are more or less important[br]to everything that's going on. 0:01:54.658,0:01:57.998 The begin trip and the end trip are the[br]real important ones, of course. 0:01:57.998,0:02:00.128 The ones that we don't want[br]to lose the most. 0:02:00.128,0:02:02.378 But all these are really important[br]to keep onto. 0:02:02.378,0:02:07.388 So what happens in the sense of[br]failure is your trip is gone. 0:02:07.388,0:02:10.548 You're both back to, [br]"oh my god, where'd my trip go?" 0:02:10.808,0:02:14.718 There you're just seeing empty cars again[br]and he's back into an open thing 0:02:14.718,0:02:17.528 like where you were when you [br]started/opened the application 0:02:17.528,0:02:18.448 in the first place. 0:02:18.448,0:02:22.448 So this is what used to happen for us[br]not too long ago. 0:02:22.448,0:02:25.698 So how do we fix this,[br]how do you fix this in general? 0:02:25.698,0:02:27.978 So classically you might try [br]and say, 0:02:27.978,0:02:31.468 "Well, let's take all the data in that one[br]datacenter and copy it, 0:02:31.468,0:02:33.542 replicate it to a backend data center." 0:02:33.542,0:02:36.749 This is pretty well understood, classic[br]way to solve this problem. 0:02:36.942,0:02:40.187 You control the active data center[br]and the backup data center 0:02:40.187,0:02:41.946 so it's pretty easy to reason about. 0:02:41.946,0:02:44.058 People feel comfortable with this scheme. 0:02:44.073,0:02:47.457 It could work more or less well depending[br]on what database you're using. 0:02:47.733,0:02:49.023 But there's some drawbacks. 0:02:49.023,0:02:51.313 It gets kinda complicated[br]beyond two datacenters. 0:02:51.421,0:02:54.634 It's always gonna be subject[br]to replication lag 0:02:54.634,0:02:58.013 because the datacenters are separated by[br]this thing called the internet, 0:02:58.013,0:03:00.562 or maybe leased lines[br]if you get really into it. 0:03:01.710,0:03:06.541 So it requires a constant level of high[br]bandwidth, especially if you're not using 0:03:06.541,0:03:09.780 a database well-suited to replication,[br]or if you haven't really tuned 0:03:09.780,0:03:13.643 your business model to get[br]the deltas really good. 0:03:14.351,0:03:16.616 So, we chose not to go with this route. 0:03:16.616,0:03:20.263 We instead said, "What if we could solve[br]it to vac down to the driver phone?" 0:03:20.263,0:03:24.065 Because since we're already in constant[br]communication with these driver phones, 0:03:24.065,0:03:27.078 what if we could just save the data there[br]to the driver phone? 0:03:27.199,0:03:30.585 Then he could failover to any datacenter,[br]rather than having to control, 0:03:30.585,0:03:33.954 "Well here's the backup datacenter for[br]this city, the backup datacenter 0:03:33.954,0:03:38.564 for this city," and then, "Oh, no no, what[br]if in a failover, we fail the wrong phones 0:03:38.564,0:03:41.529 to the wrong datacenter and now we lose[br]all their trips again?" 0:03:41.529,0:03:43.166 That would not be cool. 0:03:43.304,0:03:47.345 So we really decided to go with this[br]mobile implementation approach 0:03:47.345,0:03:49.791 of saving the trips to the driver phone. 0:03:49.791,0:03:53.371 But of course, it doesn't come without a[br]trade-off, the trade-off here being 0:03:53.371,0:03:57.361 you've got to implement some kind of a[br]replication protocol in the driver phone 0:03:57.361,0:04:00.781 consistently between whatever[br]platforms you support. 0:04:00.781,0:04:02.876 In our case, iOS and Android. 0:04:06.005,0:04:07.956 ...But if it could, how would this work? 0:04:07.956,0:04:11.110 So all these state transitions[br]are happening when the phones 0:04:11.110,0:04:13.059 communicate with our datacenter. 0:04:13.059,0:04:18.159 So if in response to his request to begin[br]trip or arrive, or accept, or any of this, 0:04:18.159,0:04:21.955 if we could send some data back down to[br]his phone and have him keep ahold of it, 0:04:21.955,0:04:26.652 then in the case of a datacenter failover,[br]when his phone pings into 0:04:26.652,0:04:30.632 the new datacenter, we could request that[br]data right back off of his phone, and get 0:04:30.632,0:04:34.064 you guys right back on your trip with[br]maybe only a minimal blip, 0:04:34.064,0:04:35.663 in the worst case. 0:04:36.719,0:04:40.866 So in a high level, that's the idea, but[br]there are some challenges of course 0:04:40.866,0:04:42.435 in implementing that. 0:04:42.675,0:04:46.402 Not all the trip information that we would[br]want to save is something we want 0:04:46.541,0:04:50.349 the driver to have access to, like to be[br]able to get your trip and end it 0:04:50.349,0:04:54.266 consistently in the other datacenter, we'd[br]have to have the full rider information. 0:04:54.266,0:04:58.278 If you're fare splitting with some friends[br]it would need to be all rider information. 0:04:58.278,0:05:01.621 So, there's a lot of things that we need[br]to save here to save your trip 0:05:01.621,0:05:03.881 that we don't want to expose[br]to the driver. 0:05:03.881,0:05:07.477 Also, you have to pretty much assume that[br]the driver phones are more or less 0:05:07.477,0:05:10.770 trustable, either because people are doing[br]nefarious things with them, 0:05:10.770,0:05:13.266 or people not the drivers[br]have compromised them, 0:05:13.266,0:05:15.930 or somebody else between you[br]and the driver. Who knows? 0:05:15.930,0:05:19.650 So for most of these reasons, we decided[br]we had to go with the crypto approach 0:05:19.650,0:05:23.479 and encrypt all the data that we store on[br]the phones to prevent against tampering 0:05:23.479,0:05:25.265 and leak of any kind of PII. 0:05:27.453,0:05:31.953 And also, towards all these security[br]designs and also simple reliability of 0:05:31.953,0:05:35.838 interacting with these phones, you want to[br]keep the replication protocol as simple 0:05:35.838,0:05:39.618 as possible to make it easy to reason[br]about, easy to debug, remove failure cases 0:05:39.618,0:05:43.339 and you also want to[br]minimize the extra bandwidth. 0:05:43.589,0:05:45.602 I kinda glossed over the bandwidth unpacks 0:05:45.602,0:05:48.290 when I said backend replication[br]isn't really an option. 0:05:48.290,0:05:51.394 But at least here when you're designing[br]this replication protocol, 0:05:51.394,0:05:54.894 at the application layer you can be[br]much more in tune with what data 0:05:54.894,0:05:57.416 you're serializing, and what[br]you're deltafying or not 0:05:57.416,0:06:00.332 and really mind your bandwidth impact. 0:06:01.017,0:06:04.736 Especially since it's going over a mobile[br]network, this becomes really salient. 0:06:05.782,0:06:07.611 So, how do you keep it simple? 0:06:07.611,0:06:12.172 In our case, we decided to go with a very[br]simple key value store with all your 0:06:12.172,0:06:16.776 typical operations: get, set, delete,[br]and list all the keys please, 0:06:16.776,0:06:21.422 with one caveat being you can only[br]set a key once so you can't 0:06:21.422,0:06:25.096 accidentally overwrite a key;[br]it eliminates a whole class 0:06:25.096,0:06:28.826 of weird programming errors or[br]out of order message delivery errors 0:06:28.826,0:06:31.756 that you might have in such a system. 0:06:31.756,0:06:34.851 This did however then force us to move,[br]we'll call "versioning" 0:06:34.851,0:06:36.892 into the keyspace though. 0:06:36.892,0:06:40.567 You can't just say, "Oh, I've got a key[br]for this trip and please update it 0:06:40.567,0:06:42.336 to the new version on each state change." 0:06:42.336,0:06:47.047 No; instead you have to have a key for[br]trip and version, and you have to do 0:06:47.047,0:06:50.671 a set of new one, delete the old one,[br]and that at least gives you the nice 0:06:50.671,0:06:54.481 property that if that fails partway[br]through, between the send and the delete, 0:06:54.481,0:06:57.969 you fail into having two things stored,[br]rather than no things stored. 0:06:59.281,0:07:02.001 So there are some nice properties to[br]keeping a nice simple 0:07:02.001,0:07:03.731 key-value protocol here. 0:07:03.731,0:07:08.767 And that makes failover resolution really[br]easy because it's simply a matter of, 0:07:08.767,0:07:11.048 "What keys do you have?[br]What trips do you store? 0:07:11.048,0:07:13.238 What keys do I have in the backend[br]datacenter?" 0:07:13.238,0:07:17.845 Compare those, and come to a resolution[br]between those set of trips. 0:07:19.106,0:07:22.638 So that's a quick overview[br]of how we built this system. 0:07:22.638,0:07:26.386 My colleague here, Nikunj Aggarwal[br]is going to give you a rundown 0:07:26.386,0:07:29.801 of some more details of how we[br]really got the reliability of this system 0:07:29.801,0:07:30.801 to work at scale. 0:07:32.693,0:07:34.697 (audience claps) 0:07:38.261,0:07:41.039 Alright, hi! I'm Nikunj. 0:07:41.493,0:07:45.018 So we talked about the idea[br]and the motivation behind the idea; 0:07:45.018,0:07:48.914 now let's dive into how did we design such[br]a solution, and what kind of tradeoffs 0:07:48.914,0:07:51.980 did we have to make[br]while we were doing the design. 0:07:51.980,0:07:56.682 So first thing we wanted to ensure was[br]that the system we built is non-blocking 0:07:56.682,0:07:59.977 but still provide eventual consistency. 0:07:59.977,0:08:04.952 So basically...any backend application[br]using this system should be able to make 0:08:04.952,0:08:08.056 further progress, even when[br]the system is down. 0:08:08.056,0:08:12.531 So the only trade-off the application[br]should be making is that it may 0:08:12.531,0:08:15.853 take some time for the data[br]to actually be stored on the phone. 0:08:15.853,0:08:19.107 However, using this application[br]should not affect 0:08:19.107,0:08:21.470 any normal business operations for them. 0:08:22.390,0:08:27.258 Secondly, we wanted to have an ability[br]to move between datacenters without 0:08:27.258,0:08:30.014 worrying about data already there. 0:08:30.014,0:08:33.299 So when we failover from[br]one datacenter to another, 0:08:33.299,0:08:37.169 that datacenter still had states in there, 0:08:37.169,0:08:41.941 and it still has its view[br]of active drivers and trips. 0:08:41.941,0:08:46.454 And no service in that datacenter is aware[br]that a failure actually happened. 0:08:46.454,0:08:52.708 So at some later time, if we fail back to[br]the same datacenter, then its view 0:08:52.708,0:08:56.547 of the drivers and trips may be actually[br]different than what the drivers 0:08:56.547,0:09:00.914 actually have and if we trusted that[br]datacenter, then the drivers may get on a 0:09:00.914,0:09:03.876 stale trip, which is[br]a very bad experience. 0:09:03.876,0:09:08.224 So we need some way to reconcile that data[br]between the drivers and the server. 0:09:08.224,0:09:13.734 Finally, we want to be able to measure[br]the success of the system all the time. 0:09:13.734,0:09:19.757 So the system is only fully executed[br]during a failure, and a datacenter failure 0:09:19.757,0:09:23.216 is a pretty rare occurrence, and we don't[br]want to be in a situation where 0:09:23.216,0:09:27.173 we detect issues with the system when we[br]need it the most. 0:09:27.173,0:09:32.431 So what we want is an ability to[br]constantly be able to measure the success 0:09:32.431,0:09:36.810 of the system so that we are confident in[br]it when a failure acutally happens. 0:09:36.810,0:09:41.766 So in keeping all these issues in mind,[br]this is a very high level 0:09:41.766,0:09:43.990 view of the system. 0:09:43.990,0:09:46.839 I'm not going to go into details of any[br]of the services, 0:09:46.839,0:09:48.999 since it's a mobile conference. 0:09:48.999,0:09:52.438 So the first thing that happens is that[br]driver makes an update, 0:09:52.438,0:09:55.785 or as Josh called it, a state change,[br]on his app. 0:09:55.785,0:09:58.645 For example, he may pick up a passenger. 0:09:58.645,0:10:02.206 Now that update comes as a request to the[br]dispatching service. 0:10:02.206,0:10:07.762 Now the dispatching service, depending on[br]the type of request, it updates the trip 0:10:07.762,0:10:13.284 model for that trip, and then it sends the[br]update to the replication service. 0:10:13.284,0:10:18.061 Now the replication service will enqueue[br]that request in its own datastore 0:10:18.061,0:10:22.586 and immediately return a successful[br]response to the dispatching service, 0:10:22.586,0:10:26.994 and then finally the dispatching service[br]will update its own datastore 0:10:26.994,0:10:30.053 and then return a success to mobile. 0:10:30.053,0:10:33.982 It may alter it in some other way on the[br]mobile, for example, things might have 0:10:33.982,0:10:36.829 changed since the last time[br]mobile pinged in, for example... 0:10:38.809,0:10:42.359 ...If it's an UberPool trip,[br]then the driver may have to pick up 0:10:42.359,0:10:43.572 another passenger. 0:10:43.572,0:10:46.206 Or if the rider entered some destination, 0:10:46.206,0:10:49.754 we might have to tell[br]the driver about that. 0:10:49.754,0:10:54.434 And in the background, the replication[br]service encrypts that data, obviously, 0:10:54.434,0:10:57.382 since we don't want drivers[br]to have access to all that, 0:10:57.382,0:11:00.306 and then sends it to a messaging service.[br] 0:11:00.306,0:11:04.296 So messaging service is something that's[br]rebuilt as part of the system. 0:11:04.577,0:11:08.707 It maintains a bidirectional communication[br]channel with all drivers 0:11:08.707,0:11:10.672 on the Uber platform. 0:11:10.672,0:11:13.975 And this communication channel[br]is actually separate from 0:11:13.975,0:11:17.678 the original request response channel[br]which we've been traditionally using 0:11:17.678,0:11:20.342 at Uber for drivers to communicate[br]with the server. 0:11:20.342,0:11:24.270 So this way, we are not affecting any[br]normal business operation 0:11:24.270,0:11:26.397 due to this service. 0:11:26.397,0:11:29.666 So the messaging service then sends the[br]message to the phone 0:11:29.666,0:11:33.122 and get an acknowledgement from them. 0:11:34.942,0:11:40.810 So from this design, what we have achieved[br]is that we've isolated the applications 0:11:40.810,0:11:47.128 from any replication latencies or failures[br]because our replication service 0:11:47.128,0:11:52.594 returns immediately and the only extra[br]thing the application is doing 0:11:52.594,0:11:57.234 by opting in to this replication strategy[br]is making an extra service call 0:11:57.234,0:12:01.549 to the replication service, which is going[br]to be pretty cheap since it's within 0:12:01.549,0:12:04.552 the same datacenter,[br]not traveling through internet. 0:12:04.552,0:12:11.552 Secondly, now having this separate channel[br]gives us the ability to arbitrarily query 0:12:11.552,0:12:15.920 the states of the phone without affecting[br]any normal business operations 0:12:15.920,0:12:19.662 and we can use that phone as a basic[br]key-value store now. 0:12:23.502,0:12:28.602 Next...Okay so now comes the issue of[br]moving between datacenters. 0:12:28.602,0:12:33.043 As I said earlier, when we failover we are[br]actually leaving states behind 0:12:33.043,0:12:34.866 in that datacenter. 0:12:34.866,0:12:36.866 So how do we deal with stale states? 0:12:36.866,0:12:40.439 So the first approach we tried[br]was actually do some manual cleanup. 0:12:40.439,0:12:43.727 So we wrote some cleanup scripts[br]and every time you failover 0:12:43.727,0:12:48.773 from our primary datatcenter to our backup[br]datacenter, somebody will run that script 0:12:48.773,0:12:52.910 in a primary and it will go to the[br]datastores for the dispatching service 0:12:52.910,0:12:55.082 and it will clean out[br]all the states there. 0:12:55.082,0:12:58.669 However, this approach had operational[br]pain because somebody had to run it. 0:12:58.669,0:13:05.377 Moreover, we allowed the ability[br]to failover per city so you can actually 0:13:05.377,0:13:09.650 choose to failover specific cities instead[br]of the whole world, and in those cases 0:13:09.650,0:13:12.843 the script started becoming complicated. 0:13:12.843,0:13:19.044 So then we decided to tweak our design a[br]little bit so that we solve this problem. 0:13:19.044,0:13:25.180 So the first thing we did was...as Josh[br]mentioned earlier, the key which is 0:13:25.180,0:13:30.229 stored on the phone contains the trip[br]identifier and the version within it. 0:13:30.229,0:13:35.679 So the version used to be an incrementing[br]number so that we can keep track of 0:13:35.679,0:13:37.469 any followed progress you're making. 0:13:37.469,0:13:41.871 However, we changed that to a[br]modified vector clock. 0:13:42.252,0:13:46.759 So using that vector clock, we can now[br]compare data on the phone 0:13:46.759,0:13:48.639 and data on the server. 0:13:48.639,0:13:53.227 And if there is a miss, we can detect any[br]causality violations using that. 0:13:53.227,0:13:57.954 And we can also resolve that using a very[br]basic conflict resolution strategy. 0:13:57.954,0:14:02.126 So this way, we handle any issues[br]with ongoing trips. 0:14:02.126,0:14:06.109 Now next came the[br]issue of completed trips. 0:14:06.109,0:14:11.707 So, traditionally what we'd been doing is,[br]when a trip is completed, we will delete 0:14:11.707,0:14:14.143 all the data about the trip[br]from the phone. 0:14:14.270,0:14:18.345 We did that because we didn't want the[br]replication data on the phone 0:14:18.345,0:14:20.043 to grow unbounded. 0:14:20.043,0:14:22.978 And once a trip is completed,[br]it's probably no longer required 0:14:22.978,0:14:24.562 for restoration. 0:14:24.562,0:14:28.861 However, that has the side-effect that[br]mobile has no idea now that this trip 0:14:28.861,0:14:30.869 ever happened. 0:14:30.869,0:14:35.878 So what will happen is if we failback to[br]a datacenter with some stale data 0:14:35.878,0:14:40.219 about this trip, then you might actually[br]end up putting the right driver 0:14:40.219,0:14:43.667 on that same trip, which is a pretty[br]bad experience because he's suddenly 0:14:43.667,0:14:46.889 now driving somebody which he already[br]dropped off, and he's probably 0:14:46.889,0:14:48.525 not gonna be paid for that. 0:14:48.525,0:14:55.239 So what we did to fix that was...[br]on trip completion, we would store 0:14:55.239,0:15:01.003 a special key on the phone, and the[br]version in that key has a flag in it. 0:15:01.003,0:15:04.176 That's why I called it[br]a modified vector clock. 0:15:04.176,0:15:07.879 So it has a flag that says that this trip[br]has already been completed, 0:15:07.879,0:15:10.213 and we store that on the phone. 0:15:10.213,0:15:15.211 Now when the replication service sees that[br]this driver has this flag for the trip, 0:15:15.211,0:15:19.203 then can tell the dispatching service that[br]"Hey, this trip has already been completed 0:15:19.203,0:15:21.422 and you should probably delete it." 0:15:21.422,0:15:24.613 So that way, we handle completed trips. 0:15:24.613,0:15:30.803 So if you think about it, storing trip[br]data is kind of expensive because we have 0:15:30.803,0:15:36.529 this huge encrypted blob, of JSON maybe. 0:15:36.529,0:15:42.828 But we can store the...[br]large completed trips 0:15:42.828,0:15:45.750 because there is no data[br]associated with them. 0:15:45.750,0:15:49.473 So we can probably store weeks' worth[br]of completed trips in the same amount 0:15:49.473,0:15:53.362 of memory as we would store one trip data. 0:15:53.362,0:15:55.683 So that's how we solve stale states. 0:15:59.683,0:16:04.341 So now, next comes the issue of ensuring[br]four nines for reliability. 0:16:04.341,0:16:10.305 So we decided to exercise the system[br]more often than a datacenter failure 0:16:10.305,0:16:14.729 because we wanted to get confident[br]that the system actually works. 0:16:14.729,0:16:19.498 So our first approach was[br]to do manual failovers. 0:16:19.498,0:16:23.513 So basically what happened was that[br]bunch of us will gather in a room 0:16:23.513,0:16:27.555 every Monday and then pick a few cities[br]and fail them over. 0:16:27.555,0:16:31.058 And after we fail them over to another[br]datacenter, we'll see... 0:16:31.058,0:16:36.776 what was the success rate[br]for the restoration, and if 0:16:36.776,0:16:40.883 there were any failures, then try to look[br]at the logs and debug any issues there. 0:16:40.883,0:16:43.862 However, there were several problems with[br]this approach. 0:16:43.862,0:16:48.700 First, it was very operationally painful.[br]So, we had to do this every week. 0:16:48.700,0:16:53.359 And for a small fraction of trips,[br]which did not get restored, 0:16:53.359,0:16:56.405 we will actually have to do[br]fare adjustment 0:16:56.405,0:16:58.304 for both the rider and the driver. 0:16:58.304,0:17:01.194 Secondly, it led to a very poor[br]customer experience 0:17:01.194,0:17:05.088 because for that same fraction, [br]they were suddenly bumped off trip 0:17:05.088,0:17:08.330 and they got totally confused,[br]like what happened to them? 0:17:08.330,0:17:14.698 Thirdly, it had a low coverage because[br]we were covering only a few cities. 0:17:14.698,0:17:19.333 However, in the past we've seen problems[br]which affected only a specific city. 0:17:19.333,0:17:22.967 Maybe because there was a new feature[br]allowed in the city 0:17:22.967,0:17:25.075 which was not global yet. 0:17:25.075,0:17:29.474 So this approach does not help us[br]catch those cases until it's too late. 0:17:29.474,0:17:33.951 Finally, we had no idea whether the[br]backup datacenter can handle the load. 0:17:33.951,0:17:38.808 So in our current architecture, we have a[br]primary datacenter which handles 0:17:38.808,0:17:42.254 all the requests and then backup[br]datacenter which is waiting to handle 0:17:42.254,0:17:44.582 all those requests in case[br]the primary goes down. 0:17:44.582,0:17:46.724 But how do we know that the[br]backup datacenter 0:17:46.724,0:17:48.227 can handle all those requests? 0:17:48.227,0:17:51.547 So one way is, maybe you can provision[br]the same number of boxes 0:17:51.547,0:17:53.963 and same type of hardware[br]in the backup datacenter. 0:17:53.963,0:17:56.682 But what if there's a configuration issue? 0:17:56.682,0:17:59.569 In some of the services, [br]we would never catch that. 0:17:59.569,0:18:04.904 And even if they're exactly the same,[br]how do you know that each service 0:18:04.904,0:18:09.329 in the backup datacenter can handle[br]a sudden flood of requests which comes 0:18:09.329,0:18:11.481 when there is a failure? 0:18:11.481,0:18:14.195 So we needed some way to fix[br]all these problems. 0:18:15.603,0:18:19.293 So then to understand how to get[br]good confidence in the system and 0:18:19.293,0:18:22.831 to measure it well, we looked at[br]the key concepts behind the system 0:18:22.831,0:18:25.177 which we really wanted to work. 0:18:25.177,0:18:29.896 So first thing was we wanted to ensure[br]that all mutations which are done 0:18:29.896,0:18:32.699 by the dispatching service are actually[br]stored on the phone. 0:18:32.699,0:18:37.908 So for example, a driver, right after he[br]picks up a passenger, 0:18:37.908,0:18:39.763 he may lose connectivity. 0:18:39.763,0:18:43.360 And so replication data may not[br]be sent to the phone immediately 0:18:43.360,0:18:47.718 but we want to ensure that the data[br]eventually makes it to the phone. 0:18:47.718,0:18:51.917 Secondly, we wanted to make sure[br]that the stored data can actually be used 0:18:51.917,0:18:54.089 for replication. 0:18:54.089,0:18:58.480 For example, there may be some[br]encryption-decryption issue with the data 0:18:58.480,0:19:01.314 and the data gets corrupted[br]and it's no longer needed. 0:19:01.314,0:19:04.047 So even if you're storing the data,[br]you cannot use it. 0:19:04.047,0:19:05.532 So there's no point. 0:19:05.532,0:19:10.473 Or, restoration actually involves[br]rehydrating the states 0:19:10.473,0:19:12.989 within the dispatching service[br]using the data. 0:19:12.989,0:19:15.835 So even if the data is fine,[br]if there's any problem 0:19:15.835,0:19:20.653 during that rehydration process,[br]some service behaving weirdly, 0:19:20.653,0:19:25.154 you would still have no use for that data[br]and you would still lose the trip, 0:19:25.154,0:19:28.273 even though the data is perfectly fine. 0:19:28.273,0:19:32.251 Finally, as I mentioned earlier,[br]we needed a way to figure out 0:19:32.251,0:19:34.810 whether the backup datacenters[br]can handle the load. 0:19:35.993,0:19:41.013 So to monitor the health of the system[br]better, we wrote another service. 0:19:41.013,0:19:48.307 Every hour it will get a list of all[br]active drivers and trips 0:19:48.307,0:19:50.573 from our dispatching service. 0:19:50.573,0:19:56.518 And for all those drivers, it will use[br]that messaging channel to ask for 0:19:56.518,0:19:58.747 their replication data. 0:19:58.747,0:20:01.933 And once it has the replication data,[br]it will compare that data 0:20:01.933,0:20:05.369 with the data which the application[br]expects. 0:20:05.369,0:20:08.913 And doing that, we get a lot of good[br]metrics around, like... 0:20:10.293,0:20:13.634 What percentage of drivers have data[br]successfully stored to them? 0:20:13.634,0:20:18.083 And you can even break down metrics[br]by region or by any app versions. 0:20:18.083,0:20:20.757 So this really helped us[br]drill into the problem. 0:20:25.058,0:20:29.006 Finally, to know whether the stored data[br]can be used for replication, 0:20:29.006,0:20:31.715 and that the backup datacenter[br]can handle the load, 0:20:31.715,0:20:35.649 what we do is we use all the data[br]which we got in the previous step 0:20:35.649,0:20:38.596 and we send that to our backup datacenter. 0:20:38.596,0:20:42.168 And within the backup datacenter[br]we perform what we call 0:20:42.168,0:20:44.099 a shatter restoration. 0:20:44.099,0:20:50.185 And since there is nobody else making any[br]changes in that backup datacenter, 0:20:50.185,0:20:54.446 after the restoration is completed,[br]we can just query the dispatching service 0:20:54.446,0:20:56.444 in the backup datacenter. 0:20:56.444,0:21:00.685 And say, "Hey, how many active riders,[br]drivers, and trips do you have?" 0:21:00.685,0:21:03.903 And we can compare that number[br]with the number we got 0:21:03.903,0:21:06.302 in our snapshot[br]from the primary datacenter. 0:21:06.302,0:21:10.707 And using that, we get really valuable[br]information around what's our success rate 0:21:10.707,0:21:14.905 and we can do similar breakdowns by[br]different parameters 0:21:14.905,0:21:16.919 like region or app version. 0:21:16.919,0:21:21.108 Finally, we also get metrics around[br]how well the backup datacenter did. 0:21:21.108,0:21:24.809 So did we subject it to a lot of load,[br]or can it handle the traffic 0:21:24.809,0:21:26.700 when there is a real failure? 0:21:26.700,0:21:31.640 Also, any configuration issue[br]in the backup datacenter 0:21:31.640,0:21:35.148 can be easily caught by this approach. 0:21:35.148,0:21:39.706 So using this service, we are[br]constantly testing the system 0:21:39.706,0:21:44.965 and making sure we have confidence in it[br]and can use it during a failure. 0:21:44.965,0:21:48.141 Cause if there's no confidence[br]in the system, then it's pointless. 0:21:50.931,0:21:53.995 So yeah, that was the idea[br]behind the system 0:21:53.995,0:21:56.627 and how we implemented it. 0:21:57.083,0:22:00.650 I did not get a chance to go into[br]different [inaudible] of detail, 0:22:00.650,0:22:04.739 but if you guys have any questions,[br]you can always reach out to us 0:22:04.739,0:22:07.054 during the office hours. 0:22:07.054,0:22:09.292 So thanks guys for coming[br]and listening to us. 0:22:09.781,0:22:13.804 (audience claps)