(audience claps)
(Presenter) Ok.
Alright everybody so, let's dive in.
So let's talk about how Uber trips
even happen
before we get into the nitty gritty
of how
we save them from the clutches
of a datacenter failover.
So you might have heard we're all about
connecting riders and drivers.
This is what it looks like.
You've probably at least
seen the rider app.
You get it out, you see
some cars on the map.
You pick where you want
to get a pickup location.
At the same time,
all these guys that
you're seeing on the map,
they have a phone open somewhere.
They're logged in waiting for a dispatch.
They're all pinging
into the same datacenter
for the city that you both are in.
So then what happens is you put
that pin somewhere,
and you get ready to pick up your trip,
you get ready to request.
You hit request and
that guy's phone starts beeping.
Hopefully if everything works out,
he'll accept that trip.
All these things that we're talking about
here, the request of a trip,
the offering it to a driver,
him accepting it.
That's something we call
a state change transition.
From the moment that you start
requesting the trip,
we start creating your trip data
in the backend datacenter.
And that transaction that might live
for anything like 5, 10, 15, 20, 30,
however many minutes it takes you
to take your trip.
We have to consistently handle that trip
to get it all the way through
to completion to get you
where you're going happily.
So every time this state change happens,
things happen in the world, so next up
he goes ahead and shows up to you.
He arrives, you get in the car,
he begins the trip.
Everything's going fine.
So this is of course...
some of these state changes
are more or less important
to everything that's going on.
The begin trip and the end trip are the
real important ones, of course.
The ones that we don't want
to lose the most.
But all these are really important
to keep onto.
So what happens in the sense of
failure is your trip is gone.
You're both back to,
"oh my god, where'd my trip go?"
There you're just seeing empty cars again
and he's back into an open thing
like where you were when you
started/opened the application
in the first place.
So this is what used to happen for us
not too long ago.
So how do we fix this,
how do you fix this in general?
So classically you might try
and say,
"Well, let's take all the data in that one
datacenter and copy it,
replicate it to a backend data center."
This is pretty well understood, classic
way to solve this problem.
You control the active data center
and the backup data center
so it's pretty easy to reason about.
People feel comfortable with this scheme.
It could work more or less well depending
on what database you're using.
But there's some drawbacks.
It gets kinda complicated
beyond two datacenters.
It's always gonna be subject
to replication lag
because the datacenters are separated by
this thing called the internet,
or maybe leased lines
if you get really into it.
So it requires a constant level of high
bandwidth, especially if you're not using
a database well-suited to replication,
or if you haven't really tuned
your business model to get
the deltas really good.
So, we chose not to go with this route.
We instead said, "What if we could solve
it to vac down to the driver phone?"
Because since we're already in constant
communication with these driver phones,
what if we could just save the data there
to the driver phone?
Then he could failover to any datacenter,
rather than having to control,
"Well here's the backup datacenter for
this city, the backup datacenter
for this city," and then, "Oh, no no, what
if in a failover, we fail the wrong phones
to the wrong datacenter and now we lose
all their trips again?"
That would not be cool.
So we really decided to go with this
mobile implementation approach
of saving the trips to the driver phone.
But of course, it doesn't come without a
trade-off, the trade-off here being
you've got to implement some kind of a
replication protocol in the driver phone
consistently between whatever
platforms you support.
In our case, iOS and Android.
...But if it could, how would this work?
So all these state transitions
are happening when the phones
communicate with our datacenter.
So if in response to his request to begin
trip or arrive, or accept, or any of this,
if we could send some data back down to
his phone and have him keep ahold of it,
then in the case of a datacenter failover,
when his phone pings into
the new datacenter, we could request that
data right back off of his phone, and get
you guys right back on your trip with
maybe only a minimal blip,
in the worst case.
So in a high level, that's the idea, but
there are some challenges of course
in implementing that.
Not all the trip information that we would
want to save is something we want
the driver to have access to, like to be
able to get your trip and end it
consistently in the other datacenter, we'd
have to have the full rider information.
If you're fare splitting with some friends
it would need to be all rider information.
So, there's a lot of things that we need
to save here to save your trip
that we don't want to expose
to the driver.
Also, you have to pretty much assume that
the driver phones are more or less
trustable, either because people are doing
nefarious things with them,
or people not the drivers
have compromised them,
or somebody else between you
and the driver. Who knows?
So for most of these reasons, we decided
we had to go with the crypto approach
and encrypt all the data that we store on
the phones to prevent against tampering
and leak of any kind of PII.
And also, towards all these security
designs and also simple reliability of
interacting with these phones, you want to
keep the replication protocol as simple
as possible to make it easy to reason
about, easy to debug, remove failure cases
and you also want to
minimize the extra bandwidth.
I kinda glossed over the bandwidth unpacks
when I said backend replication
isn't really an option.
But at least here when you're designing
this replication protocol,
at the application layer you can be
much more in tune with what data
you're serializing, and what
you're deltafying or not
and really mind your bandwidth impact.
Especially since it's going over a mobile
network, this becomes really salient.
So, how do you keep it simple?
In our case, we decided to go with a very
simple key value store with all your
typical operations: get, set, delete,
and list all the keys please,
with one caveat being you can only
set a key once so you can't
accidentally overwrite a key;
it eliminates a whole class
of weird programming errors or
out of order message delivery errors
that you might have in such a system.
This did however then force us to move,
we'll call "versioning"
into the keyspace though.
You can't just say, "Oh, I've got a key
for this trip and please update it
to the new version on each state change."
No; instead you have to have a key for
trip and version, and you have to do
a set of new one, delete the old one,
and that at least gives you the nice
property that if that fails partway
through, between the send and the delete,
you fail into having two things stored,
rather than no things stored.
So there are some nice properties to
keeping a nice simple
key-value protocol here.
And that makes failover resolution really
easy because it's simply a matter of,
"What keys do you have?
What trips do you store?
What keys do I have in the backend
datacenter?"
Compare those, and come to a resolution
between those set of trips.
So that's a quick overview
of how we built this system.
My colleague here, Nikunj Aggarwal
is going to give you a rundown
of some more details of how we
really got the reliability of this system
to work at scale.
(audience claps)
Alright, hi! I'm Nikunj.
So we talked about the idea
and the motivation behind the idea;
now let's dive into how did we design such
a solution, and what kind of tradeoffs
did we have to make
while we were doing the design.
So first thing we wanted to ensure was
that the system we built is non-blocking
but still provide eventual consistency.
So basically...any backend application
using this system should be able to make
further progress, even when
the system is down.
So the only trade-off the application
should be making is that it may
take some time for the data
to actually be stored on the phone.
However, using this application
should not affect
any normal business operations for them.
Secondly, we wanted to have an ability
to move between datacenters without
worrying about data already there.
So when we failover from
one datacenter to another,
that datacenter still had states in there,
and it still has its view
of active drivers and trips.
And no service in that datacenter is aware
that a failure actually happened.
So at some later time, if we fail back to
the same datacenter, then its view
of the drivers and trips may be actually
different than what the drivers
actually have and if we trusted that
datacenter, then the drivers may get on a
stale trip, which is
a very bad experience.
So we need some way to reconcile that data
between the drivers and the server.
Finally, we want to be able to measure
the success of the system all the time.
So the system is only fully executed
during a failure, and a datacenter failure
is a pretty rare occurrence, and we don't
want to be in a situation where
we detect issues with the system when we
need it the most.
So what we want is an ability to
constantly be able to measure the success
of the system so that we are confident in
it when a failure acutally happens.
So in keeping all these issues in mind,
this is a very high level
view of the system.
I'm not going to go into details of any
of the services,
since it's a mobile conference.
So the first thing that happens is that
driver makes an update,
or as Josh called it, a state change,
on his app.
For example, he may pick up a passenger.
Now that update comes as a request to the
dispatching service.
Now the dispatching service, depending on
the type of request, it updates the trip
model for that trip, and then it sends the
update to the replication service.
Now the replication service will enqueue
that request in its own datastore
and immediately return a successful
response to the dispatching service,
and then finally the dispatching service
will update its own datastore
and then return a success to mobile.
It may alter it in some other way on the
mobile, for example, things might have
changed since the last time
mobile pinged in, for example...
...If it's an UberPool trip,
then the driver may have to pick up
another passenger.
Or if the rider entered some destination,
we might have to tell
the driver about that.
And in the background, the replication
service encrypts that data, obviously,
since we don't want drivers
to have access to all that,
and then sends it to a messaging service.
So messaging service is something that's
rebuilt as part of the system.
It maintains a bidirectional communication
channel with all drivers
on the Uber platform.
And this communication channel
is actually separate from
the original request response channel
which we've been traditionally using
at Uber for drivers to communicate
with the server.
So this way, we are not affecting any
normal business operation
due to this service.
So the messaging service then sends the
message to the phone
and get an acknowledgement from them.
So from this design, what we have achieved
is that we've isolated the applications
from any replication latencies or failures
because our replication service
returns immediately and the only extra
thing the application is doing
by opting in to this replication strategy
is making an extra service call
to the replication service, which is going
to be pretty cheap since it's within
the same datacenter,
not traveling through internet.
Secondly, now having this separate channel
gives us the ability to arbitrarily query
the states of the phone without affecting
any normal business operations
and we can use that phone as a basic
key-value store now.
Next...Okay so now comes the issue of
moving between datacenters.
As I said earlier, when we failover we are
actually leaving states behind
in that datacenter.
So how do we deal with stale states?
So the first approach we tried
was actually do some manual cleanup.
So we wrote some cleanup scripts
and every time you failover
from our primary datatcenter to our backup
datacenter, somebody will run that script
in a primary and it will go to the
datastores for the dispatching service
and it will clean out
all the states there.
However, this approach had operational
pain because somebody had to run it.
Moreover, we allowed the ability
to failover per city so you can actually
choose to failover specific cities instead
of the whole world, and in those cases
the script started becoming complicated.
So then we decided to tweak our design a
little bit so that we solve this problem.
So the first thing we did was...as Josh
mentioned earlier, the key which is
stored on the phone contains the trip
identifier and the version within it.
So the version used to be an incrementing
number so that we can keep track of
any followed progress you're making.
However, we changed that to a
modified vector clock.
So using that vector clock, we can now
compare data on the phone
and data on the server.
And if there is a miss, we can detect any
causality violations using that.
And we can also resolve that using a very
basic conflict resolution strategy.
So this way, we handle any issues
with ongoing trips.
Now next came the
issue of completed trips.
So, traditionally what we'd been doing is,
when a trip is completed, we will delete
all the data about the trip
from the phone.
We did that because we didn't want the
replication data on the phone
to grow unbounded.
And once a trip is completed,
it's probably no longer required
for restoration.
However, that has the side-effect that
mobile has no idea now that this trip
ever happened.
So what will happen is if we failback to
a datacenter with some stale data
about this trip, then you might actually
end up putting the right driver
on that same trip, which is a pretty
bad experience because he's suddenly
now driving somebody which he already
dropped off, and he's probably
not gonna be paid for that.
So what we did to fix that was...
on trip completion, we would store
a special key on the phone, and the
version in that key has a flag in it.
That's why I called it
a modified vector clock.
So it has a flag that says that this trip
has already been completed,
and we store that on the phone.
Now when the replication service sees that
this driver has this flag for the trip,
then can tell the dispatching service that
"Hey, this trip has already been completed
and you should probably delete it."
So that way, we handle completed trips.
So if you think about it, storing trip
data is kind of expensive because we have
this huge encrypted blob, of JSON maybe.
But we can store the...
large completed trips
because there is no data
associated with them.
So we can probably store weeks' worth
of completed trips in the same amount
of memory as we would store one trip data.
So that's how we solve stale states.
So now, next comes the issue of ensuring
four nines for reliability.
So we decided to exercise the system
more often than a datacenter failure
because we wanted to get confident
that the system actually works.
So our first approach was
to do manual failovers.
So basically what happened was that
bunch of us will gather in a room
every Monday and then pick a few cities
and fail them over.
And after we fail them over to another
datacenter, we'll see...
what was the success rate
for the restoration, and if
there were any failures, then try to look
at the logs and debug any issues there.
However, there were several problems with
this approach.
First, it was very operationally painful.
So, we had to do this every week.
And for a small fraction of trips,
which did not get restored,
we will actually have to do
fare adjustment
for both the rider and the driver.
Secondly, it led to a very poor
customer experience
because for that same fraction,
they were suddenly bumped off trip
and they got totally confused,
like what happened to them?
Thirdly, it had a low coverage because
we were covering only a few cities.
However, in the past we've seen problems
which affected only a specific city.
Maybe because there was a new feature
allowed in the city
which was not global yet.
So this approach does not help us
catch those cases until it's too late.
Finally, we had no idea whether the
backup datacenter can handle the load.
So in our current architecture, we have a
primary datacenter which handles
all the requests and then backup
datacenter which is waiting to handle
all those requests in case
the primary goes down.
But how do we know that the
backup datacenter
can handle all those requests?
So one way is, maybe you can provision
the same number of boxes
and same type of hardware
in the backup datacenter.
But what if there's a configuration issue?
In some of the services,
we would never catch that.
And even if they're exactly the same,
how do you know that each service
in the backup datacenter can handle
a sudden flood of requests which comes
when there is a failure?
So we needed some way to fix
all these problems.
So then to understand how to get
good confidence in the system and
to measure it well, we looked at
the key concepts behind the system
which we really wanted to work.
So first thing was we wanted to ensure
that all mutations which are done
by the dispatching service are actually
stored on the phone.
So for example, a driver, right after he
picks up a passenger,
he may lose connectivity.
And so replication data may not
be sent to the phone immediately
but we want to ensure that the data
eventually makes it to the phone.
Secondly, we wanted to make sure
that the stored data can actually be used
for replication.
For example, there may be some
encryption-decryption issue with the data
and the data gets corrupted
and it's no longer needed.
So even if you're storing the data,
you cannot use it.
So there's no point.
Or, restoration actually involves
rehydrating the states
within the dispatching service
using the data.
So even if the data is fine,
if there's any problem
during that rehydration process,
some service behaving weirdly,
you would still have no use for that data
and you would still lose the trip,
even though the data is perfectly fine.
Finally, as I mentioned earlier,
we needed a way to figure out
whether the backup datacenters
can handle the load.
So to monitor the health of the system
better, we wrote another service.
Every hour it will get a list of all
active drivers and trips
from our dispatching service.
And for all those drivers, it will use
that messaging channel to ask for
their replication data.
And once it has the replication data,
it will compare that data
with the data which the application
expects.
And doing that, we get a lot of good
metrics around, like...
What percentage of drivers have data
successfully stored to them?
And you can even break down metrics
by region or by any app versions.
So this really helped us
drill into the problem.
Finally, to know whether the stored data
can be used for replication,
and that the backup datacenter
can handle the load,
what we do is we use all the data
which we got in the previous step
and we send that to our backup datacenter.
And within the backup datacenter
we perform what we call
a shatter restoration.
And since there is nobody else making any
changes in that backup datacenter,
after the restoration is completed,
we can just query the dispatching service
in the backup datacenter.
And say, "Hey, how many active riders,
drivers, and trips do you have?"
And we can compare that number
with the number we got
in our snapshot
from the primary datacenter.
And using that, we get really valuable
information around what's our success rate
and we can do similar breakdowns by
different parameters
like region or app version.
Finally, we also get metrics around
how well the backup datacenter did.
So did we subject it to a lot of load,
or can it handle the traffic
when there is a real failure?
Also, any configuration issue
in the backup datacenter
can be easily caught by this approach.
So using this service, we are
constantly testing the system
and making sure we have confidence in it
and can use it during a failure.
Cause if there's no confidence
in the system, then it's pointless.
So yeah, that was the idea
behind the system
and how we implemented it.
I did not get a chance to go into
different [inaudible] of detail,
but if you guys have any questions,
you can always reach out to us
during the office hours.
So thanks guys for coming
and listening to us.
(audience claps)