-
(audience claps)
(Presenter) Ok.
-
Alright everybody so, let's dive in.
-
So let's talk about how Uber trips
even happen
-
before we get into the nitty gritty
of how
-
we save them from the clutches
of a datacenter failover.
-
So you might have heard we're all about
connecting riders and drivers.
-
This is what it looks like.
-
You've probably at least
seen the rider app.
-
You get it out, you see
some cars on the map.
-
You pick where you want
to get a pickup location.
-
At the same time,
all these guys that
-
you're seeing on the map,
they have a phone open somewhere.
-
They're logged in waiting for a dispatch.
-
They're all pinging
into the same datacenter
-
for the city that you both are in.
-
So then what happens is you put
that pin somewhere,
-
and you get ready to pick up your trip,
you get ready to request.
-
You hit request and
that guy's phone starts beeping.
-
Hopefully if everything works out,
he'll accept that trip.
-
All these things that we're talking about
here, the request of a trip,
-
the offering it to a driver,
him accepting it.
-
That's something we call
a state change transition.
-
From the moment that you start
requesting the trip,
-
we start creating your trip data
in the backend datacenter.
-
And that transaction that might live
for anything like 5, 10, 15, 20, 30,
-
however many minutes it takes you
to take your trip.
-
We have to consistently handle that trip
to get it all the way through
-
to completion to get you
where you're going happily.
-
So every time this state change happens,
-
things happen in the world, so next up
he goes ahead and shows up to you.
-
He arrives, you get in the car,
he begins the trip.
-
Everything's going fine.
-
So this is of course...
some of these state changes
-
are more or less important
to everything that's going on.
-
The begin trip and the end trip are the
real important ones, of course.
-
The ones that we don't want
to lose the most.
-
But all these are really important
to keep onto.
-
So what happens in the sense of
failure is your trip is gone.
-
You're both back to,
"oh my god, where'd my trip go?"
-
There you're just seeing empty cars again
and he's back into an open thing
-
like where you were when you
started/opened the application
-
in the first place.
-
So this is what used to happen for us
not too long ago.
-
So how do we fix this,
how do you fix this in general?
-
So classically you might try
and say,
-
"Well, let's take all the data in that one
datacenter and copy it,
-
replicate it to a backend data center."
-
This is pretty well understood, classic
way to solve this problem.
-
You control the active data center
and the backup data center
-
so it's pretty easy to reason about.
-
People feel comfortable with this scheme.
-
It could work more or less well depending
on what database you're using.
-
But there's some drawbacks.
-
It gets kinda complicated
beyond two datacenters.
-
It's always gonna be subject
to replication lag
-
because the datacenters are separated by
this thing called the internet,
-
or maybe leased lines
if you get really into it.
-
So it requires a constant level of high
bandwidth, especially if you're not using
-
a database well-suited to replication,
or if you haven't really tuned
-
your business model to get
the deltas really good.
-
So, we chose not to go with this route.
-
We instead said, "What if we could solve
it to vac down to the driver phone?"
-
Because since we're already in constant
communication with these driver phones,
-
what if we could just save the data there
to the driver phone?
-
Then he could failover to any datacenter,
rather than having to control,
-
"Well here's the backup datacenter for
this city, the backup datacenter
-
for this city," and then, "Oh, no no, what
if in a failover, we fail the wrong phones
-
to the wrong datacenter and now we lose
all their trips again?"
-
That would not be cool.
-
So we really decided to go with this
mobile implementation approach
-
of saving the trips to the driver phone.
-
But of course, it doesn't come without a
trade-off, the trade-off here being
-
you've got to implement some kind of a
replication protocol in the driver phone
-
consistently between whatever
platforms you support.
-
In our case, iOS and Android.
-
...But if it could, how would this work?
-
So all these state transitions
are happening when the phones
-
communicate with our datacenter.
-
So if in response to his request to begin
trip or arrive, or accept, or any of this,
-
if we could send some data back down to
his phone and have him keep ahold of it,
-
then in the case of a datacenter failover,
when his phone pings into
-
the new datacenter, we could request that
data right back off of his phone, and get
-
you guys right back on your trip with
maybe only a minimal blip,
-
in the worst case.
-
So in a high level, that's the idea, but
there are some challenges of course
-
in implementing that.
-
Not all the trip information that we would
want to save is something we want
-
the driver to have access to, like to be
able to get your trip and end it
-
consistently in the other datacenter, we'd
have to have the full rider information.
-
If you're fare splitting with some friends
it would need to be all rider information.
-
So, there's a lot of things that we need
to save here to save your trip
-
that we don't want to expose
to the driver.
-
Also, you have to pretty much assume that
the driver phones are more or less
-
trustable, either because people are doing
nefarious things with them,
-
or people not the drivers
have compromised them,
-
or somebody else between you
and the driver. Who knows?
-
So for most of these reasons, we decided
we had to go with the crypto approach
-
and encrypt all the data that we store on
the phones to prevent against tampering
-
and leak of any kind of PII.
-
And also, towards all these security
designs and also simple reliability of
-
interacting with these phones, you want to
keep the replication protocol as simple
-
as possible to make it easy to reason
about, easy to debug, remove failure cases
-
and you also want to
minimize the extra bandwidth.
-
I kinda glossed over the bandwidth unpacks
-
when I said backend replication
isn't really an option.
-
But at least here when you're designing
this replication protocol,
-
at the application layer you can be
much more in tune with what data
-
you're serializing, and what
you're deltafying or not
-
and really mind your bandwidth impact.
-
Especially since it's going over a mobile
network, this becomes really salient.
-
So, how do you keep it simple?
-
In our case, we decided to go with a very
simple key value store with all your
-
typical operations: get, set, delete,
and list all the keys please,
-
with one caveat being you can only
set a key once so you can't
-
accidentally overwrite a key;
it eliminates a whole class
-
of weird programming errors or
out of order message delivery errors
-
that you might have in such a system.
-
This did however then force us to move,
we'll call "versioning"
-
into the keyspace though.
-
You can't just say, "Oh, I've got a key
for this trip and please update it
-
to the new version on each state change."
-
No; instead you have to have a key for
trip and version, and you have to do
-
a set of new one, delete the old one,
and that at least gives you the nice
-
property that if that fails partway
through, between the send and the delete,
-
you fail into having two things stored,
rather than no things stored.
-
So there are some nice properties to
keeping a nice simple
-
key-value protocol here.
-
And that makes failover resolution really
easy because it's simply a matter of,
-
"What keys do you have?
What trips do you store?
-
What keys do I have in the backend
datacenter?"
-
Compare those, and come to a resolution
between those set of trips.
-
So that's a quick overview
of how we built this system.
-
My colleague here, Nikunj Aggarwal
is going to give you a rundown
-
of some more details of how we
really got the reliability of this system
-
to work at scale.
-
(audience claps)
-
Alright, hi! I'm Nikunj.
-
So we talked about the idea
and the motivation behind the idea;
-
now let's dive into how did we design such
a solution, and what kind of tradeoffs
-
did we have to make
while we were doing the design.
-
So first thing we wanted to ensure was
that the system we built is non-blocking
-
but still provide eventual consistency.
-
So basically...any backend application
using this system should be able to make
-
further progress, even when
the system is down.
-
So the only trade-off the application
should be making is that it may
-
take some time for the data
to actually be stored on the phone.
-
However, using this application
should not affect
-
any normal business operations for them.
-
Secondly, we wanted to have an ability
to move between datacenters without
-
worrying about data already there.
-
So when we failover from
one datacenter to another,
-
that datacenter still had states in there,
-
and it still has its view
of active drivers and trips.
-
And no service in that datacenter is aware
that a failure actually happened.
-
So at some later time, if we fail back to
the same datacenter, then its view
-
of the drivers and trips may be actually
different than what the drivers
-
actually have and if we trusted that
datacenter, then the drivers may get on a
-
stale trip, which is
a very bad experience.
-
So we need some way to reconcile that data
between the drivers and the server.
-
Finally, we want to be able to measure
the success of the system all the time.
-
So the system is only fully executed
during a failure, and a datacenter failure
-
is a pretty rare occurrence, and we don't
want to be in a situation where
-
we detect issues with the system when we
need it the most.
-
So what we want is an ability to
constantly be able to measure the success
-
of the system so that we are confident in
it when a failure acutally happens.
-
So in keeping all these issues in mind,
this is a very high level
-
view of the system.
-
I'm not going to go into details of any
of the services,
-
since it's a mobile conference.
-
So the first thing that happens is that
driver makes an update,
-
or as Josh called it, a state change,
on his app.
-
For example, he may pick up a passenger.
-
Now that update comes as a request to the
dispatching service.
-
Now the dispatching service, depending on
the type of request, it updates the trip
-
model for that trip, and then it sends the
update to the replication service.
-
Now the replication service will enqueue
that request in its own datastore
-
and immediately return a successful
response to the dispatching service,
-
and then finally the dispatching service
will update its own datastore
-
and then return a success to mobile.
-
It may alter it in some other way on the
mobile, for example, things might have
-
changed since the last time
mobile pinged in, for example...
-
...If it's an UberPool trip,
then the driver may have to pick up
-
another passenger.
-
Or if the rider entered some destination,
-
we might have to tell
the driver about that.
-
And in the background, the replication
service encrypts that data, obviously,
-
since we don't want drivers
to have access to all that,
-
and then sends it to a messaging service.
-
So messaging service is something that's
rebuilt as part of the system.
-
It maintains a bidirectional communication
channel with all drivers
-
on the Uber platform.
-
And this communication channel
is actually separate from
-
the original request response channel
which we've been traditionally using
-
at Uber for drivers to communicate
with the server.
-
So this way, we are not affecting any
normal business operation
-
due to this service.
-
So the messaging service then sends the
message to the phone
-
and get an acknowledgement from them.
-
So from this design, what we have achieved
is that we've isolated the applications
-
from any replication latencies or failures
because our replication service
-
returns immediately and the only extra
thing the application is doing
-
by opting in to this replication strategy
is making an extra service call
-
to the replication service, which is going
to be pretty cheap since it's within
-
the same datacenter,
not traveling through internet.
-
Secondly, now having this separate channel
gives us the ability to arbitrarily query
-
the states of the phone without affecting
any normal business operations
-
and we can use that phone as a basic
key-value store now.
-
Next...Okay so now comes the issue of
moving between datacenters.
-
As I said earlier, when we failover we are
actually leaving states behind
-
in that datacenter.
-
So how do we deal with stale states?
-
So the first approach we tried
was actually do some manual cleanup.
-
So we wrote some cleanup scripts
and every time you failover
-
from our primary datatcenter to our backup
datacenter, somebody will run that script
-
in a primary and it will go to the
datastores for the dispatching service
-
and it will clean out
all the states there.
-
However, this approach had operational
pain because somebody had to run it.
-
Moreover, we allowed the ability
to failover per city so you can actually
-
choose to failover specific cities instead
of the whole world, and in those cases
-
the script started becoming complicated.
-
So then we decided to tweak our design a
little bit so that we solve this problem.
-
So the first thing we did was...as Josh
mentioned earlier, the key which is
-
stored on the phone contains the trip
identifier and the version within it.
-
So the version used to be an incrementing
number so that we can keep track of
-
any followed progress you're making.
-
However, we changed that to a
modified vector clock.
-
So using that vector clock, we can now
compare data on the phone
-
and data on the server.
-
And if there is a miss, we can detect any
causality violations using that.
-
And we can also resolve that using a very
basic conflict resolution strategy.
-
So this way, we handle any issues
with ongoing trips.
-
Now next came the
issue of completed trips.
-
So, traditionally what we'd been doing is,
when a trip is completed, we will delete
-
all the data about the trip
from the phone.
-
We did that because we didn't want the
replication data on the phone
-
to grow unbounded.
-
And once a trip is completed,
it's probably no longer required
-
for restoration.
-
However, that has the side-effect that
mobile has no idea now that this trip
-
ever happened.
-
So what will happen is if we failback to
a datacenter with some stale data
-
about this trip, then you might actually
end up putting the right driver
-
on that same trip, which is a pretty
bad experience because he's suddenly
-
now driving somebody which he already
dropped off, and he's probably
-
not gonna be paid for that.
-
So what we did to fix that was...
on trip completion, we would store
-
a special key on the phone, and the
version in that key has a flag in it.
-
That's why I called it
a modified vector clock.
-
So it has a flag that says that this trip
has already been completed,
-
and we store that on the phone.
-
Now when the replication service sees that
this driver has this flag for the trip,
-
then can tell the dispatching service that
"Hey, this trip has already been completed
-
and you should probably delete it."
-
So that way, we handle completed trips.
-
So if you think about it, storing trip
data is kind of expensive because we have
-
this huge encrypted blob, of JSON maybe.
-
But we can store the...
large completed trips
-
because there is no data
associated with them.
-
So we can probably store weeks' worth
of completed trips in the same amount
-
of memory as we would store one trip data.
-
So that's how we solve stale states.
-
So now, next comes the issue of ensuring
four nines for reliability.
-
So we decided to exercise the system
more often than a datacenter failure
-
because we wanted to get confident
that the system actually works.
-
So our first approach was
to do manual failovers.
-
So basically what happened was that
bunch of us will gather in a room
-
every Monday and then pick a few cities
and fail them over.
-
And after we fail them over to another
datacenter, we'll see...
-
what was the success rate
for the restoration, and if
-
there were any failures, then try to look
at the logs and debug any issues there.
-
However, there were several problems with
this approach.
-
First, it was very operationally painful.
So, we had to do this every week.
-
And for a small fraction of trips,
which did not get restored,
-
we will actually have to do
fare adjustment
-
for both the rider and the driver.
-
Secondly, it led to a very poor
customer experience
-
because for that same fraction,
they were suddenly bumped off trip
-
and they got totally confused,
like what happened to them?
-
Thirdly, it had a low coverage because
we were covering only a few cities.
-
However, in the past we've seen problems
which affected only a specific city.
-
Maybe because there was a new feature
allowed in the city
-
which was not global yet.
-
So this approach does not help us
catch those cases until it's too late.
-
Finally, we had no idea whether the
backup datacenter can handle the load.
-
So in our current architecture, we have a
primary datacenter which handles
-
all the requests and then backup
datacenter which is waiting to handle
-
all those requests in case
the primary goes down.
-
But how do we know that the
backup datacenter
-
can handle all those requests?
-
So one way is, maybe you can provision
the same number of boxes
-
and same type of hardware
in the backup datacenter.
-
But what if there's a configuration issue?
-
In some of the services,
we would never catch that.
-
And even if they're exactly the same,
how do you know that each service
-
in the backup datacenter can handle
a sudden flood of requests which comes
-
when there is a failure?
-
So we needed some way to fix
all these problems.
-
So then to understand how to get
good confidence in the system and
-
to measure it well, we looked at
the key concepts behind the system
-
which we really wanted to work.
-
So first thing was we wanted to ensure
that all mutations which are done
-
by the dispatching service are actually
stored on the phone.
-
So for example, a driver, right after he
picks up a passenger,
-
he may lose connectivity.
-
And so replication data may not
be sent to the phone immediately
-
but we want to ensure that the data
eventually makes it to the phone.
-
Secondly, we wanted to make sure
that the stored data can actually be used
-
for replication.
-
For example, there may be some
encryption-decryption issue with the data
-
and the data gets corrupted
and it's no longer needed.
-
So even if you're storing the data,
you cannot use it.
-
So there's no point.
-
Or, restoration actually involves
rehydrating the states
-
within the dispatching service
using the data.
-
So even if the data is fine,
if there's any problem
-
during that rehydration process,
some service behaving weirdly,
-
you would still have no use for that data
and you would still lose the trip,
-
even though the data is perfectly fine.
-
Finally, as I mentioned earlier,
we needed a way to figure out
-
whether the backup datacenters
can handle the load.
-
So to monitor the health of the system
better, we wrote another service.
-
Every hour it will get a list of all
active drivers and trips
-
from our dispatching service.
-
And for all those drivers, it will use
that messaging channel to ask for
-
their replication data.
-
And once it has the replication data,
it will compare that data
-
with the data which the application
expects.
-
And doing that, we get a lot of good
metrics around, like...
-
What percentage of drivers have data
successfully stored to them?
-
And you can even break down metrics
by region or by any app versions.
-
So this really helped us
drill into the problem.
-
Finally, to know whether the stored data
can be used for replication,
-
and that the backup datacenter
can handle the load,
-
what we do is we use all the data
which we got in the previous step
-
and we send that to our backup datacenter.
-
And within the backup datacenter
we perform what we call
-
a shatter restoration.
-
And since there is nobody else making any
changes in that backup datacenter,
-
after the restoration is completed,
we can just query the dispatching service
-
in the backup datacenter.
-
And say, "Hey, how many active riders,
drivers, and trips do you have?"
-
And we can compare that number
with the number we got
-
in our snapshot
from the primary datacenter.
-
And using that, we get really valuable
information around what's our success rate
-
and we can do similar breakdowns by
different parameters
-
like region or app version.
-
Finally, we also get metrics around
how well the backup datacenter did.
-
So did we subject it to a lot of load,
or can it handle the traffic
-
when there is a real failure?
-
Also, any configuration issue
in the backup datacenter
-
can be easily caught by this approach.
-
So using this service, we are
constantly testing the system
-
and making sure we have confidence in it
and can use it during a failure.
-
Cause if there's no confidence
in the system, then it's pointless.
-
So yeah, that was the idea
behind the system
-
and how we implemented it.
-
I did not get a chance to go into
different [inaudible] of detail,
-
but if you guys have any questions,
you can always reach out to us
-
during the office hours.
-
So thanks guys for coming
and listening to us.
-
(audience claps)