If you take a look somewhere in an engineering VP or Director’s office, you’ll find a binder that hasn’t been touched in a while labeled “DR / BCM Plan.”
Disaster Recovery / Business Continuity Management planning are important things to take into consideration. But the clowns you work for have almost certainly screwed it up well into the realm of absurdity.
Why these plans exist
These plans start with the best of intentions. “What happens if our site falls over?” is absolutely the kind of thing that responsible businesses and also Facebook need to ask. In fact, when I was shopping around for The Duckbill Group’s insurance policy, one of the questions was “Do you have a DR plan?” The next sentence was “please attach a copy of it,” so you couldn’t just skate by.
Further, if your data center or cloud service provider reaches out with a “Hey, so our facility is now a smoking hole in the ground because it turns out that powering it with what is in effect a giant compressed bomb had some failure modes we didn’t fully anticipate,” you’re going to want to have at least a rough idea of what to do next.
No, not “update your résumé and look for a new job,” you coward. We’ll get to that part later.
Why they’re jokes
The problem with these plans is that they betray a severe lack of understanding about how failures work. As an environment grows and its applications become world-spanning, it’s less a question of whether the site is up or down and more a question of “How down is it?”
Knowing when to activate the DR plan is never as clear-cut as it is in tabletop exercises. If your provider fails to communicate with you about what’s going on do you activate the plan or try to wait it out?
DR plans also suffer from the conceit that they’re able to predict the scale and scope of any given outage. “Surely if the database server fails, it won’t do so in a manner that corrupts its replica” is one expression of this, and a common one.
But there’s a darker one.
If you’re in AWS’s us-tirefire-1 and you test your plan to make the poor life decision of migrating to Ohio, that’s going to work pretty well during your DR exercise. It’s likely to work far less well in the event of a regional AWS outage because roughly half of the internet will be attempting to do the exact same thing.
Did your DR plan account for EC2 instance provisioning to take 45 minutes? Did it account for EBS latency well above normal? The “herd of elephants” problem will stampede you to death if you’re not careful, and there’s no good way to test for this in advance.
DR plans are also snapshots of fixed points in time. If you’re at a shop that does quarterly DR tests—spoiler: almost none do, despite what they claim in their audit attestations—what happens is you attempt to run the DR plan from last quarter and it runs into a problem and fails. You fix that, move forward another step or two, and hit a different problem. You keep iterating on your DR plan until it works, and you get to check the blessed box on the form.
And then your next commit to production breaks your DR plan again.
Unless you’re testing your DR plan continually, it’s almost certainly going to break in hilarious fashion right when you need it most.
Scope
Any DR plan that isn’t written by complete clowns is going to have to address up front exactly what the scale and scope of its applicability is. “We lost the primary database” is a common and great example of what your DR plan should cover. “Three quarters of the world is destroyed by an asteroid” is going to have different answers—and for almost all of us, our sites will be down because we’ll all have bigger problems to worry about for the foreseeable future.
Even things in the middle of these two extremes—such as “AWS loses a major region for a month”—are likely to be hilariously out of touch just because they fail to account for human behaviors.
The human element
I once worked in a regulated environment where I was a key employee with respect to the DR plan. “Here’s our offsite location well outside of San Francisco in case the city isn’t able to sustain work; in that event we’ll all rendezvous here within four hours of the disaster being declared.”
Unless this is your first encounter with my personality, you can probably guess how that conversation went.
“Yes, excuse me! One question for you folks, and it’s just a minor thing really. None of our computers live in San Francisco; they’re cloud hosted very far away in undisclosed locations managed by AWS. Can you identify a single scenario—any scenario at all—in which AWS lost a region, San Francisco was uninhabitable for work purposes, and a single employee here gave anything remotely resembling a crap about work instead of, y’know, their families? Further, let’s assume that this hit-the-lottery-jackpot-three-weeks-in-a-row scenario happens; exactly which of our employees do you believe are dumb enough to continue working for their existing salaries rather than becoming multi-million dollar a month consultants for a number of companies who suddenly have far, far, far more expensive problems than we will? I don’t recall ‘hire people who are incredibly intelligent about everything except knowing their own market worth’ as being in our charter. Did I miss that paragraph?”
And then, suddenly, I wasn’t invited to DR planning meetings anymore.
At some point, “this is ridiculous; I quit” is going to be your staff’s response—and they’ll be right.
DR plans tend to skip over this entirely and lose sight of the bigger picture. Sure, okay—you have a policy that three of your executives can’t all travel on the same plane (strangely, there’s no such policy about them riding in the same car), but half of your engineering team will quit the second you mention Azure.
Our DR policy
The Duckbill Group’s DR policy states, in effect, that we back up our data a couple of different ways. We’re fully remote, so should any employee’s internet stop working, they can presumably work from a coffee shop or tether from a phone. Should the multiple cities in which our Cloud Economists reside suddenly become unsuitable for work, we are prepared to operate on the assumption that nobody is going to care overly much about their AWS bills that month.
In effect, we take a realistic view that doesn’t depend upon our employees sacrificing themselves or their families’ well being in extremis. We didn’t expect Pete Cheslock to keep working after I messed up drop-shipping his company car because we’re human beings. At some scale, you’ve gotta have a business continuity plan that transcends individuals—heck, we do ourselves!—but that flat out can’t come at the expense of overlooking people’s basic humanity.
If your employer’s DR plan is written by clowns and assumes you’ll prioritize them over your family, I suggest you find a new place to work.