Cloud Resilience Strategies with Seth Eliot

Episode Summary

Seth Eliot, Principal Resilience Architect at Arpio, and former Global Reliability Lead at AWS, joins Corey to discuss cloud resilience. He emphasizes that Multi-AZ setups are typically sufficient, with multi-region configurations only necessary for specific risks. Seth highlights the importance of balancing cost and resilience based on business needs, while cautioning against making resilience a mere checkbox exercise. Together, they explore disaster recovery challenges, noting that many companies fail to account for real-world complexities during testing. Seth also stresses the importance of avoiding control plane dependencies and warns that poorly designed multi-cloud setups can introduce additional risks.

Episode Video

Episode Show Notes & Transcript

Show Highlights

(0:00) Intro

(1:12) Backblaze sponsor read

(1:40) Seth’s involvement in the Well-Architected sphere of AWS

(4:43) Well-Architected as a maturity model

(6:46) Cost vs. resilience

(10:37) The tension between resiliency and the cost pillar

(13:26) Legitimate reasons to go multi-region

(18:31) Mistakes people make when trying to avoid an AWS outage

(24:07) The challenges of control planes

(25:04) What people are getting wrong about the resiliency landscape in 2024

(26:31) Where you can find more from Seth

About Seth Eliot

Currently Principal Resilience Architect at Arpio, and ex-Amazon, ex-AWS, ex-Microsoft… Seth has spent years knee-deep in the tech trenches, figuring out how to design, implement, and launch software that's not just fast but also bulletproof. He thrives on helping teams tackle those "make or break" technical, process, or culture challenges—then partners up to solve them. As the Global Reliability Lead for AWS Well-Architected, Seth didn’t just work with customers; he scaled his insights via workshops, presentations, and blog posts, benefiting thousands. Before that, as one of the rare AWS-dedicated Principal Solutions Architects at Amazon.com (yep, not AWS, but the mothership itself), he rolled up his sleeves with engineers to fine-tune the AWS magic powering Amazon.com’s immense stack. Earlier? He led as Principal Engineer for Amazon Fresh and International Tech, and before that, helped bring Prime Video into homes everywhere.

Links

Personal site: https://linktr.ee/setheliot
LinkedIn: https://www.linkedin.com/in/setheliot/
Twitter: https://twitter.com/setheliot

Sponsor
Backblaze: https://www.backblaze.com/

Transcript

Seth Eliot: What risks does your company need to be protected against? If it's the risk of a fire in a data center or a minor electrical outage, guess what? Multi-AZ is fine. You don't need to be multi-region because again, it's multiple data centers when you're Multi-AZ.

Corey Quinn: Welcome to Screaming in the Cloud. I'm Corey Quinn. You may know my guest today from the reInvent stage or from various YouTube videos or, you know, blogs, if you're more of the reading type. Seth Eliot was a principal solutions architect at AWS and is currently between roles. Seth, you're free. What's it like to suddenly not have to go to work at an enterprise style company, which I think is where you spent your entire career.

Seth Eliot: No, instead I'm working full time doing interviews. [laughs] It's another job to have.

Corey Quinn: It's a lot less process, a lot fewer stakeholders have to sign off on these things most of the time.

Seth Eliot: Be your own boss, you know, work from home. Well, what's that like? But, you know, hey, you know, after hearing this conversation, anyone out there thinks they could use my skills as a cloud architect with a focus on resilience.

I'm on LinkedIn. That's Eliot with one L and one T.

Corey Quinn: And we will, of course, toss that into the show notes.

Sponsor: Backblaze B2 Cloud Storage helps you scale applications and deliver services globally. Egress is free up to 3X of data stored, and completely free between S3-compatible Backblaze and leading CDN and compute providers like Fastly, Cloudflare, Vultr, and CoreWeave. Visit backblaze.com to learn more. Backblaze—cloud storage built better.

Corey Quinn: You've been talking for a while now about something that I've always had a strange relationship with. Specifically, the Well-Architected Framework Tool Team Process, etc. at AWS. What is your involvement with that whole Well-Architected sphere?

Seth Eliot: So, actually, I was, uh, I had a really interesting job we could talk about in another question about it as an AWS Solutions Architect embedded in Amazon.com, not working for AWS, but working for Amazon. But my first job in AWS was as Global Reliability Lead for AWS Well-Architected. So there were five pillars at the time. There are six now. Let's see if I can remember them. Operational Excellence, Performance, Security, Reliability, Cost, and the sixth is Sustainability.

Uh, cost might be of some interest to you, Corey. I'm not sure. But I joined that team as Reliability Lead.

Corey Quinn: There is. Computers tend to do interesting things in that sense, but we'll roll with it. A question that I always had when I first heard about the Well-Architected Framework coming out was, "Oh great, I might be about to learn a whole bunch of things."

And I've got to be direct, I was slightly disappointed when I first saw it because it's, well, this is just, stuff is how you're supposed to build on the cloud, who's worked in operations and doesn't know these things. And then I started talking to a whole bunch of AWS customers, and it turns out lots of them, lots of them have not thought about these things because not everyone is an old grumpy sysadmin like I am and where you sort of learn exactly why through painful experience you do things the way that you do.

Seth Eliot: Yeah, exactly. Um, you know, certainly the performance pillar would play a role here in terms of this recording and the lag issue, but yeah, all the pillars are important, and they're a really great resource. Basically, they're just a set of best practices that everyone should be knowledgeable about in building the cloud.

And if you take a conscientious decision to not do one of those best practices, we want you to make sure it's a conscientious decision and not just a decision of omission. Um, what I found, though, about it, and you mentioned this when you talk about the Well-Architected tool is that people tend to think of Well-Architected as the Well-Architected tool, and it's not.

It's- Well-Architected is the best practices, and the tool is a great resource by which you could sit down with six of your friends that you work with from all different parts of the company and do a multi-hour, sometimes multi-day deep dive review going through each of the best practices and checking the boxes.

And that actually is useful if you're actually the right people there and have the right conversations. But what I was finding was people were not doing that. They were just doing as a checkbox exercise, ticking, ticking off in an hour, and that's not useful. So when I joined, I wanted to bring the best practices to the people, democratized.

So a couple of things that I did, I led the effort to rewrite my pillar, and all my pillar leads followed suit so that it's now on the web. It's not a PDF, not an 80 page PDF, but it's on the web with hyperlinks to each section and hyperlinks to each best practice. So I want to know about just disaster recovery, I could zip right in there. And the other thing I did is, like you said, I got on stage. I wrote blogs. I wanted to get the best practices into people's hands, how they wanted it, whether they wanted to do a review or not, and reviews are useful, don't get me wrong, but whether they wanted to review or not, I wanted to make sure they had those best practices.

Corey Quinn: It always felt like something of a maturity model, where it's easy to misinterpret the, okay, what is your maturity along a variety of axes and different arenas? The goal is not to get a perfect score on everything, it's to understand where you are with these, along these various pillars, to use your term, and figure out where you need to be on those pillars. Not everything needs to excel across the entire universe of Well-Architected. If you, if you're making that decision, it should be a conscious decision rather than, "Well, apparently cost is not something we care about here after the fact."

Seth Eliot: You'll care about it when you get your bill, right?

Um, exactly. So, that's why I, I presented the best practices as I did. That's why I would talk one-on-one with customers. I mean, basically, a Well-Architected review only works well. If you're having a conversation, if you're just going to a checkbox, checkbox exercise, then don't even bother. It's not going to really tell you anything.

So I decided there's too many ways to have conversations. One is with the tool, and there's another at a whiteboard, and there's another on a stage, and there's all kinds of ways to have conversations. And I'm talking about conversations with the engineers and the business, right? Like, like, if we're going to, like, disaster recovery objectives, guess what?

That's a business decision. That's not a technical decision. It's a technical implementation, but your business better, you know, you better work with them to understand what those disaster recovery objectives are.

Corey Quinn: In the time before cloud, I was- I wound up being assigned to build out a mail server for a company that I worked at.

That's where I come from is relatively large scale email systems. And my question was, "Okay, great. What is the acceptable downtime for this thing?" And their answer was, "No downtime is acceptable." It's- okay, so we'll start with five nines and figure it out from there, which you won't get in a single facility.

So we'll have that in multiple data centers because it's before cloud. That'll start at around $20 million. Um, when can I tap the first portion of the budget, which led to a negotiation? It turned out what they meant was we kind of like it to mostly be available during business hours with an added bonus outside of that.

Okay. That, that dramatically lowered the cost, but it was a negotiation with the business.

Seth Eliot: There is kind of this lever that, you know, and then we talked about pillars being in intention with each other. There's this lever cost versus resilience, right? And it's not always that. You can definitely add resilience without additional cost by doing certain smart things and optimizations.

But very often it's that cost resilience lever, and I try to talk to customers about it and say, you have to decide where you want that lever to be. There's no magic formula that you get, you know, lowest cost and highest resilience at the same time.

Corey Quinn: You were embedded in an Amazon team before you wound up moving over to AWS.

Were you working on the equivalent of the responsibility model or the actual responsibility model? I confess I don't know much about the internal workings of the Amazon behemoth.

Seth Eliot: I was embedded in several teams. I actually started at Amazon in 2005 when it was this relatively small company in Seattle.

I remember in 2005, like when I moved here and I'd like go to like a gym and they say, "Oh, we have discount for Microsoft members. Are you a Microsoft member?" I said, "No, I work for Amazon." They're like, "Nah, we don't have a discount for them." Like you're too small. [laughs] So, uh, I started out on the catalog team deep into, you know, um, it'd be the inner workings of later worked on the team that launched Prime Video.

It was called Amazon Unbox at the time. Uh, if anybody out there remembers that. I, uh, then took a break for a little period of time. If you want to call working at Microsoft a break, it was actually quite rewarding.

Corey Quinn: For a brief half decade or so. Yeah, it worked out.

Seth Eliot: [laughs] I worked at Amazon Fresh, and I worked in international technology.

And then the role I was telling you about before was the most interesting one where I was embedded as one of only two AWS focused solution architects in all of Amazon.com, not Counting AWS, but in Amazon.com working across Amazon teams in, in, in obviously in Seattle, but in Tokyo, Beijing, Luxembourg, and Germany on their adoption and modernization on AWS.

That was a really cool and interesting job, and I got to see how, you know, teams build with AWS, a wide variety. And the number one thing, interestingly enough, is that the cool thing about internally at Amazon is that cost is a first class metric. You'd be happy to know that. Everybody I talk to when I'm talking about system design and architecture, they ask about cost.

"Should we use Lambda? Lambda seems cool. I don't want to maintain servers." Okay, how much is it going to cost? We'd have to work through those numbers and make a decision. Is it worth the cost? So cost is a first class metric inside Amazon.com.

Corey Quinn: Which makes a fair bit of sense. I mean, obviously, things like performance and durability are as- reliability are as well.

The idea of having a, in fact, you folks, I don't forget if you, I don't remember the exact timing on this, so you may have been at AWS by then, but they came out with a shared responsibility model for resilience as opposed to the one for security, which I always somewhat cynically tended to view as, well, you have to put it in a shared responsibility model because when someone gets breached, you need to drag that out for 45 minutes.

You can't just say, you left it misconfigured. It's your fault. We wash our hands of you.

Seth Eliot: Well Corey, you're looking at one of the co-authors of that shared responsibility model. So there was already a shared responsibility model for security, and Alex Livingston and myself wrote a white paper on disaster recovery, which became very, very popular in late 2020 and early 2021.

And we could talk about why, but we wrote a white paper on disaster recovery, and in there, we put the shared responsibility model for resiliency, and that since has been backported into the reliability pillar, too. And it's important to actually to say, hey, look, you talk about myths, right? There's a myth On some people's part, it says to be, make it more resilient, we put it in the cloud, we put it in the cloud, it's now more resilient.

That, that, you know, that is not necessarily so. You put it in the cloud and you take these steps, being those steps, being the best practices in the reliability pillar and the operational excellence pillar, and now it's more resilient. You can't just count on the cloud to do everything for you.

Corey Quinn: One of the inherent tensions behind the way to approach resilience is it cuts, in the, it cuts against the grain of the idea of the cost pillar, where you get to make an investment decision of do you invest in saving money or do you spend the money to wind up boosting your resilience, and it's basically rarely an all-or-nothing approach. But it's always been tricky to message that from the position of AWS because it sounds an awful lot like, "We would like to see more data transfer and maybe twice as many instances in different regions or availability zones.

That would be, well, that's the right way to do it." It rings a little hollow though. I have no doubt that was in no way, shape or form your intention at the time.

Seth Eliot: As I said, it's sometimes a lever. I mean, but it's sometimes not like if you're running on EC2 instances and you switch over to like Kubernetes and put yourself on pods, you could be using a lot less EC2 instances and have a lot more pods in place and you make yourself more resilient.

So it's not always more cost, but if you're talking about disaster recovery, a question I used to get a lot is, "Do I need to be multi-region?" Okay, I'm Multi-AZ, multi availability zone, and for folks that might not know, that means I'm in multiple physical data centers, but I'm still within a single region.

Uh, that makes me highly resilient if I've taken, or highly available, I should say, if I've taken the right steps to architect it. Do I need to be multi-region? And when I go multi-region, yeah, I'm going to be setting up infrastructure in another region, and that's probably going to increase my costs. So when answering that question, I always ask, what risks are you trying to protect yourself against?

That's again, it's a business question, right? What risks does your company need to be protected against? If it's the risk of a fire in a data center or a minor electrical outage, guess what? Multi-AZ protection. is fine. You don't need to be multi-region. Because again, it's multiple data centers when you're Multi-AZ.

So then people say why do I need to be multi-region? And the first thing that comes to mind is, oh, a comet, you know, wiping out the eastern seaboard, or a nuclear bomb, or, you know, something like that. And guess what? That's never happened yet. I mean, uh, yesterday's performance is no guarantee of tomorrow's returns, but no comets, no nuclear bombs.

Corey Quinn: Right. There's also the idea that, okay, assume that you do lose the entire U. S. East Coast somehow. How much are people going to care about your specific web app? I'm going to guess not that much in that extreme example. There's a question of what level disaster are you bracing for?

Seth Eliot: Yeah, we talked about that too.

Your disaster recovery strategy is part of your business continuity plan. Do you have a plan for getting workers in front of keyboards? Do you have a plan for your supply chain if the eastern seaboard is wiped out? If not, don't worry about the tech right now. It's not important. But I was standing in front of a crowd once, teaching a course, and I gave the whole, you don't have to worry about nuclear bombs for your service, probably.

And then I realized I was like literally in Washington, D.C. talking to public sector. I'm like, well, maybe some of you do. Maybe that is on your business continuity plan.

Corey Quinn: One thing that I'll see a lot when people try to go with multiple regions is that they'll very often, "Ah, we don't want a single point of failure, so we're going to use multiple regions with a second region."

And then after a whole lot of work and expense and time, they built a second single point of failure.

Seth Eliot: That's funny. But, um, but there is a reason, there is a legitimate reason to go multi-region. And you know what it is. Do you care to venture a guess on what that is?

Corey Quinn: In many cases, there's the idea of a few things.

You can separate out, uh, different phases of what's going on in your environment at different times. You are, there is a, there's a hard control plane separation. So if there's a region-wide service outage, theoretically, you can wind up avoiding that by being in multiple regions. And, of course, there's always the question of getting closer to customers, too.

Seth Eliot: Oh, yeah, well, that's not a resilience issue. That's a performance issue, and that's a legitimate reason to go multi-region. But, yeah, the number one thing that the actual risk we have seen, and across all cloud providers, is not a slam on AWS. All cloud providers is an event, and a service, or a cloud provider owned network.

And if you have a hard dependency on that service, or, unfortunately, if you have a hard dependency on other services that depend on that service, then you need a plan to be able to go to another region. So, I mentioned how our Disaster Recovery White Paper became really popular in 2021. That's because in December 2020, there was a Kinesis event in US East 1.

And I don't know how many people use Kinesis, uh, but that wasn't, but a lot of AWS services apparently at the time used Kinesis. So like, I think CloudWatch, I don't, I could be wrong, double check me on that, but- So, people were affected. And so if you need to protect yourself against that, and again, it's all cloud providers.

And actually, AWS likes to show data that is objectively true that they have the least number of these events of all the major cloud providers. But if you need to protect yourself against those events, you need to be multi-region.

Corey Quinn: Well, when you do talk about multiple cloud providers, there's also the question of, okay, great. Well, Amazon themselves is a single point of failure. The credit card payment instrument that I have on file for my AWS account is in fact a single point of failure to some extent. And I'll see companies in some cases storing, uh, rehydrate the business level backups in another provider, where they're certainly not going to be "active" active, but they don't have to shut down as a going concern in the event that something catastrophic happens AWS-wide.

Seth Eliot: Yeah, again, it's about risk assessment. If you're afraid of your cloud provider going belly up in some reason, I think that's a pretty, pretty low risk. But you're right. Taking the step of just doing backups of your data and infrastructure to another cloud provider without any of the operational ability to bring it up quickly.

You're just, this is the, "Oh crap," recovery scenario. I know it's going to take a long time. My recovery time is going to be extended, but this is like, I don't care that it's a long RTO because it's protecting myself against a risk that's probably never, never going to happen. That seems legitimate, but yeah, I was just simply trying to say not protect yourself against your cloud provider or protect yourself against an event in a region of your cloud provider.

Corey Quinn: I know of at least one company that winds up having to rehydrate the backups level of infrastructure and other providers specifically so they don't have to call it out as strongly as a risk factor in their quarterly filings. In some cases, it's just easier to do it and stop trying to explain this to the auditor every quarter and just smile, nod, and do the easier thing. It comes down to being, again, a business decision.

Seth Eliot: And that can be a fairly low effort to implement. It's going to cost you. Data transfer is going to cost you. Data storage on the other provider is going to cost you. But, yeah, unlike a full- so, you know, when I talk about disaster recovery, I adopted the model that it pre-existed before mid-AWS of four strategies, backup and recovery, which is the one we're talking about now,. Very easy to do, but very long recovery times and longer recovery points, although not always.

And then moving towards shorter recovery times, you could have a pilot light, or you could have a warm standby, or you could have an active active, which I consider to be, uh, both a high availability and disaster recovery strategy. Some of my colleagues at AWS considered it not to be a disaster recovery strategy, but that's an argument.

I dunno if it's worth getting into.

Corey Quinn: One thing that I found when I was doing my own analysis for the stuff that I built, I have an overly complicated system that does my newsletter, publication every week, and it's exclusively in US West 2, in AWS in Oregon. And the services that I use are all, as it turns out, Multi-AZ.

So, there's really no reason for me to focus on resilience in any meaningful sense, because if Oregon is unreachable as a region for multiple days, well, that week I can write the newsletter by hand because I think I'm going to have a bigger story to talk about than they released yet another CloudFront edge pop in Dallas.

Seth Eliot: No, it's part of the plan, Corey. They're bringing you down so you can't write about the outage. But honestly, um, you are resilient, you're highly available, but you're not, you don't have a disaster recovery strategy because you don't need one. So, you know, just to be clear, you know, resilience has that high availability piece.

I'm in multiple availability zones. I could, I could tolerate component level failures versus disaster recovery where I need to stand myself up somewhere else.

Corey Quinn: One mistake I see people making, especially in the multi-cloud direction, is, okay, we're an e commerce store, so we're going to either build entirely on Azure, or we're going to go multi-cloud, so in the event that AWS has an outage, we're not exposed to it.

In practice, what they do is they expose themselves to everyone's vulnerability, unless they're extremely careful, and if they're, I don't know, using Stripe to process payments, Stripe is all in on AWS. So great, we're now living on Azure, but our payment vendor has the dependency on AWS. So when there's an actual serious outage, you wind up with dependency issues that are several layers removed.

Some cases, your vendors know about it, and in many more, they don't. So when we see things like that giant S3 issue about seven years ago, well, that's one of those things where everyone's learning an awful lot about the various interchained dependencies as you go down this path. Though on the plus side, for most of us, on that day just the internet is having a bad day, so we don't have to spend a lot of time explaining why we alone are down.

It's, there's safety in numbers.

Seth Eliot: First of all, you know, you're still talking about S3 from over a decade ago. I need to educate you on more recent events, so you have more recent stuff to talk about. But, uh, as for your point of dependencies, that's so true, because we have customers that'll look at, we used to publish, well, we publish SLAs, but those are not, you know, guaranteed numbers. Those are just a cost agreement of what any any provider is going to credit you. And we used to publish the design for reliability. They since took them away. It used to be part of the reliability white paper. And you'd see the number of nines designed for EC2 and S3, and people would like to try to take those numbers and do the availability map.

Availability map says if I have redundant ones, I could like, you know, if I have two things that are four nines and I put them redundantly, then I now have eight nines. But if I have two things that are four nines and they're in parallel with each other. I have to multiply the errors times each other and I have less than four nines.

But, and, and that availability math is well known. And, and, but you try to do that, it is the way to madness, right? Because yeah, you've done that with all the components in AWS, but how about those third party providers? How about DNS? How about, The internet, you know, having problems? You know, like, you're not accounting for all those other things.

So I think you're not getting a real number when you do that.

Corey Quinn: And even when you're doing DR testing, you have these scenarios where, okay, we've tested our ability to fail between regions, but what you haven't done is tested your ability to fail between regions when everyone else in that region is doing something similar, because it's not just you doing this at three o'clock on a Tuesday afternoon, suddenly there is a service-wide outage. We'll, we'll avoid picking on S3 further, but when everyone is starting to evacuate, you often see, like, even an older issue, we saw that with EBS volume failures in US East 1, I want to say in 2012, where suddenly there was the herd of elephants problem that we all learned a lot from.

Seth Eliot: So Thundering Herd is more of an issue if you're in an availability zone in a region and that availability zone is having issues and there's only two more availability zones to go. So everybody's going to those two availability zones. That's a real, uh, Thundering Herd issue, especially if you're looking for EC2 availability, instance availability.

Your- the type you want if you're not flexible, might be gone by the time you get over there. When you're talking about multi-region, It's less so because, especially if you're in the U. S., you have multiple regions, four commercial regions. So A, there's no guarantee everybody's even going to the same region.

But B, most people aren't even failing over, right? Not everybody has a multi-region strategy. So we actually haven't seen thundering herd happen with multi-region failure. What we have seen happen is control plane dependency. So, I actually added this into the reliability pillar pretty late. It was like the last two best practices I added to my, you know, to the reliability power, which was don't take hard dependencies on the control plane if you could help it. Because the way this works is for every service you use, if it's a regional service like S3 or EC2, there's a data plane and there's a control plane in that region. Data plane is basically the stuff running all the time to service actual requests, control plane of the CRUD operations, create, modify, update, delete.

The gotcha is, if you're using a global service, like Route 53, at the time when I last was at AWS, had a single control plane in U.S. East 1. And so what happened was, I think this was a 2021 outage, maybe event, or 2022, we saw an outage in US East One. It was a network outage and it brought down the control plane for Route 53 so that people couldn't modify the Route 53 records, which was how they planned to do a failover. So they couldn't fail over. Now there are solutions to this. The solution is choose a data plane strategy instead. Since then, AWS has come out with Application Recovery Controller, which I want to hear from you, Corey, what you think of the cost benefit of that is.

It's a little spendy, but you could also roll your own Application Recovery Controller by doing something like creating a CloudWatch alarm. Connecting that to a Route 53 health check and having that CloudWatch alarm not check for health because that's not reliable, but literally check, is there an object with this name in S3?

If not, alarm. And then you could delete that object, data plane operation, the alarm will go off, data plane operation, the Route 53 health check will go off, data plane operation, and it'll swap

Corey Quinn: It's very helpful. I, I do like the application recovery controller. The challenge is it starts at two grand a month, which means for small scale experiments that, that gets a little pricey just to kick the tires on and really get a lot of hands on experience with it.

But for the large scale sites that use it, it's, it's who cares money. They're thrilled to be able to have something like that. So it's just a question of who the product is actually for. On the topic of control planes, one of the challenges I've run into in the past is, it's not just, is the control plane available, but is it latent?

At some point when you have a bunch of folks spinning up EC2 instances, yeah, the SLA on the data plane of those instances is still there, but it might take 45 minutes to get enough capacity to spin up just by the time that your request gets actioned.

Seth Eliot: Yeah, and that's taking a dependency on a control plane.

Even, even if you're Multi-AZ, if your plan is I need to use auto scaling to spin up EC2 instances in the two remaining healthy availability zones, that's control plane. If you want to avoid that, you need to be statically stable and have capacity. If your costs are three availability zones, then having full capacity in two of them means you're 50 percent over provisioned in a given availability zone.

Math works, right? So that's something you have to be willing to pay for. Or take the dependency on the control plane and it'll probably work. But you're, you're, you're taking on more risk. And this, again, is driven by business need.

Corey Quinn: If you were to take a look at the entire resiliency landscape, as my last question, this is something I'm deeply curious about.

What do you see people getting wrong the most that you wish they wouldn't in 2024?

Seth Eliot: I think in general, what we're looking at is people not understanding how the cloud is different, I think, when they're moving from on prem. I'm not talking about your mature folks in the cloud, but folks looking to adopt cloud for the first time.

It needs to be explained to them that an availability zone is not only a data center, it's multiple data centers. And the other availability zone, guess what? That's a completely separate set of data centers. So like if your on prem strategies to be in two data centers, whoo two, that are like 400 miles apart, and that's a really far distance, so there's no chance those two are going to be affected by each other, even a thousand miles apart. Guess what? When you move to AWS or, ya know, at least with AWS's availability zone model, if your two availability zones are not 400 miles apart, they're only between 10 and 30 miles apart. But AWS has put in a lot of effort, and I've seen some of these reports. I've seen the reports include the geological survey and the floodplain analysis, so that these availability zones are not sharing the same floodplain, and that if any disaster happens, it's unlikely that It should affect more than one availability zone. So guess what? You don't have to be a thousand miles apart.

You don't have to be 400 miles apart. Being 30 miles apart is giving you almost that same benefit. Now, let's talk to your regulator, your auditor, and convince them of that too, so you don't have to set up in another region a thousand miles away.

Corey Quinn: I really want to thank you for taking the time to speak with me about all this. Uh, given that you're currently on the market, if people want to learn more or potentially realize that, "Huh, we could potentially use a cloud architect with a resiliency emphasis where we work." Where's the best place for them to find you these days?

Seth Eliot: Well, I mean, I'm, you know, just search for me on your, on your, on your, uh, search engine of choice.

Seth Eliot, E-L-I-O-T. One "L." One "T." Throw AWS on the end of that, you'll probably find stuff related to me, especially my LinkedIn account. That's a good way to reach me.

Corey Quinn: Awesome. And we will, of course, put links to that. In the show notes. Thank you so much for taking the time to speak with me. I really appreciate it.

Seth Eliot: Oh, thank you. It's been a pleasure.

Corey Quinn: Seth Eliot, Principal Solutions Architect, currently between roles. I'm Cloud Economist Corey Quinn, and this is Screaming in the Cloud. If you enjoyed this podcast, please leave a five star review on your podcast platform of choice. Whereas if you hated this podcast, please leave a 5 star review on your podcast platform of choice, along with an angry, insulting comment because despite being in four different regions, you didn't take all of the control plane access away from Dewey, who pushed a bad configuration change and brought you down anyway.

Cloud Resilience Strategies with Seth Eliot

Episode Summary

Episode Video

Episode Show Notes & Transcript

Transcript

You might also like

Conversations at the Intersection of AI and Code with Harjot Gill

The Transformation Trap: Why Software Modernization Is Harder Than It Looks

AI’s Security Crisis: Why Your Assistant Might Betray You

Get the Newsletter

Sponsor an Episode