Evolving, Adapting, and Staying Prepared with Brian Weber

Episode Summary

Ever wondered how Corey got to where he is today? You have Brian Weber to partially thank for that. On this episode of Screaming in the Cloud, Corey catches up with his old friend and mentor to talk about the ever-evolving world of tech. Brian’s been around the block a time or two having done significant stints at Pinterest, Facebook, and Twitter (during the Elon acquisition no less)! As Corey and Brian catch up, you’ll hear them chat about the importance of empathy, coaching the next generation of tech workers, and their conspiracies surrounding Google and Kubernetes. So grab your tinfoil hats, it’s time to go Screaming!

Episode Video

Episode Show Notes & Transcript




Show Highlights
(0:00) Intro
(0:53) The Duckbill Group sponsor read
(1:27) When Brian took Corey under his win
(3:21) Brian's experience coming to the cloud as an engineer
(7:24) Why it's important to reinvent yourself in tech
(8:54) How Brian reacted to the industry adopting Kubernetes over Mesos Marathon
(10:31) Kubernetes conspiracy theories
(12:30) The importance of empathy in tech
(15:46) Trying to advise younger generations entering tech
(19:19) The Duckbill Group sponsor read
(20:02) Working at Twitter when jobs started getting cut and the site frequently went down
(22:41) The best way to navigate certification expiration
(26:08) Talking about "The Golden Path”
(28:52) Why you should always plan ahead in tech (and life)
(34:21) Where you can find more from Brian


About Brian Weber
Brian is a former FedRAMP DevOps Engineer for Coralogix. He’s also been a Site Reliability Engineer at Twitter, Pinterest, and Facebook, where he has maintained large installations on-premises, building reliability, security, and developer efficiency. In my spare time, Brian skis, knits, cycles, bakes, and tries to spend as much time outdoors as possible.


Links


Sponsor
The Duckbill Group: duckbillgroup.com

Transcript

Brian Weber: And that's exactly how SRE generally works in my mind as well. You're not building something for the normal day-to-day. Actually, no, that's not true. You're building stuff for the normal day-to-day. But you are also building stuff for the day when everything catches fire.

Corey Quinn: Welcome to Screaming in the Cloud. I'm Corey Quinn, and I've been trying to get a particular person on this show since its very inception. Brian Weber, currently between jobs, was a formative influence on my early career that started to look a little bit vaguely like software engineering. Brian, thank you for your ongoing patience and willingness to subject yourself to my tomfoolery yet again.

Brian Weber: Oh, your tomfoolery is always amazing. Did you just call me a mentor?

Sponsor: This episode is sponsored in part by my day job, the Duckbill Group. Do you have a horrifying AWS bill? That can mean a lot of things.

Predicting what it's going to be. Determining what it should be. Negotiating your next long-term contract with AWS. Or just figuring out why it increasingly resembles a phone number, but nobody seems to quite know why that is. To learn more, visit duckbillgroup.com. Remember, you can't duck the duck bill, Bill.

And my CEO informs me that is absolutely not our slogan.

Corey Quinn: I had no idea what I was doing many years ago when I was working for a large consulting firm, and you were working at Pinterest at the time, and they parachuted me into this environment because I was personable for lack of a better term, and they had a, at the time, Pinterest had a very weird technical vetting process for consultants, so they needed someone who could do the work ostensibly, but also be gregarious and talk their way through the process. This was many years ago. The consulting company no longer exists after being bought by IBM.

I don't think I'm spilling any tea here, but as at the end of it, I was brought in to write a bunch of Tests for Puppet Code as part of a long stalled Puppet 3 migration, if memory serves. I had no idea what I was doing. Ruby was a precious stone to me, not so much a programming language, and you, you took me under your wing for about a month and a half, and it resonated.

Thank you for doing that.

Brian Weber: Well, thank you so kindly. I do remember you had, I believe it was an Ansible sticker on your laptop, and you told me that you made a very clear point of not adhering a laptop sticker until you'd actually contributed to the source repo.

Corey Quinn: It would have been SaltStack then, not Ansible, because I still haven't dared to touch Ansible with my hands.

Brian Weber: Oh, the other Python.

Corey Quinn: Exactly. The one that basically was frozen in amber forever, and then achieved its final form of all software projects that have run their course, getting acquired by VMware.

Brian Weber: Yeah, that's funny. I have a very good friend who basically. Soft retired when VMware got bought out by Broadcom.

Corey Quinn: Mm hmm. A lot of folks have that story.

Brian Weber: Oh, yeah. It's kind of funny how everybody takes layoffs just a little bit differently, you know? Just like me and all my various layoffs, like, you end up, like, staying in touch with some friends. And maybe not. I don't know. And some people get angry and bitter and others are just like, Woo hoo!

I can do what I want now. I have severance.

Corey Quinn: So, I do want to talk about your technical evolution because you are something of a rarity in that you were, for years, over at Facebook, which they've since renamed to something dumb, but they'll always be Facebook to me. You were briefly at Pinterest, coincided with my time there, and then you decided to spend the next seven and a half years over at Twitter.

Yes, we're still calling it Twitter. Now, what makes that interesting is that Pinterest was sort of the departure from the other two because neither Twitter nor Facebook at least at the time, were large cloud shops. They weren't running Kubernetes. You, in fact, called yourself Mr. Mesos at one point, or Mr. Marathon. I forget what it was, but you were effectively responsible for the care and feeding of that particular orchestration system while you were at Twitter. So you have found yourself in this interesting scenario where despite the fact that this is where the zeitgeist has gone, you hadn't done a whole lot of cloud work until your most recent gig over at CoreLogix, where you're focusing on FedRAMP.

So one could argue that, well, is GovCloud really cloud at all? The jury is still out. But what's it like coming to cloud as someone who's very competent as an engineer, but who just has found themselves in a situation where, until recently, you never had to touch it?

Brian Weber: Well, you know, you could consider that might, I don't know, what do you, what happens when you put a cloud in a bottle or a room?

Is it, it's kind of like when you go to a bar and they have the Smoked Manhattan, you know, and it's, it's there and it's pretty. And then you open the bottle and all the tendrils. Anyway, I'm bad at metaphors today. It's early yet, by the way, Happy New Year. We're recording this. Shortly into the new year.

Corey Quinn: It is the second of the year. Yes.

Brian Weber: Anyway, in some ways It was just like being dropped into literally any other environment where, you know, you don't know anything You don't know what's going on you don't know how all the pieces glue together, but it was a lot more challenging because a lot of the facets that I do know. Like, I know, you know, how a kernel works, how all of the modules work, how systemd works, how to strap things together.

You know, when do you need to disable SELinux permissions to make

Corey Quinn: Always.

Brian Weber: talk to one another? Oh, you're funny.

Corey Quinn: Setting four zero. It's a way to live.

Brian Weber: There you go. But, well, it's, it's a way to lift. But anyway, so it was a very different environment. You know, when you spin up, say, a web server on a siloed out host, you know, you spin it up, you access it, you see, Oh, this is cool.

And then you start putting up walls to protect it. When you spin up an instance for the very first time in a kube cluster, in an AWS cluster, you can see that it's running, but it is very much behind the phalanx. You know, all of those protections are saying, yes, your service is running. Come and get it.

Which is often a challenge when you don't know how to do things, simple things like properly open up a port and make sure it stays open and reopens the next time the service runs. How to hack and slash your way through all of the VPC rules and whatever other rules randomly appear in the way when you don't know.

Now, I spent, you know, a good 10 months, you know, trying to figure all that out. And luckily, I was there in an environment where there were parallel running environments. And, you know, once you learn that the various differences are basically just down to ARN names are different. You know, when you look at an ARN, it's, you know, it has AWS right in the middle of everything you have to then change it to "aws-us-gov."

Corey Quinn: Yeah. Yeah. That's a different partition is to use their nomenclature.

Brian Weber: Right? Because it is a common assumption in all of your templating. And so I had to go and hack and slash through so many YAML configs and Terraform configs. You know, and we can sit here and talk about, you know, how fascinating and interesting it is that all the stuff glues together, but at the end of the day, we're all just monkeys scratching our heads, looking at code saying, where the hell is this config and why doesn't it do what I want it to do?

Corey Quinn: It's the same thing that brought me to consulting in that I was always parachuted into environments where I didn't know what the hell was going on. In order to succeed in those environments, to my mind at least, you've got to have a strong grasp of fundamentals. Okay, I don't know how this particular system works. However, I know the Linux system internals well enough to know that it should be doing this. Okay, if it's doing this, that means it's making this other call.

It's not doing what I would expect, what do I not understand fully, and diving deeper and dismantling it into bite sized problems. Which is why when people ask, oh, what technology should I learn? It almost doesn't matter. If you're entering the field now as a new graduate in your early 20s, the technology you're going to be running by the time that you're my age, in my mid-to-late 40s, is no longer going to be the same thing.

You have to reinvent yourself, you have to understand how this stuff all ties together. So I like the foundational things that are likely to remain constant for, well, at least the rest of my life.

Brian Weber: Oh, I remember when there were cries and moans when the environment I was in at the time, which I'll leave nameless, was migrating from CentOS 7 to CentOS 8 because of the whole stream model. What are you doing to my RPM delivery system? How does this work? And you look under the hood and it's really just the same. It's just packaged slightly differently and branded differently, and it works the same. It's just, they figured out ways to smooth off some of the rough edges. So, if you're sitting there saying, oh my goodness, I can't handle change, then what the hell are you doing here?

Corey Quinn: That's one of the areas I wanted to dive into with you, because I wasn't kidding when I said you used to be the Mesos Marathon guy for a period of time. The industry collectively took a vote, and Mesos Marathon did not win, Kubernetes did. How did you react to that?

Brian Weber: Well, at the time, when I was still at the company formerly known as Twitter, we had talked a lot about whether we should spin up Kubernetes.

When the decision came through that we should, we did it in a very slow and piecemeal manner. And in my opinion, I felt it was a little bit too slow. We spun up sample environments in GCP. We even had acquisitions that were in AWS that we were still operating in AWS, just because the migration just didn't make sense.

It was well entrenched. It worked properly. Why the heck not? Leave it there. And so we actually had reasonable brain trust around this stuff for a while. Where we ran into a lot of trouble was spinning up Kubernetes internally on our own bare metal infrastructure. You know, not the least of that, as I'm learning now as I set up my own home lab, setting up Kubernetes on your own bare metal infrastructure is a pain in the ass.

Corey Quinn: Oh yeah. I did it a year ago, almost exactly where I spun up a Kubernete in my spare room running on top of K3s and some raspberries pie, and sure enough it was, oh, okay, this, this makes sense. Kubernetes lets you cosplay as your own cloud provider. I sort of get it now.

Brian Weber: Right.

Corey Quinn: But yeah, I'd forgotten all the obnoxious hardware bits that the cloud has gently abstracted away in the intervening years.

Brian Weber: Oh, I don't even think it's the hardware bits.

Kubernetes makes a point of not making it easy. I wonder if they're just in collusion with the cloud providers to say, here, we're going to escort you on the way so that way you can earn all this money and then pay the CNCF a bunch of money so that way we all get rich.

Corey Quinn: My tinfoil hat conspiracy theory remains that Kubernetes is how Google decided to get the rest of the world to write software more like Google does, because without that, Google Cloud was never going to work as a cloud provider for a lot of these workloads.

So it, it works super well. They sort of lost control of it and they don't get to drive it anymore the way that they once did, but I'm not entirely convinced I'm wrong.

Brian Weber: Well, you know, that same model worked for Google and search, you know. They got everybody in the world to change how they wrote webpages, how they structured webpages.

Buying into the AMP project. All that stuff is all because Google said we want it this way, and everybody wanted some of that sweet, sweet search results and figured out how to do it. And now, as a result, when you go to a webpage to look for a recipe for peanut butter brownies, you have to read a 10 page diatribe about somebody who grew up in Oklahoma, all because they need those keywords in order to come up in the search rankings and potentially get affiliate links, which makes the experience of a human reading a web page suck.

Corey Quinn: And there's always some of the better sites now have the jump to recipe button at the top because they know what's up.

But at the same time, it's why do we go through this ridiculous theater piece?

Brian Weber: Because it's what our Google overlords built for us, you know, and now we experience that in cloud factories because we get to play with Kubernetes.

Corey Quinn: How lovely of them for doing these games. It's always appreciated.

Brian Weber: Well, what can I say? It's, you know, it makes our lives relatively easier as opposed to when we had to thumb through recipe cards and when I could just, you know, bootstrap install, you know, whatever OS I felt like at the time and get something running at home.

Corey Quinn: It's a reasonable approach to take. But I guess what I'm curious about is how you perceive that shift, though. Because I've met an awful lot of technologists over the course of my career who start to identify themselves by the technology upon which they're working. And I'm not immune from this. I think of myself these days as an AWS guy, to some extent. And before that, I was an email systems guy. And reinventing the way that you perceive yourself is never easy.

Brian Weber: You know, I still perceive myself as somebody who just, like you say, and like you do, parachute it into a site, try to figure out what was wrong, and mostly just try to make things better for the other people running it. Because I've said this before a thousand times, and I'll say it again, software is made of people.

We are all here together, and we do what we do as a collective. You know, open source projects, yes, there's occasionally the one lone guy in Nebraska, a la XKCD, who's maintaining a very important core project, but a lot of projects out there and a lot of company and, well, all companies out there are building it as a group, as people, as many people.

If we can make that experience for our peers, for our colleagues, for, you know, whoever you're working with better, then we all get better at writing the software, at building the systems, at making things better. So, that's what I pride myself in and that one thing has never changed for me. You know, I've picked up multiple languages. I've dived into multiple different environments. I'm comfortable in multiple operating systems.

But the reality is that we're all people. We all do what people do. And if I can at least just be empathetic and be as human as I can and try and understand that you're human too, you just want to read a simple doc that tells you how to start and stop the service. You just want to read a simple dashboard that can tell you what's wrong and you don't want to get paged in the middle of the night at something stupid and pointless that had no reason to page you.

Every human wants that. Every human engineer wants that. I mean, granted, there may be exceptions to that case. I have known masochists who just want to alert on everything because they don't know what's going on and they'd rather be woken up and find out, and 90 percent of the time, they wake up. They look at the alert.

They say, oh, this is nothing. They crush the alert, they go back to sleep, and then the next person comes on call and goes what the holy hell. And I care about both of those people similarly.

Corey Quinn: I think empathy is one of those core attributes to being a competent technologist, and I'd have no idea how you teach it. I feel like it's something you either have or you don't.

Brian Weber: I feel like the significant bulk of us have it. We just don't often know what to do with it. You know, sometimes we learn how not to be empathetic. Sometimes we're psychopaths, and we just innately don't have it. But I believe those are the exceptions.

You know, in reality, we're all empathetic people. And if we can tap into that empathy and help make other people's lives better as a result, then that's what we should be doing.

You know, this is in part why up here in my small town, I tried to help start a tech meetup out here because there's so many people around here.

There's a local university, a local community college, and a whole lot of other people who are just career changers who are just interested in trying to learn about the technology, not only because they find it fascinating, but they see it as a career path forward. Hopefully as long as AI doesn't destroy everything.

Corey Quinn: I used to be fairly active in the, I guess, helping the next generation figure out how to navigate the world of technology. Tech. And I've gotten away from it just because it's been so long since I was new to this space that I worry I would give boomer to your advice of, oh, just have a strong handshake and walk in with a resume printed on nice paper, asked to speak to the owner, and you'll have a job by dark, which does not work.

I don't know how to get started in technology in the current system. I know a lot about how to get started in technology in the early 2000s, but that apparently is not a highly useful skill.

Brian Weber: No, absolutely not. Although traces of it still are like, you know, yes, you can't just, you know, walk in and be bold, but having a level of confidence exudes and it shows other people you're talking to. When you're talking to a recruiter, when you're talking to a hiring manager, if you can say, hey, I may not know everything, but I know how to do these things well.

And I know how to figure out what I don't know. And it's funny because one other person in our little group here of my local meetup has finally achieved something that I had been hoping for. And of course I'm leaving location out, I'm leaving people nameless and all that to protect the innocent. You know, this young man had been doing hack jobs on Fiverr to try and boost his skills on top of working a simple retail job and got enough chops together after a while that he cleared an interview for a local company.

Now it's not that huge, you know, it's writing some JavaScript tests, but it's a start. And if that's what gives him the foot in the door that he needs to build a career, then I feel 100 percent vindicated in everything that I've ever done to try and build a community out here.

Corey Quinn: What worries me is the future of that story.

When I first played with ChatGPT and it spat out a quick hacked together script to query NAT gateway prices across different AWS regions, the response that I got instantly from a couple senior devs was, "Oh, well, this is fantastic, but it's only for junior dev work. It'll never take the place of a senior engineer." Great. Where do you, is it that you believe that senior engineers come from?

You didn't just show up one day knowing all the stuff that you know now, it was incremental. What does this mean for the next generation? And people don't really have a good answer for that yet.

Brian Weber: No, nobody has the crystal ball right now, unfortunately. And I wish we did, because I'd love to be able to say, here's what's coming.

Now, I have high hopes that we're still going to need humans in order to actually build large systems because large systems are not easily intuited. You know, as much as other talking heads out there would like you to believe, "Oh, Twitter's just, uh, Small globs of characters ordered in a timeline, right?"

Corey Quinn: Twitter sounds like the easiest problem in the world. Oh, I could build that in a weekend until you actually think about it for 30 seconds.

Brian Weber: Well, you could build it in a weekend to serve like 10 users.

Sponsor: Here at the Duckbill Group, one of the things we do with, you know, my day job, is we help negotiate AWS contracts. We just recently crossed five billion dollars of contract value negotiated. It solves for fun problems such as how do you know that your contract that you have with AWS is the best deal you can get?

How do you know you're not leaving money on the table? How do you know that you're not doing what I do on this podcast and on Twitter constantly and sticking your foot in your mouth? To learn more, come chat at duckbillgroup.com. Optionally, I will also do podcast voice when we talk about it. Again, that's duckbillgroup.com.

Corey Quinn: I have a question about Twitter that since you were there during the acquisition for a bit before the fall, everyone that I know in this space, and we didn't talk to you folk 'cause we didn't want to compromise any of the folks who were working there and tried to hold onto a job, but a lot of us predicted that Twitter itself would basically fall over one day and have a lot of trouble getting back up, and that never happened.

Do you have any insight into why that might have been? Like, how did we all get it wrong? I almost want to do a post mortem on how the SRE community got it wrong.

Brian Weber: I'm in a little chat group with a bunch of other former SREs from Twitter, and we have talked about this a time or two, and we attributed that a lot to the work that we had done.

Because those of us who are SREs, we don't just think about, you know, what's going on right now. We often think about what's going on in the future. How do I make sure my service doesn't completely tip over? You know, the first hammer fell and cut off half the company and then another half of the remaining company all right in November before the holidays.

And I believe it was the week between Christmas and New Year's that Elon said, "Oh, we don't need Sacramento Data Center."

Corey Quinn: And you would expect that to end as hilariously as it sounds, but somehow they pulled it off.

Brian Weber: Somehow they pulled it off. Now, granted, Twitter had for, you know, that whole year of 2023 stumbled a lot.

The down detector had been going bonkers on Twitter. Things had been falling over. Sites, you know, the site just didn't always want to work. So, I attribute partly the work that we had done to shore up the service for the long term. Now, the other thing that I can think of as maybe just the reduced user count, because I know people had been leaving the site in droves.

But, I don't know, I honestly haven't looked at, you know, any, whatever stat count to see what the daily active users are, what the, of course none of that stuff is public anymore because they don't have to report to the SEC anymore because they're a privately held company. Thanks, dudes!

Corey Quinn: A lot of it does make sense in that when I was building systems, I always wanted to make sure they were well documented, the interfaces were easily understood for basically a complete fool by which I of course mean me six months from now in the middle of the night having just been woken up by something unfortunate, and I'm not really firing on all cylinders.

So you want to make these things easily understood.

You want the idea that Twitter learned pretty early on in the course of its life was one of graceful degradation. Instead of showing the fail whale when things started breaking, okay, maybe you just don't reload the timeline as rapidly, or you put the eventual in eventual consistency.

That tends to be a failure mode that is less noticeable, and it stops treating the service as a binary, is it up or is it down, and instead views it how down is it. Once you unlock those graceful degradation modes, that's kind of awesome. I'm still surprised there weren't a whole bunch of issues that were coincided with certificate expirees and whatnot, but apparently there's still enough, there's enough talent left there to keep the lights on.

Brian Weber: I'm glad you mentioned certificate expirees because that's what I worked on.

Corey Quinn: Yeah.

Brian Weber: That's what my, you know, I was on that team for, I want to say like four years, I think, where we managed the distribution of internal certificates and public PKI's and all that stuff.

And we automated the shit out of that.

Corey Quinn: It's the dumbest outage in the world because it's highly visible that there's a certificate that just expired when someone can get to it with their browser. It's one of those things of, you should have known this was coming. We have this fancy technology called calendar reminders.

So the idea of automated certificate renewal is huge. I think that this was a lost, a poor decision in the 90s to have an expired certificate by 15 minutes, have the exact same failure mode as a man in the middle attack, but that's a battle long since lost.

Brian Weber: Well, it was also relatively simple to just say, you know, A Java application loads the file on disk at start time.

So if you, at that point, you can do whatever you want to the file. So we had automated systems that just went in and said, that cert is due to expire in X amount of time. Let's just snap it up. All right. So you'd have X number of days before it expired, and service owners should theoretically know, restart your service within X number of days and lights good. Now, what you can do is have a failure state that says, "Oh, I've never restarted my service. But this cert's expired. Maybe I should die." And then it dies, and then whatever container system you're using restarts the service for you because the service died.

You do that, and then voila, automation happens. These are the kinds of things that we thought of collectively at Twitter for years in order to keep things up and running smoothly. So that way, as much as possible, all of the pain in the butt things that everybody had to deal with could just be on autopilot.

Oh, I just restart my service. Cool.

Corey Quinn: It's the right approach. It's why I love what, uh, Let's Encrypt has done. And. The maximum, maximum cert validity is 90 days because people go through an outage like that. Like, Oh crap, let's, let's build a cert that has a 10 year expiry, which I understand from a human perspective, this was painful.

Let's make sure we don't have to deal with this again anytime soon while we're rotating it. But when you have like a wild card cert, God help you, that is good for the next 10 years, you'll never be able to trace all the places that it winds up in the next decade. So when that does hit expiry. Everything is going to break and it becomes a massive issue, whereas if you do the painful things and scary things more frequently and it makes them routine, yeah, I have a bunch of systems now that auto roll certificates programmatically and I never have to think about it until and unless I'm doing something clever.

Brian Weber: Yeah, well I know a lot of people have talked about "the golden path." "The golden path" being where you want everybody to go in order to get to that destination. That destination being a running service that makes us all money so that we can all pay our rent and eat food. So, if you make that golden path as easy to walk as possible, then people will naturally go there.

You know, and I say that knowing full well, that's one of those Pareto principle things you run into. Because multiple times in my career, I have run through mass migrations where I chased down large numbers of people at a large company in order to get them to do a thing, you know, here, this is going to take you two hours to do.

It's going to take you 10 minutes to do. We just need you to do it. I will show you how to do it. I will do it for you. If you're willing to let me. So on and so on. You know, the bulk of people are just like, "Oh, cool. We love it. Sure." And then you get to that last 20%. And even worse, you get to that last 3%.

And those last people are like, "uh, you want me to restart? I'm not sure we know how."

Corey Quinn: That's one of the things I learned from my Kubernetes, because it's, okay, great. I have a bunch of Raspberries Pi plugged into the same power supply, and when that thing gets jostled and loses power, okay, how do you safely bring up an entire cluster?

We didn't think about that, because why would you ever turn off cloud instances all at once? Oh no. Oh dear. Because my first, again, this comes back to the ancient sysadmin wisdom of once I had my cluster built out, one of the first things I did is I yanked the power cord out of the back of one of the nodes like I was rip starting a lawnmower just so I could see what the recovery looked like.

And it turns out with a lot of extra work, it just never comes back. Which, okay, that's, that's a little disturbing. It all comes down to LongHoard, the disk system I'm using, because EBS is a marvel that people do not give enough credence to. Because managing disk volumes in a distributed fashion is super hard.

Brian Weber: And this is why people pay AWS, GCP, and Azure tons of money. Tons of money. Because managing Kubernetes sucks on its own. Managing an EBS, I 100 percent agree, sucks even worse. At least for home labbing stuff, you can do a TrueNAS, which has all the right APIs for doing that, which makes that a lot easier.

Corey Quinn: Oh yeah, there are a lot of options you have, but it's also stuff that I run that is only projection adjacent. Like, my RSS reader lives on top of this thing. My change detection bot that winds up validating at different websites have these things changed and showing me what happens. I have a bunch of container stuff that I've thrown together in here, but if the entire thing blows up and falls into the sea, I still have a bunch of options that do not preclude me from getting my work done.

Brian Weber: Yes, yes, yes. I get that, you know? It's funny, I think about this in the real world too. I have a pantry full of home canned soup. No, I'm not a super prepper, I just like doing it. But it's great because, you know, where I live, it can get inclement weather. So, if the roads shut down, I have four days worth of food in the house just in case.

And this was just because of how I learned how to live growing up. You know, I grew up in another mountain town, and the roads would routinely close. So we would have, routinely, a couple weeks of food in the house, and if the power went out, we could pull out a camp stove and warm up a can of soup. I just like homemade soup better than Campbell's.

Corey Quinn: It's, it's the right answer. I wish more people thought about these things and did a little bit of planning ahead. Like, oh, they start forecasting and climate weather. You don't need to do a run to the store with everyone else necessarily.

Brian Weber: And that's exactly how SRE generally works in my mind as well. You're not building something for the normal day-to-day. Actually, no, that's not true. You're building stuff for the normal day-to-day, but you are also building stuff for the day when everything catches fire. A lot of work that I did on a lot of my different teams and products that I've worked on was not just to say, okay, everything is burning to the ground. How are we surviving?

A lot of what I have done is saying, let's make deploys easier so that we don't have to think about it. So one thing that's kind of on my brag sheet is I worked with a couple of different teams, both my own team and the core services team at Ye Olde Twitter to help build out a process for continuously deploying NRPM.

Now, this is often a Not something you want to do in production environments.

Corey Quinn: Not without some gating or some really great automated testing.

Brian Weber: Oh yeah, and that's what we did. We made sure that we had good process for dating and versioning, for easy push button rollback, for hard versioning, because originally my first, my first version of this was just saying yada da latest, whatever.

Which, uh, is never a good scenario, so why the hell are you doing it with your RPMs? So, we came up with this process. We pinned the version into a Hiera file for Puppet, we read that out of a config file from elsewhere, so that way another automation surface could stamp it in, and tied it all together in a Jenkins script that would then pull all the right stuff together, auto stamp a version, and then ratchet up a FQDN hash percentage number.

So that way you could say, let's roll this new version to 1 percent of the fleet and see how it does. Let's roll it to 10 percent of the fleet and see how it does. And once we got that machine well oiled and well lubricated, and mind you, this was a process that took, like, maybe three-to-six months to build on top of doing other things.

And then another three-to-six months to gain enough confidence on that we could just pull the brakes off and just say, let's let it go. And the biggest noise that we ran into was that we would ratchet the version forward faster than all of the RPM masters could sync. So occasionally, a puppet run would go through, would talk to a YUM mast, a YUM repo that didn't have the newest version because we literally just shoved it out there, and we actually got some feedback from the team that managed that saying, "Oh yeah, we are having some problems with a couple of these." And I said, what can I do to help? And he said, "well, maybe don't roll out so fast." So I added extra steps to then say, let's not roll through and look through and see, did all the YUM repos sync?

Because you could just probe it all, you know, in a loop, and then just come back and wait a minute and probe it all, blah, blah, blah, blah. Minutia, minutia, minutia. We got it working. And again, software is made of people. I was able to do that because I had good relationships with the people on my team and the people on those other teams, so that we could talk about these things like humans.

Corey Quinn: Which is a reasonable and grown up way to approach it.

Brian Weber: Yeah, because it's one thing to walk up and say, I don't give two shits about what your job is, I have to get this done. Which is not the way.

Corey Quinn: That's not how you win friends and influence people.

Brian Weber: No, it's not. Instead, you walk up and say, "Well, I'd like to get this done, how do you think we can do this?"

You know, I'm here playing in your pool. I don't want to pee in your pool. I want to do this right.

Corey Quinn: Exactly. With the unspoken thing being, look, at some point this has to get done. And so you, you either have to get at some point, leave, follow, or get out of the way. I would love to collaborate with you on this for a better outcome for everyone.

Brian Weber: Right. And at the end of the day, this can be copy pasted out to make everybody else's life easier. You know, lots of carrots, lots of hugs and lots of, you know, golden stars and all that, you know, the stick may be back there somewhere else. But don't even think about it. You know, be people, be human. We all, we're all here to just take care of each other.

So let's do that.

Corey Quinn: I want to thank you for taking the time to chat with me about all this. If people want to learn more about what you're up to, where's the best place for them to find you these days?

Brian Weber: I feel like I should re step up my social media game because I was a lot more active on ye olde Twitter before it became something not Twitter.

Corey Quinn: I have migrated to entirely to Bluesky, and it's like Twitter of old in a lot of ways. It's great.

Brian Weber: That's what it looks like. All right. Well, in the meanwhile, you can find me on the LinkedIn.

Corey Quinn: We will of course put a link to that in the show notes. Thank you so much for taking the time to speak with me. I appreciate it.

Brian Weber: More than happy to, Corey. Thank you.

Corey Quinn: Brian Weber, longtime friend and mentor. I'm cloud economist Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a 5 star review on your podcast platform of choice. Whereas if you've hated this podcast, please leave a 5 star review on your podcast platform of choice, along with an angry, insulting comment telling us that we must be idiots, because clearly setting up storage for Kubernetes in a home environment couldn't possibly be that hard.

Newsletter Footer

Get the Newsletter

Reach over 30,000 discerning engineers, managers, enthusiasts who actually care about the state of Amazon’s cloud ecosystems.

"*" indicates required fields

This field is for validation purposes and should be left unchanged.
Sponsor Icon Footer

Sponsor an Episode

Get your message in front of people who care enough to keep current about the cloud phenomenon and its business impacts.