Episode Summary
Alex Hidalgo is a principal site reliability engineer at Nobl9, makers of a robust service level objective (SLO) platform for SREs. Prior to this role, Alex worked as a senior site reliability engineer at Squarespace and a senior site reliability engineer at Google. He’s also the author of the O’Reilly book Implementing Service Level Objectives, which was released in September 2020. In 2001, Alex restored a 1964.5 Mustang for money.
Join Corey and Alex as they discuss the pros and cons of writing a book, what exactly a service-level objective is, the difference between a service-level objective and a service-level agreement, how implementing SLOs is all about finding the perfect balance of failure your users are willing to tolerate, how reliability for an SRE is defined by SLOs, what the moment was like when Alex realized he was going to write a book, how it’s difficult to bring up the fact that you’ve written a book in conversation, and more.
Episode Show Notes & Transcript
About Alex Hidalgo
Alex Hidalgo is a Site Reliability Engineer and author of the upcoming Implementing Service Level Objectives (O'Reilly Media, September 2020). During his career he has developed a deep love for sustainable operations, proper observability, and using SLO data to drive discussions and make decisions. Alex's previous jobs have included IT support, network security, restaurant work, t-shirt design, and hosting game shows at bars. When not sharing his passion for technology with others, you can find him scuba diving or watching college basketball. He lives in Brooklyn with his partner Jen and a rescue dog named Taco. Alex has a BA in philosophy from Virginia Commonwealth University.
Links Referenced:
Alex Hidalgo is a Site Reliability Engineer and author of the upcoming Implementing Service Level Objectives (O'Reilly Media, September 2020). During his career he has developed a deep love for sustainable operations, proper observability, and using SLO data to drive discussions and make decisions. Alex's previous jobs have included IT support, network security, restaurant work, t-shirt design, and hosting game shows at bars. When not sharing his passion for technology with others, you can find him scuba diving or watching college basketball. He lives in Brooklyn with his partner Jen and a rescue dog named Taco. Alex has a BA in philosophy from Virginia Commonwealth University.
Links Referenced:
- Buy Implementing Service Level Objectives on bookshop.org
- Buy Implementing Service Level Objectives on Amazon
- Follow Alex on Twitter
- Alex’s personal site
- Corey’s landing page for Implementing Service Level Objectives
Transcript
Announcer: Hello, and welcome to Screaming in the Cloud with your host, Cloud Economist Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.
Corey: This episode is sponsored in part by our friends at Linode. You might be familiar with Linode; they’ve been around for almost 20 years. They offer Cloud in a way that makes sense rather than a way that is actively ridiculous by trying to throw everything at a wall and see what sticks. Their pricing winds up being a lot more transparent—not to mention lower—their performance kicks the crap out of most other things in this space, and—my personal favorite—whenever you call them for support, you’ll get a human who’s empowered to fix whatever it is that’s giving you trouble. Visit linode.com/screaminginthecloud to learn more, and get $100 in credit to kick the tires. That’s linode.com/screaminginthecloud.
Corey: This episode is sponsored by a personal favorite: Retool. Retool allows you to build fully functional tools for your business in hours, not days or weeks. No front end frameworks to figure out or access controls to manage, just ship the tools that will move your business forward fast. Okay, let’s talk about what this really is. It’s Visual Basic for interfaces. Say I needed a tool to, I don’t know, assemble a whole bunch of links into a weekly sarcastic newsletter that I send to everyone. I can drag various components onto a canvas: buttons, checkboxes, tables, etc. Then I can wire all of those things up to queries with all kinds of different parameters: post, get, put, delete, et cetera. It all connects to virtually every database natively, or you can do what I did, and build a whole crap ton of Lambda functions, shove them behind some APIs gateway and use that instead. It speaks MySQL, Postgres, Dynamo—not Route 53 in a notable oversight, but nothing’s perfect. Any given component then lets me tell it which query to run when I invoke it. Then it lets me wire up all of those disparate APIs into sensible interfaces. And I don’t know front end. That’s the most important part here: Retool is transformational for those of us who aren’t front end types. It unlocks a capability I didn’t have until I found this product. I honestly haven’t been this enthusiastic about a tool for a long time. Sure they’re sponsoring this, but I’m also a customer, and a super happy one at that. Learn more and try it for free at retool.com/lastweekinaws. That’s retool.com/lastweekinaws, and tell them Corey sent you because they are about to be hearing way more from me.
Corey: This episode has been sponsored in part by our friends at Veeam. Are you tired of juggling the cost of AWS backups and recovery with your SLAs? Quit the circus act and check out Veeam. Their AWS backup and recovery solution is made to save you money—not that that’s the primary goal, mind you—while also protecting your data properly. They’re letting you protect 10 instances for free with no time limits, so test it out now. You can even find them on the AWS Marketplace at snark.cloud/backitup. Wait? Did I just endorse something on the AWS Marketplace? Wonder of wonders, I did. Look, you don’t care about backups, you care about restores, and despite the fact that multi-cloud is a dumb strategy, it’s also a realistic reality, so make sure that you’re backing up data from everywhere with a single unified point of view. Check them out as snark.cloud/backitup.
Corey: Welcome to Screaming in the Cloud. I'm Corey Quinn. I'm joined this week by Alex Hidalgo, who's a site reliability engineer and, due to an escalatingly poor series of life decisions, a recently published author, specifically, the book Implementing Service Level Objectives. Alex, welcome to the show.
Alex: Thanks, Corey.
Corey: So, every person I've talked to who's written a book has given me the thousand-yard stare when I ask if I should write one, and then immediately begins screaming, “No, never do this.” I've come to the conclusion that nobody actually wants to write a book; they want to have written a book. How accurate is that?
Alex: I think there's absolutely some truth to that. It is difficult, it is tiring, it is emotionally draining, and having to have written one in the middle of a global pandemic didn't make anything any easier. That being said, it's also been an incredibly rewarding experience, especially when you end up producing something that you're truly proud of. And my book is about people. It sounds like it's about service level objectives, and I guess it kind of is, but it's mostly about how to use those to make people's lives better. And I've seen how this process can do that. And so having something out there that I hope will help people's lives is ultimately rewarding. And there were more tears of frustration than there were tears of joy, but there were both.
Corey: So, let's start at the very beginning here, I know the book is—at the time of this recording, it is in print; they are starting to ship out. I have not yet received my copy, but of course, I have ordered one, I keep an eye on whatever O'Reilly releases, and especially when it's people I know, it's a no brainer. It will, in all honesty, sit on the shelf and never get opened because I will do my actual reading on the Kindle. But having something on the shelf is, for something like this—[00:03:54 unintelligible] you know, you’re the person that wrote it—is just the right thing to do. But start at the beginning for me here because it turns out that I am a white guy in tech, which means my failure mode is a board seat and a book deal somewhere; everyone assumes that I know everything about everything, and I tend to not shatter that illusion very often, but I have no earthly idea what the hell a service level objective is. Since it's just you, me, and the thousands of people listening to this, what is an SLO?
Alex: So, an SLO, well, it's an objective for your service; you can't be perfect. The story I like to tell is, imagine you're using a streaming media service of some sort: a Netflix, a Hulu, a Disney Plus, whatever, and when you're using this, normally we start a new video, it buffers for a few seconds, and you're fine with that. And that's technically not perfect, but it turns out these services don't have to be perfect for you because you're fine if it buffers a bit. But on the same token, if it buffers for, like, 20 seconds, you don't love that, but you're not going to abandon the service unless it buffers for 20 seconds every single time. Then you may say, “Screw this. I'm moving to a competitor.”
So, the idea is, find out what your users can tolerate, and make sure you're only failing that often. If you're not losing users, if people are still happy, in general, with your service, if you only take 20 seconds to buffer 1 in 50 times, then aim for that. Because you're going to spend too many resources, both financially and via your developers and your support engineers if you try to make everything 100 percent all the time.
Corey: It sounds, on some level, like it's a derivative of SLA, service level agreement. What's the difference?
Alex: So, the difference is, a service level agreement—and they've definitely been around much longer. Actually, in some of my research, I’ve found—
Corey: I've seen SLAs in contracts all the time when negotiating those, I have never seen the phrase ‘service level objective’ in a contract, which means that lawyers will not know what I'm talking about if I use SLO, I suspect.
Alex: Yep, exactly. An SLA is something you put into a contract, and it generally implies that you owe somebody something, whether it's credit or actual money, if you violate that. SLOs are an approach to thinking about the reliability of your service. They are promises in some sense, but definitely not contractual ones. They’re tools, they're a bit of data that helps you make decisions. Are we buffering too long, too often? Is this page not loading correctly, too often? You know these things are going to happen, and just make sure it's not too much of the time. It kind of accepts the same thing as an SLA does. SLAs are generally not 100 percent because, again, people realize something will break at some point in time. SLOs think about things in the same way, in that sense, but they're used to help you make decisions. Do we need to focus on this part of our product? Do we need to focus on this?
Corey: So, help me understand this in the context of a story that I've related from time to time on various forms of podcasts and whatnot. Years ago, I was trying to buy a pair of socks on amazon.com. And I clicked the buy button and I got one of their error pages, which of course features dogs. In all honesty, the dog page is more satisfying than any other page on amazon.com.
If I listened to the common wisdom, that would mean that during that outage that lasted about an hour or so, I would have therefore gone to a competitor to buy the pair of socks, or alternately, one day out of the week on the day that that pair of socks should have been there, I would just go without socks whatsoever. In practice, “Oh, that's weird, I ever see that. Haha.” I come back an hour later, I buy the socks, and life goes on. There was no loss of revenue, in my case, for amazon.com during that outage. However, if every third time I tried to buy something at Amazon, I got the dog page instead, I'd probably spend a lot more money at Target. So, is that a naive storytelling, I guess, understanding of a much more complex concept of SLOs? Are they related, or is this completely out in the weeds and it's a boring story we should make sure we drop on the floor in post-production?
Alex: No, that's exactly it. If you're down for an hour, chances are people are going to be like, “Huh, this happens.” Stuff breaks; people are used to it; they're actually mostly okay with it. And you're probably just going to come back and check in an hour. That's exactly correct. That's the whole point. You can be down for an hour, you just can't be down for an hour too often.
So, find out what that is. Find that percentage. Can you be down for an hour, once a month? Twice a month? It's going to be different depending on your service. If you're a specialty retailer—no one else sells your stuff—then you can probably actually be a little bit more lax. And if you're someone like Amazon, who's also expected to be constantly up because they're the largest company in the world, in some sense.
So, no, I think you got it right, exactly. It's just that implementing SLOS, there's a lot of math that goes into it. There's a lot of discussions you have to have, it's not easy to just pick a number. So, they're simple to talk about, not always easy to implement. That's why there's a whole book about it. But your story, that's exactly it. That's exactly the whole point.
Corey: B&H, the photo company, closes for 24 hours for Shabbat every week. And they wind up having their website up, but they say yeah, you can't actually make a purchase until Shabbat ends. On some level, it's kind of their own brand now at this point, and it seems to have worked out reasonably well. But as you say, it comes down to what your story is, as far as approaching the market. I'll go back an hour later to buy socks because I need socks. I'm not going to go back to your website an hour later to click on an ad that wasn't displaying.
Alex: Exactly. So, a meaningful SLI is a measurement of how your service is operating from your users’ perspective. And when people ask me to explain it a bit more, I'm always like, “Well, it's the same thing as a KPI, or key performance indicator, for the business side.” Or if you were to talk about this to a product manager, they would say, “Oh, it's a user journey.”
You’ve got to take everything into account. People are going to have different expectations for how different parts of the service work. As you said, there's a difference in between, perhaps, something like a button not registering a click on the first try but registering a click on the second try, that’s—I think people aren't going to be too upset about that can probably fail more often than just not be able to check out entirely.
Corey: When you're looking at SLOs through a lens of things we should strive to do, how does that keep from becoming a, “We'll try our best.” which sounds great, makes everyone feel good, but isn't something that is easy to represent as either having value or matters at all, to the larger business.
Alex: So, I think you really do just try your best, but I understand, I agree that that doesn't sound like a great sentence, even though it generally is true. You know, you'll try your best, and you’ll try to make sure your service is good. But they can't always be perfect, and I get that that kind of language isn't great, but that's actually exactly why I think the whole SLI, SLO, error budgets—which we haven't even talked about. It’s measurement of your SLO over time, as opposed to, kind of, right now—this is actually an example where the numbers can help, can point to things and say, “Yeah, buts, we were only unreliable for 4 minutes and 32 seconds last month,” or something along those lines.
And that's how you, kind of, help explain to people that yes, in a sense, this is we're just going to try our best because we cannot try our perfect; that's not a thing. People get to understand what you actually mean with that. When you're saying I'm going to try my best, you're actually saying, “Well, we're aiming to be 99.95 percent reliable, and that translates to X number of minutes per month that we may not be unreliable.” And that can often help people understand, “Oh huh. Maybe trying your best is actually good enough.”
Corey: I really wish that more people were explicit about saying trying your best is good enough because I can't shake the feeling that that is not a well-circulated belief in far too many places.
Alex: Totally agreed. But it's the truth. It's how things actually work. Things fail, people fail, and it turns out people actually know that. When you’re running a business, the end goal, of course, is to make money and make as much money as possible—or at least for most businesses. There are absolutely outliers there—and that means your executives or whoever owns the most shares, or your shareholders if you've gone public, that's their goal, right?
At some level, they want to make money and want to make as much money as possible. And therefore, they think the way to do that is to aim for perfection, to aim for 100 percent. But you're always going to falter if you do that, and those are the people who can be most difficult to convince that it's just not prudent to aim for 100 percent, but those people can get there as well. It can take time, it can take a lot of examples, but don't let great be the enemy of the good. It's a concept that, as humans, we know so well that we have an idiom for it.
And you just got to figure out how to translate that to a business where, again, they think they want to be 100 percent because they think they need to do that to make as much money as possible, but you can actually often save in resources by not trying to be perfect because the amount of money you're going to spend, it ends up almost becoming a limit approaching infinity, that curve is going to shoot straight up in how much money it's going to cost for you to try to be as reliable as possible. You're going to have to run multi-region, and you're going to have to pay for quicker replication, and you're going to have to hire more engineers because, like, if you want to be up almost all the time, you're going to have to have people who are on call who can respond within, like, 30 seconds.
The only way you can do that is if you have, like, a follow the sun rotation where you have offices all over the world because otherwise—not everyone's going to wake up at 3 a.m. so you got to make sure it's someone's 3 p.m. instead. And you can see how quickly this escalates. It can actually cost you more money; you can actually make more money by having a more reasonable target. Think about what your users actually need from you, and often—well, again, humans expect failure. They're cool if stuff doesn't work every once in a while, as long as it doesn't work too often.
Corey: One of the only other places I've seen SLOs discussed in any serious capacity was the SRE book that came out of Google. And there was a lot of good stuff in that book, but I had a bit of a negative reaction to it just because that came out right when I was in the middle of getting an awful lot of, “We’re Google, we’re smarter than you,” flak on other fronts. The people who wrote that book, to be very clear, are great. That is not the impression I have of those people.
But it's, “Oh good. How to be more like Google. Just what I don't want to listen to.” So, I largely ignored it for a while. But Liz Fong-Jones, now at Honeycomb, is a big advocate of SLOs. They're one of my consulting reference clients, and we've had—most of our conversations around cost optimization have centered around SLOs. So, what the expectation for their customers is and the commitments that they have made. And it was a really interesting philosophy that I haven’t seen replicated elsewhere, yet.
Alex: Yeah, it's something that's gaining a lot of traction. And actually, so I was at Google. I was on the Customer Reliability Engineering Team with Liz, and one of the things we did was we went out and we taught some of Google's largest cloud customers how to SRE. That was kind of our goal. And the beginning of every journey was, “You need to have SLOs first. This is our common language. This is how we talk about reliability. Reliability to an SRE is defined by service level objectives.”
And so while it's still, outside of Google, still kind of a growing discipline, lots of people are doing it. In some ways, this book is about the two years I spent at Squarespace. It's essentially my story there. I joined and people said, “We want to do SLOs, and we know that you know SLOs, and let's do it.” And I was like, “Okay, sure.”
Except I didn't realize what it takes to build this kind of thing from the ground up. Because suddenly, I wasn't at Google anymore. I didn't have all the tooling that Google has, I didn't have the cultural buy-in, not just by engineers but across various different organizations, and I had to do everything from the bottom up: new tooling had to be created, new software written, I had to drastically change how people measure things, how people think about things, and that's kind of how the book came to be. I was running a lot of SLO workshops just internally where teams could come and I could spend three, four hours with them, and maybe even hopefully, end up with a single defined SLI/SLO pair before they left the room. And it was just getting to be a lot because I was seeing the same thing over and over again.
And I was talking to my colleague, Gabe, and I explained this to him, you know? It's just getting tiring. And then I said, “I wish there was a whole book about this so I just point people to the book.” And Gabe said, “You should write it.” And I said, “No, no, no, no. We need, like, the expert to write it.” And he said, “You are the expert.” And so I'm pretty sure my response was, “[BLEEP],” because I knew I was now going to write a book.
Corey: It sounds on some level, like writing a book is something that is—it used to be this thing that people would do, this aspirational task. “I’m going to write a book.” Increasingly, it's starting to sound, for an awful lot of author folks, that it's more like a dead dog that has been cast into your yard by one of your neighbors, and now it's time for you to worry about it.
Alex: [laughs]. I mean, I don't know if I go quite that far with the metaphor, but yeah, absolutely. I mean, I won't lie. It's awesome that I wrote a book. That's neat. As a kid, my dream was to be an author. I thought it'd be like fantasy books and not a technical manual, but still, it's amazing. I got the first physical copy yesterday, And I bawled. I cried.
It's really, really neat having done that, and I won't pretend that some of the status that you get with that, I won't pretend I’m totally ignoring that. But absolutely, at the root of it, it's more like, yeah, someone needs to do this. I am the right person for it. I write well; I know a lot of people who can help me with this; I helped with the second SRE book, so I already understood the process just a little bit, and it was like this needs to be out there, and so, it may well be me.
This episode is sponsored by our friends at New Relic. If you’re like most environments, you probably have an incredibly complicated architecture, which means that monitoring it is going to take a dozen different tools. And then we get into the advanced stuff. We all have been there and know that pain, or will learn it shortly, and New Relic wants to change that. They’ve designed everything you need in one platform with pricing that’s simple and straightforward, and that means no more counting hosts. You also can get one user and a hundred gigabytes a month, totally free. To learn more, visit newrelic.com. Observability made simple.
Corey: I'm glad it's you because, first, I get to wind up reading a book that I don't have to write, which is great because I'm never going to write one. I don't have the attention span to write a tweet most days. But being able to read an in-depth thought in book form, where you are able to opine on the various angles of this from a whole bunch of different perspectives in a much longer form than a blog post, or this is a tweet thread. Tweet one of 487,000, and so on and so forth. It's great to have a single place for it to go. Also, you tipped me off fairly early on that—talk about bucket list items—that I am referenced in an easter egg within the book.
Alex: Yeah. So, actually got to give props to some co-authors here. I actually ended up writing only about 60 percent of the book. I was always planning on bringing in two or three people who are, like, experts at very niche parts of this, and suddenly I had people volunteering all over the place.
And so, the other 40 percent have a bunch of amazing people from across the industry. And Polina Giralt and Blake Bisset wrote a chapter about data reliability. Data reliability is—you've got to approach it in a very different way than latency, or availability, or error rates. So, we have a whole chapter about data reliability, and there's a part where they discuss what even is a database? And they ask the question, is Route 53 a database? And we were able to get a footnote in that says, “Hi, Corey.” At first, the editor wanted to take it out, and I was pretty adamant that we should leave it in.
Corey: Oh, absolutely. The fact that I am referenced in this now means that I'm getting it framed and hung on the wall. “Oh, did you write that book?” “Absolutely not. One of my stupid jokes made it into that book.”
And that's—yeah, oh, I'm absolutely going to steal credit for you in this sense. I can finally have that as my counterpoint to my business partner’s story. He wrote Practical Monitoring for O'Reilly and has it on display on his bookshelf behind him when he's on Zoom calls. And it's a fun problem because, as it turns out when you've written a book, it's very hard to bring up the fact that you have written a book—because you're proud of it, you spent a disturbing portion of your life for a while on writing that book, but you can't open the sentence with that, or people find it pretentious and ridiculous. So, my position has always been that if I know someone's written a book, I will drop that into virtually every conversation when someone who’s talking to them doesn't know that fact. I try and be the one-person promotional band for stuff like that. So, do you know that you've written a book?
Alex: It still doesn't feel real. Even though I got that physical copy and have paged through it one by one, I spent almost, like, an hour, not really reading but kind of reading, and it still doesn't feel real, to be honest. But I totally hear you. So, I've given myself the opportunity to brag occasionally, to gush, because I'm incredibly proud of this and this was literally the hardest thing I've ever done in my entire life. So, I tried to give myself the opportunity to occasionally talk about it.
But outside of Twitter, where, honestly, I don't care that much; I’m constantly self-promotional there. In the real world, you're absolutely right, it's difficult to bring up. I posted a picture of the book on an internal Slack channel at Squarespace because I let myself say, okay, cool. I haven’t talked about the book at work for several months, here's my once a quarter permitted single message about it, and there were engineers at the company that still didn't know I was writing the book because you're right, it's difficult to bring up. You don't want to sound like a bragger. But I do think you have to try to give yourself permission every once in a while. There's nothing wrong with promoting the things you've done, especially the ones that you're very proud of.
Corey: Looking back, based upon what now, first, would you write the book again, and secondly, what do you wish you'd known before you started?
Alex: Yes, I’d write the book. Again, at the end of the day, the positives outweigh how difficult it was. What would I do differently? I would take more time off. I'm not writing a book again and also working full time.
I did take a few weeks off in December but having to do essentially all of this work on weekends and evenings, that was draining; it was a lot. And I should have known that I needed to give myself more time there. Don't do it when a global pandemic is about to happen. That part was [laugh] terrible. Especially if you're someone who—I can't write at home, I need to be at a coffee shop, at a bar. It's strangely not very unheard of. Lots of writers are this way.
And when I didn't have anywhere to go anymore, that was tough. That was not easy. Of course, you can't really predict that, but I had to throw that out there. [laugh]. And the final thing I’d do is I had a bunch of co-authors and they're all wonderful and every chapter turned out great, but trying to navigate that many people's schedules, and their own commitments, and I think I just want to be more sure, ahead of time, to let these chapter authors know, this is going to be difficult; this is going to take you more time than you think.
Can you take a week or two off? Because otherwise, you're really going to be struggling with this. So, those are the things: just give yourself more time; if you're working with other people, ensure that they're giving themselves enough time. And just try to make sure you're in a good situation in general. That might be a better way to sum up the whole pandemic thing because, of course, you can't ever control that. But make sure you have time, make sure you're comfortable. Make sure you're not going through other life events. Don't do this in the middle of some other crisis, or health problems, or something like that. You need to be in the best possible mental state that you can be in before you embark on something like this.
Corey: In what you might be forgiven for mistaking for a blast from the past, today I want to talk about New Relic. They seem to be a relatively legacy monitoring company, and I would have agreed with that assessment up until relatively recently. But they did something a little out there: they reworked everything. They went open source, they made it so you can monitor your whole stack in one place and, most notably from my perspective, they simplified their pricing into something that is much more affordable for almost everyone. There's even a free tier with one user and 100 gigs per month, totally free. Check it out at newrelic.com.
Corey: That's a hard and heavy lift in 2020. And again, there is a production delay between the time that we record this and the time it goes out. Sorry listeners, when you download something from your podcast, I don’t, quick, get someone on the phone, and have that talk live. I know. Spoiling the production magic for you.
But it seems like this is getting to be such a weird year from week to week. It's, “Wow, they didn't even mention the giant meteor.” And well, here we are. It's a hard problem to solve for as far as finding time to write. I've written a few basic outlines of books I've toyed with writing, and invariably in almost every case, I find another book that's already been written that aligns closely enough that, “Oh, I'll just talk about that thing instead. It's easier.”
Alex: Yeah, that’s… [laugh] I remember, when I was first really getting into service level objectives at Google, I was on a team that was responsible for the monitoring and alerting for everything else across Google. So, we wanted to have well-defined SLOs so other internal engineers could look to our definitions and understand how reliable we were aiming to be so they could ensure that their systems were, you know, handling things correctly and knew how to retry when they had to, and things like that. And suddenly, I had this great idea of building it an SLO repository, a centralized place where everyone can define their SLOs and they get some tooling or dashboarding for free, and it was a centralized place for you discover what your dependencies were aiming for so you could set your targets correctly. And I told one of the staff engineers on my team, and he was ecstatic about it.
And he was like, “Oh, my God. Alex, this is a great idea. This is going to get you your next promotion.” And I spent a few hours starting to outline what it would look like, and then someone else on my team came to me, he’s like, “I just discovered there is an entire team staffed of ten people working on this product.” [laugh]. So, my great idea, immediately up in smoke because someone else already had that great idea first.
But in this case, I looked, I really did. When you fill out a proposal for O'Reilly, they ask you, “What are competing books to yours? We need to know, what are you comparing yourself against?” And I listed the SRE books including Seeking SRE, David Blank-Edelman’s book because they at least talk about the SLOs, but really, it was tangential.
I was like, what I'm reading is strangely new. It's not just much more expansive, but it's actually a pretty different take than how they're described in either of the Google SRE books. So, that was one of the reasons I really felt like I had to do it. Because I looked and I couldn't find what I thought needed to be out there.
Corey: That's probably the greatest sign it's time to write a book, I would imagine. When no one else is talking about the thing that you want to talk about, or they are, they're getting it all wrong across the board. My position has always been to do a snarky take on Twitter or a sarcastic blog post, but there are times you need to go deeper than that. And to be honest, I'm very glad that people like you have attention spans.
It's easy to fall into a trap, in my experience of, in the world of Twitter and things like it, it's easy to attain relative mastery—or absolutely not, but the appearance of relative mastery in the confines of 280 characters. But then you see people who are legitimate experts in things and oh, it turns out that maybe I shouldn't be reinventing all the stuff from first principles, as if I were, suddenly Hacker News come to life. There's a definite value in seeing deep exhaustive research. One of the things I find most worrying about my increased attention to short-form social media is that I'm not reading the long-form stuff that really lets you dive into a topic with anything approaching the frequency that I used to.
Alex: Yeah, and I think it's actually—it's an interesting time in tech because I think a lot of people are pivoting towards understanding that we need to have a more in-depth understanding of how everything works. People have stopped being experts or deep experts at individual things. People are always being asked to be full-stack engineers, and you've got to understand everything. If you try to do that, then you will only ever have a shallow understanding of anything.
And I think it's a really interesting time because people are starting to realize that, and one of the ways I see it manifesting right now is in the fact that people are starting to look to outside industries. We've tried to call ourselves engineers for a long time. There's a lot of debate about whether or not that's applicable or not. It's just semantics, I don't think it's actually important outside of the fact that it is, in fact, a fun debate to have. But what I am seeing is people realizing, “Oh, there are other engineering disciplines, and they've learned all these lessons already.”
Especially in my world, people talk about reliability, and unfortunately, from an industry standpoint, that's mostly come to mean availability. Those things are actually very different. Reliability means so much more than that, and reliability engineering has been around since, I think, was the late 1940s is when the term was first phrased by the US military, in terms of whether some—I can't remember the exact object it was now, but, you know, like some armament, would it function the way it was intended to? The term reliability was coined to mean, “Is this doing what it was designed to do?” And that's a heck of a lot more than just being available.
And we're seeing more people think about that, though. You're seeing more people getting academic about things. And I remember a few years ago, at work, someone was trying to solve a problem with the fact that there were only very low-resolution metrics coming in, only a few API calls per hour. They’re like, “How do we alert off of this?” Because a single error—which might be totally fine—could represent 30 percent of all traffic over this hour window, but do we want to measure over 24 hours because then we wouldn't alert until 24 hours have passed.
And a colleague came over and said, “You know, you can just use a binomial distribution to solve that, right?” And everyone was like, “Huh? What are you talking about?” And he just broke out Wikipedia and showed us and suddenly, the entire team pivoted to understanding like, holy crap, there are statistical models, some of which were developed centuries ago, that we can use to help with so much stuff. And again, I think it's neat because after years and years and years of tech being pretty egocentric, and thinking we must solve it, or thinking we are the smartest in the room, while it’s definitely not gone entirely, I feel like—and I've been doing this for a long time. I've been in the industry in some way or another for almost 20 years—and personally, I feel like I'm seeing more and more people looking outside saying, “How have other people already solved this problem before?”
Corey: One of the problems that I've always seen is that there's this tendency to not look for prior art; instead, sit down and dive right into attempting to solve it internally with the resources you have first. Well, one of those resources is Google. Take a look and see, maybe other people have solved this. One problem that I have around this space is that the term ‘[00:31:17 serverless] level objectives’ is not discoverable if you don't know that the term exists. How do you get this in front of people who are absolutely positioned to benefit from this, but don't know what they don't know, so they don't have the term to look for?
Alex: I don't know if I have all the answers there beyond the fact that, in my opinion, to be a good engineer, you need to understand how to market. If you have a new idea or a new service—
Corey: “Well slow down, there, hasty pudding,” says AWS. But please continue.
Alex: [laugh]. Exactly how to approach this, how to get this in front of everyone, I don't think I know, and I don't know if I'm the exact right person for that. But one thing I have learned is you just repeat yourself a lot. If you have a good idea that you think can help other people just tell them over and over again, maybe not to the point that you're actually annoying them because you do want them to listen to you in the future. But if you think you have a solution, let people know.
And they may ignore you at first, or they may think that they still have a better way to do it. But the one thing I will say, and I come back to this over and over and over again, people listen to stories. Yeah, sometimes data helps, sometimes having some numbers to put in front of someone helps, but overall, people like stories. Tell them a story about how this worked for you. Tell them a story about how it made your life easier. Tell them a story about how it saved a company a ton of money or helped them discover some very obscure bug.
And that's what people really connect to. We've been storytellers for a millennia, and that's the best way to get these kinds of things across, I think. And a lot of the book is written that way. There are some chapters that are very heavy in math, and there's an entire chapter about statistics, but even that one has some great stories about dumplings. And we tried to frame the whole book that way. You need a narrative, that's how you engage people, that's how you keep people listening.
Corey: And of course, there's always the option of telling the story on wonderful podcasts like this one.
Alex: 100 percent. I don't think the medium necessarily matters. You can podcast it, you can write a book, you can write a blog post, you can go out to conferences and tell people these stories verbally, or whatever it is. Yeah, I agree. I don't think that part necessarily matters because different people consume information and consume narratives in different ways.
Corey: Well, this has been an absolutely fantastic experience and incredibly educational, at least for me. If people want to hear more about what you're up to, or learn more about SLOs, where can they find you, slash buy your book?
Alex: You can buy the book wherever you want. It's an O'Reilly published book, it's available widely. Go to your local bookstore if you can. If you're not comfortable currently leaving the house, go to bookshop.org that helps support local bookstores. But if you want to order off Amazon because you got Prime, feel free to do that. Just think about, perhaps, supporting small and local businesses. You can find me on Twitter at @ahidalgosre—that’s A-H-I-D-A-L-G-O-S-R-E—where I often pontificate about these kinds of things. And I have a website at www.alex-hidalgo.com.
Corey: Given that you do obviously care about various ways to purchase the book, let's make it easy on people. If you visit snark.cloud/slobook—that's S-L-O-B-O-O-K—we'll drop you onto a site that shows you how to go about purchasing this in a variety of different ways. That's snark.cloud/slobook. Alex, thank you so much for taking the time to speak with me. I really do appreciate it.
Alex: No, thank you, Corey. This has been a great conversation. I've had a lot of fun.
Corey: Likewise. Alex Hidalgo, site reliability engineer, and author. I'm Cloud Economist Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on Apple Podcasts, whereas if you hated this podcast please leave a five-star review on Apple Podcasts and a published statement about exactly how many nines we should have had instead of an SLO, in the comments.
Announcer: This has been this week’s episode of Screaming in the Cloud. You can also find more Corey at ScreamingintheCloud.com, or wherever fine snark is sold.
This has been a HumblePod production. Stay humble.