Episode Summary
Episode Show Notes & Transcript
- Slack: https://slack.com
- “Infrastructure Observability for Changing the Spend Curve”: https://slack.engineering/infrastructure-observability-for-changing-the-spend-curve/
- “Right Sizing Your Instances Is Nonsense”: https://www.lastweekinaws.com/blog/right-sizing-your-instances-is-nonsense/
- Personal webpage: https://frankc.net
- Twitter: @frankc
Transcript
Corey: It seems like there is a new security breach every day. Are you confident that an old SSH key, or a shared admin account, isn’t going to come back and bite you? If not, check out Teleport. Teleport is the easiest, most secure way to access all of your infrastructure. The open source Teleport Access Plane consolidates everything you need for secure access to your Linux and Windows servers—and I assure you there is no third option there. Kubernetes clusters, databases, and internal applications like AWS Management Console, Yankins, GitLab, Grafana, Jupyter Notebooks, and more. Teleport’s unique approach is not only more secure, it also improves developer productivity. To learn more visit: goteleport.com. And not, that is not me telling you to go away, it is: goteleport.com.
Corey: This episode is sponsored by our friends at Oracle Cloud. Counting the pennies, but still dreaming of deploying apps instead of "Hello, World" demos? Allow me to introduce you to Oracle's Always Free tier. It provides over 20 free services and infrastructure, networking, databases, observability, management, and security. And—let me be clear here—it's actually free. There's no surprise billing until you intentionally and proactively upgrade your account. This means you can provision a virtual machine instance or spin up an autonomous database that manages itself all while gaining the networking load, balancing and storage resources that somehow never quite make it into most free tiers needed to support the application that you want to build. With Always Free, you can do things like run small scale applications or do proof-of-concept testing without spending a dime. You know that I always like to put asterisks next to the word free. This is actually free, no asterisk. Start now. Visit snark.cloud/oci-free that's snark.cloud/oci-free.
Corey: Welcome to Screaming in the Cloud. I’m Corey Quinn. Several people are undoubtedly angrily typing, and part of the reason they can do that, and the fact that I know that is because we’re all using Slack. My guest today is Frank Chen, senior staff software engineer at Slack. So, I guess, sort of… [sales force 00:00:53]. Frank, thanks for joining me.
Frank: Hey, Corey, I have been a longtime listener and follower, and just really delighted to be here.
Corey: It’s one of the weird things about doing a podcast is that for better or worse, people don’t respond to it in the same way that they do writing a newsletter, for example, because you receive an email, and, “Oh, well, I know how to write an email. I can hit reply and send an email back and give that jackwagon a piece of my mind,” and people often do. But with podcasts, I feel like it’s much more closely attuned to the idea of an AM radio talk show. And who calls into a radio talk show? Lunatics, and most people don’t self-describe as lunatics, so they don’t want to do that.
But then when I catch up with people one-on-one or at events in person, I find out that a lot more people listen to this show than I thought they did. Because I don’t trust podcast statistics because lies, damn lies, and analytics are sort of how I view this world. So, you’ve worked at a bunch of different companies. You’re at Slack now, which, of course, upsets some people because, “Slack is ruining the way that people come and talk to me in the office.” Or it’s making it easier for employees to collaborate internally in ways their employers wish they wouldn’t. But that’s neither here nor there.
Before this, you were at Palantir, and before this, you’re at Amazon, working on Amazon WorkDocs of all things, which is supposedly rumored to have at least one customer somewhere, but I’ve never seen them. Before that you were at Sandia National Labs, and you’ve gotten a master’s in computer science from Stanford. You’ve done a lot of things and everything you’ve done, on some level, seems like the recurring theme is someone on Twitter will be unhappy at you for a career choice you’ve made. But what is the common thread—in seriousness—between the different places that you’ve been?
Frank: One thing that’s been a driver for where I work is finding amazing people to work with and building something that I believe is valuable and fun to keep doing. The thing that brought me to Slack is I became my own Slack admin, [laugh] when I met a girl and we moved in together into a small apartment in Brooklyn. And she had a cat that, you know, is a sweetheart, but also just doesn’t know how to be social. Yes, you covered that with ‘cat.’ Part of moving it together, I became my own Slack admin and discovered well, we can build a series of home automations to better train and inform our little command center for when the cat lies about being fed, or not fed, clipping his nails, and discovering and tracking bad behaviors. In a lot of ways this was like the human side of a lot of the data work that I had been doing at my previous role. And it was like a fun way to use the same frameworks that I use at work to better train and be a cat caretaker.
Corey: Now, at some point, you know that some product manager at Amazon is listening to this and immediately sketching notes because their product strategy is, “Yes,” and this is going to be productized and shipping in two years as Amazon Prime Meow. But until then we’ll enjoy the originality of having a Slack bot more or less control the home automation slash making your house seem haunted for anyone who didn’t write the code themselves. There's an idea of solving real world problems that I definitely understand. I mean, and again, it might not even be a fair question entirely. Just because I am… for better or worse, staggering through my world, and trying—and failing most days—to tell a narrative that, “Oh, why did I start my tech career at a university, and then spend time in ad tech, and then spend time in consulting, and then FinTech, and the rest?” And the answer is, “Oh, I get fired an awful lot, and that sucked.”
So, instead of going down that particular rabbit hole of a mess, I went in other directions. I started finding things that would pay me and pay me more money because I was in debt at the time. But that was the narrative thread that was the, “I have rent to pay and they have computers that aren’t behaving properly.” And that’s what dictated the shape of my career for a long time. It’s only in retrospect that I started to identify some of the things that aligns with it. But it’s easy to look at it with the shine of hindsight and not realize that no, no, that’s sort of retconning what happened in the past.
Frank: Yeah, I have a mentor and my former adviser had this way of describing, building out the jankiest prototype you can to prove out an idea. And this manifested in his class in building out paper prototypes, or really, really janky ideas for what helping people through technology might look like. And I feel like it a lot of ways, even when those prototypes fail, like, in a career or some half baked tech prototype I put together, it might succeed and great, we could keep building upon that, but when it fails, you actually discover, “Oh, this is one way that I didn’t succeed.” And even in doing so, you discover things about yourself, your way of building, and maybe a little bit about your infrastructure, or whatever it is that you build on a day-to-day basis. And wrapping that back to the original question, it’s like, well, we think we’re human beings, right, we’re static, but in a lot of ways we’re human becomings. We think we know what the future might look like with our careers, what we’re building on a day-to-day basis, and what we’re building a year from now, but oftentimes, things change if we discover things about ourselves, the people we work with, and ultimately, the things that we put out into the world.
Corey: Obviously, I’ve been aware of who Slack is, for a long time; I’ve been a paying customer for years because it basically is IRC with reaction gifs, and not having to teach someone how to sign into IRC when they work in accounting. So, the user experience alone solved the problem.
Frank: And you’ve actually worked with us in the past before. [laugh]. Slack, it’s the Searchable Log for all Content and Knowledge; I think that backronym, that’s how it works. And I was delighted when I had mentioned your jokes and you’re trolling [a folk 00:07:00] on Twitter and on your podcast to my former engineering manager, Chris Merrill, who was like, oh, you should search the Slack. Corey actually worked with us and he put together a lot of cool tooling and ideas for us to think about.
Corey: Careful. If we talk too much, or what I did when I was at Slack years ago, someone’s going to start looking into some of the old commits and whatnot and start demanding an apology, and we don’t want that. It’s, “Wow, you’re right. You are a terrible engineer.” “Told you.” There’s a reason I don’t do that anymore.
Frank: I think that’s all of us. [laugh]. An early career mentor of mine, he was like, “Hey, Frank, listen. You think you’re building perfect software at any point in time? No, you’re building future tech debt.” And yeah, we should put much more emphasis on interfaces and ideas we’re putting out because the implementation is going to change over time, and likely your current implementation is shit. And that is, okay.
Corey: That’s the beautiful part about this is that things grow and things evolve. And it’s interesting working with companies, and as a consultant, I tend to build my projects in such a way that I start on day one and people know that I’m leaving with usually a very short window because I don’t want to build a forever job for myself; I don’t want to show up and start charging by the hour or by the day, if I can possibly avoid it. Because then it turns into eternal projects that never end because I’m billing and nothing’s ever done. No, no, I like charging fixed fee and then getting out at a predetermined outcome, but then you get to hear about what happens with companies as they move on.
This combines with the fact that I have a persistent alert for my name, usually because I’m looking for various ineffective character assassination from enterprise marketing types because you know, I dish it out, I should certainly be able to take it. But I found a blog post on the Slack engineering blog that mentioned my name, and it’s, “Aw, crap. Are they coming after me for a refund?” No, it was not. It was you writing a fairly sizable post. Tell me more about that.
Frank: Yeah, I’m part of an organization called Developer Productivity. And our goal is to help folk at Slack deliver services to their customers, where we build, test, and release high quality software. And a lot of our time is spent thinking about internal tooling and making infrastructure bets. As engineers, right, it’s like, we have this idea for what the world looks like, we have this idea for what our infrastructure looks like, but what we discover using a set of techniques around observability of just asking questions—advanced questions, basic questions, and hell, even dumb questions—we discover hey, the things that we think our computers are doing aren’t actually doing what they say they’re doing. And the question is like, great. Now, what? How can we ask better questions? How can we better tune, change, and equip engineers with tooling so that they can do better work to make Slack customers have simple, pleasant, and productive experiences?
Corey: And I have to say that there’s a lot that Slack does that is incredibly helpful. I don’t know that I’m necessarily completely bought into the idea that all work should happen in Slack. It’s, well, on some level, I—like people like to debate the ‘should people work from home? Should people all work in an office?’ Discussion.
And, on some level, it seems if you look at people who are constantly fighting that debate online, it’s, “Do you ever do work at all?” on some level. But I’m not here to besmirch others; I’m here to talk about, on some level, what you alluded to in your blog post. But I want to start with a disclaimer that Slack as far as companies go is not small, and if you take a look around, most companies are using Slack whether they know it or not. The list of side-channel Slack groups people have tend to extend massively.
I look and I pare it down every once in a while, whenever I cross 40 signed-in Slacks on my desktop. It is where people talk for a wide variety of different reasons, and they all do different things. But if you’re sitting here listening to this and you have a $2,000 a month AWS bill, this is not for you. You will spend orders of magnitude more money trying to optimize a small cost. Once you’re at significant points of scale, and you have scaled out to the point where you begin to have some ability to predict over months or years, that’s what a lot of this stuff starts to weigh in.
So, talk to me a bit about how you wound up—and let me quote directly from the article, which is titled, “Infrastructure Observability for Changing the Spend Curve,” and I will, of course, throw a link to this in the [show notes 00:11:38]. But you talk in this about knocking, I believe it was orders of magnitude off of various cost areas within your bill.
Frank: Yeah. The article itself describes three big-ish projects, where we are able to change the curve of the number of tests that we run, and a change in how much it costs to run any single test.
Corey: When you say test, are you talking CI/CD infrastructure test or code test, to make sure it goes out, or are you talking something higher up the stack, as far as, “Huh, let’s see how some users respond when, I don’t know, we send four notifications on every message instead of the usual one,” to give a ridiculous example?
Frank: Yeah, this is in the CI/CD pipelines. And one of these projects was around borrowing some concepts from data engineering: oversubscription and planning your capacity to have access capacity at peak, where at peak, your engineers might have a 5% degradation in performance, while still maintaining high resiliency and reliability of your tests in order to oversubscribe, either CPU or memory and keep throughput on the overall system stable and consistent and fast enough. I think, with spend in developer productivity, I think, both, like, the metrics you’re trying to move and why you’re optimizing for it at any given time are, like, this, like, calculus. Or it’s like, more art than science in that there’s no one right answer, right? It’s like, oh, yeah—very naively—like, yeah, let’s throw the biggest machines most expensive machines we can at any given problem. But that doesn’t solve the crux of your problem. It’s like, “Hey, what are the things in your system doing?” And what is the right guess to capitalize around how much to spend on your CI/CD [unintelligible 00:13:39] is oftentimes not precise, nor is this blog article meant to be prescriptive.
Corey: Yeah, it depends entirely on what you’re doing and how because it’s, on some level, well, we can save a whole bunch of money if we slow all of our CI/CD runs down by 20 minutes. Yeah, but then you have a bunch of engineers sitting idle and I promise you, that costs a hell of a lot more than your cloud bill is going to be. The payroll is almost always a larger expense than your infrastructure costs, and if it’s not, you should seriously consider firing at least part of your data science team, but you didn’t hear it from me.
Frank: Yeah. And part of the exploration on profiling and performance and resiliency was, like, around interrogating what the boundaries and what the constraints were for our CI/CD pipelines. Because Slack has grown in engineering and in the number of tests we were running on a month-to-month basis; for a while from 2017 to mid 2020, we were growing about 10% month-over-month in test suite execution numbers. Which means on a given year, we doubled almost two times, which is quite a bit of strain on internal resources and a lot of dependent services where—and internal systems, we oftentimes have more complexity and less understood changes in what dependencies your infrastructure might be using, what business logic your internal services are using to communicate with one another than you do your production.
And so, by, like, performing a series of curiosity-driven development, we’re able to both answer, at that point in time, what our customers internally were doing, and start to put together ideas for eliminating some bottlenecks, and hell, even adding bottlenecks with circuit breakers where you keep the overall throughput of your system stable, while deferring or canceling work that otherwise might have overloaded dependencies.
Corey: There’s a lot to be said for understanding what the optimization opportunities are, in an environment and understanding what it is you’re attempting to achieve. Having those test for something like Slack makes an awful lot of sense because let’s be very clear here, when you’re building an application that acts as something people use to do expense reports—to cite one of my previous job examples—it turns out you can be down for a week and a majority of your customers will never know or care. With Slack, it doesn’t work that way. Everyone more or less has a continuous monitor that they’re typing into for a good portion of the day—angrily or otherwise—and as soon as it misses anything, people know. And if there’s one thing that I love, on some level, seeing change when I know that Slack is having a blip, even if I’m not using Slack that day for anything in particular, because Twitter explodes about it. “Slack is down. I’m now going to tweet some stuff to my colleagues.” All right. You do you, I suppose.
And credit where due, Slack doesn’t go down nearly as often as it used to because as you tend to figure out how these things work, operational maturity increases through a bunch of tests. Fixing things like durability, reliability, uptime, et cetera, should always, to some extent, take precedence priority-wise over let’s save some money. Because yeah, you could turn everything off and save all the money, but then you don’t have a business anymore. It’s focused on where to cut, where to optimize in the right way, and ideally as you go, find some of the areas in which, oh, I’m paying AWS a tax for just going about my business. And I could have flipped a switch at any point and saved—“How much money? Oh, my God, that’s more than I’ll make in my lifetime.”
Frank: Yeah, and one thing I talk about a little bit is distributed tracing as one of the drivers for helping us understand what’s happening inside of our systems. Where it helps you figure out and it’s like this… [best word 00:17:24] to describe how you ask questions of deployed code? And there a lot of ways it’s helped us understand existing bottlenecks and identify opportunities for performance or resiliency gains because your past janky Band-Aids become more and more obvious when you can interrogate and ask questions around what is it performing like it used to? Or what has changed recently?
Corey: This episode is sponsored in part by something new. Cloud Academy is a training platform built on two primary goals. Having the highest quality content in tech and cloud skills, and building a good community the is rich and full of IT and engineering professionals. You wouldn’t think those things go together, but sometimes they do. Its both useful for individuals and large enterprises, but here's what makes it new. I don’t use that term lightly. Cloud Academy invites you to showcase just how good your AWS skills are. For the next four weeks you’ll have a chance to prove yourself. Compete in four unique lab challenges, where they’ll be awarding more than $2000 in cash and prizes. I’m not kidding, first place is a thousand bucks. Pre-register for the first challenge now, one that I picked out myself on Amazon SNS image resizing, by visiting cloudacademy.com/corey. C-O-R-E-Y. That’s cloudacademy.com/corey. We’re gonna have some fun with this one!
Corey: It’s also worth pointing out that as systems grow organically, that it is almost impossible for any one person to have it all in their head anymore. I saw one of the most overly complicated architecture flow trees that I think I’ve seen in recent memory, and it was on the Slack engineering blog about how something was architected, but it wasn’t the Slack app itself; it was simply the [decision tree for ‘Should we send a notification?’ 00:18:17] and it is more complicated than almost anything I’ve written, except maybe my newsletter content publication pipeline. It is massive. And I’ll throw a link to that in the [show notes 00:18:31] as well, just because it is well worth people taking a look at.
But there is so much complexity at scale for doing the right thing, and it’s necessary because if I’m talking to you on Slack right now and getting notifications every time you reply on my phone, it’s not going to take too long before I turn off notifications everywhere, and then I don’t notice that Slack is there, and it just becomes useless and I use something else. Ideally, something better—which is hard to come by—moderately worse, like, email or completely worse, like, Microsoft Teams.
Frank: I tell all my close collaborators about this. I typically set myself away on Slack because I like to make time for deep, focused work. And that’s very hard with a constant stream of notifications. How people use Slack and how people notify others on Slack is, like, not incumbent on the software itself, but it’s a reflection of the work culture that you’re in. The expectation for an email-driven culture is, like, oh, yeah, you should be reading your email all the time and be able to respond within 30 minutes. Peace, I have friends that are lawyers, [laugh] and that is the expectation at all times of day.
Corey: I married one of those. Oh, yeah, people get very salty. And she works with a global team spread everywhere, to the point where she wakes up and there’s just a whole flurry of angry people that have tried to reach her in the middle of the night. Like, “Why were you sleeping at 2 a.m.? It’s daytime here.” And yeah, time zones. Not everyone understands how they work, from my estimation.
Frank: [laugh]. That’s funny. My sweetheart is a former attorney. On our first international date, we spent an entire day-and-a-half hopping between WiFi spots in Prague so that she could answer a five minute question from a partner about standard deviations.
Corey: So, one thing that you link to that really is what drew my notice to this—because, again, if you talk about AWS cost optimization, I’m probably going to stumble over it, but if you mention my name, that’s sort of a nice accelerator—and you linked to my article called Why “Right Sizing Your Instances Is Nonsense.” And that is a little overblown, to some extent, but so many folks talk about it in the cost optimization space because you can get a bunch of metrics and do these things programmatically, and somewhat without observability into what’s going on because, “Well, I can see how busy the computers are and if it’s not busy, we could use smaller computers. Problem solved,” versus, the things that require a fair bit of insight into what is that thing doing exactly because it leads you into places of oh, turn off that idle fleet that’s not doing anything is all labeled ‘backup,’ where you’re going to have three seconds of notice before it gets all the traffic.
There’s an idea of sometimes things are the way they are for a reason. And it’s also not easy for a lot of things—think databases—to seamlessly just restart the thing and have it scale back up and run on a different instance class. That takes weeks of planning and it’s hard. So, I find that people tend to reach for it where it doesn’t often make sense. At your level of scale and operational maturity, of course, you should optimize what instance classes things are using and what sizes they are, especially since that stuff changes over time as far as what AWS has made available. But it’s not the sort of thing that I suggest as being the first easy thing to go for. It’s just what people think is easy because it requires no judgment and computers can do it. At least that’s their opinion.
Frank: I feel like you probably have a lot more experience than me, and talked about war stories, but I recall working with customers where they want to lift-and-shift on-prem hardware to VMs on-prem. I’m like, “It’s not going to be as simple as you’re making it out to be.” Whereas, like, the trend today is probably oh, yeah, we’re going to shift on-prem VMs to AWS, or hell, like, let’s go two levels deeper and just run everything on Kubernetes. Similar workloads, right? It’s not going to be a huge challenge. Or [laugh] everything serverless.
Corey: Spare me from that entire school of thought, my God.
Frank: [laugh].
Corey: Yeah, but it’s fun, too, because this came out a month ago, and you’re talking about using—an example you gave was a c5.9xlarge instance. Great. Well, the c6i is out now as well, so are people going to look at that someday and think, “Oh, wow. That’s incredibly quaint.”
It’s, you wrote this a month ago, and it’s already out of date, as far as what a lot of the modern story instances are. From my perspective, one of the best things that AWS has done in this space has been to get away from the reserved instance story and over into savings plans, where it’s, “I know, I’m going to run some compute—maybe it’s Fargate, maybe it’s EC2; let’s be serious, it’s definitely going to be EC2—but I don’t want to tie myself to specific instance types for the next three years.” Great, well, I’m just going to commit to spending some money on AWS for the next three years because if I decide today to move off of it, it’s going to take me at least that long to get everything out. So okay, then that becomes something a lot more palatable for an awful lot of folks.
Frank: One thing you brought up in the article I linked to is instance types. You think upgrading to the newest instance type will solve all your challenges, but oftentimes it’s not obvious that it won’t all the time, and in fact, you might even see degraded resiliency and degraded performance because different packages that your software relies upon might not be optimized for the given kernel or CPU type that you’re running against. And ultimately, you go back to just asking really basic questions and performing some end-to-end benchmarking so that you can at least get a sense for what your customers are doing today, and maybe make a guess for what they’re going to do tomorrow.
Corey: I have to ask because I’m always interested in what it is that gives rise to blog posts like this—which, that’s easy; it’s someone had to do a project on these things, and while we learn things that would probably apply to other folks—like, you’re solving what is effectively a global problem locally when you go down this path. It’s part of the reason I have a consulting business is things I learned at one company apply almost identically to another company, even though that they’re in completely separate industries and parts of the world because AWS billing is, for better or worse, a bounded problem space despite their best efforts to, you know, use quantum computers to fix that. What was it that gave rise to looking at the CI/CD system from an optimization point of view?
Frank: So internally, I initially started writing a white paper about, hey, here’s a simple question that we can answer, you know, without too much effort. Let’s transition all of our C3 instances to C5 instances, and that could have been the one and done. But by thinking about it a little more and kind of drawing out, while we can actually borrow a model for oversubscription from another field, we could potentially decrease our spend by quite a bit. That eventually [laugh] evolved into a 70 page white paper—no joke—that my former engineering manager said, “Frank, no one’s going to [BLEEP] read this.” [laugh].
Corey: Always. Always, always. Like, here’s a whole bunch of academically research and the rest. It’s like, “Great. Which of these two buttons do I press?” is really the question people are getting at. And while it’s great to have the research and the academic stuff, it’s also a, “Great we’re trying to achieve an outcome which, what is the choice?” But it’s nice to know that people are doing actual research on the back end, instead, “Eh, my gut tells me to take the path on the left because why not? Left is better; right’s tricky friend.”
Frank: Yeah. And it was like, “Oh, yeah. I accidentally wrote a really long thing because there was, like, a lot of variables to test.” I think we had spun up 16-plus auto-scaling groups. And ran something like the cross-section of a couple of representative test suites against them, as well as configurations for a number of executors per instance.
And about a year ago, I translated that into a ten page blog article that when I read through, I really didn’t enjoy. [laugh]. And that template blog article is ultimately, like, about a page in the article you’re reading today. And the actual kick in the butt to get this out the door was about four months ago. I spoke at o11ycon rescources which you’re a part of.
And it was a vendor conference by Honeycomb, and it was just so fun to share some of the things we’ve been doing with distributed tracing, and how we were able to solve internal problems using a relatively simple idea of asking questions about what was running. And the entire team there was wonderful in coaching and just helping me think through what questions people might have of this work. And that was, again, former academic. The last time I spoke at a conference was about a decade earlier, and it was just so fun to be part of this community of people trying to all solve the same set of problems, just in their own unique ways.
Corey: One of the things I loved about working with Honeycomb was the fact that whenever I asked them a question, they have instrumented their own stuff, so they could tell me extremely quickly what something was doing, how it was doing it, and what the overall impact on this was. It’s very rare to find a client that is anywhere near that level of awareness into what’s going on in their infrastructure.
Frank: Yeah, and that blog article, right, it’s like, here’s our current perspective, and here’s, like, the current set of projects we’re able to make to get to this result. And we think we know what we want to do, but if you were to ask that same question, “What are we doing for our spend a year from now?” the answer might be very different. Probably similar in some ways, but probably different.
Corey: Well, there are some principles that we’ll never get away from. It’s, “Is no one using the thing? Turn that shit off.” That’s one of those tried and true things. “Oh, it’s the third copy of that multiple petabyte of data thing? Maybe delete it or stuff in a deep archive.” It’s maybe move data less between various places. Maybe log things fewer times, given that you’re paying 50 cents per gigabyte ingest, in some cases. Et cetera, et cetera, et cetera. There’s a lot to consider as far as the general principles go, but the specifics, well, that’s where it gets into the weeds. And at your scale, yeah, having people focus on this internally with the context and nuance to it is absolutely worth doing. Having a small team devoted to this at large companies will pay for itself, I promise. Now, I go in and advise in these scenarios, but past a certain point, this can’t just be one person’s part-time gig anymore.
Frank: I’m kind of curious about that. How do you think about working with a company and then deprecating yourself, and allowing your tools and, like, the frameworks you put into place to continue, like, thrive?
Corey: We’re advisory only. We make no changes to production.
Frank: Or I don’t know if that’s the right word, deprecate. I think… that’s my own word. [laugh].
Corey: No, no, it’s fair. It’s a—what we do is we go in and we are advisory. It’s less of a cost engagement, more of an architecture engagement because in cloud, cost and architecture are the same thing. We look at what’s going on, we look at the constraints of why we’ve been brought in, and we identify things that companies can do and the associated cost savings associated with that, and let them make their own decision. Because it’s, if I come in and say, “Hey, you could save a bunch of money by migrating this whole subsystem to serverless.”
Great, I sound like a lunatic evangelist because yeah, 18 months of work during which time the team doing that is not advancing the state of the business any further so it’s never going to happen. So, why even suggest it? Just look at things that are within the bounds of possibility. Counterpoint: when a client says, “A full re-architecture is on the table,” well, okay, that changes the nature of what we’re suggesting. But we’re trying to get away from what a lot of tooling does, which is, “Great. Here’s 700 things you can adjust and you’ll do none of them.” We come back with a, “Here’s three or four things you can do that’ll blow 20% off the bill. Then let’s see where you stand.” The other half of it, of course, is large scale enterprise contract negotiation, that’s a bit of a horse of a different color. I want to thank you so much for taking the time to speak with me today. I really do appreciate it. If folks want to hear more about what you’re up to, and how you think about these things. Where can they find you?
Corey: Oh, inviting people to yell at you at Twitter. That’s never a great plan. Yeash. Good luck. Thanks again. We’ve absolutely got to talk more about this in-depth because I think this is one of those areas that you have the folks above a certain point of scale, talk about these things semi-constantly and live in the space, whereas folks who are in relatively small-scale environments are listening to this and thinking that they’ve got to do this.
And no. No, you do not want to spend millions of dollars of engineering effort to optimize a bill that’s 80 grand a year, I promise. It’s focus on the thing that’s right for your business. At a certain point of scale, this becomes that. But thank you so much for being so generous with your time. I appreciate it.
Frank: Thank you so much, Corey.
Corey: Frank Chen, senior staff software engineer at Slack. I’m Cloud Economist Corey Quinn, and this is Screaming in the Cloud. If you’ve enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you’ve hated this podcast, please leave a five-star review on your podcast platform of choice, along with an angry comment that seems to completely miss the fact that Microsoft Teams is free because it sucks.
Frank: [laugh].
Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.
Announcer: This has been a HumblePod production. Stay humble.
Transcript
Announcer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.
Corey: It seems like there is a new security breach every day. Are you confident that an old SSH key, or a shared admin account, isn’t going to come back and bite you? If not, check out Teleport. Teleport is the easiest, most secure way to access all of your infrastructure. The open source Teleport Access Plane consolidates everything you need for secure access to your Linux and Windows servers—and I assure you there is no third option there. Kubernetes clusters, databases, and internal applications like AWS Management Console, Yankins, GitLab, Grafana, Jupyter Notebooks, and more. Teleport’s unique approach is not only more secure, it also improves developer productivity. To learn more visit: goteleport.com. And not, that is not me telling you to go away, it is: goteleport.com.
Corey: This episode is sponsored by our friends at Oracle Cloud. Counting the pennies, but still dreaming of deploying apps instead of "Hello, World" demos? Allow me to introduce you to Oracle's Always Free tier. It provides over 20 free services and infrastructure, networking, databases, observability, management, and security. And—let me be clear here—it's actually free. There's no surprise billing until you intentionally and proactively upgrade your account. This means you can provision a virtual machine instance or spin up an autonomous database that manages itself all while gaining the networking load, balancing and storage resources that somehow never quite make it into most free tiers needed to support the application that you want to build. With Always Free, you can do things like run small scale applications or do proof-of-concept testing without spending a dime. You know that I always like to put asterisks next to the word free. This is actually free, no asterisk. Start now. Visit snark.cloud/oci-free that's snark.cloud/oci-free.
Corey: Welcome to Screaming in the Cloud. I’m Corey Quinn. Several people are undoubtedly angrily typing, and part of the reason they can do that, and the fact that I know that is because we’re all using Slack. My guest today is Frank Chen, senior staff software engineer at Slack. So, I guess, sort of… [sales force 00:00:53]. Frank, thanks for joining me.
Frank: Hey, Corey, I have been a longtime listener and follower, and just really delighted to be here.
Corey: It’s one of the weird things about doing a podcast is that for better or worse, people don’t respond to it in the same way that they do writing a newsletter, for example, because you receive an email, and, “Oh, well, I know how to write an email. I can hit reply and send an email back and give that jackwagon a piece of my mind,” and people often do. But with podcasts, I feel like it’s much more closely attuned to the idea of an AM radio talk show. And who calls into a radio talk show? Lunatics, and most people don’t self-describe as lunatics, so they don’t want to do that.
But then when I catch up with people one-on-one or at events in person, I find out that a lot more people listen to this show than I thought they did. Because I don’t trust podcast statistics because lies, damn lies, and analytics are sort of how I view this world. So, you’ve worked at a bunch of different companies. You’re at Slack now, which, of course, upsets some people because, “Slack is ruining the way that people come and talk to me in the office.” Or it’s making it easier for employees to collaborate internally in ways their employers wish they wouldn’t. But that’s neither here nor there.
Before this, you were at Palantir, and before this, you’re at Amazon, working on Amazon WorkDocs of all things, which is supposedly rumored to have at least one customer somewhere, but I’ve never seen them. Before that you were at Sandia National Labs, and you’ve gotten a master’s in computer science from Stanford. You’ve done a lot of things and everything you’ve done, on some level, seems like the recurring theme is someone on Twitter will be unhappy at you for a career choice you’ve made. But what is the common thread—in seriousness—between the different places that you’ve been?
Frank: One thing that’s been a driver for where I work is finding amazing people to work with and building something that I believe is valuable and fun to keep doing. The thing that brought me to Slack is I became my own Slack admin, [laugh] when I met a girl and we moved in together into a small apartment in Brooklyn. And she had a cat that, you know, is a sweetheart, but also just doesn’t know how to be social. Yes, you covered that with ‘cat.’ Part of moving it together, I became my own Slack admin and discovered well, we can build a series of home automations to better train and inform our little command center for when the cat lies about being fed, or not fed, clipping his nails, and discovering and tracking bad behaviors. In a lot of ways this was like the human side of a lot of the data work that I had been doing at my previous role. And it was like a fun way to use the same frameworks that I use at work to better train and be a cat caretaker.
Corey: Now, at some point, you know that some product manager at Amazon is listening to this and immediately sketching notes because their product strategy is, “Yes,” and this is going to be productized and shipping in two years as Amazon Prime Meow. But until then we’ll enjoy the originality of having a Slack bot more or less control the home automation slash making your house seem haunted for anyone who didn’t write the code themselves. There's an idea of solving real world problems that I definitely understand. I mean, and again, it might not even be a fair question entirely. Just because I am… for better or worse, staggering through my world, and trying—and failing most days—to tell a narrative that, “Oh, why did I start my tech career at a university, and then spend time in ad tech, and then spend time in consulting, and then FinTech, and the rest?” And the answer is, “Oh, I get fired an awful lot, and that sucked.”
So, instead of going down that particular rabbit hole of a mess, I went in other directions. I started finding things that would pay me and pay me more money because I was in debt at the time. But that was the narrative thread that was the, “I have rent to pay and they have computers that aren’t behaving properly.” And that’s what dictated the shape of my career for a long time. It’s only in retrospect that I started to identify some of the things that aligns with it. But it’s easy to look at it with the shine of hindsight and not realize that no, no, that’s sort of retconning what happened in the past.
Frank: Yeah, I have a mentor and my former adviser had this way of describing, building out the jankiest prototype you can to prove out an idea. And this manifested in his class in building out paper prototypes, or really, really janky ideas for what helping people through technology might look like. And I feel like it a lot of ways, even when those prototypes fail, like, in a career or some half baked tech prototype I put together, it might succeed and great, we could keep building upon that, but when it fails, you actually discover, “Oh, this is one way that I didn’t succeed.” And even in doing so, you discover things about yourself, your way of building, and maybe a little bit about your infrastructure, or whatever it is that you build on a day-to-day basis. And wrapping that back to the original question, it’s like, well, we think we’re human beings, right, we’re static, but in a lot of ways we’re human becomings. We think we know what the future might look like with our careers, what we’re building on a day-to-day basis, and what we’re building a year from now, but oftentimes, things change if we discover things about ourselves, the people we work with, and ultimately, the things that we put out into the world.
Corey: Obviously, I’ve been aware of who Slack is, for a long time; I’ve been a paying customer for years because it basically is IRC with reaction gifs, and not having to teach someone how to sign into IRC when they work in accounting. So, the user experience alone solved the problem.
Frank: And you’ve actually worked with us in the past before. [laugh]. Slack, it’s the Searchable Log for all Content and Knowledge; I think that backronym, that’s how it works. And I was delighted when I had mentioned your jokes and you’re trolling [a folk 00:07:00] on Twitter and on your podcast to my former engineering manager, Chris Merrill, who was like, oh, you should search the Slack. Corey actually worked with us and he put together a lot of cool tooling and ideas for us to think about.
Corey: Careful. If we talk too much, or what I did when I was at Slack years ago, someone’s going to start looking into some of the old commits and whatnot and start demanding an apology, and we don’t want that. It’s, “Wow, you’re right. You are a terrible engineer.” “Told you.” There’s a reason I don’t do that anymore.
Frank: I think that’s all of us. [laugh]. An early career mentor of mine, he was like, “Hey, Frank, listen. You think you’re building perfect software at any point in time? No, you’re building future tech debt.” And yeah, we should put much more emphasis on interfaces and ideas we’re putting out because the implementation is going to change over time, and likely your current implementation is shit. And that is, okay.
Corey: That’s the beautiful part about this is that things grow and things evolve. And it’s interesting working with companies, and as a consultant, I tend to build my projects in such a way that I start on day one and people know that I’m leaving with usually a very short window because I don’t want to build a forever job for myself; I don’t want to show up and start charging by the hour or by the day, if I can possibly avoid it. Because then it turns into eternal projects that never end because I’m billing and nothing’s ever done. No, no, I like charging fixed fee and then getting out at a predetermined outcome, but then you get to hear about what happens with companies as they move on.
This combines with the fact that I have a persistent alert for my name, usually because I’m looking for various ineffective character assassination from enterprise marketing types because you know, I dish it out, I should certainly be able to take it. But I found a blog post on the Slack engineering blog that mentioned my name, and it’s, “Aw, crap. Are they coming after me for a refund?” No, it was not. It was you writing a fairly sizable post. Tell me more about that.
Frank: Yeah, I’m part of an organization called Developer Productivity. And our goal is to help folk at Slack deliver services to their customers, where we build, test, and release high quality software. And a lot of our time is spent thinking about internal tooling and making infrastructure bets. As engineers, right, it’s like, we have this idea for what the world looks like, we have this idea for what our infrastructure looks like, but what we discover using a set of techniques around observability of just asking questions—advanced questions, basic questions, and hell, even dumb questions—we discover hey, the things that we think our computers are doing aren’t actually doing what they say they’re doing. And the question is like, great. Now, what? How can we ask better questions? How can we better tune, change, and equip engineers with tooling so that they can do better work to make Slack customers have simple, pleasant, and productive experiences?
Corey: And I have to say that there’s a lot that Slack does that is incredibly helpful. I don’t know that I’m necessarily completely bought into the idea that all work should happen in Slack. It’s, well, on some level, I—like people like to debate the ‘should people work from home? Should people all work in an office?’ Discussion.
And, on some level, it seems if you look at people who are constantly fighting that debate online, it’s, “Do you ever do work at all?” on some level. But I’m not here to besmirch others; I’m here to talk about, on some level, what you alluded to in your blog post. But I want to start with a disclaimer that Slack as far as companies go is not small, and if you take a look around, most companies are using Slack whether they know it or not. The list of side-channel Slack groups people have tend to extend massively.
I look and I pare it down every once in a while, whenever I cross 40 signed-in Slacks on my desktop. It is where people talk for a wide variety of different reasons, and they all do different things. But if you’re sitting here listening to this and you have a $2,000 a month AWS bill, this is not for you. You will spend orders of magnitude more money trying to optimize a small cost. Once you’re at significant points of scale, and you have scaled out to the point where you begin to have some ability to predict over months or years, that’s what a lot of this stuff starts to weigh in.
So, talk to me a bit about how you wound up—and let me quote directly from the article, which is titled, “Infrastructure Observability for Changing the Spend Curve,” and I will, of course, throw a link to this in the [show notes 00:11:38]. But you talk in this about knocking, I believe it was orders of magnitude off of various cost areas within your bill.
Frank: Yeah. The article itself describes three big-ish projects, where we are able to change the curve of the number of tests that we run, and a change in how much it costs to run any single test.
Corey: When you say test, are you talking CI/CD infrastructure test or code test, to make sure it goes out, or are you talking something higher up the stack, as far as, “Huh, let’s see how some users respond when, I don’t know, we send four notifications on every message instead of the usual one,” to give a ridiculous example?
Frank: Yeah, this is in the CI/CD pipelines. And one of these projects was around borrowing some concepts from data engineering: oversubscription and planning your capacity to have access capacity at peak, where at peak, your engineers might have a 5% degradation in performance, while still maintaining high resiliency and reliability of your tests in order to oversubscribe, either CPU or memory and keep throughput on the overall system stable and consistent and fast enough. I think, with spend in developer productivity, I think, both, like, the metrics you’re trying to move and why you’re optimizing for it at any given time are, like, this, like, calculus. Or it’s like, more art than science in that there’s no one right answer, right? It’s like, oh, yeah—very naively—like, yeah, let’s throw the biggest machines most expensive machines we can at any given problem. But that doesn’t solve the crux of your problem. It’s like, “Hey, what are the things in your system doing?” And what is the right guess to capitalize around how much to spend on your CI/CD [unintelligible 00:13:39] is oftentimes not precise, nor is this blog article meant to be prescriptive.
Corey: Yeah, it depends entirely on what you’re doing and how because it’s, on some level, well, we can save a whole bunch of money if we slow all of our CI/CD runs down by 20 minutes. Yeah, but then you have a bunch of engineers sitting idle and I promise you, that costs a hell of a lot more than your cloud bill is going to be. The payroll is almost always a larger expense than your infrastructure costs, and if it’s not, you should seriously consider firing at least part of your data science team, but you didn’t hear it from me.
Frank: Yeah. And part of the exploration on profiling and performance and resiliency was, like, around interrogating what the boundaries and what the constraints were for our CI/CD pipelines. Because Slack has grown in engineering and in the number of tests we were running on a month-to-month basis; for a while from 2017 to mid 2020, we were growing about 10% month-over-month in test suite execution numbers. Which means on a given year, we doubled almost two times, which is quite a bit of strain on internal resources and a lot of dependent services where—and internal systems, we oftentimes have more complexity and less understood changes in what dependencies your infrastructure might be using, what business logic your internal services are using to communicate with one another than you do your production.
And so, by, like, performing a series of curiosity-driven development, we’re able to both answer, at that point in time, what our customers internally were doing, and start to put together ideas for eliminating some bottlenecks, and hell, even adding bottlenecks with circuit breakers where you keep the overall throughput of your system stable, while deferring or canceling work that otherwise might have overloaded dependencies.
Corey: There’s a lot to be said for understanding what the optimization opportunities are, in an environment and understanding what it is you’re attempting to achieve. Having those test for something like Slack makes an awful lot of sense because let’s be very clear here, when you’re building an application that acts as something people use to do expense reports—to cite one of my previous job examples—it turns out you can be down for a week and a majority of your customers will never know or care. With Slack, it doesn’t work that way. Everyone more or less has a continuous monitor that they’re typing into for a good portion of the day—angrily or otherwise—and as soon as it misses anything, people know. And if there’s one thing that I love, on some level, seeing change when I know that Slack is having a blip, even if I’m not using Slack that day for anything in particular, because Twitter explodes about it. “Slack is down. I’m now going to tweet some stuff to my colleagues.” All right. You do you, I suppose.
And credit where due, Slack doesn’t go down nearly as often as it used to because as you tend to figure out how these things work, operational maturity increases through a bunch of tests. Fixing things like durability, reliability, uptime, et cetera, should always, to some extent, take precedence priority-wise over let’s save some money. Because yeah, you could turn everything off and save all the money, but then you don’t have a business anymore. It’s focused on where to cut, where to optimize in the right way, and ideally as you go, find some of the areas in which, oh, I’m paying AWS a tax for just going about my business. And I could have flipped a switch at any point and saved—“How much money? Oh, my God, that’s more than I’ll make in my lifetime.”
Frank: Yeah, and one thing I talk about a little bit is distributed tracing as one of the drivers for helping us understand what’s happening inside of our systems. Where it helps you figure out and it’s like this… [best word 00:17:24] to describe how you ask questions of deployed code? And there a lot of ways it’s helped us understand existing bottlenecks and identify opportunities for performance or resiliency gains because your past janky Band-Aids become more and more obvious when you can interrogate and ask questions around what is it performing like it used to? Or what has changed recently?
Corey: This episode is sponsored in part by something new. Cloud Academy is a training platform built on two primary goals. Having the highest quality content in tech and cloud skills, and building a good community the is rich and full of IT and engineering professionals. You wouldn’t think those things go together, but sometimes they do. Its both useful for individuals and large enterprises, but here's what makes it new. I don’t use that term lightly. Cloud Academy invites you to showcase just how good your AWS skills are. For the next four weeks you’ll have a chance to prove yourself. Compete in four unique lab challenges, where they’ll be awarding more than $2000 in cash and prizes. I’m not kidding, first place is a thousand bucks. Pre-register for the first challenge now, one that I picked out myself on Amazon SNS image resizing, by visiting cloudacademy.com/corey. C-O-R-E-Y. That’s cloudacademy.com/corey. We’re gonna have some fun with this one!
Corey: It’s also worth pointing out that as systems grow organically, that it is almost impossible for any one person to have it all in their head anymore. I saw one of the most overly complicated architecture flow trees that I think I’ve seen in recent memory, and it was on the Slack engineering blog about how something was architected, but it wasn’t the Slack app itself; it was simply the [decision tree for ‘Should we send a notification?’ 00:18:17] and it is more complicated than almost anything I’ve written, except maybe my newsletter content publication pipeline. It is massive. And I’ll throw a link to that in the [show notes 00:18:31] as well, just because it is well worth people taking a look at.
But there is so much complexity at scale for doing the right thing, and it’s necessary because if I’m talking to you on Slack right now and getting notifications every time you reply on my phone, it’s not going to take too long before I turn off notifications everywhere, and then I don’t notice that Slack is there, and it just becomes useless and I use something else. Ideally, something better—which is hard to come by—moderately worse, like, email or completely worse, like, Microsoft Teams.
Frank: I tell all my close collaborators about this. I typically set myself away on Slack because I like to make time for deep, focused work. And that’s very hard with a constant stream of notifications. How people use Slack and how people notify others on Slack is, like, not incumbent on the software itself, but it’s a reflection of the work culture that you’re in. The expectation for an email-driven culture is, like, oh, yeah, you should be reading your email all the time and be able to respond within 30 minutes. Peace, I have friends that are lawyers, [laugh] and that is the expectation at all times of day.
Corey: I married one of those. Oh, yeah, people get very salty. And she works with a global team spread everywhere, to the point where she wakes up and there’s just a whole flurry of angry people that have tried to reach her in the middle of the night. Like, “Why were you sleeping at 2 a.m.? It’s daytime here.” And yeah, time zones. Not everyone understands how they work, from my estimation.
Frank: [laugh]. That’s funny. My sweetheart is a former attorney. On our first international date, we spent an entire day-and-a-half hopping between WiFi spots in Prague so that she could answer a five minute question from a partner about standard deviations.
Corey: So, one thing that you link to that really is what drew my notice to this—because, again, if you talk about AWS cost optimization, I’m probably going to stumble over it, but if you mention my name, that’s sort of a nice accelerator—and you linked to my article called Why “Right Sizing Your Instances Is Nonsense.” And that is a little overblown, to some extent, but so many folks talk about it in the cost optimization space because you can get a bunch of metrics and do these things programmatically, and somewhat without observability into what’s going on because, “Well, I can see how busy the computers are and if it’s not busy, we could use smaller computers. Problem solved,” versus, the things that require a fair bit of insight into what is that thing doing exactly because it leads you into places of oh, turn off that idle fleet that’s not doing anything is all labeled ‘backup,’ where you’re going to have three seconds of notice before it gets all the traffic.
There’s an idea of sometimes things are the way they are for a reason. And it’s also not easy for a lot of things—think databases—to seamlessly just restart the thing and have it scale back up and run on a different instance class. That takes weeks of planning and it’s hard. So, I find that people tend to reach for it where it doesn’t often make sense. At your level of scale and operational maturity, of course, you should optimize what instance classes things are using and what sizes they are, especially since that stuff changes over time as far as what AWS has made available. But it’s not the sort of thing that I suggest as being the first easy thing to go for. It’s just what people think is easy because it requires no judgment and computers can do it. At least that’s their opinion.
Frank: I feel like you probably have a lot more experience than me, and talked about war stories, but I recall working with customers where they want to lift-and-shift on-prem hardware to VMs on-prem. I’m like, “It’s not going to be as simple as you’re making it out to be.” Whereas, like, the trend today is probably oh, yeah, we’re going to shift on-prem VMs to AWS, or hell, like, let’s go two levels deeper and just run everything on Kubernetes. Similar workloads, right? It’s not going to be a huge challenge. Or [laugh] everything serverless.
Corey: Spare me from that entire school of thought, my God.
Frank: [laugh].
Corey: Yeah, but it’s fun, too, because this came out a month ago, and you’re talking about using—an example you gave was a c5.9xlarge instance. Great. Well, the c6i is out now as well, so are people going to look at that someday and think, “Oh, wow. That’s incredibly quaint.”
It’s, you wrote this a month ago, and it’s already out of date, as far as what a lot of the modern story instances are. From my perspective, one of the best things that AWS has done in this space has been to get away from the reserved instance story and over into savings plans, where it’s, “I know, I’m going to run some compute—maybe it’s Fargate, maybe it’s EC2; let’s be serious, it’s definitely going to be EC2—but I don’t want to tie myself to specific instance types for the next three years.” Great, well, I’m just going to commit to spending some money on AWS for the next three years because if I decide today to move off of it, it’s going to take me at least that long to get everything out. So okay, then that becomes something a lot more palatable for an awful lot of folks.
Frank: One thing you brought up in the article I linked to is instance types. You think upgrading to the newest instance type will solve all your challenges, but oftentimes it’s not obvious that it won’t all the time, and in fact, you might even see degraded resiliency and degraded performance because different packages that your software relies upon might not be optimized for the given kernel or CPU type that you’re running against. And ultimately, you go back to just asking really basic questions and performing some end-to-end benchmarking so that you can at least get a sense for what your customers are doing today, and maybe make a guess for what they’re going to do tomorrow.
Corey: I have to ask because I’m always interested in what it is that gives rise to blog posts like this—which, that’s easy; it’s someone had to do a project on these things, and while we learn things that would probably apply to other folks—like, you’re solving what is effectively a global problem locally when you go down this path. It’s part of the reason I have a consulting business is things I learned at one company apply almost identically to another company, even though that they’re in completely separate industries and parts of the world because AWS billing is, for better or worse, a bounded problem space despite their best efforts to, you know, use quantum computers to fix that. What was it that gave rise to looking at the CI/CD system from an optimization point of view?
Frank: So internally, I initially started writing a white paper about, hey, here’s a simple question that we can answer, you know, without too much effort. Let’s transition all of our C3 instances to C5 instances, and that could have been the one and done. But by thinking about it a little more and kind of drawing out, while we can actually borrow a model for oversubscription from another field, we could potentially decrease our spend by quite a bit. That eventually [laugh] evolved into a 70 page white paper—no joke—that my former engineering manager said, “Frank, no one’s going to [BLEEP] read this.” [laugh].
Corey: Always. Always, always. Like, here’s a whole bunch of academically research and the rest. It’s like, “Great. Which of these two buttons do I press?” is really the question people are getting at. And while it’s great to have the research and the academic stuff, it’s also a, “Great we’re trying to achieve an outcome which, what is the choice?” But it’s nice to know that people are doing actual research on the back end, instead, “Eh, my gut tells me to take the path on the left because why not? Left is better; right’s tricky friend.”
Frank: Yeah. And it was like, “Oh, yeah. I accidentally wrote a really long thing because there was, like, a lot of variables to test.” I think we had spun up 16-plus auto-scaling groups. And ran something like the cross-section of a couple of representative test suites against them, as well as configurations for a number of executors per instance.
And about a year ago, I translated that into a ten page blog article that when I read through, I really didn’t enjoy. [laugh]. And that template blog article is ultimately, like, about a page in the article you’re reading today. And the actual kick in the butt to get this out the door was about four months ago. I spoke at o11ycon rescources which you’re a part of.
And it was a vendor conference by Honeycomb, and it was just so fun to share some of the things we’ve been doing with distributed tracing, and how we were able to solve internal problems using a relatively simple idea of asking questions about what was running. And the entire team there was wonderful in coaching and just helping me think through what questions people might have of this work. And that was, again, former academic. The last time I spoke at a conference was about a decade earlier, and it was just so fun to be part of this community of people trying to all solve the same set of problems, just in their own unique ways.
Corey: One of the things I loved about working with Honeycomb was the fact that whenever I asked them a question, they have instrumented their own stuff, so they could tell me extremely quickly what something was doing, how it was doing it, and what the overall impact on this was. It’s very rare to find a client that is anywhere near that level of awareness into what’s going on in their infrastructure.
Frank: Yeah, and that blog article, right, it’s like, here’s our current perspective, and here’s, like, the current set of projects we’re able to make to get to this result. And we think we know what we want to do, but if you were to ask that same question, “What are we doing for our spend a year from now?” the answer might be very different. Probably similar in some ways, but probably different.
Corey: Well, there are some principles that we’ll never get away from. It’s, “Is no one using the thing? Turn that shit off.” That’s one of those tried and true things. “Oh, it’s the third copy of that multiple petabyte of data thing? Maybe delete it or stuff in a deep archive.” It’s maybe move data less between various places. Maybe log things fewer times, given that you’re paying 50 cents per gigabyte ingest, in some cases. Et cetera, et cetera, et cetera. There’s a lot to consider as far as the general principles go, but the specifics, well, that’s where it gets into the weeds. And at your scale, yeah, having people focus on this internally with the context and nuance to it is absolutely worth doing. Having a small team devoted to this at large companies will pay for itself, I promise. Now, I go in and advise in these scenarios, but past a certain point, this can’t just be one person’s part-time gig anymore.
Frank: I’m kind of curious about that. How do you think about working with a company and then deprecating yourself, and allowing your tools and, like, the frameworks you put into place to continue, like, thrive?
Corey: We’re advisory only. We make no changes to production.
Frank: Or I don’t know if that’s the right word, deprecate. I think… that’s my own word. [laugh].
Corey: No, no, it’s fair. It’s a—what we do is we go in and we are advisory. It’s less of a cost engagement, more of an architecture engagement because in cloud, cost and architecture are the same thing. We look at what’s going on, we look at the constraints of why we’ve been brought in, and we identify things that companies can do and the associated cost savings associated with that, and let them make their own decision. Because it’s, if I come in and say, “Hey, you could save a bunch of money by migrating this whole subsystem to serverless.”
Great, I sound like a lunatic evangelist because yeah, 18 months of work during which time the team doing that is not advancing the state of the business any further so it’s never going to happen. So, why even suggest it? Just look at things that are within the bounds of possibility. Counterpoint: when a client says, “A full re-architecture is on the table,” well, okay, that changes the nature of what we’re suggesting. But we’re trying to get away from what a lot of tooling does, which is, “Great. Here’s 700 things you can adjust and you’ll do none of them.” We come back with a, “Here’s three or four things you can do that’ll blow 20% off the bill. Then let’s see where you stand.” The other half of it, of course, is large scale enterprise contract negotiation, that’s a bit of a horse of a different color. I want to thank you so much for taking the time to speak with me today. I really do appreciate it. If folks want to hear more about what you’re up to, and how you think about these things. Where can they find you?
Frank: You can find me at frankc.net. Or at me at @FrankC on Twitter.
Corey: Oh, inviting people to yell at you at Twitter. That’s never a great plan. Yeash. Good luck. Thanks again. We’ve absolutely got to talk more about this in-depth because I think this is one of those areas that you have the folks above a certain point of scale, talk about these things semi-constantly and live in the space, whereas folks who are in relatively small-scale environments are listening to this and thinking that they’ve got to do this.
And no. No, you do not want to spend millions of dollars of engineering effort to optimize a bill that’s 80 grand a year, I promise. It’s focus on the thing that’s right for your business. At a certain point of scale, this becomes that. But thank you so much for being so generous with your time. I appreciate it.
Frank: Thank you so much, Corey.
Corey: Frank Chen, senior staff software engineer at Slack. I’m Cloud Economist Corey Quinn, and this is Screaming in the Cloud. If you’ve enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you’ve hated this podcast, please leave a five-star review on your podcast platform of choice, along with an angry comment that seems to completely miss the fact that Microsoft Teams is free because it sucks.
Frank: [laugh].
Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.
Announcer: This has been a HumblePod production. Stay humble.