Shifting from Observability 1.0 to 2.0 with Charity Majors

Episode Summary

This week on Screaming in the Cloud, Corey is joined by good friend and colleague, Charity Majors. Charity is the CTO and Co-founder of Honeycomb.io, the widely popular observability platform. Corey and Charity discuss the ins and outs of observability 1.0 vs. 2.0, why you should never underestimate the power of software to get worse over time, and the hidden costs of observability that could be plaguing your monthly bill right now. The pair also shares secrets on why speeches get better the more you give them and the basic role they hope AI plays in the future of computing. Check it out!

Episode Show Notes & Transcript



Show Highlights:

(00:00 - Reuniting with Charity Majors: A Warm Welcome
(03:47) - Navigating the Observability Landscape: From 1.0 to 2.0
(04:19) - The Evolution of Observability and Its Impact
(05:46) - The Technical and Cultural Shift to Observability 2.0
(10:34) - The Log Dilemma: Balancing Cost and Utility
(15:21) - The Cost Crisis in Observability
(22:39) - The Future of Observability and AI's Role
(26:41) - The Challenge of Modern Observability Tools
(29:05) - Simplifying Observability for the Modern Developer
(30:42) - Final Thoughts and Where to Find More


About Charity

Charity is an ops engineer and accidental startup founder at honeycomb.io. Before this she worked at Parse, Facebook, and Linden Lab on infrastructure and developer tools, and always seemed to wind up running the databases. She is the co-author of O'Reilly's Database Reliability Engineering, and loves free speech, free software, and single malt scotch.

Links:

Transcript

Charity Majors: And a lot of it has to do with the 1.0 vs the 2.0 stuff, you know, like the more sources of the truth that you have and the more you have to dance between the less value you get out because the less you could actually correlate, the more the actual system is held in your head, not your tools.

Corey: Welcome to Screaming in the Cloud. I'm Corey Quinn and I am joined after a long hiatus from the show by my friend and yours, Charity Majors. Who probably needs no introduction, but we can't assume that. Charity is and remains the co founder and CTO of Honeycomb.io. Charity, it's been a few years since we spoke in public.

How are ya?

Charity Majors: It really has been a few years since we bantered publicly. I'm doing great, Corey!

Corey: We are both speaking in a few weeks from this recording at the, at SRECON San Francisco, which is great. And I think we both found out after the fact that, wait a minute, this is, this is like a keynote plenary opening session.

Oh dear. That, that means I won't be able to phone it in. Uh, given that you are at least as peculiar when it comes to writing processes, I am, I have to ask, have you started building your talk yet? Cause I haven't.

Charity Majors: That's funny. No, I have not.

Corey: Good. Good. I always worry. It's just me. Like, there was some Twitter kerfuffle years ago about how it's so rude when speakers get up and say they didn't build their talk until recently.

Like, we're all sitting here listening to you. Like, that, that, we deserve better than that. That doesn't mean it hasn't been weighing on me, and I haven't been thinking about it for months, but I'm not gonna sit down and write slides until I'm forced to.

Charity Majors: I've been angsting about it. I've had nightmares.

Does that count? I feel like that should count for something. You know, I feel like that gets lumped in the same bucket as, Speakers should never give the same talk more than once. That's just rude. We are paying to be here. And it's like, calm down, kids. Like, okay, everyone's a critic, but like It's a lot of work.

We also have our day jobs, and we all have a process that works for us. And, you know, if you don't like the product that I am here delivering to you, You don't have to come see me ever again. That's cool. You don't have to invite me to your conference ever again. I would completely understand if you made those choices, but leave me to my life choices.

Corey: My, my presenter notes are generally bullet points of general topics to talk about, so I don't think I've ever given the same talk twice. Sure. The slides might be the same, but I'll at least try and punch up the title a smidgen.

Charity Majors: Well, I've given the same talk, not, not verbatim, but several times, and actually, I think it gets better every time, because you, you lean into the material, you learn it more, you learn how an audience likes to interact with it more.

I've had people request that I give the same talk again, and they're like, oh god, I love this one, and I'm like, oh cool, I've got some updated material there, and they're like, I really want my team to see this, and so, you know, I think you can do it in a lazy way, um, and you can, but just because you're doing it doesn't mean you don't care.

Which I think is the root of the criticism. Oh, you don't care.

Corey: Yeah, this is a new talk for me. It's about the economics of on prem versus cloud, which I assure you I've been thinking about for a long time and answering questions on for the past seven years, but I never put it together into a talk. And I'm somewhat annoyed as I'm finally starting to put it together that I hear reports that VMware slash Broadcom is now turning someone's 8 million renewal into a 100 million renewal.

It's like, well, suddenly that just throws any nuance take out the window. Yeah, when you're like 11Xing your bill to run on prem, yeah, suddenly move to the cloud. You can do it the dumbest possible way and still come out financially ahead. Not something I usually get to say.

Charity Majors: Well, you know, the universe provides.

Corey: You have been talking a fair bit lately around the concept of going from observability 1.0 to observability 2.0. It's all good. Well, if nothing else, at least you people are using decent forms of semantic versioning. Good for you. But what does that mean here for the rest of us who are not drowning in the world of understanding what our applications are doing at any given moment?

Charity Majors: You know, it was kind of an offhand comment that I made, you know, you spit a bunch of shit out into the world and every now and then people like pick up on a strand, they get, they start really pulling on it. This is a strand that I feel like people have been really pulling on and it, the source is of course the mass confusion and delusion that has been, you know, everyone's idea of what observability means over the past Like 10 years, I feel like, you know, Christine and I started talking about observability in 2016, and we were lucky enough to be the only ones at the time.

And we had this whole idea that, you know, there was a technical definition of observability, you know, high cardinality, high dimensionality. It's about no, no, no, it's about unknown number, and lasted for a couple of few years, but like 2020ish, like. All the big players start paying attention and flooding their money and they're like, well, we do observability too.

There's three pillars. Now it's like it can mean literally anything. So the definition of observability that I've actually been kind of honing in on recently is it's a property of complex systems, just like reliability. You can improve your observability by adding metrics. If you don't have any, you can improve your observability by improving your code.

You can improve your observability by educating your team or sharing some great dashboard links, right? But it, it remains the fact that there's kind of a giant sort of discontinuous step function in capabilities and usabilities and, and, and a bunch of other things that, that we've experienced and, and our users report experiencing.

And so we kind of need some way to differentiate between the three pillars world and what I think. Hope is the world of the future. And they're very discontinuous because it starts at the very bottom. It starts with how you collect and store data. With Observability 1.0 you've got three pillars, at least.

You're probably collecting and storing your data way more than three times. You've got your APM, you've got your, you know, you've got your, um, your web monitoring stuff, you've got, you know, you might be collecting and storing your telemetry half a dozen different times. paying every time and nothing really connects them except you sitting in the middle like eyeballing these different graphs and trying to like correlate them in your head based on past scar tissue for the most part.

And 2.0 is based on a single source of truth. These arbitrarily wide structured data blobs. You know, you can call them canonical log, you can call them spans, traces, um, but you can derive metrics from those, you can derive traces from those, you can derive logs from those, and, and, and the thing that connects them is it's data.

So you can slice and dice, you can dive down, you can zoom out, you can treat it just like fucking data. And so we've been starting to refer to these as Observability 1.0 and 2.0, and I think a lot of people have found this very clarifying.

Corey: What is the boundary between 1.0 and 2.0? Because, you know, with vendors doing what vendors are going to do, if the term Observability 2.0 catches on, They're just going to drape themselves in its trappings. But what is it that does the differentiation?

Charity Majors: There's a whole bunch of things that sort of collect around both. Like for 1.0, you know, the observability tends to be about MTTR, MTTD, reliability. And for 2.0, it's very much about what underpins the software development life cycle.

But the, the, the thing that you, the filter that you can apply to tell us what 1.0 2.0 is how many times you store in your data. If it's greater than one, you're in observability 1.0 land. But the reason that I find this so helpful is like, so there's a lot of stuff that like 1.0, you're tend to paging yourself a lot because you're paging on symptoms because you rely on these paid jobs to help you debug your systems.

In 2.0 land, you typically have SLOs. So I feel like to get observability 2.0, you need both. a change in tools, and a sort of change in mentality and practices. Because it's about hooking up and, you know, and making these tight feedback loops, you know, the ones so that you're instrumenting your code as you go, get it into production, and then you look at your code through the lens of the instrumentation you just wrote, and you're asking yourself, is it doing what I expected it to do, and does anything else look weird?

And you can get those loops so, so tight and so fast that, you know, you're finding, you're reliably finding things before your users can find them in production, you're fighting bugs before your users find them. And you can't do that without the practices, and you can't do that without the tools either.

Because if you're on 1.0 land, you know, you're like trying to predict which custom metrics you might need. You're like logging out a crap load of stuff. But there's a lot of guesswork involved, and there's a lot of pattern matching involved. And in 2.0 land, it doesn't require a lot of, you know, knowledge up front about how your system's going to behave, because you can just slice and dice.

You can break down by build ID, break down by feature flag, break down by device type, break down by, you know, canary. Like Anything you need, you can just explore and see exactly where the data tells you every time. You don't need all this prior knowledge of your systems. And so you can, you can ask yourself, is my code doing what I expect it to do?

Does anything else look weird? Even if you honestly have no idea what the code is supposed to be doing, it's that whole known and known versus unknown unknown thing.

Corey: Yeah, I've been spending the last few months running, uh, Kubernetes locally. I've built myself a Kubernetes of my very own, given that I've been fighting against that tide for far too long, and I have a conference talk coming up where I committed to making fun of it, and turns out that's not going to be nearly as hard as I was worried it would be.

It's something of a target rich environment. But one of the things I'm seeing is that this cluster isn't doing a whole heck of a lot. But it's wordy, and figuring out why a thing is happening requires ongoing levels of instrumentation in strange and different ways. Some things work super well because of how they're designed for, and what their imagined use cases are.

For example, when I instrumented this was Honeycomb, among other things. And I've yet to be able to get it to reliably spit out the CPU temperature of the nodes, just because that's not something Otel usually thinks about in a cloud first world. I've also been spending fun times figuring out why the storage subsystem would just eat itself from time to time with what appears to be no rhyme or reason to it.

And I was checking with Axiom, where I was just throwing all of my logs from the beginning on this thing, and in less than two months, it's taken 150 gigs of data. I'm thinking of that in terms of just data egress charges. That is an absurd amount of data/money for a cluster that frankly isn't doing anything.

So it's certainly wordy about it, but it's not doing anything.

Charity Majors: Yeah, this is why, this is why the logs topic is so fraught. You can spend so much money doing f*** all. You know, and the log vendors are like, I always get the Monty Python, every sperm is sacred, when they're like, every log line must be kept. And I'm like, yeah, I bet it does, you know?

Because like, people are so afraid of sampling, but this is the shit you sample, the shit that means absolutely f*** all, right? Or health checks. You know, in a microservices environment, 25 percent of your requests might be health checks. So like, when we come out and say, sample, we're not saying, Sample your billing API requests, you know, it's like sample the bullshit that's just getting spat out for no reason and log with intent the stuff that you care about and keep that.

But like the whole logging mindset is spammy and loud and, and, and it's full of trash, frankly, like when you're just like admitting everything you might possibly think about, you know, then, then, you can't really correlate anything, can't really do anything with that data, doesn't really mean anything. But when, when you take sort of the canonical logging/tracing approach, you know, you can spend very little money, but get a lot of very rich, intentful data.

Corey: It also, I also find that spewing out these, these logging events In a bunch of different places, I have no idea where it's stirring any of this internally to the cluster. I'm sure I'll find out if a disk fills up, if the, if that alarm can get through or anything else. But the, the painful piece that I keep smacking into is that it is, it, all of this wordiness, all this verbosity is occluding anything that could actually be important signal, and there have been some of those during some of the experiments I'm running.

I love the fact, for example, that by default, you can run kubectl events or kubectl get events, and those are not the same thing, because why would they be? And kubectl get events loves to just put them in apparently non deterministic order, and the reasoning behind that is well, if something is, if we've seen something a bunch of times, do we show it at the beginning or at the end of that list?

It's, it's a hard decision to make. It's, that's great. I thought a bunch of things happened in the last 30 seconds. Why is that all hidden by stuff that happened nine days ago? It's obnoxious.

Charity Majors: That’s a great question. Boy, you should be a, you should be a Kubernetes designer, Corey.

Corey: No, because apparently I still have people who care what I have to say, and I want people to think well of me.

That's apparently a non starter for doing these things. It's awful. It's I don't mean to attack people for the quality of their work, but the output is not good. Things that I thought Kubernetes was supposed to do out of the box. Like, alright, let's fail a node. I'll yank the power cord out of the thing.

Because why not? And it just sort of sits there forever, because I didn't take additional extra steps on a given workload to make sure that it would time out itself, and then respawn on something else. I wasn't aware every single workload needed to do that. The fact that it does is more than a little disturbing.

Charity Majors: Yes, it is.

Corey: So things are great over here in Kubernetes land, as best I can tell, but I've been avoiding it for a decade and I'm coming here and looking at all of this and it's, what have you people exactly been doing? Just because this seems like the basic problems that I was dealing with back when I worked on servers when we were just running VMs and pushing those around in the days before containers.

Now I keep thinking there's something I'm missing, but I'm getting more and more concerned that I'm not sure that there is.

Charity Majors: You know, never underestimate software's capability, software's ability to get worse.

Corey: I will say that instrumenting it with Honeycomb was a heck of a lot easier than it was when I tried to use Honeycomb to instrument a magically bespoke, architectured, serverless thing running on Lambda and some other stuff.

Because, oh, you're actually, you know, it turns out when you're running an architecture that a sane company might actually want to deploy, then, yeah, okay, it turns out that suddenly you're back on the golden path of where most observability folks are going. A lot less of solve it myself. And, let's be fair, you folks have improved the onboarding experience, the documentation begins to make a lot more sense now.

And what it says it's going to do is generally what happens in the environment. So, gold star for you. That

Charity Majors: is high praise. Thank you, Corey. Yeah, we put some effort, we put some real muscle, elbow grease into Kubernetes, uh, last fall, right before KubeCon. Because, like you said, it's the golden path. It's the path everyone's going down.

And for a long time, we kind of avoided that because, honestly, we're not an infrastructure tool. Uh, we are an app, we are for understanding your applications from the perspective of your applications for the most part. But A very compelling argument was made that, you know, Kubernetes is kind of an application of itself.

It's your distributed system, so it actually does kind of matter when you need to like, pull down an artifact, or you need to do a rolling, you know, restart, or when all these things are happening. So, we tried to make that, you know, pretty, pretty easy, and I'm glad to hear things have gotten better. Uh, you mentioned the cost thing, I'd like to circle back to that briefly.

I recently wrote an article about about observability, um, the cost crisis and observability, uh, is what it's called because a lot of people have been kind of hot under the collar about their, we won't name specific vendors, but their bills lately when it comes to observability. The more I listened, the more I realized they aren't actually upset about the bill itself.

I mean maybe they're upset about the bill itself. What they're really upset about is the fact that the value that they're getting out of these tools has become radically decoupled from the amount of money that they're paying. And as their bill goes up, the value they get out does not go up. In fact, as the bill goes up, the value they get out often goes down.

And so I wrote a blog post about why that is. And a lot of it has to do with the 1.0 versus 2.0 stuff. You know, like the more sources of truth that you have, and the more you have to dance between, The less value you get out, because the less you could actually correlate, the more the actual system is held in your head, not your tools.

With logs, as you just, you know, talked about, the more, the more, the more you're logging, the more span you have, like the higher your bill is. The harder it is to get shit out, the slower your full text search has become. The more you have to know what you're looking for before you can search for the thing that you need to look for.

Corey: And where they live. I will name a vendor name. CloudWatch is terrible in this sense. 50 cents per gigabyte ingest, though at reInvent they just launched a 25 cent gigabyte ingest with a lot less utility. Great. And the only way to turn some things off from logging to CloudWatch at 50 cents a gigabyte ingest is to remove the ability for what it's doing to talk to CloudWatch.

That is absurd. That is one of the most ridiculous things I've seen. And I've got to level with you, CloudWatch is not that great of an analysis tool. It's just not. I know they're trying with CloudWatch login sites and all the other stuff, but they're failing.

Charity Majors: Wow, yeah, I mean, you can't always just solve it at the network level.

That's a solution you can't always, can't always reach for. You can also solve it at the power socket level, but most of us prefer other levels of solving our distributed system problems.

Corey: It’s awful. It's one of those areas where I have all these data going all these different places, even when I, when I trace it. It still gets very tricky to understand what that is.

When I work on client environments, there's always this question of, okay, there's an awful lot of cross AZ traffic, an awful lot of egress traffic. What is that? And very often the answer is tied to observability in some form.

Charity Majors: Yeah, yeah, yeah. No, for sure. For sure. That, like, in our systems nowadays, like, the problem is very rarely debugging the code.

It's very often where in the system is the code that I need to debug?

Corey: It's the murder mystery aspect to it.

Charity Majors: It's the murder mystery for sure. But like, metrics are even worse than logs when it comes to this shit because number one, like your bill is so opaque. In my article, I used the example of a friend of mine who had a, uh, who was going through a bill and realized that he had individual metrics that were costing him 30,000 a month.

Because they were getting hit too much or, you know, there are all kinds of things you can do to, this is not visible in your bill. You have to really like take a microscope to it. And I was repeating this to someone last Friday. He was like, Oh, I wish it only cost us 30,000 a month. He's like, over the weekend, some folks deployed some metrics and they were causing us 10,000 a piece over the weekend.

And I was just like, Oh my God. And there's no way to tell in advance before you deploy one of these, you just have to deploy and hope that you're watching closely. And that isn't even the worst of it. You have to predict in advance every single custom metric that you need to collect. Every single combination, permutation of attributes, you have to predict in advance.

And then you can never connect any two metrics again, ever. Because you've discarded all of that at the right time. You can't correlate anything from one metric to the next. You can't tell if this spike in this metric is the same as that spike in that metric. You can't tell. You have to predict them up front.

You have no insight into how much they're going to cost. You have barely any insight into how much they didn't cost. And you have to constantly sort of like reap them, because your cost goes up at minimum linearly with the number of new custom metrics that you create. Which means there's a real hard cap on the number that you're willing to pay for.

And some poor fucker has to just sit there like Manually combing through them every day and picking out which ones they think that they can afford to sacrifice. And when it comes to using metrics, all you have, what it feels like, honestly, is on the command line using grep and bc. You can do math on the metrics and you can search on the tags, but never the twain shall meet.

And you can't combine them, you can't use any other data types, and it's just like This is a fucking mess. So, talking about the bridge from Serverly 1.0 to 2.0, that bridge is just a log. So while I have historically said some very mean things about logs, The fact is that like that is the continuum that people need to be building the future on.

Metrics are the best. Metrics are so hobbled. Like they're, they, they were, they got us here, right? As my therapist would say, they're what got us here, they won't get us to where we need to go. Logs will get us to where we need to go, you know, as we structure them, as we make them wider, as we make them less messy, as we, as we stitch together, you know, all of the things that happen over the course of a run, as we add IDs so that we can trace using them and use spans using them.

That is the, that is the bridge that we have to the, to the future. And, and, and what the cost, what the cost model looks like in the future is very different. It's not exactly cheap. I'm not gonna lie and say that it's ever cheap, but the thing about it is that. As, as you pay more, as as, as your bill goes up, the value you get outta it goes up too because, well, well, for honeycomb, at least, I can't speak for all observability 2.0 vendors.

But like, it doesn't matter how wide the event is, says hundreds of dimensions per, per request. We don't care. We encourage you to 'cause the wider, your events are wider, your logs are the more valuable they will be to you. And if you instrument your, your, your code more, adding, more spans, presumably that's because you have decided that those spans are valuable to you to have in your observability tool.

And when it comes to, you know, really controlling costs, uh, your levers are much more powerful because you can do dynamic sampling. You can say like, okay, all the health checks, all the Kubernetes spam, sample that at a rate of, you know, one to a hundred. But then, so one of the things we do at Honeycomb is we, you can attach a sample rate to every event.

So we do the math for you in the UI. So like if you're sampling at one to a hundred, we multiply every event by a hundred when it comes to counting it. So you can still see, you know, what the actual traffic rate is.

Corey: And there’s, that's the problem is it's people view this as an all or nothing, where you've either got to retain everything, or as soon as you start sampling, people think you're going to start sampling like transaction data.

And that, that doesn't work. So there's a, it requires a little bit of tuning and tweaking at scale.

Charity Majors: But it doesn't require constant tuning and tweaking the way it does in Observability 1.0 and you don't have to sit there and scan your list of metrics after every weekend, you know, and you don't have to do dumb sampling like the sample, the sample story in Observability 1.0. and I understand why people hate it because you're usually dealing with consistent hashes. Or log levels, or you're just commenting out blocks of code. Yeah, I agree, that sucks a lot.

Corey: Now, I have to ask the difficult question here. Do you think there's an AI story down this path? Because on some level, when something explodes, back in Observability 1.0 land, for me at least, it's, okay, at 3 o'clock everything went nuts. What's spiked or changed right around then that might be a potential leading indicator of where to start digging? And naively, I feel like that's the sort of thing that something with a little bit of, uh, smart behind it might be able to dive into.

But again, math and generative AI don't seem to get along super well yet.

Charity Majors: You know, I do think that there are a lot of really interesting AI stories to be told, but I don't think we need it for that. You've seen bubble up. If there's anything on any of your graphs, any of your heatmaps that you think is interesting in Honeycomb, you draw a little bubble around it, and it computes for all dimensions inside the bubble and outside the baseline.

You're like, Ooh, what's this? You draw a little thing and it's like, oh, these, these are all, you know, requests that are going to slash export. They're all with a two megabit blob size coming from these three customers. They're going to this region, this language pack, this build ID, and with this feature flag turned on.

And like, so much of debugging is that. Here's the thing I care about, because it's paging me. How is it different from everything else I don't care about? And like, And you can do this with SLOs, too. This is how people start debugging with Honeycomb. You get paged about an SLO violation, you go straight to the SLO, and it shows you, oh, the events that are violating SLOs are all the ones to this cluster, to this read replica, using, you know, this client.

This is easy shit. This is, we're still in place, like, people need to do this. People probably do need to use AI in Observability 1.0 land just because they're dealing with so much noise and, and again, the connective tissue that they need to just get this out of the data, they don't have anymore because they threw it away at right time.

But when you have the connective tissue, you can just do this with normal numbers. Now, I do think there are a lot of interesting things we can do with AI here. I hesitate to use a PCORI debugging task because false positives are so expensive and so common and, and you're changing your system every day if you're deploying it.

Like, I just think it's a bit of a mismatch. But we use generative AI for like, we have a query system. If you've shipped some code and you want to use natural language to ask, Did something break? Did something get slow? You just ask using English. Did something break? Did something get slow? And it pops you into the query builder with that question.

I think there are really interesting things that we can do when it comes to using AI to like, Oh god, the thing that I really can't wait to get started with is like, So Honeycomb, I have to keep explaining Honeycomb and I apologize because I don't want, I don't like to be one of the centers, but like one of the things that we do with Honeycomb is you can see your history, all the queries that you've run, because so often when you're debugging you run into a wall and you're like, oh, lost the plot, so you need to like rewind to where you last had the plot, right?

So you can see, you know, all the queries you've run, the shapes of them and everything, so you can scroll back, but you also have access to everyone else's history. So if I'm getting paged about MySQL. Maybe I don't know all about MySQL, but I know that Ben and Emily on my team really know MySQL well. And I feel like they were debugging a problem like this, like, last Thanksgiving.

So I just go and I ask, like, what were Ben and Emily doing around MySQL last Thanksgiving? What did they think was valuable enough to, like, put in a postmortem? What did they put in, you know, a ClipDoc? What did they Put a comment in. What did they post to Slack? And, and then I can jump to like, and that's the shit where I can't get to, I can't wait to get generative AI, like processing the shit that, what is your team talking about?

What is your team doing? How are the experts in every, cause every, in a distributed system, everybody's an expert in their corner of the world, right? But like, You have to debug things that span the entire world. So how can we like derive wisdom from everybody's deep knowledge of their part of the system and expose that to everyone else?

That's the shit I'm excited about. I think it's hella basic. You have to use computer AI to just see what paid you in the middle of the night.

Corey: And yet somehow it is. It's the There's no good answer here. It's the sort of thing we're all stuck with on some level.

Charity Majors: It's a chicken and egg problem, right? But the thing is that, like, that's because people are used to using tools that were built for known unknowns.

They were built in the days of the LAMP stack era, when all of the complexity was really built, it was, it was bound up inside the code you were writing. So if all else failed, you could just jump into GDB or something. But like, your system's failed in predictable ways. You know, you could, You can size up a LAMP stack and go, okay, I know how to write exactly 80 percent of all the monitored text that are ever going to go off, right?

Queues are going to fill up, you know, things are going to like start 500ing, fine. But those days are like long gone, right? Now we have these like vast, far flung, distributed architectures. You're using, you know, yes, you're using containers, you're using Lambda, you're using third party platforms, you're using all this shit, right?

And that means that every time you get paged, honestly, it could be something radically different. We've also gotten better at building resilient systems. We've gotten better at fixing the things that break over and over. And every time you get paged, it should be something radically different. But our tools are still built for this world where you have to kind of know what's going to break before it breaks.

And that's really what Honeycomb was built for, is a world where You don't have prior knowledge. Every time it breaks, it is something new, and you should be able, you should be able to follow the trail of breadcrumbs from something broke to the answer every time in a very short amount of time without needing a lot of prior knowledge.

And the thing is, you know, I was on a panel with the amazing Miss Amy Tobey a couple weeks ago, and at the end they asked us, what's the biggest hurdle that people have, you know, in their observability journey? And I was like, straight up. They don't believe a better world is possible. They think that, like, the stuff we're saying is just vendor hype.

Like, because all vendors hype their shit. The thing about Honeycomb is we have never been able to adequately describe to people the enormity of how their lives are going to change. Like we, we under pitch the ball because we can't figure out or that people don't believe us, right? Like better tools are better tools and they can make your life so much better.

And the thing is, the other thing is that people, not only do they not believe a better world is possible, but when they hear us talk about a better world, they're like, Oh God, yeah, that sounds great. But, you know, It's probably going to be hard, it's going to be complicated, you know, we're going to have to shift everyone's way of thinking.

Corey: Instrument all of your code manually historically has been the way that you wind up getting information out of this. It's like, that's great, like, I'll just do that out of the unicorn paddock this weekend. It doesn't happen

Charity Majors: It doesn't happen, but in fact it's so much easier. It's easier to get data in, it's easier to get data out.

Corey: I’m liking a lot of the self instrumentation stuff. Oh, we're going to go ahead and start grabbing this easy stuff and then we're going to figure out ourselves where, where to change some of that. It's doing the right things. The trend line is positive.

Charity Majors: Yeah, when you don't have to, like, predict in advance every single custom metric that you're going to have to be able to use and everything, it's so much easier.

Like, it's, it's like, it's like the difference between using the keyboard and using the mouse, which is not something that you and I, people like us, typically go to, but like, having to know every single unit's command line versus drag and drop, it's that, it's that kind of, like, leap forward.

Corey: Yeah, we're not going to teach most people to wind up becoming command line wizards.

And frankly, in 2024, we should not have to do that.

Charity Majors: No, we should not have to do that.

Corey: I remember inheriting a curriculum for a LAMP course I was teaching about 10 years ago, and the first half of it was how to use VI. It's like, step one, rip that out. We're using nano today. The end. And People should not have to learn a new form of text editing in order to make changes to things.

These days, VS Code is the right answer for almost all of it, but I digress.

Charity Majors: Especially, especially, and this goes, this is really true for observability, because nobody is sitting around just like, okay, on every team there's like one dude, one person who's like, okay, I'm a huge geek about graphs. I've never been that person.

Typically when people are using these tools, it's because something else is wrong, and they are on a hot path to try and figure it out. And that means that every brain cycle that they have to spare to their tool is a brain signal that they're not devoting to the actual problem that they have. And that's a huge problem.

Corey: It really is.I want to thank you for taking the time to speak with me today about, well, what you've been up to and how you're thinking about these things. If people want to learn more, where's the best place for them to go find you?

Charity Majors: Uh, I write periodically at charity.wpf and the Honeycomb blog, we actually write a lot about open telemetry and stuff that isn't super Honeycomb related and I still am on Twitter @MipsyTipsy.

Corey: And we will of course include links to all of those things in the show notes. Thank you so much for taking the time to speak with me. I really appreciate it.

Charity Majors: Thanks for having me, Corey. It's always a joy and a pleasure.

Corey: Charity Majors, CTO and co founder of Honeycomb, I'm cloud economist Corey Quinn, and this is Screaming in the Cloud.

If you enjoyed this podcast, please leave a five star review on your podcast platform of choice. Whereas if you hated this podcast, please leave a five star review on your podcast platform of choice. That platform, of course, being the one that is a reference customer for whatever observability vendor you work for.

Newsletter Footer

Get the Newsletter

Reach over 30,000 discerning engineers, managers, enthusiasts who actually care about the state of Amazon’s cloud ecosystems.

"*" indicates required fields

This field is for validation purposes and should be left unchanged.
Sponsor Icon Footer

Sponsor an Episode

Get your message in front of people who care enough to keep current about the cloud phenomenon and its business impacts.