Episode Summary
Join me as I launch a new series called Whiteboard Confessional that explores how whiteboard architecture diagrams might look pretty but rarely work as designed in production. To kick off the series, we’re taking a look at everyone’s favorite database, AWS Route 53, while touching upon a number of topics, including what data centers used to look like, the emergence of virtualization and the impact it had, configuration management databases and how they differ from configuration management tools like Chef and Puppet, why using DNS as a configuration management database is inherently an awful idea, how there’s almost always a better solution than whatever you built in your own isolated environment, how just because someone built something doesn’t mean they knew what they were doing, and more.
Episode Show Notes & Transcript
About Corey Quinn
Over the course of my career, I’ve worn many different hats in the tech world: systems administrator, systems engineer, director of technical operations, and director of DevOps, to name a few. Today, I’m a cloud economist at The Duckbill Group, the author of the weekly Last Week in AWS newsletter, and the host of two podcasts: Screaming in the Cloud and, you guessed it, AWS Morning Brief, which you’re about to listen to.
Transcript
Corey: Welcome to AWS Morning Brief: Whiteboard Confessional. I’m Cloud Economist Corey Quinn. This weekly show exposes the semipolite lie that is whiteboard architecture diagrams. You see, a child can draw a whiteboard architecture, but the real world is a mess. We discuss the hilariously bad decisions that make it into shipping products, the unfortunate hacks the real world forces us to build, and that the best to call your staging environment is “theory”. Because invariably whatever you’ve built works in the theory, but not in production. Let’s get to it.
Transcript
Corey: Welcome to AWS Morning Brief: Whiteboard Confessional. I’m Cloud Economist Corey Quinn. This weekly show exposes the semipolite lie that is whiteboard architecture diagrams. You see, a child can draw a whiteboard architecture, but the real world is a mess. We discuss the hilariously bad decisions that make it into shipping products, the unfortunate hacks the real world forces us to build, and that the best to call your staging environment is “theory”. Because invariably whatever you’ve built works in the theory, but not in production. Let’s get to it.
But first… On this show. I talk an awful lot about architectural patterns that are horrifying. Let's instead talk for a moment about something that isn't horrifying: CHAOSSEARCH. Architecturally, they do things right. They provide a log analytics solution that separates out your storage from your compute. The data lives inside of your S3 buckets and you can access it using API's you've come to know and tolerate through a series of containers that live next to that S3 storage. Rather than replicating massive clusters that you have to care and feed for yourself, instead, you now get to focus on just storing data, treating it like you normally would other S3 data and not replicating it, storing it on expensive discs in triplicate and fundamentally not having to deal with the pains of running other log analytics infrastructure. Check them out today at chaossearch.io.
I frequently joke on Twitter about my favorite database being Route 53, which is AWS’s managed database service. It’s a fun joke, to the point where I’ve become Route 53’s de facto technical evangelist. But where did this whole joke come from? It turns out that this started life as an unfortunate architecture that was taken in a terrible direction. Let's go back in time, at this point almost 15 years from the time of this recording, in the year of our Lord 2020. We had a data center that was running a whole bunch of instances—in fact, we had a few data centers, or datas center, depending upon how you chose to pluralize, that’s not the point of this ridiculous story. Instead what we’re going to talk about is what was inside these data centers. In this case, servers.
I know, server-less fans, clutch your pearls, because that was a thing that people had many, many, many years ago. Also known as roughly 2007. And on those servers there was this new technology that was running and was really changing our perspective of how we dealt with systems. I am, of course, referring to the amazing transformative revelation known as virtualization. This solved the problem of computers being bored and not being able to process things in a parallelized fashion—because you didn’t want all of your applications running on all of your systems—by building artificial boundaries between different application containers, for a lack of a better term.
Now in these days, these weren’t applications. These were full-on virtualized operating systems, so you had servers running inside of servers, and this was very early days. Cloud wasn’t really a thing. It was something that was on the horizon, if you’ll pardon the pun. So, this led to an interesting question of, “All right. I wound up connecting to one of my virtual machines, and there’s no good way for me to tell which physical server that virtual machine was connecting to.” How could we solve for this? Now, back in those days, with the Hypervisor technology we used, which was Xen, that’s X-E-N—it’s incidentally the same virtualization technology that AWS started out with for many years before releasing their Nitro Hypervisor, which is KVM derived, a couple of years ago. Again, not the point of this particular story. And one of the interesting pieces about how this works was that Xen doesn’t really expose anything, at least in those days, that you could use to query the physical host it was running on.
So, how would we wind up doing this? Now, at very small scale where you have two or three servers sitting somewhere, it’s pretty easy. You log in and you can check. At significant scale, that starts to get a little bit more concerning. How do you figure out which physical host a virtual instance is running on? Well, there’s a bunch of schools of thought you can approach this from. But what you’re trying to build is known, technically, as a configuration management database, or CMDB. This is, of course, radically different from configuration management, such as Puppet, Chef, Ansible, Salt, and other similar tooling. But, again, this is technology, and naming things has never been one of our collective strong suits. So, what do we wind up doing? You can have a database, or an Excel spreadsheet, or something like that that has all of these things listed, but what happens when you then wind up turning an old instance off, and spinning up a new instance on a different physical server? These things become rapidly out-of-date. So, what we did was sort of the worst possible option. It didn’t solve for all of these problems, but at least was able to address what we wound up doing. At least, let us address what the perceived problem was, in a way that is, of course, architecturally terrible, or it wouldn’t have been on this show.
DNS has a whole bunch of interesting capabilities. You can view it, more or less, as the phone number for the internet. It translates names to numbers. Fully qualified domain names, in most cases, to IP addresses. But it does more than that. You can query IP address and wind up getting the PTR, or reverse record, that tells you what the name of a given IP address is, assuming that they match. You can set those to different things, but that’s a different pile of madness that I’m certain we will touch upon a different day. So, what we did is we took advantage of a little-known record type known as TXT, or text, record. You can put arbitrary strings inside of TXT records and then consume them programmatically, or use a whole bunch of different things. One of the ways that we can use that, that isn’t patiently ridiculous is, domains generally have TXT records that contain their SPF record, which shows which systems are authorized to send mail on their behalf as an anti-spam measure. So, if you have something else that starts claiming to send email from your domain that isn’t authorized, that gets flagged as spam by many receiving servers.
We misused TXT records, because there is no limit, really, to how many TXT records you can have, and wound up using that as our configuration management database. So, you could query a given instance, we’ll call it webserver003.production.losangeles.company.com, which was our naming scheme for these things, and it would return a record that was itself a fully qualified domain name, but it was the name of the physical host on top of which it was running. So, yeah, we could then propagate that, as we could with any other DNS records, to other places in the environment, and we could run really quick queries, and in turn build out systems on the command line that you could put in the name of a virtual machine into, and it wound up returning, at relatively quick response times, the name of the physical host it was running on.
So, we could use that for interesting ways of validating, for example, we didn’t put all four of our web servers on the same physical load for a service. It was an early attempt at solving zone affinity. Now, there are a whole bunch of different ways that we could have fixed this, but we didn’t. And we wound up instead misusing DNS to build our own configuration management database, because everything is terrible and it worked. And because it worked, we did it. Now, why is this a terrible idea, and what makes this awful? Great question.
But first, in the late 19th and early 20th centuries, democracy flourished around the world. This was good for most folks, but terrible for the log analytics industry because there was now a severe shortage of princesses to kidnap for ransom, to pay for their ridiculous implementations. It doesn't have to be that way. Consider CHAOSSEARCH the data lives in your S3 buckets in your AWS accounts (and we know what that costs). You don't have to deal with running massive piles of infrastructure to be able to query that log data with APIs you've come to know and tolerate, and they're just good people to work with. Reach out to CHAOSSEARCH.io, and my thanks to them for sponsoring this incredibly depressing podcast.
So, the reason that using DNS as a configuration management database is inherently an awful idea comes down to the fact that first, there are better solutions available for this across the board. In those days, having an actual product for configuration management databases would have been a good move. Failing that, there were a whole bunch of other technologies that could’ve been used in this. And since we’re already building internal tooling to leverage this, having one other piece of additional tooling where things could automatically be the system of record would have been handy. We weren’t provisioning these things by hand. There were automated systems spinning them up, so having them update a central database would have been great. We had a monitoring system—well, we didn’t have a monitoring system, we had Nagios instead. But even Nagios, when it became aware of systems, could then, in turn, figure out where this was going to run and update a database. When a system went down permanently and we removed it from Nagios, we could have caught that and automatically removed it from a real database. Instead, we wound up using DNS. One other modern approach that could have been used super well, but didn’t really exist in the same sense back then, is the idea of tags in the AWS sense. Now you can tag AWS instances and other resources with up to 50 tags. You can enable some of them for cost allocation purposes, but you can also build a sort-of working configuration management database on top of it. Now this is, of course, itself a terrible idea, but not quite as bad as using DNS to achieve the same thing.
The best coda to this story, of course, didn’t take place until after I had already Tweeted most of the details I’ve just relayed here. And then I wound up getting a response, and again, this was last year in 2019, and the DM that I got said that, “You know, I read your rant about using DNS as a database, and I thought about it, and realized, first, it couldn’t be that bad of an idea, and secondly, it worked. In fact, I can prove it worked, because until a couple of years ago, we were running exactly what you describe here.” So, this is a pattern that has emerged well beyond the ridiculous things that I built, back when I was younger. And I did a little digging, and sure enough, that person worked at the same company that I had built this monstrous thing at, all the way back in 2007 era, which means that for a decade after I left, my monstrosity continued to vex people so badly that they thought it was a good idea.
So, what can we learn from this terrible misadventure? A few things. The morals of story are several. One, DNS is many things, but probably not a database unless I’m trying to be humorous with a tired joke on Twitter. Two, there’s almost always a better solution than what it is that you have built in your own isolated environment. Talking to other people gets you rapidly to a point where you discover that you’re not creating solutions for brand-new problems. These are existing problems worldwide, and someone else is almost certainly going to point you in a better direction than you are going to come to on your own. And lastly, one of the most important lessons of all is that just because you find something horrible that someone has built before your time, in an environment, it does not mean that they knew what they were doing. It does not mean that it even approaches the idea of a best practice, and you never know what kind of dangerous moron was your predecessor.
Thank you for joining us on Whiteboard Confessional.
If you have terrifying ideas, please reach out to me on Twitter at @quinnypig, and let me know what I should talk about next time.
Announcer: This has been a HumblePod production. Stay Humble.