S3 and the Evolution of Storage with Andy Warfield

Episode Summary

Andy Warfield joins Corey in this episode to discuss the evolution of storage technology at Amazon. This includes the evolution of S3 from archival storage to supporting modern AI and analytics. As Vice President and Distinguished Engineer at AWS, Andy is able to explain performance-enhancing innovations like S3 Tables and Common Runtime (CRT). On the other hand, challenges like compaction and namespace structuring are discussed. Reflecting on his journey from working on the Xen hypervisor to AWS, Andy shares insights into scaling S3, including buckets spanning millions of hard disks. 

Episode Video

Episode Show Notes & Transcript



Show Highlights
(0:00) Intro
(1:09) The Duckbill Group sponsor read
(1:43) Andy’s background
(3:38) How AWS envisioned services being used vs. what customers actually do with them
(6:54) The frustration of legacy applications not keeping up with the times
(10:14) Why S3 is so accurate
(15:29) S3 as a role model for how a service should be run
(18:04) The Duckbill Group sponsor read
(18:46) Why AWS made Iceberg into a native offering
(23:50) Why S3 Tables is slightly more expensive
(28:23) How Andy handled the transition from Zen to Nitro
(32:22) What Andy is currently excited about 


About Andy Warfield
Andrew Warfield is a VP / Distinguished Engineer at Amazon. As a senior technical leader at one of the world's largest technology companies, he plays a crucial role in shaping Amazon's engineering strategies and initiatives. 


Links


Sponsor
The Duckbill Group: duckbillgroup.com 

Transcript

Andy Warfield: Like three or four years ago at reInvent, when I had customer meetings, probably like half of my meetings were with customers building data lakes on top of S3, talking about Parquet performance. And there's like all sorts of optimizations you can make to clients and to how readers work with Parquet and some customers do really, really sophisticated things.

The next year, so I guess this is probably three years ago at this point, we started to see some of our most sophisticated. Parquet Analytics customers talking about Iceberg in particular and OpenTable formats generally, and then the 2024 year, it was like in every analytics conversation, right? Customers were voting with their feet on it and asking questions about Iceberg and OTFs, and so that was really what pulled the development of the platform forward.

Sort of, um, S3 Tables product.

Corey Quinn: Welcome to Screaming in the Cloud. I'm Corey Quinn. And I've been angling to have a conversation like this one for a very long time. Andy Warfield is a vice president and distinguished engineer at AWS. Andy, thank you for finally agreeing to suffer my slings and arrows.

Andy Warfield: Thanks for having me.

Sponsor: This episode is sponsored in part by my day job, the Duckbill Group. Do you have a horrifying AWS bill? That can mean a lot of things.

Predicting what it's going to be. Determining what it should be. Negotiating your next long-term contract with AWS. Or just figuring out why it increasingly resembles a phone number, but nobody seems to quite know why that is. To learn more, visit duckbillgroup.com. Remember, you can't duck the duck bill, Bill.

And my CEO informs me that is absolutely not our slogan.

Corey Quinn: So you have been working on storage for, to say it's a long time feels like that's doing a bit of a disservice.

You were a senior principal engineer at AWS for a while before, I dunno, the term is ascending, but that's the one I'm going to use to the distinguished engineer side of the universe.

And you've been focusing primarily on S3 based upon your public works.

Andy Warfield: I get to work across all of the storage teams at Amazon. So I work with the EBS folks and the file folks as well. And I spend a fair bit of time working with all of the internal teams that use storage, which is a lot of teams. So especially the analytics and.

machine learning folks.

Corey Quinn: It always feels like there are interesting expressions of storage, but at least in the way that I played around in data centers, what actually was backing that was all more or less the same substrate, uh, just presented differently. Think of a net app filer presenting, presenting in some cases, an iSCSI mount or an NFS target or an SMB share or, or that was back in the days before object store because I'm old.

Uh, is that how it works in your view of the world or is everything highly specialized and divergent?

Andy Warfield: That's a fun question. I can kind of answer that one at a few different levels. And so like, maybe to start with the S3 stuff, when I was initially working more at the OS layer and on hypervisors and stuff, we were, I was doing a lot of stuff at that point in my career with traditional enterprise setups.

And so we saw a lot of Fibre Channel and iSCSI and NAS attach and stuff. And in all of those things, even the startup that I did before joining Amazon on the storage side, you're kind of like hamstrung by the APIs. That were there and the storage attached that folks had, whereas in the time that I get to spend with S3, we have a ton more flexibility to change that stuff.

And so I think one way to answer it from an AWS perspective is that there are a bunch of storage services that are kind of defined by the protocols that they talk over. But they have enormous flexibility on how they implement behind the scenes. And then there's sort of, you know, S3 in particular, where there's even flexibility at the API level.

Corey Quinn: From my perspective, it always seemed like it was, okay, you have hard drives you need to worry about and start playing games with. And how those get expressed is sort of an implementation detail from the days you're spending in the data center. Then you come out of it and suddenly you care very much about those details.

You basically created a problem for yourself in the past moving forward. In that respect, How have you found that how AWS has envisioned a lot of these services being used has been challenged, I suppose, by what customers have been doing with them?

Andy Warfield: This thing is kind of, you know, in a lot of senses, the coolest bit of working at Amazon for me, right?

Like, every time you look at something, you find out that there's some workload that is using it in a way that you didn't anticipate. And it's often like quite I mean, this is your route 53 as a database example. And in some senses, I've seen as three used as a Like RPC and IPC mechanism to do message passing across applications.

That, that thing was surprising seven years ago and now it's like kind of, you know, there's libraries built around it to do stuff like that.

Corey Quinn: The read after write consistency change made a lot of those cases a little more defensible.

Andy Warfield: I don't know. Like, uh, it's a thing that factors in interestingly to, um, to a lot of what we have to do with, uh, with setting pricing for, for new products is like making sure that we're like anticipating all the ways that something might be used in a way that we don't expect to make sure that we're like still happy, you know, running the service in a way that's like, uh, the sustainable.

Corey Quinn: I think that on some level that the use case, I guess the misuse cases have changed in the early days when it first came out, no one knew what an object store was, so let's quick write some stuff and fuse to go ahead and mount it as a file system and no, don't do that and then the request charges started being a good way to dissuade that behavior and Two decades later, near enough.

Okay, now you offer a mount point option for S3 as an open source project to explicitly empower that for AI workloads and analytics workloads, which on some level, I have to confess, is a little frustrating. Rather than learning how to use something properly, it feels like if I dig my heels in and wait long enough, something will come along that'll just meet the use case that I have for it going in.

Stubbornness is actually a virtue in some of these cases.

Andy Warfield: To maybe even fill in some of the points in between the things that you mentioned, the, I mean, the early S3 build was very much like archival. Right? Like it was, it was very, you know, not optimized for performance or parallel throughput or things like that.

And as you say, like even where things like, um, VFS as a fuse driver existed, it was often used for folks to like do backup, basically, right? Like you would set up SVFS on your Linux home directory as a way of making sure that it was like effectively R syncing stuff, um, into the cloud. Interestingly, you know, as we watch folks use that stuff, and we did a lot of like, Performance and throughput work on S3, you know, we found that Two things were true.

One was the thing that you're saying, which is like, it'd be awesome if people work at the APIs that we build to and develop to get the best possible experience. But at the same time, there's like 40 years of file based applications out there that, you know, people in some cases don't even have source code for.

And it's valuable to be able to support that stuff. And so we've been kind of and kind of continuing to pursue how to make that integration seamless on a bunch of fronts. So mount point's been pretty cool to see in that way. All

Corey Quinn: right. The idea of legacy applications, getting a conception of object store is, is great.

And I'll even then you still, I still have laggards where I still have to use legacy. I am credentials rather than ephemeral things for a couple of desktop clients that I have. Just, it works well enough for what I'm trying to achieve, but it's frustrating that can you please just inch a little bit further into modernity?

And, and that's challenging, like legacy code is, you know, it's condescending engineering speak for it makes money.

Andy Warfield: Right. Well, there's, there's a neat thing with mount points. I believe you have a commit on that point, actually. So you know where the GitHub is on that stuff. Um, the, the team, when we put it together, had a pile of like super interesting design discussions on it.

And, uh, one of the things we talked about was that where we looked at other fuse connectors to S3, which there are like loads of, they often kind of lean into engineering around some of the shortcomings of using an object API for file. And it leads to, you know, like zero byte file directory markers and a whole bunch of things that kind of like complicate the structure.

Of your, of your bucket, make it difficult to use with, with non file clients and stuff. And so inside the mount point distro, there's actually like this design file where we talked about some of our tenants and the way that we wanted to engineer for it. And the team took the decision that we were like intentionally going to break stuff that wasn't.

Easily supported at an API level by S3 and so mount point doesn't support directory moves or object moves because You have to fake them out in really nasty ways with like forwarding pointers and stuff And so the the decision that we've kind of taken with that is the mount point team internally acts as a point of tension That's driving a bunch of like api changes and like namespace improvements inside s3 because They want to present direct APIs.

And so an example of that is the recent append API changes on S3 express. We launched S3 express one zone, and we found that there were loads and loads of customers that were using that thing as like a lower latency ingest target. They're accumulating like small batches of logs, and then they would go and.

Um, append all those things together and then write them out to S3 and so adding an append API and there's a way to sort of facilitate that and it ties back up in a mount point where as long as you're doing sequential rights to the end of the file, we like brought in our our file support and that's a trend that will continue.

Corey Quinn: It feels like a lot of the old things that you took as chapter and verse no longer might apply. For example, I always tended to view S3 objects as inherently immutable. Once you start being able to append that, that does change the game somewhat.

Andy Warfield: We spent long hours talking about that one. And I think that the team still lands very much on, there's a ton of value in the sort of like immutable view of an object and the sort of consistency of an object moving from one complete state to another.

Which is different than at a file API where like, you know, file applications are used to going in and leaving the file in completely inconsistent states as they do random writes around it. And so if you look at what we launched with the append support, it's, it's very anchored on multi part. And it's kind of an optimization of like copy parts being used to like sneak in appends on stuff.

And so we're, we're trying to like hold ourselves to a point where we're at least supporting an object moving from one consistent state to another. Instead of just becoming this completely, like, inconsistent, sort of transient thing.

Corey Quinn: S3 is probably one of the best examples I can come up with of a service where when people learn about it, they don't go back and revisit things that they once learned to validate that they're still accurate.

The canonical example of this is having to hash the initial path of a key in a bucket. Just because otherwise you wind up with hot spots and wind up with performance issues at scale, it doesn't have to do that anymore. At least it's my understanding of a lot of those changes that came through about 2019 or so.

Andy Warfield: Uh

Corey Quinn: huh.

Andy Warfield: We did a ton of work on the S3 index to scale up performance for those things. There are still cases where under like really high TPS that's localized or certain types of, uh, of key name updates that hashing is an optimization. That works. We don't like that. And so we're still working to improve it.

One of the, um, one of the things that we actually did with the recent launch of S3 Tables was an optimization internally that's exactly around this. That because we know the structure of Iceberg's naming of parquet files, we can make optimizations on the index side to pre partition the bucket, get a bunch more performance, and also manage the naming internally in a way that further scales up the TPS that you can get.

Corey Quinn: Our timing on this is funny. Yesterday when I was poking around on the AWS subreddit, because, you know, I, it's not like I have anything better to do with my time. Someone was asking about throughput challenges from an EC2 instance when putting data into S3, like they weren't approaching anything, uh, nearing the bandwidth limits inherent to that instance type.

And someone suggested that, Oh, instead of, uh, there's a flag to use the common runtime that you can pass the AWS CLI that I'd forgotten that existed. Yeah, it. massively improves performance. Why isn't that the default these days?

Andy Warfield: We're getting there. It's, it's, it's always the path of like, you know, making sure that we're making defaults that don't break anything and that we're super, super confident in them.

This has been a totally a lesson for me. Working on, you know, like a kind of, I don't know what to call it, a gold standard service like S3 relative to more research systems or startups and things with the S3, with the CRT work. So CRT, as you mentioned, is a bunch of changes to the AWS SDKs that instead of just directly exposing the S3 REST APIs, actually add some smarts on the client side to get a pilot performance off of them.

And so when I, holy smokes, I've been at Amazon now for. I guess about seven years. And one of the first things I was tasked with when I joined was to like really dig in and understand performance for S3. Have you gotten there yet? I think I'm closer. We had like a big leadership meeting on it in, in like my second week at Amazon where the meeting was set before I joined.

And so it was like a, Terrifying meeting to walk into and be told that I was on the hook to, to talk about this stuff

Corey Quinn: here. Catch. It's the best intro.

Andy Warfield: One of the things we did when we first started was we, we went and looked at kind of like the, the, the most aggressive customers in terms of driving performance test suite.

And the two super interesting realizations were number one, that S3 for like throughput oriented applications is actually remarkable. Like, I don't even think we'd really realized it. At the time, customers were kind of moving there fast with a lot of analytics APIs, but the width of the storage fleet and the width of the web server fleet allows you to drive throughput like no other storage service anywhere.

Like it's, it's remarkable. However, the second thing we observed was that it was really, really finicky. And there was a lot of kind of like folklore and like. earned experience and how to drive that performance. And so we sat down with like a whole bunch of customers that were doing everything from using S3 as like a CDN to actually like putting S3 under a legit CDN like CloudFront to like, you know, doing analytics work and things like that.

And we, we started to collect this like set of, um, best practices. We actually published in the S3 docs, this like, Best practices thing around, you know, this is how you deal with like potentially slow front end servers are slow network links, right? Doing retries. And this is how you should like set your part sizes.

And this is how you should do parallelization of transfers. And you should be like monitoring for connection health. And here's how to set retries and stuff like that. And we published this thing. And then immediately, you know, at that point, Yeah. People were like, well, great. I'm glad you've written this down, but why isn't this the default?

Like, why isn't this automatic? And so that's when we sat down with the SDK team and ended up building this, this extension to the common runtime, which is, it's basically a bunch of native code. It's like largely at this point in C although we're doing a whole bunch of work on it in Rust right now. And that thing is kind of like, uh, you know, like an event scheduler that drives S3.

Transfers super, super fast and like goes and like solicits IP addresses from DNS proactively to get you like a big width of access into the S3 fleet and stuff. And so we're progressively rolling that thing out, not just into all of the SDKs, but also into a lot of the other connectors. So mount point uses it, S3 uses it for Spark and Hadoop.

The iceberg file IO client is moving to use it. And so we're, we kind of realized with S3 that to get. The performance that S3 is capable of, we would have to work really in a hands on way with open source to drive changes closer to the clients.

Corey Quinn: S3 has always been sort of a role model of how a service at significant scale and extreme longevity can and should be run.

Jeff Barr was talking at one point about S3 having to be a generational service. You, by design, have no idea what data a customer has in their account. And as a result, you have to treat every last bit of it as, as if it were precious as a result of that. Even the deprecations have been very well considered and communicated clearly.

Uh, the two I can think of off the top of my head are you can't use BitTorrent for new buckets anymore. It's just a seed automatically without some extra work. And S3 Select seems to have been replaced by Athena.

Andy Warfield: Yeah, that's that's right. We're constantly, obviously the first order bid is the customer.

And so we're like super, super sensitive to not change the service in a way that's going to break stuff. And like, what's going to be impacted by any change we make is, is kind of one of the biggest sources of friction. In building everything and it's remarkable like how many conversations that sort of like care and like custodianship steps into, but at the same time, like you say, everything that we build ends up being a point of friction for everything that we build in the future.

And so with BitTorrent. You know, we did a lot of looking at what was still using BitTorrent and, and did a bunch of conversations with folks and, and so on and, and decided that that was a thing that we could, we could probably let go of.

Corey Quinn: Even today, when I talk about that, people are convinced I'm making it up to be in service of a joke and same with S3 Select.

No one really seemed to use that in any meaningful sense. Whenever I said, whenever I mentioned it could do that, people thought I was confused and talking about a different service.

Andy Warfield: Select is interesting. I mean, BitTorrent kind of had its day. And then didn't. Select's an interesting one to me because it's actually like a really cool feature.

Right, like it's the idea of doing, you know, a degree of pushdown queries and optimizing that, like, bit of the data path, avoiding copies, especially when you really just want to do a needle in a haystack type interaction. You want to get a small bit of data out of a really large object. The challenge with Select was uh, I don't think that we got the interfaces or the integrations exactly right.

And so, there were absolutely a bunch of folks that used it and loved it. And we had to work really hard to think about, like, that was a painful one and we had to figure out how to move folks on to Athena, which, like you say, largely seems to have gone okay for folks. But that's a place where I think we may revisit, right?

Like that, that bit of functionality is a thing that may end up. You know, coming up in other forms with some of the analytics integrations for things like S3 tables,

Corey Quinn: which is what I want to talk about next.

Sponsor: Here at the Duckbill Group, one of the things we do with, you know, my day job, is we help negotiate AWS contracts. We just recently crossed five billion dollars of contract value negotiated. It solves for fun problems such as how do you know that your contract that you have with AWS is the best deal you can get?

How do you know you're not leaving money on the table? How do you know that you're not doing what I do on this podcast and on Twitter constantly and sticking your foot in your mouth? To learn more, come chat at duckbillgroup.com. Optionally, I will also do podcast voice when we talk about it. Again, that's duckbillgroup.com.

Corey Quinn: Iceberg is absolutely something that is that has done well in the industry, and it's something a lot of customers are using backed by S3. Why was that the use case that you that you folks decided to go ahead and turn into a native offering of the service?

Why does that exist instead of an S3 solution? That's oh, here's how to deploy iceberg on top of S3 the way people have been doing it until now. What changed?

Andy Warfield: This bit's neat. And we're going to do other stuff in, in, in this direction in the future. So I'm going to try and talk about it, like specifically about Iceberg, but, you know, maybe hint at, at, at stuff that's, that's coming up.

There is a pattern that I think we have seen where for like a narrow, but important set of data types, the object API is insufficient. Right. And so what we, but everything else about S3 is really, really attractive. Right. I think the, you know, for, for tables, as an example, we were seeing customers initially put loads and loads of parquet on top of S3.

I'm trying to remember the number. I think that we have something in the neighborhood of 15 million requests a second to parquet on S3. All the time right now. And so it's a very, very popular data type. Like three or four years ago at reInvent, when I had customer meetings, probably like half of my meetings were with customers building data lakes on top of S3, talking about Parquet performance.

And there's like all sorts of optimizations you can make to clients and to how readers work with Parquet, and some customers do really, really sophisticated things. The next year, so I guess this is probably three years ago at this point, we started to see some of our most sophisticated Parquet analytics customers talking about iceberg in particular and open table formats generally.

And then the 2024 year, it was like in every analytics conversation, right? Customers were voting with their feet on it and asking questions about iceberg and OTS. And so that was really what pulled the development of the sort of S3 tables product. The thing that we saw on it, I think it was motivated by a couple of things.

One was that if you think about how iceberg is structured, right? The reason that people are moving from Plain parquet to these open table formats is they want to be able to like mutate the data, right? Like, and there's a bunch of other like reasons in terms of snapshots and stuff like that. But ultimately, people were putting logs and warehouse style data in test tree as parquet.

They would have to either rewrite the whole object or like add extra parquet files.

Corey Quinn: We're going to store the same thing in three different formats. Yeah.

Andy Warfield: Yeah. Um, and so. The OTFs, and Iceberg in particular, gives you this first class table construct. It gives you a way, basically by adding a layer of metadata, also as objects, on top of the parquet file, to reflect the fact that if I have a gigabyte parquet file, that's like a massive, massive table, and I want to do an update to one row of it, I can write that one row as a new parquet file and update the metadata to say most of the table's over here, except for this one row that's changed over here.

Right, and by extension, it gives you this mutability. The fact that the top of that metadata ends up being like a file system superblock, it's like a root node on the tree, means that there's a place to atomically move the view of the table from one state to the next, kind of like a git commit. And so now you've got this more transactional primitive.

for representing tables. And so that was kind of the thing that was like pulling customers toward it and the fact that it was getting integrated into a lot of the analytics frameworks. However, in all of the customer meetings that we were having with Iceberg, we were hearing excitement. And a like low sort of tenor of post traumatic stress disorder, right?

Like a little bit of like, and so one of the things folks said was it works great on day one. But like, as I like do more and more updates, I either have to be like really vigilant about running compaction, which is to fold everything back down or performance actually falls off. And so, like, now I have to do this, like, very storage y maintenance task of running compaction.

Corey Quinn: The beautiful part of storage for a lot of us is once it's there, it has the capacity, we don't have to think about it. That's the dream. Turns out it doesn't work that way for everyone.

Andy Warfield: And the customers that were, like, the most sophisticated customers on Iceberg were often, like, they'd suffered enough with that that they would wrap an engineering team around it, and then they were building their own You know, catalogs and compaction things, which they had again, a mixed experience of like, this is awesome because it's better than what's in the existing open source code and we're pushing stuff back, but we've made mistakes with it and potentially had events where we've like messed up reference counting or like, you know, deleted a snapshot.

We didn't want to delete this stuff. I'm like, why don't you guys just do this? And so that's kind of been the thing that's driven is like, we, we really felt that where you're really building. A table construct as a richer data type on top of an object construct that that translation and maintenance task is ultimately a storage thing.

And it's a thing that we have the durability experience to do well on and that we have the deep understanding of the underlying storage system to make performance. really scream on it. And so that's kind of been the motivation for

Corey Quinn: when I was first introduced to S3 tables before its launch, my default.

And as it turns out, naive assumption was, Oh, since you know exactly what the use case looks like, it will probably have a cost advantage over a standard S3 storage approach. In fact, it's slightly more expensive. Why is that? Because the easy and lazy answer is, Oh, you decided nickel and dime on this. But I have been informed that is not the case.

Andy Warfield: It's actually like the sort of Pricing and cost structure of tables has been one of the more surprisingly, I don't know, I don't probably get to say this about pricing a ton, but like more interesting technical aspects of working on the thing. And so the data path end of it was we did a whole bunch of work with S3 index and like open source contributions to iceberg to make sure that we were naming and structuring namespace partitions to maximize TPS for iceberg tables.

And so we actually are spending more resources. On the S3 namespace for S3 tables to get performance for them because Iceberg has this pattern of accessing lots of objects in parallel. So that's kind of the simple bit. The more complicated bit that is absolutely fascinating is this challenge of doing compaction.

Is like It's remarkable. And it's like incredibly difficult to actually like price. And so, um, what compaction is doing is, like I said, you've got like one giant parquet file, possibly as an initial table. And over time, you're adding additional parquet files. Each one of those adds a bunch of metadata files.

And so you're fragmenting your data up like crazy. And the simple task of compaction is to take all of those changes. Throw away the stuff that was deleted. Keep the stuff that's alive. Fold it into like a single or a small number of very large files so that you can get back to doing large reads of just the columns of the database that you care about.

Right? Maximize the sort of like utilization of your request path because that gets you like huge performance and it also gets you like the most usable bytes read per bytes used. The challenge to it is the way that the customer workload updates the data. In the table completely changes the complexity of compaction workload to workload.

So a read only database, right, like a table that has never changed obviously doesn't need compaction. It just kind of sits as it is. The one exception to that is you might decide to restructure that table in the background over time if you notice that the queries are accessing it in a way that the table is not well laid out to.

Right? Like, so you might want to do, like, order compassion to reorder the rows in the table in a way that speeds up queries later. That's not a thing we do now, but that's kind of a, an interesting direction. Going back to, like, just the deletion reclamation, if, if I am putting in, you know, like, four megabytes of Parquet at a time in chunks, and at a trailing horizon of, like, 60 days.

I'm deleting all the new, all of the data that's aged out at that point. That's a very inexpensive workload to maintain from a compaction perspective because you can just throw away those old files. If I am like writing little little updates all the time, it's a bit more expensive because I have to frequently go and grab all those little updates.

Say I'm writing one row per second. All the time I have to like fold those in and convert them into something larger. If I am continuously updating fragmented rows through all of my table, right? If I'm deleting one out of every like 100th row at random through all the tables, it's kind of the pathologically worst case because now compaction has to read everything.

To get a storage cost reclamation of folding it all back together. And so there's this like crazy tension between storage utilization and performance. And then there's another crazy tension between the work you do to put the data in a good place versus like the access to data to amortize that work.

Corey Quinn: There's also an amortization story that's very different on S3 standard. Uh, I used to work at a company that did expense reports. Uh, we would upload receipts and they would get looked at either zero or one times, and that was it. And that, that sort of offsets me misusing S3 as a database, as I basically can misuse anything like that if I hold it wrong.

Andy Warfield: Yeah, totally. and so, so this is an area where we actually wrote a pile of code. On modeling compaction for all these different workloads when we were, when we were building the system and pricing it out. And I, I think that we've ended up in a good initial spot. I think that we are going to make a pile of improvements over time to it, but this is like an absolutely remarkable bit of the, of the service and we're learning a ton.

Corey Quinn: One last topic I want to delve into before we call this an episode and it's a complete departure. from the storage side of the universe, uh, it turns out that people don't emerge into the world fully formed or the thing that they're working on now. Before you worked on storage, you worked on hypervisors.

You were one of the founders of the Zen project to my understanding, which is historically what AWS used as its hypervisor. And now it. doesn't. It uses nitro, which historically was KVM derived. Was that a, I guess, emotionally challenging as far as this is this thing you made, it's your baby and now they're walking away from it.

How did you handle that transition?

Andy Warfield: I don't think anybody's asked me a Zen question, uh, in quite a while. I was one of the early folks working on Zen. I did it when I was in grad school, living in the UK on that team. I, I don't know. I, We worked on that thing and we open sourced it. And it was actually like a point of pride that we were a bunch of university researchers and that we were like building something that we were actually maintaining an open source and other folks could use, right?

Like that was a, that was sort of a thing that, you know, we were critical of other projects for not releasing their code. At the time. And so like, it was incredibly rewarding at that point, I think for all of us working on it, to see folks pick it up and use it, it was remarkable to see AWS in particular, pick it up and like really prove it out at scale.

It was like incredible, probably made the wrong choice at the time doing like an enterprise shrink wrap software startup around it instead of like going and working on a cloud system, but then eventually like coming to AWS and seeing all of the stuff that had been done with nitro and I mean over the whatever it was probably like 20 years in the middle there, the way that like all of the like CPU architectures evolved.

You know, to be much more accommodating of virtualization. Nitro really made a lot of sense. I don't think I had a lot of grief about seeing Xen move out. I kind of, you know, I've always been more interested in, you know, what I'm working on right now and, and

Corey Quinn: It's uncommon just because so many people fall into the trap of identifying themselves with the technology they're working on.

Like you, I still talk to folks occasionally who are sort of the Maytag repairman hanging onto Pearl with two hands and a death grip for the three companies that still are actively developing with it. And they don't want to move on to something that's more broadly adopted or in any cases has a, is a better technical fit for whatever challenge they're trying to overcome.

It's Yeah. It's admirable and laudable that you're, that you were able to, I guess, let go, if that makes sense.

Andy Warfield: That's funny, you know, there's, I think I actually have the, the opposite reaction on some stuff, which is that, you know, it's, whatever, it's January 7th today. It's usually the time of the year where I wonder if I've been in storage for, for too long.

You know, as a, as an example, right, that, that I've worked in like one section of, of technology for, for a bunch of years now, I worked on, you know, security and distributed systems and hypervisors at earlier points. And I think the thing that has been really, really remarkable to me. About getting to work on S3 in particular is I spent so much time talking to customers about their workloads, and I've learned more about like databases and machine learning and, you know, like time series systems and all this stuff than I ever expected to learn working on a storage system.

Or the past bunch of years, like every day, there's some surprising new thing.

Corey Quinn: Storage is a way of touching almost everything. And historically for me, it was something I tried to touch as little as possible because I'm both unlucky and have an aura when it comes to breaking things in production, when you can blow away something stateless, we all laugh and can restore the web server and have a fun discussion.

Do that to the data warehouse. There might not be a company here tomorrow for us to have that conversation at.

Andy Warfield: Yeah, that's that's true. That's uh, that's probably the uh, the the. The sentiment that I've heard the most from strong engineers that have moved into working on storage and then moved on.

Corey Quinn: Everything I do here suddenly matters. How do we make that not happen? Yeah. No, it's, it's been an incredible evolution. Any chance you can give us a peek at what you're thinking about next? What's, what's hard for you these days? What's exciting?

Andy Warfield: Well, I mean, scale is always exciting and like the reality of working on s3 is that, you know, something is, is always needing to, to scale up, you know, on some dimension of the system.

And so a ton of my time is, is on those aspects of, of things. There's this stat. You've probably heard from me before, but I, I like can't get over it having been used to working on enterprise storage prior to this, the investigation that we did last year where we were looking at like what some of the largest scale as three customers were and finding out that there were actually like 10,000 customers that had buckets where the buckets had objects spanning over a million physical hard disks.

I had to have a moment on that one and just go like, I don't actually know how to visualize this for the first time.

Corey Quinn: Yeah, that is beyond my blow to wrap my head around.

Andy Warfield: So it's, I mean, scale is a big thing. The adoption of S3 tables has been really exciting to watch over the last month. I think, you know, one thing that I'm.

Personally really excited about is I feel like and this is a totally personal take but inside of s3 One of the bits of sentiment that i've seen following the last reinvent is people are commenting about velocity And how, like, there's a bunch of motion with features and stuff. And I think when you look at tables as an example, there's a bit of an old school AWS sentiment about, like, in S3, I think we have such a high bar for correctness that sometimes it bleeds into also having a sort of Desire for perfection on feature launches.

And I think that we're starting to, you know, regain the perspective that it's okay to launch something that is usable, but not yet complete as a way of like listening and improving, although there still needs to be a high bar on like the way that you operate it. And that's kind of where we are with tables.

I think tables is, is early and it has a bunch of sharp edges, but we proved out that for a bunch of customers, it was a meaningful like starting point and we're seeing that happen. And so I'm excited to see how it goes. Sorry.

Corey Quinn: No, please, I will challenge that the reticence on it in that this is exactly what I like seeing from AWS.

I don't want five new storage services. I want feature expansion of the existing ones. And you talk about sharp edges. I won't deny that they're there, but they don't take the form of and then it accidentally just drops all of your data and you'll never get it back. It doesn't have the disastrous failure modes.

It's it's UX sharp edges. It's okay. It's challenging to Load a bunch of data in if we already have a functional iceberg set up, this is probably better today for net new that that sort of rough edge feels a lot more addressable and the sort of thing that customers understand and can can empathize with as opposed to, well, we just got sloppy because we had to get it over the line for reinvent and ran out of time, which there are companies that do that.

I don't see AWS being one of them.

Andy Warfield: That bit of discipline was one of my favorite things over the past year, right? Like the, the, the, the team really made like quick, but careful compromises on stuff. And I think, you know, especially with S3 tables, we entered reinvent with velocity and I'm excited, like looking at what's coming out over the next like month on month plan for the thing.

And so I, I'm super, you know, just like excited that. That how that feature is going to evolve over the next year.

Corey Quinn: As am I. I really want to thank you for being so generous with your time. If people want to learn more, where's the best place for them to find you?

Andy Warfield: They can drop me a note. I'm on LinkedIn or they can just email Warfield at, uh, at Amazon.

Corey Quinn: And we'll put links to that, at least the LinkedIn part into the show notes. Who knows what spam you'll get if we put the actual email address into something where it can get scraped. Thank you so much for your time. I appreciate it.

Andy Warfield: Thanks a lot for having me, Corey. It's super fun to talk.

Corey Quinn: Andy Warfield, Vice President and Distinguished Engineer at AWS.

I'm Cloud Economist Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five star review on your podcast platform of choice. Whereas if you've hated this podcast, please leave a five star review on your podcast platform of choice, along with an angry, insulting comment that tells me where I can download BitTorrent endpoint.

Newsletter Footer

Get the Newsletter

Reach over 30,000 discerning engineers, managers, enthusiasts who actually care about the state of Amazon’s cloud ecosystems.

"*" indicates required fields

This field is for validation purposes and should be left unchanged.
Sponsor Icon Footer

Sponsor an Episode

Get your message in front of people who care enough to keep current about the cloud phenomenon and its business impacts.