Stepping Onto the AWS Commerce Platform with James Greenfield

Episode Summary

Corey has been angling to get someone from a particular department at AWS for a long while now. In the halls of AWS one may see “Commerce Platform” on a few of the doors. A point of interest for Corey, for sure. His tenacity has paid off as he is joined by James Greenfield, VP of AWS Commerce Platform, who has decided to step into the “Screaming” line up. James defines Commerce Platform as owning all the infrastructure, processes and software that takes what you’ve stored, and turns it into a number, and in turn makes that number as easy to pay as possible. James discusses moving from EC2 to Commerce Platform, how they’re constantly listening to their customers, the caliber and commitment of the Commerce Platform team, and more!

Episode Show Notes & Transcript

About James
James has been part of AWS for over 15 years. During that time he's led software engineering for Amazon EC2 and more recently leads the AWS Commerce Platform group that runs some of the largest systems in the world, handling volumes of data and request rates that would make your eyes water. And AWS customers trust us to be right all the time so there's no room for error.


Links Referenced:

Transcript
Announcer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.


Corey: This episode is sponsored in part by our friends at Vultr. Optimized cloud compute plans have landed at Vultr to deliver lightning-fast processing power, courtesy of third-gen AMD EPYC processors without the IO or hardware limitations of a traditional multi-tenant cloud server. Starting at just 28 bucks a month, users can deploy general-purpose, CPU, memory, or storage optimized cloud instances in more than 20 locations across five continents. Without looking, I know that once again, Antarctica has gotten the short end of the stick. Launch your Vultr optimized compute instance in 60 seconds or less on your choice of included operating systems, or bring your own. It’s time to ditch convoluted and unpredictable giant tech company billing practices and say goodbye to noisy neighbors and egregious egress forever. Vultr delivers the power of the cloud with none of the bloat. “Screaming in the Cloud” listeners can try Vultr for free today with a $150 in credit when they visit getvultr.com/screaming. That’s G-E-T-V-U-L-T-R dot com slash screaming. My thanks to them for sponsoring this ridiculous podcast.


Corey: Finding skilled DevOps engineers is a pain in the neck! And if you need to deploy a secure and compliant application to AWS, forgettaboutit! But that’s where DuploCloud can help. Their comprehensive no-code/low-code software platform guarantees a secure and compliant infrastructure in as little as two weeks, while automating the full DevSecOps lifestyle. Get started with DevOps-as-a-Service from DuploCloud so that your cloud configurations are done right the first time. Tell them I sent you and your first two months are free. To learn more visit: snark.cloud/duplo. Thats’s snark.cloud/D-U-P-L-O-C-L-O-U-D. 


Corey: Welcome to Screaming in the Cloud. I’m Corey Quinn. And I’ve been angling to get someone from a particular department at AWS on this show for nearly its entire run. If you were to find yourself in an Amazon building and wander through the various dungeons and boiler rooms and subterranean basements—I presume; I haven’t seen nearly as many of you inside of those buildings as people might think—you pass interesting departments labeled things like ‘Spline Reticulation,’ or whatnot. And then you come to a very particular group called Commerce Platform.


Now, I’m not generally one to tell other people’s stories for them. My guest today is James Greenfield, the VP of Commerce Platform at AWS. James, thank you for joining me and suffering the slings and arrows I will no doubt be hurling at you.


James: Thanks for having me. I’m looking forward to it.


Corey: So, let’s start at the very beginning—because I guarantee you, you’re going to do a better job of giving the chapter and verse answer than I would from a background mired deeply in snark—what is Commerce Platform? It sounds almost like it’s the retail website that sells socks, books, and underpants.


James: So, Commerce Platform actually spans a bunch of different things. And so, I’m going to try not to bore you with a laundry list of all of the things that we do—it’s a much longer list than most people assume even internal to AWS—at its core, Commerce Platform owns all of the infrastructure and processes and software that takes the fact that you’ve been running an EC2 instance, or you’re storing an object in S3 for some period of time, and turns it into a number at the end of the month. That is what you asked for that service and then proceeds to try to give you as many ways to pay us as easily as possible. There are a few other bits in there that are maybe less obvious. One is we’re also responsible for protecting the platform and our customers from fraudulent activity. And then we’re also responsible for helping collect all of the data that we need for internal reporting to support some of the back-ends services that a business needs to do things like revenue recognition and general financial reporting.


Corey: One of the interesting aspects about the billing system is just how deeply it permeates everything that happens within AWS. I frequently say that when it comes to cloud, cost and architecture are foundationally and fundamentally the same exact thing. If your entire service goes down, a few interesting things happen. One, I don’t believe a single customer is going to complain other than maybe a few accountants here and there because the books aren’t reconciling, but also you’ve removed a whole bunch of constraints around why things are the way that they are. Like, what is the most efficient way to run this workload?


Well, if all the computers suddenly become free, I don’t really care about efficiency, so much is, “Oh, hey. There’s a fly, what do I have as a flyswatter? That’s right, I’m going to drop a building on it.” And those constraints breed almost everything. I’ve said, for example, that S3 has infinite storage because it does.


They can add drives faster than we’re able to fill them—at least historically; they added some more replication services—but they’re going to be able to buy hard drives faster than the rest of us are going to be able to stretch our budgets. If that constraint of the budget falls away, all bets are really off, and more or less, we’re talking about the destruction of the cloud as a viable business entity. No pressure or anything.


James: [laugh].


Corey: You’re also a recent transplant into AWS billing as a whole, Commerce Platform in general. You spent 15 years at the company, the vast majority of that over an EC2. So, either it was you’ve been exiled to a basically digital Siberia or it was one of those, “Okay, keeping all the EC2 servers up, this is easy. I don’t see what people stress about.” And they say, “Oh, ho ho, try this instead.” How did you find yourself migrating over to the Commerce Platform?


James: That’s actually one I’ve had a lot from folks that I’ve worked with. You’re right, I spent the first 15 or so years of my career at AWS in EC2, responsible for various things over there. And when the leadership role in Commerce Platform opened up, the timing was fortuitous, and part of it, I was in the process of relocating my family. We moved to Vancouver in the middle of last year. And we had an opening in the role and started talking about, potentially, me stepping into that role.


The reason that I took it—there’s a few reasons, but the primary reason is that if I look back over my career, I’ve kind of naturally gravitated towards owning things where people only really remember that they exist when they’re not working. And for some reason, you know, I enjoy the opportunity to try to keep those kinds of services ticking over to the point where people don’t notice them. And so, Commerce Platform lands squarely in that space. I’ve always been attracted to opportunities to have an impact, and it’s hard to imagine having much more of an impact than in the Commerce Platform space. It underpins everything, as you said earlier.


Every single one of our customers depends on the service, whether they think about it or realize it. Every single service that we offer to customers depends on us. And so, that really is the sort of nexus within AWS. And I’m a platform guy, I’ve always been a platform guy. I like the force multiplier nature of platforms, and so Commerce Platform, you know, as I kind of thought through all of those elements, really was a great opportunity to step in.


And I think there’s something to be said for, I’ve been a customer of Commerce Platform internally for a long time. And so, a chance to cross over and be on the other side of that was something that I didn’t want to pass up. And so, you know, I’m digging in, and learning quickly, ramping up. By no means an expert, very dependent on a very smart, talented, committed group of people within the team. That’s kind of the long and short of how and why.


Corey: Let’s say that I am taking on the role of an AWS product team, for the sake of argument. I know, keep the cringe down for a second, as far as oh, God, the wince is just inevitable when the idea of me working there ever comes up to anyone. But I have an idea for a service—obviously, it runs containers, and maybe it does some other things as well—going from idea to six-pager to MVP to barely better than MVP day-one launch, and at some point, various things happen to that service. It gets staff with a team, objectives and a roadmap get built, a P&L and budget, and a pricing model and the rest. One the last thing that happens, apparently, is someone picks the worst name off of a list of candidates, slaps it on the product, and ships it off there.


At what point does the billing system and figuring out the pricing dimensions for a given service tend to factor in? Is that a last-minute story? Is that almost from the beginning? Where along that journey does, “Oh, by the way, we’re building this thing. Maybe we should figure out, I don’t know, how to make money from it.” Factor into the conversation?


James: There are two parts to that answer. Pretty early on as we’re trying to define what that service is going to look like, we’re already typically thinking about what are the dimensions that we might charge along. The actual pricing discussions typically happen fairly late, but identifying those dimensions and, sort of, the right way to present it to customers happens pretty early on. The thing that doesn’t happen early enough is actually pulling the Commerce Platform team in. but it is something that we’re going to work this year to try to get a little bit more in front of.


Corey: Have you found historically that you have a pretty good idea of how a service is going to be priced, everything is mostly thought through, a service goes to either private preview or you’re discussing about a launch, and then more or less, I don’t know, someone like me crops up with a, “Hey, yeah, let’s disregard 90% of what the service does because I see a way to misuse the remaining 10% of it as a database.” And you run some mental math and realize, “Huh. We’re suddenly giving, like, eight petabytes of storage per customer away for free. Maybe we should guard against that because otherwise, it’s rife with misuse.” It used to be that I could find interesting ways to sneak through the cracks of various services—usually in pursuit of a laugh—those are getting relatively hard to come by and invariably a lot more trouble than they’re worth. Is that just better comprehensive diligence internally, is that learning from customers, or am I just bad at this?


James: No, I mean, what you’re describing is almost a variant of the Defender’s Dilemma. They are way more ways to abuse something than you can imagine, and so defending against that is pretty challenging. And it’s important because, you know, if you turn the economics of something upside down, then it just becomes harder for us to offer it to customers who want to use it legitimately. I would say 90% of that improvement is us learning. We make plenty of mistakes, but I think, you know, one of the things that I’ve always been impressed by over my time here is how intentional we are trying to learn from those mistakes.


And so, I think that’s what you’re seeing there. And then we try very hard to listen to customers, talk to folks like you, because one of the best ways to tackle anything it smells of the Defender’s Dilemma is to harness that collective creativity of a large number of smart people because you really are trying to cover as much ground as possible.


Corey: There was a fun joke going around a while back of what is the most expensive environment you can get running on a free tier account before someone from AWS steps in, and I think I got it to something like half a billion dollars in the first month. Now, I haven’t actually tested this for reasons that mostly have to do with being relatively poor compared to, you know, being able to buy Guam. And understanding as well the fraud protections built into something like AWS are largely built around defending against getting service usage for free that in some way, shape or form, benefits the attacker. The easy example of that would be mining cryptocurrency, which is just super-economic as long as you use someone else’s AWS account to do it. Whereas a lot of my vectors are, “Yeah, ignore all of that. How do I just make the bill artificially high? What can I do to misuse data transfer? And passing a single gigabyte through, how much can I make that per gigabyte cost be?” And, “Oh, circular replication and the Lambda invokes itself pattern,” and basically every bad architectural decision you can possibly make only this time, it’s intentional.


And that shines some really interesting light on it. And I have to give credit where due, a lot of that didn’t come from just me sitting here being sick and twisted nearly so much as it did having seen examples of that type of misconfiguration—by mistake—in a variety of customer accounts, most confidently my own because it turns out that the way I learn things is by screwing them up first.


James: Yeah, you’ve touched on a couple of different things in there. So, you know, maybe the first one is, I typically try to draw a line between fraud and abuse. And fraud is essentially trying to spend somebody else’s money to get something for free. And we spent a lot of time trying to shut that down, and we’re getting really good at catching it. And then abuse is either intentional or unintentional. There’s intentional abuse: You find a chink in our armor and you try to take advantage of it.


But much more commonly is unintentional abuse. It’s not really abuse, you know. Abuse has very negative connotations, but it’s unintentionally setting something up so that you run up a much larger bill than you intended. And we have a number of different internal efforts, and we’re working on a bunch more this year, to try to catch those early on because one of my personal goals is to minimize the frequency with which we surprise customers. And the least favorite kind of surprise for customers is a [laugh] large bill. And so, what you’re talking about there is, in a sufficiently complex system, there’s always going to be weaknesses and ways to get yourself tied up in knots.


We’re trying both at the service team level, but also within my teams to try to find ways to make it as hard as possible to accidentally do that to yourself and then catch when you do so that we can stop it. And even more on the intentional abuse side of things, if somebody’s found a way to do something that’s problematic for our services, then you know, that’s pretty much on us. But we will often reach out and engage with whoever’s doing and try to understand what they’re trying to do and why. Because often, somebody’s trying to do something legitimate, they’ve got a problem to solve, they found a creative way to solve it, and it may put strain on the service because it’s just not something we designed for, and so we’ll try to work with them to use that to feed into either new services, or find a better place for that workload, or just bolster what they’re using. And maybe that’s something that eventually becomes a fully-fledged feature that we offer the customers. We’re always open to learning from our customers. They have found far more creative ways to get really cool things done with our services than we’ve ever imagined. And that’s true today.


Corey: I mean, most of my service criticisms come down to the fact that you have more-or-less built a very late model, high performing iPad, and I’m out there complaining about, “What a shitty hammer this thing is, it barely works at all, and then it breaks in my hand. What gives?” I would also challenge something you said a minute ago that the worst day for some customers is to get a giant surprise bill, but [unintelligible 00:13:53] to that is, yeah, but, on some level, that kind of only money; you do have levers on your side to fix those issues. A worse scenario is you have a customer that exhibits fraud-like behavior, they’re suddenly using far more resources than they ever did before, so let’s go ahead and turn them off or throttle them significantly, and you call them up to tell them you saved them some money, and, “Our Superbowl ad ran. What exactly do you think you’re doing?” Because they don’t get a second bite at that kind of Apple.


So, there’s a parallel on both sides of this. And those are just two examples. The world is full of nuances, and at the scale that you folks operate at. The one-in-a-million events happen multiple times a second, the corner cases become common cases, and I’m surprised—to be direct—how little I see you folks dropping the ball.


James: Credit to all of the teams. I think our secret sauce, if anything, really does come down to our people. Like, a huge amount of what you see as hopefully relatively consistent, good execution comes down to people behind the scenes making sure. You know, like, some of it is software that we built and made sure it’s robust and tested to scale, but there’s always an element of people behind the scenes, when you hit those edge cases or something doesn’t quite go the way that you planned, making sure that things run smoothly. And that, if anything, is something that I’m immensely proud of and is kind of amazing to watch from the inside.


Corey: And, on some level, it’s the small errors that are the bigger concern than the big ones. Back a couple years ago, when they announced GP3 volumes at re:Invent, well, great, well spin up a test volume and kick the tires on it for an hour. And I think it was 80 or 100 gigs or whatnot, and the next day in the bill, it showed up as about $5,000. And it was, “Okay, that’s not great. Not great at all.” And it turned out that it was a mispricing error by I think a factor of a million.


And okay, at least it stood out. But there are scenarios where we were prepared to pay it because, oops, you got one over on us. Good job. That’s never been the mindset I’ve gotten about AWS’s philosophy for pricing. The better example that I love because no one took it seriously, was a few years before that when there was a LightSail bug in the billing system, and it made the papers because people suddenly found that for their LightSail instance, they were getting predicted bills of $4 billion.


And the way I see it, you really only had to make that work once and then you’ve made your numbers for the year, so why not? Someone’s going to pay for it, probably. But that was such out-of-the-world numbers that no one saw that and ever thought it was anything other than a bug. It’s the small pernicious things that creep in. Because the billing system is vast; I had no idea when I started working with AWS bills just how complicated it really was.


James: Yeah, I remember both of those, and there’s something in there that you touched on that I think is really important. That’s something that I realized pretty early on at Amazon, and it’s why customer obsession is our flagship leadership principle. It’s not because it’s love and butterflies and unicorns; customer obsession is key to us because that’s how you build a long-term sustainable business is your customers depend on you. And it drives how we think about everything that we do. And in the billing space, small errors, even if there are small errors in the customer’s favor, slowly erode that trust.


So, we take any kind of error really seriously and we try to figure out how we can make sure that it doesn’t happen again. We don’t always get that right. As you said, we’ve built an enormous, super-complex business to growing really quickly, and really quick growth like that always acts as kind of a multiplier on top of complexity. And on the pricing points, we’re managing millions of pricing points at the moment.


And our tools that we use internally, there’s always room for improvement. It’s a huge area of focus for us. We’re in the beginning of looking at applying things like formal methods to make sure that we can make very hard guarantees about the correctness of some of those. But at the end of the day, people are plugging numbers in and you need as many belts and braces as possible to make sure that you don’t make mistakes there.


Corey: One of the things that struck me by surprise when I first started getting deep into this space was the fact that the finalized bill was—what does it mean to have this be ‘finalized?’ It can hit the Cost and Usage Report in an S3 bucket and it can change retroactively after the month closed periodically. And that’s when I started to have an inkling of a few things: Not just the sheer scale and complexity inherent to something like the billing system that touches everything, but the sheer data retention stories where you clearly have to be able to go back and reconstruct a bill from the raw data years ago. And I know what the output of all of those things are in the form of Cost and Usage Reports and the billing data from our client accounts—which is the single largest expense in all of our AWS accounts; we spent thousands and thousands and thousands of dollars a year just on storing all of that data, let alone the processing piece of it—the sheer scale is staggering. I used to wonder why does it take you a day to record me using something to it’s showing up in the bill? And the more I learned the more it became a how can you do that in only a day?


James: Yes, the scale is actually mind-boggling. I’m pretty sure that the core of our billing system is—I’m reasonably confident it’s the largest or one of the largest data processing systems on the planet. I remember pretty early on when I joined Commerce Platform and was still starting to wrap my head around some of these things, Googling the definition of quadrillion because we measured the number of metering events, which is how we record usage in services, on a daily basis in the quadrillions, which is a billion billions. So, it’s just an absolutely staggering number. And so, the scale here is just out of this world.


That’s saying something because it’s not like other services across AWS are small in their own right. But I’m still reasonably sure that being one of a handful of services that is kind of at the nexus of AWS and kind of deals with the aggregate of AWS’s scale, this is probably one of the biggest systems on the planet. And that shows up in all sorts of places. You start with that input, just the sheer volume of metering events, but that has to produce as an output pretty fine-grained line item detailed information, which ultimately rolls up into the total that a customer will see in their bill. But we have a number of different systems further down the pipeline that try to do things like analyze your usage, make sensible recommendations, look for opportunities to improve your efficiency, give you the ability to slice and dice your data and allocate it out to different parts of your business in whatever way it makes sense for your business. And so, those systems have to deal with anywhere from millions to billions to recently, we were talking about trillions of data points themselves. And so, I was tangentially aware of some of the scale of this, but being in the thick of it having joined the team really just does underscore just how vast the systems are.


Corey: I think it’s, on some level, more than a little unfortunate that that story isn’t being more widely told, more frequently. Because when Commerce Platform has job postings that are available on the website, you read it and it’s very vague. It doesn’t tend to give hard numbers about a lot of these things, and people who don’t play in these waters can easily be forgiven for thinking the way that you folks do your job is you fire up one of those 24 terabyte of RAM instances that—you know, those monstrous things that you folks offer—and what do you do next? Well, Microsoft Excel. We have a special high memory version that we’ve done some horse-trading with our friends over at Microsoft for.


It’s, yeah, you’re several steps beyond that, at this point. It’s a challenging problem that every one of your customers has to deal with, on some 
level, as well. But we’re only dealing with the output of a lot of the processing that you folks are doing first.


James: You’re exactly right. And a big focus for some of my teams is figuring out how to help customers deal with that output. Because even if you’re talking about couple of orders of magnitude reduction, you’re still talking about very large numbers there. So, to help customers make sense of that, we have a range of tools that exist, we’re investing in.


There’s another dimension of complexity in the space that I think is one that’s also very easy to miss. And I think of it as arbitrary complexity. And it’s arbitrary because some of the rules that we have to box within here are driven by legislative changes. As you operate more and more countries around the world, you want to make sure that we’re tax compliant, that we help our customers be tax compliant. Those rules evolve pretty rapidly, and Country A may sit next to Country B, but that doesn’t mean that they’re talking to one another. They’ve all got their own ideas. They’re trying to accomplish r—00:22:47
Corey: A company is picking up and relocating from India to Germany. How do we—


James: Exactly.


Corey: —change that on the AWS side and the rest? And it’s, “Hoo boy, have you considered burning it all down and filing an insurance claim 
to start over?” And, like, there’s a lot of complexity buried underneath that that just doesn’t rise to the notice of 99% of your customers.


James: And the fact that it doesn’t rise to the notice is something that we strive for. Like, these shouldn’t be things that customers have to worry about. Because it really is about clearing away the things that, as far as possible, you don’t want to have to spend time thinking about so that you can focus on the thing that your business does that differentiates you. It’s getting rid of that undifferentiated heavy lifting. And there’s a ton of that in this space, and if you’re blissfully unaware of it, then hopefully that means that we’re doing our job.


Corey: What I’m, I think, the most surprised about, and I have been for a long time. And please don’t take this as an insult to various other folks—engineers, the rest, not just in other parts of AWS but throughout the other industry—but talking to the people who work within Commerce Platform has always been just a fantastic experience. The caliber of people that you have managed to attract and largely retain—we don’t own people, they do matriculate out eventually—but the caliber of people that you’ve retained on your teams has just been out of this world. And at first, I wondered, why are these awesome people working on something as boring and prosaic as billing? And then I started learning a little bit more as I went, and, “Oh, wow. How did they learn all the stuff that they have to hold in their head in tension at once to be able to build things like this?” It’s incredibly inspiring just watching the caliber of the people that you’ve been able to bring in.


James: I’ve been really, really excited joining this team, as I’ve gotten other folks on the team because there’s some super-smart people here. But what’s really jumped out to me is how committed the team is. This is, for the most part, a team that has been in the space for many years. Many of them have—we talk about boomerangs, folks who live AWS, go spend some time somewhere else and come back and there’s a surprisingly high proportion of folks in Commerce Platform who have spent time somewhere else and then come back because they enjoy the space, they find that challenging, folks are attracted to the ability to have an impact because it is so foundational. But yeah, there’s a super-committed core to this team. And I really enjoy working with teams where you’ve got that because then you really can take the long view and build something great. And I think we have tons of opportunities to do that here.


Corey: It sounds ridiculous, but I’ve reached out to team members before to explain two-cent variances in my bill, and never once have I been confronted with a, “It’s two cents. What do you care?” They understand the requirement that these things be accurate, not just, “Eh, take our word for it.” And also, frankly, they understand that two cents on a $20 bill looks a little different on a $20 million bill. So yeah, let us figure out if this is systemic or something I have managed to break.


It turns out the Cost and Usage Report processing systems don’t love it when there’s a cost allocation tag whose name contains an emoji. Who knew? It’s the little things in life that just have this fun way of breaking when you least expect it.


James: They’re also a surprisingly interesting problem. So like, it turns out something as simple as rounding numbers consistently across a distributed system at this scale, is a non-trivial problem. And if you don’t, then you do get small seventh or eighth decimal place differences that add up to something that then shows up as a two-cent difference somewhere. And so, there’s some really, really interesting problems in the space. And I think the team often takes these kinds of things as a personal challenge. It should be correct, and it’s not, so we should go make sure it is correct. The interesting problems abound here, but at the end of the day, it’s the kind of thing that any engineering team wants to go and make sure it’s correct because they know that it can be.


Corey: This episode is sponsored in parts by our friend EnterpriseDB. EnterpriseDB has been powering enterprise applications with PostgreSQL for 15 years. And now EnterpriseDB has you covered wherever you deploy PostgreSQL on premises, private cloud, and they just announced a fully managed service on AWS and Azure called BigAnimal, all one word. 


Don't leave managing your database to your cloud vendor because they're too busy launching another half dozen manage databases to focus on any one of them that they didn't build themselves. Instead, work with the experts over at EnterpriseDB. They can save you time and money, they can even help you migrate legacy applications, including Oracle, to the cloud.


To learn more, try BigAnimal for free. Go to biganimal.com/snark, and tell them Corey sent you.


Corey: On the one hand, I love people who just round and estimate—we all do that, let’s be clear; I sit there and I back-of-the-envelope everything first. But then I look at some of your pricing pages and I count the digits after the zeros. Like, you’re talking about trillionths of a dollar on some of your pricing points. And you add it up in the course of a given hour and it’s like, oh, it’s $250 a month, most months. And it’s you work backwards to way more decimal places of precision than is required, sometimes.


I’m also a personal fan of the bill that counts, for example, number of Route 53 zones. Great. And it counts them to four decimal places of precision. Like, I don’t even know what half of it Route 53 zone is at this point, let alone something to, like, ah the 1,000th of the zone is going to cause this. It’s all an artifact of what the underlying systems are.


Can you by any chance shed a little light on what the evolution of those systems has been over a period of time? I have to imagine that anything you built in the early days, 16 years ago or so from the time of this recording when S3 launched to general availability, you probably didn’t have to worry about this scope and scale of what you do, now. In fact, I suspect if you tried to funnel this volume through S3 back then, the whole thing would have collapsed under its own weight. What’s evolved over the time that you had the billing system there? Because changes come slowly to your environment. And frankly, I appreciate that as a customer. I don’t like surprising people in finance.


James: Yeah, you’re totally right. So, I joined the EC2 team as an engineer myself, some 16 years ago, and the very first thing that I did was our billing integration. And so, my relationship with the Commerce Platform organization—what was the billing team way back when—it goes back over my entire career at AWS. And at the time, the billing team was similar, you know, [unintelligible 00:28:34] eight people. And that was everything. There was none of the scale and complexity; it was all one system.


And much like many of our biggest, oldest services—EC2 is very similar, S3 is as well—there’s been significant growth over the last decade-and-a-half. A lot of that growth has been rapid, and rapid growth presents its own challenges. And you live with decisions that you make early on that you didn’t realize were significant decisions that have pretty deep implications 15 years later. We’re still working through some of those; they present their own challenges. Evolving an existing system to keep up with the growth of business and a customer base that’s as varied and complex as ours is always challenging.


And also harder but I also think more fun than a clean sheet redo at this point. Like, that’s a great thought exercise for, well, if we got to do this again today, what would we do now that we’ve learned so much over the last 15 years? But there’s this—I find it personally fascinating challenge with evolving a live system where it’s like, “No, no, like, things exist, so how do we go from there to where we want to be next?”


Corey: Turn the billing system off for 18 months, rebuild—


James: Yeah. [laugh].


Corey: The whole thing from first principles. Light it up. I’m sure you’d have a much better billing system, and also not a company left anymore.


James: [laugh]. Exactly, exactly. I’ve always enjoyed that challenge. You know, even prior to AWS, my previous careers have involved similar kinds of constraints where you’ve got a live system, or you’ve got an existing—in the one case, it was an existing SDK that was deployed to tens of thousands of customers around the world, and so backwards compatibility was something that I spent the first five years of my career thinking about it way more detail than I think most people do. And it’s a very similar mindset. And I enjoy that challenge. I enjoy that: How do I evolve from here to there without breaking customers along the way?


And that’s something that we take pretty seriously across AWS. I think SimpleDB is the poster child for we never turn things off. But that applies equally to the services that are maybe less visible to customers, and billing is definitely one of them. Like, we don’t get to switch stuff off. We don’t get to throw things away and start again. It’s this constant state of evolution.


Corey: So, let’s say that I were to find a way to route data through a series of two Managed NAT Gateways and then egress to internet, and the sheer density of the expense of that traffic tears a hole in the fabric of space-time, it goes back 15 years ago, and you can make a single change to how the billing system was built. What would it be? What pisses you off the most about the current constraints that you have to work within or around?


James: I think one of the biggest challenges we’ve got, actually, is the concept of an account. Because an account means half-a-dozen different things. And way back, when it seemed like a great idea, you just needed an account; an account was your customer, and it was the same thing as the boundary that you put all your resources inside. And of course, it’s the same thing that you’re going to roll all of your usage up and issue a bill against. And that has been one of the areas that’s seen the most evolution and probably still has a pretty long way to go.


And what’s interesting about that is, that’s probably something we could have seen coming because we watched the retail business go through, kind of, the same evolution because they started with, well, a customer is a customer is a customer and had to evolve to support the concept of sellers and partners. And then users are different than customers, and you want to log in and that’s a different thing. So, we saw that kind of bifurcation of a single entity into a wide range of different related but separate entities, and I think if we’d looked at that, you know, thought out 15 years, then yeah, we could probably have learned something from that. But at the same time, when AWS first kicked off, we had wild ambitions for it, but there was no guarantee that it was going to be the monster that it is today. So, I’m always a little bit reluctant to—like, it’s a great thought exercise, but it’s easy to end up second-guessing a pretty successful 15 years, so I’m always a little bit careful to walk that line. But I think account is one of the things that we would probably go back and think about a little bit more.


Corey: I want to be very clear with this next question that it is intentionally setting up a question I suspect you get a lot. It does not mirror my own thinking on the matter even slightly, but I get a version of it myself all the time. “AWS bills, that sounds boring as hell. Why would you choose to work on such a thing?” Now, I have a laundry list of answers to that aren’t nearly as interesting as I suspect yours are going to be. What makes working on this problem space interesting to you?


James: There’s a bunch of different things. So, first and foremost, the scale that we’re talking about here is absolutely mind-blowing. And for any engineer who wants to get stuck into problems that deal with mind-blowingly large volumes of data, incredibly rich dimensions, problems where, honestly, applying techniques like statistical reasoning or machine learning is really the only way to chip away at it, that exists in spades in the space. It’s not always immediately obvious, and I think from the outside, it’s easy to assume this is actually pretty simple. So, the scale is a huge part of that.


Corey: “Oh, petabytes. How quaint.”


James: [laugh]. Exactly. Exactly I mean, it’s mind-blowing every time I see some of the numbers in various parts of the Commerce Platform space. I talked about quadrillions earlier. Trillions is a pretty common unit of measure.


The complexity that I talked about earlier, that’s a result of external environments is another one. So, imposed by external entities, whether it’s a government or a tax authority somewhere, or a business requirement from customers, or ourselves. I enjoy those as well. Those are different kinds of challenge. They really keep you on your toes.


I enjoy thinking of them as an engineering problem, like, how do I get in front of them? And that’s something we spend a lot of time doing in Commerce Platform. And when we get it right, customers are just unaware of it. And then the third one is, I personally am always attracted to the opportunity to have an impact. And this is a space where we get to hopefully positively impact every single customer every day. And that, to me is pretty fulfilling.


Those are kind of the three standout reasons why I think this is actually a super-exciting space. And I think it’s often an underestimated space. I think once folks join the team and sort of start to dig in, I’ve never heard anybody after they’ve joined, telling me that what they’re doing is boring. Challenging, yes. Is frustrating, sometimes. Hard, absolutely, but boring never comes up.


Corey: There’s almost no service, other than IAM, that I can think of that impacts every customer simultaneously. And it’s easy for me to sit in the cheap seats and say, “Oh, you should change this,” or, “You should change that.” But every change you have is so massive in scale that it’s going to break a whole bunch of companies’ automations around the bill processing in different ways. You have an entire category of user persona who is used to clicking a certain button in this certain place in the console to generate the report every month, and if that button moves or changes color, or has a different font, suddenly that renders their documentation invalid, and they’re scrambling because it’s not their core competency—nor should it be—and every change you make is so constricted, just based upon all the different concerns that you’ve got to be juggling with. How do you get anything done at all? I find that to be one of the most impressive aspects about your organization, bar none.


James: Yeah, I’m not going to lie and say that it isn’t a challenge, but a lot of it comes down to the talent that we have on the team. We have a super-motivated, super-smart, super-engaged team, and we spend a lot of time figuring out how to make sure that we can keep moving, keep up with the business, keep up with a world that’s getting more complicated [laugh] with every passing day. So, you’ve kind of hit on one of the core challenges there, which is, how do we keep up with all of those different dimensions that are demanding an increasing amount of engineering and new support and new investment from us, while we keep those customers happy?


And I think you touched on something else a little bit indirectly there, which is, a lot of our customers are actually pretty technical across AWS. The customers that Commerce Platform supports, are often the least technical of our customers, and so often need the most help understanding why things are the way they are, where the constraints are.


Corey: “A big bill from Amazon. How many books did you people buy last month?”—


James: [laugh]. Exactly.


Corey: —is still very much level of understanding in some cases. And it’s not because they’re dumb; far from it. It’s just, imagine that some 
people view there as being more to life than understanding the nuances and intricacies of cloud computing. How dare they?


James: Exactly. Who would have thought?


Corey: So, as you look now over all of your domain, such as it is, what sucks the most? What are you looking to fix as far as impactful changes that the rest of the world might experience? Because I’m not going to accept one of those questions like, “Oh, yeah, on the back-end, we have 
this storage subsystem for a tertiary thing that just annoys me because it wakes us up once in a whi”—no, no, I want something customer-facing. What’s the painful thing you’re looking at fixing next?


James: I don’t like surprising customers. And free tier is, sort of, one of those buckets of surprises, but there are others. Another one that’s pretty squarely in my sights is, whether we like it or not, customer accounts get compromised. Usually, it’s a password got reused somewhere or was accidentally committed into a GitHub repository somewhere.


And we have pretty established, pretty effective mechanisms for finding all of those, we’ll scan for passwords and credentials, and alert customers to those, and help them correct that pretty quickly. We’re also actually pretty good at detecting when an account does start to do something that suggests that it’s been compromised. Usually, the first thing that a compromised account starts to do is cryptocurrency mining. We’re pretty quick to catch those; we catch those within a matter of hours, much faster most days.


What we haven’t really cracked and where I’m focused at the moment is getting back to the customer in a way that’s effective. And by that I mean specifically, we detect an account compromised super-quickly, we reach out automatically. And so, you know, a customer has got some kind of contact from us usually within a couple of hours. It’s not having the effect that we need it to. Customers are still being surprised a month later by a large bill. And so, we’re digging into how much of that is because they never saw the contact, they didn’t know what to do with the contact.


Corey: It got buried with all the other, “Hey, we saw you spun up an S3 bucket. Have you heard of what S3 is?” Again, that’s all valuable, but you have 300-some-odd services. If you start doing that for every service, you’re going to hit mail sending limits for Gmail.


James: Exactly. It’s not just enough that we detect those and notify customers; we have to reduce the size of the surprise. It’s one thing to spend 100 bucks a month on average, and then suddenly find that your spend has jumped $250 because you reused the password somewhere and somebody got ahold of it and it’s cryptocurrency-mining your account. It’s a whole different ballgame to spend 100 bucks a month and then at the end of the month discover that your bill is suddenly $2,000 or $20,000. And so, that’s something that I really wanted to make some progress on this year. 


Corey: I’ve really enjoyed our conversation. If people want to learn more about how you view these things, how you’re approaching some of these problems, or potentially are just the right kind of warped to consider joining up, where’s the best place for them to go?


James: They should drop me an email at [email protected]. That is the most direct way to get hold of me, and I promise I will get back to you. I try to stay on top of my email as much as possible. But that will come straight to me, and I’m always happy to talk to folks about the space, talk to folks about opportunities in this team, opportunities across AWS, or just hear what’s not working, make sure that it’s something that we’re aware of and looking at.


Corey: Throughout Amazon, but particularly within Commerce Platform, I’ve always appreciated the response of, whenever I report something, no matter how ridiculous it is—and I assure you there’s an awful lot of ridiculousness in my bug reports—the response has always been the same: “Tell me more. Help me understand what it is you’re trying to achieve—even if it is ridiculous—so we can look at this and see what is actually going on.” Every Amazonian team has been great about that or you’re not at Amazon very long, but you folks have taken that to an otherworldly level. I just want to thank you for doing that.


James: I appreciate you for calling that out. We try, you know, we really do. We take listening to our customers very seriously because, at the end of the day, that’s what makes us better, and that’s how we make sure we’re in it for the long haul.


Corey: Thanks once again for being so generous with your time. I really appreciate it.


James: Yeah, thanks for having me on. I’ve enjoyed it.


Corey: James Greenfield, VP of Commerce Platform at AWS. I’m Cloud Economist Corey Quinn, and this is Screaming in the Cloud. If you’ve enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you’ve hated this podcast, please leave a five-star review on your podcast platform of choice along with an angry comment—possibly on YouTube as well—about how you aren’t actually giving this five-stars at all; you have taken three trillions of a star off of the rating.


Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.


Announcer: This has been a HumblePod production. Stay humble.


Transcript

Announcer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.

Corey: This episode is sponsored in part by our friends at Vultr. Optimized cloud compute plans have landed at Vultr to deliver lightning-fast processing power, courtesy of third-gen AMD EPYC processors without the IO or hardware limitations of a traditional multi-tenant cloud server. Starting at just 28 bucks a month, users can deploy general-purpose, CPU, memory, or storage optimized cloud instances in more than 20 locations across five continents. Without looking, I know that once again, Antarctica has gotten the short end of the stick. Launch your Vultr optimized compute instance in 60 seconds or less on your choice of included operating systems, or bring your own. It’s time to ditch convoluted and unpredictable giant tech company billing practices and say goodbye to noisy neighbors and egregious egress forever. Vultr delivers the power of the cloud with none of the bloat. “Screaming in the Cloud” listeners can try Vultr for free today with a $150 in credit when they visit getvultr.com/screaming. That’s G-E-T-V-U-L-T-R dot com slash screaming. My thanks to them for sponsoring this ridiculous podcast.

Corey: Finding skilled DevOps engineers is a pain in the neck! And if you need to deploy a secure and compliant application to AWS, forgettaboutit! But that’s where DuploCloud can help. Their comprehensive no-code/low-code software platform guarantees a secure and compliant infrastructure in as little as two weeks, while automating the full DevSecOps lifestyle. Get started with DevOps-as-a-Service from DuploCloud so that your cloud configurations are done right the first time. Tell them I sent you and your first two months are free. To learn more visit: snark.cloud/duplo. Thats’s snark.cloud/D-U-P-L-O-C-L-O-U-D.

Corey: Welcome to Screaming in the Cloud. I’m Corey Quinn. And I’ve been angling to get someone from a particular department at AWS on this show for nearly its entire run. If you were to find yourself in an Amazon building and wander through the various dungeons and boiler rooms and subterranean basements—I presume; I haven’t seen nearly as many of you inside of those buildings as people might think—you pass interesting departments labeled things like ‘Spline Reticulation,’ or whatnot. And then you come to a very particular group called Commerce Platform.

Now, I’m not generally one to tell other people’s stories for them. My guest today is James Greenfield, the VP of Commerce Platform at AWS. James, thank you for joining me and suffering the slings and arrows I will no doubt be hurling at you.

James: Thanks for having me. I’m looking forward to it.

Corey: So, let’s start at the very beginning—because I guarantee you, you’re going to do a better job of giving the chapter and verse answer than I would from a background mired deeply in snark—what is Commerce Platform? It sounds almost like it’s the retail website that sells socks, books, and underpants.

James: So, Commerce Platform actually spans a bunch of different things. And so, I’m going to try not to bore you with a laundry list of all of the things that we do—it’s a much longer list than most people assume even internal to AWS—at its core, Commerce Platform owns all of the infrastructure and processes and software that takes the fact that you’ve been running an EC2 instance, or you’re storing an object in S3 for some period of time, and turns it into a number at the end of the month. That is what you asked for that service and then proceeds to try to give you as many ways to pay us as easily as possible. There are a few other bits in there that are maybe less obvious. One is we’re also responsible for protecting the platform and our customers from fraudulent activity. And then we’re also responsible for helping collect all of the data that we need for internal reporting to support some of the back-ends services that a business needs to do things like revenue recognition and general financial reporting.

Corey: One of the interesting aspects about the billing system is just how deeply it permeates everything that happens within AWS. I frequently say that when it comes to cloud, cost and architecture are foundationally and fundamentally the same exact thing. If your entire service goes down, a few interesting things happen. One, I don’t believe a single customer is going to complain other than maybe a few accountants here and there because the books aren’t reconciling, but also you’ve removed a whole bunch of constraints around why things are the way that they are. Like, what is the most efficient way to run this workload?

Well, if all the computers suddenly become free, I don’t really care about efficiency, so much is, “Oh, hey. There’s a fly, what do I have as a flyswatter? That’s right, I’m going to drop a building on it.” And those constraints breed almost everything. I’ve said, for example, that S3 has infinite storage because it does.

They can add drives faster than we’re able to fill them—at least historically; they added some more replication services—but they’re going to be able to buy hard drives faster than the rest of us are going to be able to stretch our budgets. If that constraint of the budget falls away, all bets are really off, and more or less, we’re talking about the destruction of the cloud as a viable business entity. No pressure or anything.

James: [laugh].

Corey: You’re also a recent transplant into AWS billing as a whole, Commerce Platform in general. You spent 15 years at the company, the vast majority of that over an EC2. So, either it was you’ve been exiled to a basically digital Siberia or it was one of those, “Okay, keeping all the EC2 servers up, this is easy. I don’t see what people stress about.” And they say, “Oh, ho ho, try this instead.” How did you find yourself migrating over to the Commerce Platform?

James: That’s actually one I’ve had a lot from folks that I’ve worked with. You’re right, I spent the first 15 or so years of my career at AWS in EC2, responsible for various things over there. And when the leadership role in Commerce Platform opened up, the timing was fortuitous, and part of it, I was in the process of relocating my family. We moved to Vancouver in the middle of last year. And we had an opening in the role and started talking about, potentially, me stepping into that role.

The reason that I took it—there’s a few reasons, but the primary reason is that if I look back over my career, I’ve kind of naturally gravitated towards owning things where people only really remember that they exist when they’re not working. And for some reason, you know, I enjoy the opportunity to try to keep those kinds of services ticking over to the point where people don’t notice them. And so, Commerce Platform lands squarely in that space. I’ve always been attracted to opportunities to have an impact, and it’s hard to imagine having much more of an impact than in the Commerce Platform space. It underpins everything, as you said earlier.

Every single one of our customers depends on the service, whether they think about it or realize it. Every single service that we offer to customers depends on us. And so, that really is the sort of nexus within AWS. And I’m a platform guy, I’ve always been a platform guy. I like the force multiplier nature of platforms, and so Commerce Platform, you know, as I kind of thought through all of those elements, really was a great opportunity to step in.

And I think there’s something to be said for, I’ve been a customer of Commerce Platform internally for a long time. And so, a chance to cross over and be on the other side of that was something that I didn’t want to pass up. And so, you know, I’m digging in, and learning quickly, ramping up. By no means an expert, very dependent on a very smart, talented, committed group of people within the team. That’s kind of the long and short of how and why.

Corey: Let’s say that I am taking on the role of an AWS product team, for the sake of argument. I know, keep the cringe down for a second, as far as oh, God, the wince is just inevitable when the idea of me working there ever comes up to anyone. But I have an idea for a service—obviously, it runs containers, and maybe it does some other things as well—going from idea to six-pager to MVP to barely better than MVP day-one launch, and at some point, various things happen to that service. It gets staff with a team, objectives and a roadmap get built, a P&L and budget, and a pricing model and the rest. One the last thing that happens, apparently, is someone picks the worst name off of a list of candidates, slaps it on the product, and ships it off there.

At what point does the billing system and figuring out the pricing dimensions for a given service tend to factor in? Is that a last-minute story? Is that almost from the beginning? Where along that journey does, “Oh, by the way, we’re building this thing. Maybe we should figure out, I don’t know, how to make money from it.” Factor into the conversation?

James: There are two parts to that answer. Pretty early on as we’re trying to define what that service is going to look like, we’re already typically thinking about what are the dimensions that we might charge along. The actual pricing discussions typically happen fairly late, but identifying those dimensions and, sort of, the right way to present it to customers happens pretty early on. The thing that doesn’t happen early enough is actually pulling the Commerce Platform team in. but it is something that we’re going to work this year to try to get a little bit more in front of.

Corey: Have you found historically that you have a pretty good idea of how a service is going to be priced, everything is mostly thought through, a service goes to either private preview or you’re discussing about a launch, and then more or less, I don’t know, someone like me crops up with a, “Hey, yeah, let’s disregard 90% of what the service does because I see a way to misuse the remaining 10% of it as a database.” And you run some mental math and realize, “Huh. We’re suddenly giving, like, eight petabytes of storage per customer away for free. Maybe we should guard against that because otherwise, it’s rife with misuse.” It used to be that I could find interesting ways to sneak through the cracks of various services—usually in pursuit of a laugh—those are getting relatively hard to come by and invariably a lot more trouble than they’re worth. Is that just better comprehensive diligence internally, is that learning from customers, or am I just bad at this?

James: No, I mean, what you’re describing is almost a variant of the Defender’s Dilemma. They are way more ways to abuse something than you can imagine, and so defending against that is pretty challenging. And it’s important because, you know, if you turn the economics of something upside down, then it just becomes harder for us to offer it to customers who want to use it legitimately. I would say 90% of that improvement is us learning. We make plenty of mistakes, but I think, you know, one of the things that I’ve always been impressed by over my time here is how intentional we are trying to learn from those mistakes.

And so, I think that’s what you’re seeing there. And then we try very hard to listen to customers, talk to folks like you, because one of the best ways to tackle anything it smells of the Defender’s Dilemma is to harness that collective creativity of a large number of smart people because you really are trying to cover as much ground as possible.

Corey: There was a fun joke going around a while back of what is the most expensive environment you can get running on a free tier account before someone from AWS steps in, and I think I got it to something like half a billion dollars in the first month. Now, I haven’t actually tested this for reasons that mostly have to do with being relatively poor compared to, you know, being able to buy Guam. And understanding as well the fraud protections built into something like AWS are largely built around defending against getting service usage for free that in some way, shape or form, benefits the attacker. The easy example of that would be mining cryptocurrency, which is just super-economic as long as you use someone else’s AWS account to do it. Whereas a lot of my vectors are, “Yeah, ignore all of that. How do I just make the bill artificially high? What can I do to misuse data transfer? And passing a single gigabyte through, how much can I make that per gigabyte cost be?” And, “Oh, circular replication and the Lambda invokes itself pattern,” and basically every bad architectural decision you can possibly make only this time, it’s intentional.

And that shines some really interesting light on it. And I have to give credit where due, a lot of that didn’t come from just me sitting here being sick and twisted nearly so much as it did having seen examples of that type of misconfiguration—by mistake—in a variety of customer accounts, most confidently my own because it turns out that the way I learn things is by screwing them up first.

James: Yeah, you’ve touched on a couple of different things in there. So, you know, maybe the first one is, I typically try to draw a line between fraud and abuse. And fraud is essentially trying to spend somebody else’s money to get something for free. And we spent a lot of time trying to shut that down, and we’re getting really good at catching it. And then abuse is either intentional or unintentional. There’s intentional abuse: You find a chink in our armor and you try to take advantage of it.

But much more commonly is unintentional abuse. It’s not really abuse, you know. Abuse has very negative connotations, but it’s unintentionally setting something up so that you run up a much larger bill than you intended. And we have a number of different internal efforts, and we’re working on a bunch more this year, to try to catch those early on because one of my personal goals is to minimize the frequency with which we surprise customers. And the least favorite kind of surprise for customers is a [laugh] large bill. And so, what you’re talking about there is, in a sufficiently complex system, there’s always going to be weaknesses and ways to get yourself tied up in knots.

We’re trying both at the service team level, but also within my teams to try to find ways to make it as hard as possible to accidentally do that to yourself and then catch when you do so that we can stop it. And even more on the intentional abuse side of things, if somebody’s found a way to do something that’s problematic for our services, then you know, that’s pretty much on us. But we will often reach out and engage with whoever’s doing and try to understand what they’re trying to do and why. Because often, somebody’s trying to do something legitimate, they’ve got a problem to solve, they found a creative way to solve it, and it may put strain on the service because it’s just not something we designed for, and so we’ll try to work with them to use that to feed into either new services, or find a better place for that workload, or just bolster what they’re using. And maybe that’s something that eventually becomes a fully-fledged feature that we offer the customers. We’re always open to learning from our customers. They have found far more creative ways to get really cool things done with our services than we’ve ever imagined. And that’s true today.

Corey: I mean, most of my service criticisms come down to the fact that you have more-or-less built a very late model, high performing iPad, and I’m out there complaining about, “What a shitty hammer this thing is, it barely works at all, and then it breaks in my hand. What gives?” I would also challenge something you said a minute ago that the worst day for some customers is to get a giant surprise bill, but [unintelligible 00:13:53] to that is, yeah, but, on some level, that kind of only money; you do have levers on your side to fix those issues. A worse scenario is you have a customer that exhibits fraud-like behavior, they’re suddenly using far more resources than they ever did before, so let’s go ahead and turn them off or throttle them significantly, and you call them up to tell them you saved them some money, and, “Our Superbowl ad ran. What exactly do you think you’re doing?” Because they don’t get a second bite at that kind of Apple.

So, there’s a parallel on both sides of this. And those are just two examples. The world is full of nuances, and at the scale that you folks operate at. The one-in-a-million events happen multiple times a second, the corner cases become common cases, and I’m surprised—to be direct—how little I see you folks dropping the ball.

James: Credit to all of the teams. I think our secret sauce, if anything, really does come down to our people. Like, a huge amount of what you see as hopefully relatively consistent, good execution comes down to people behind the scenes making sure. You know, like, some of it is software that we built and made sure it’s robust and tested to scale, but there’s always an element of people behind the scenes, when you hit those edge cases or something doesn’t quite go the way that you planned, making sure that things run smoothly. And that, if anything, is something that I’m immensely proud of and is kind of amazing to watch from the inside.

Corey: And, on some level, it’s the small errors that are the bigger concern than the big ones. Back a couple years ago, when they announced GP3 volumes at re:Invent, well, great, well spin up a test volume and kick the tires on it for an hour. And I think it was 80 or 100 gigs or whatnot, and the next day in the bill, it showed up as about $5,000. And it was, “Okay, that’s not great. Not great at all.” And it turned out that it was a mispricing error by I think a factor of a million.

And okay, at least it stood out. But there are scenarios where we were prepared to pay it because, oops, you got one over on us. Good job. That’s never been the mindset I’ve gotten about AWS’s philosophy for pricing. The better example that I love because no one took it seriously, was a few years before that when there was a LightSail bug in the billing system, and it made the papers because people suddenly found that for their LightSail instance, they were getting predicted bills of $4 billion.

And the way I see it, you really only had to make that work once and then you’ve made your numbers for the year, so why not? Someone’s going to pay for it, probably. But that was such out-of-the-world numbers that no one saw that and ever thought it was anything other than a bug. It’s the small pernicious things that creep in. Because the billing system is vast; I had no idea when I started working with AWS bills just how complicated it really was.

James: Yeah, I remember both of those, and there’s something in there that you touched on that I think is really important. That’s something that I realized pretty early on at Amazon, and it’s why customer obsession is our flagship leadership principle. It’s not because it’s love and butterflies and unicorns; customer obsession is key to us because that’s how you build a long-term sustainable business is your customers depend on you. And it drives how we think about everything that we do. And in the billing space, small errors, even if there are small errors in the customer’s favor, slowly erode that trust.

So, we take any kind of error really seriously and we try to figure out how we can make sure that it doesn’t happen again. We don’t always get that right. As you said, we’ve built an enormous, super-complex business to growing really quickly, and really quick growth like that always acts as kind of a multiplier on top of complexity. And on the pricing points, we’re managing millions of pricing points at the moment.

And our tools that we use internally, there’s always room for improvement. It’s a huge area of focus for us. We’re in the beginning of looking at applying things like formal methods to make sure that we can make very hard guarantees about the correctness of some of those. But at the end of the day, people are plugging numbers in and you need as many belts and braces as possible to make sure that you don’t make mistakes there.

Corey: One of the things that struck me by surprise when I first started getting deep into this space was the fact that the finalized bill was—what does it mean to have this be ‘finalized?’ It can hit the Cost and Usage Report in an S3 bucket and it can change retroactively after the month closed periodically. And that’s when I started to have an inkling of a few things: Not just the sheer scale and complexity inherent to something like the billing system that touches everything, but the sheer data retention stories where you clearly have to be able to go back and reconstruct a bill from the raw data years ago. And I know what the output of all of those things are in the form of Cost and Usage Reports and the billing data from our client accounts—which is the single largest expense in all of our AWS accounts; we spent thousands and thousands and thousands of dollars a year just on storing all of that data, let alone the processing piece of it—the sheer scale is staggering. I used to wonder why does it take you a day to record me using something to it’s showing up in the bill? And the more I learned the more it became a how can you do that in only a day?

James: Yes, the scale is actually mind-boggling. I’m pretty sure that the core of our billing system is—I’m reasonably confident it’s the largest or one of the largest data processing systems on the planet. I remember pretty early on when I joined Commerce Platform and was still starting to wrap my head around some of these things, Googling the definition of quadrillion because we measured the number of metering events, which is how we record usage in services, on a daily basis in the quadrillions, which is a billion billions. So, it’s just an absolutely staggering number. And so, the scale here is just out of this world.

That’s saying something because it’s not like other services across AWS are small in their own right. But I’m still reasonably sure that being one of a handful of services that is kind of at the nexus of AWS and kind of deals with the aggregate of AWS’s scale, this is probably one of the biggest systems on the planet. And that shows up in all sorts of places. You start with that input, just the sheer volume of metering events, but that has to produce as an output pretty fine-grained line item detailed information, which ultimately rolls up into the total that a customer will see in their bill. But we have a number of different systems further down the pipeline that try to do things like analyze your usage, make sensible recommendations, look for opportunities to improve your efficiency, give you the ability to slice and dice your data and allocate it out to different parts of your business in whatever way it makes sense for your business. And so, those systems have to deal with anywhere from millions to billions to recently, we were talking about trillions of data points themselves. And so, I was tangentially aware of some of the scale of this, but being in the thick of it having joined the team really just does underscore just how vast the systems are.

Corey: I think it’s, on some level, more than a little unfortunate that that story isn’t being more widely told, more frequently. Because when Commerce Platform has job postings that are available on the website, you read it and it’s very vague. It doesn’t tend to give hard numbers about a lot of these things, and people who don’t play in these waters can easily be forgiven for thinking the way that you folks do your job is you fire up one of those 24 terabyte of RAM instances that—you know, those monstrous things that you folks offer—and what do you do next? Well, Microsoft Excel. We have a special high memory version that we’ve done some horse-trading with our friends over at Microsoft for.

It’s, yeah, you’re several steps beyond that, at this point. It’s a challenging problem that every one of your customers has to deal with, on some level, as well. But we’re only dealing with the output of a lot of the processing that you folks are doing first.

James: You’re exactly right. And a big focus for some of my teams is figuring out how to help customers deal with that output. Because even if you’re talking about couple of orders of magnitude reduction, you’re still talking about very large numbers there. So, to help customers make sense of that, we have a range of tools that exist, we’re investing in.

There’s another dimension of complexity in the space that I think is one that’s also very easy to miss. And I think of it as arbitrary complexity. And it’s arbitrary because some of the rules that we have to box within here are driven by legislative changes. As you operate more and more countries around the world, you want to make sure that we’re tax compliant, that we help our customers be tax compliant. Those rules evolve pretty rapidly, and Country A may sit next to Country B, but that doesn’t mean that they’re talking to one another. They’ve all got their own ideas. They’re trying to accomplish r—00:22:47 Corey: A company is picking up and relocating from India to Germany. How do we—

James: Exactly.

Corey: —change that on the AWS side and the rest? And it’s, “Hoo boy, have you considered burning it all down and filing an insurance claim to start over?” And, like, there’s a lot of complexity buried underneath that that just doesn’t rise to the notice of 99% of your customers.

James: And the fact that it doesn’t rise to the notice is something that we strive for. Like, these shouldn’t be things that customers have to worry about. Because it really is about clearing away the things that, as far as possible, you don’t want to have to spend time thinking about so that you can focus on the thing that your business does that differentiates you. It’s getting rid of that undifferentiated heavy lifting. And there’s a ton of that in this space, and if you’re blissfully unaware of it, then hopefully that means that we’re doing our job.

Corey: What I’m, I think, the most surprised about, and I have been for a long time. And please don’t take this as an insult to various other folks—engineers, the rest, not just in other parts of AWS but throughout the other industry—but talking to the people who work within Commerce Platform has always been just a fantastic experience. The caliber of people that you have managed to attract and largely retain—we don’t own people, they do matriculate out eventually—but the caliber of people that you’ve retained on your teams has just been out of this world. And at first, I wondered, why are these awesome people working on something as boring and prosaic as billing? And then I started learning a little bit more as I went, and, “Oh, wow. How did they learn all the stuff that they have to hold in their head in tension at once to be able to build things like this?” It’s incredibly inspiring just watching the caliber of the people that you’ve been able to bring in.

James: I’ve been really, really excited joining this team, as I’ve gotten other folks on the team because there’s some super-smart people here. But what’s really jumped out to me is how committed the team is. This is, for the most part, a team that has been in the space for many years. Many of them have—we talk about boomerangs, folks who live AWS, go spend some time somewhere else and come back and there’s a surprisingly high proportion of folks in Commerce Platform who have spent time somewhere else and then come back because they enjoy the space, they find that challenging, folks are attracted to the ability to have an impact because it is so foundational. But yeah, there’s a super-committed core to this team. And I really enjoy working with teams where you’ve got that because then you really can take the long view and build something great. And I think we have tons of opportunities to do that here.

Corey: It sounds ridiculous, but I’ve reached out to team members before to explain two-cent variances in my bill, and never once have I been confronted with a, “It’s two cents. What do you care?” They understand the requirement that these things be accurate, not just, “Eh, take our word for it.” And also, frankly, they understand that two cents on a $20 bill looks a little different on a $20 million bill. So yeah, let us figure out if this is systemic or something I have managed to break.

It turns out the Cost and Usage Report processing systems don’t love it when there’s a cost allocation tag whose name contains an emoji. Who knew? It’s the little things in life that just have this fun way of breaking when you least expect it.

James: They’re also a surprisingly interesting problem. So like, it turns out something as simple as rounding numbers consistently across a distributed system at this scale, is a non-trivial problem. And if you don’t, then you do get small seventh or eighth decimal place differences that add up to something that then shows up as a two-cent difference somewhere. And so, there’s some really, really interesting problems in the space. And I think the team often takes these kinds of things as a personal challenge. It should be correct, and it’s not, so we should go make sure it is correct. The interesting problems abound here, but at the end of the day, it’s the kind of thing that any engineering team wants to go and make sure it’s correct because they know that it can be.

Corey: This episode is sponsored in parts by our friend EnterpriseDB. EnterpriseDB has been powering enterprise applications with PostgreSQL for 15 years. And now EnterpriseDB has you covered wherever you deploy PostgreSQL on premises, private cloud, and they just announced a fully managed service on AWS and Azure called BigAnimal, all one word.

Don't leave managing your database to your cloud vendor because they're too busy launching another half dozen manage databases to focus on any one of them that they didn't build themselves. Instead, work with the experts over at EnterpriseDB. They can save you time and money, they can even help you migrate legacy applications, including Oracle, to the cloud.

To learn more, try BigAnimal for free. Go to biganimal.com/snark, and tell them Corey sent you.

Corey: On the one hand, I love people who just round and estimate—we all do that, let’s be clear; I sit there and I back-of-the-envelope everything first. But then I look at some of your pricing pages and I count the digits after the zeros. Like, you’re talking about trillionths of a dollar on some of your pricing points. And you add it up in the course of a given hour and it’s like, oh, it’s $250 a month, most months. And it’s you work backwards to way more decimal places of precision than is required, sometimes.

I’m also a personal fan of the bill that counts, for example, number of Route 53 zones. Great. And it counts them to four decimal places of precision. Like, I don’t even know what half of it Route 53 zone is at this point, let alone something to, like, ah the 1,000th of the zone is going to cause this. It’s all an artifact of what the underlying systems are.

Can you by any chance shed a little light on what the evolution of those systems has been over a period of time? I have to imagine that anything you built in the early days, 16 years ago or so from the time of this recording when S3 launched to general availability, you probably didn’t have to worry about this scope and scale of what you do, now. In fact, I suspect if you tried to funnel this volume through S3 back then, the whole thing would have collapsed under its own weight. What’s evolved over the time that you had the billing system there? Because changes come slowly to your environment. And frankly, I appreciate that as a customer. I don’t like surprising people in finance.

James: Yeah, you’re totally right. So, I joined the EC2 team as an engineer myself, some 16 years ago, and the very first thing that I did was our billing integration. And so, my relationship with the Commerce Platform organization—what was the billing team way back when—it goes back over my entire career at AWS. And at the time, the billing team was similar, you know, [unintelligible 00:28:34] eight people. And that was everything. There was none of the scale and complexity; it was all one system.

And much like many of our biggest, oldest services—EC2 is very similar, S3 is as well—there’s been significant growth over the last decade-and-a-half. A lot of that growth has been rapid, and rapid growth presents its own challenges. And you live with decisions that you make early on that you didn’t realize were significant decisions that have pretty deep implications 15 years later. We’re still working through some of those; they present their own challenges. Evolving an existing system to keep up with the growth of business and a customer base that’s as varied and complex as ours is always challenging.

And also harder but I also think more fun than a clean sheet redo at this point. Like, that’s a great thought exercise for, well, if we got to do this again today, what would we do now that we’ve learned so much over the last 15 years? But there’s this—I find it personally fascinating challenge with evolving a live system where it’s like, “No, no, like, things exist, so how do we go from there to where we want to be next?”

Corey: Turn the billing system off for 18 months, rebuild—

James: Yeah. [laugh].

Corey: The whole thing from first principles. Light it up. I’m sure you’d have a much better billing system, and also not a company left anymore.

James: [laugh]. Exactly, exactly. I’ve always enjoyed that challenge. You know, even prior to AWS, my previous careers have involved similar kinds of constraints where you’ve got a live system, or you’ve got an existing—in the one case, it was an existing SDK that was deployed to tens of thousands of customers around the world, and so backwards compatibility was something that I spent the first five years of my career thinking about it way more detail than I think most people do. And it’s a very similar mindset. And I enjoy that challenge. I enjoy that: How do I evolve from here to there without breaking customers along the way?

And that’s something that we take pretty seriously across AWS. I think SimpleDB is the poster child for we never turn things off. But that applies equally to the services that are maybe less visible to customers, and billing is definitely one of them. Like, we don’t get to switch stuff off. We don’t get to throw things away and start again. It’s this constant state of evolution.

Corey: So, let’s say that I were to find a way to route data through a series of two Managed NAT Gateways and then egress to internet, and the sheer density of the expense of that traffic tears a hole in the fabric of space-time, it goes back 15 years ago, and you can make a single change to how the billing system was built. What would it be? What pisses you off the most about the current constraints that you have to work within or around?

James: I think one of the biggest challenges we’ve got, actually, is the concept of an account. Because an account means half-a-dozen different things. And way back, when it seemed like a great idea, you just needed an account; an account was your customer, and it was the same thing as the boundary that you put all your resources inside. And of course, it’s the same thing that you’re going to roll all of your usage up and issue a bill against. And that has been one of the areas that’s seen the most evolution and probably still has a pretty long way to go.

And what’s interesting about that is, that’s probably something we could have seen coming because we watched the retail business go through, kind of, the same evolution because they started with, well, a customer is a customer is a customer and had to evolve to support the concept of sellers and partners. And then users are different than customers, and you want to log in and that’s a different thing. So, we saw that kind of bifurcation of a single entity into a wide range of different related but separate entities, and I think if we’d looked at that, you know, thought out 15 years, then yeah, we could probably have learned something from that. But at the same time, when AWS first kicked off, we had wild ambitions for it, but there was no guarantee that it was going to be the monster that it is today. So, I’m always a little bit reluctant to—like, it’s a great thought exercise, but it’s easy to end up second-guessing a pretty successful 15 years, so I’m always a little bit careful to walk that line. But I think account is one of the things that we would probably go back and think about a little bit more.

Corey: I want to be very clear with this next question that it is intentionally setting up a question I suspect you get a lot. It does not mirror my own thinking on the matter even slightly, but I get a version of it myself all the time. “AWS bills, that sounds boring as hell. Why would you choose to work on such a thing?” Now, I have a laundry list of answers to that aren’t nearly as interesting as I suspect yours are going to be. What makes working on this problem space interesting to you?

James: There’s a bunch of different things. So, first and foremost, the scale that we’re talking about here is absolutely mind-blowing. And for any engineer who wants to get stuck into problems that deal with mind-blowingly large volumes of data, incredibly rich dimensions, problems where, honestly, applying techniques like statistical reasoning or machine learning is really the only way to chip away at it, that exists in spades in the space. It’s not always immediately obvious, and I think from the outside, it’s easy to assume this is actually pretty simple. So, the scale is a huge part of that.

Corey: “Oh, petabytes. How quaint.”

James: [laugh]. Exactly. Exactly I mean, it’s mind-blowing every time I see some of the numbers in various parts of the Commerce Platform space. I talked about quadrillions earlier. Trillions is a pretty common unit of measure.

The complexity that I talked about earlier, that’s a result of external environments is another one. So, imposed by external entities, whether it’s a government or a tax authority somewhere, or a business requirement from customers, or ourselves. I enjoy those as well. Those are different kinds of challenge. They really keep you on your toes.

I enjoy thinking of them as an engineering problem, like, how do I get in front of them? And that’s something we spend a lot of time doing in Commerce Platform. And when we get it right, customers are just unaware of it. And then the third one is, I personally am always attracted to the opportunity to have an impact. And this is a space where we get to hopefully positively impact every single customer every day. And that, to me is pretty fulfilling.

Those are kind of the three standout reasons why I think this is actually a super-exciting space. And I think it’s often an underestimated space. I think once folks join the team and sort of start to dig in, I’ve never heard anybody after they’ve joined, telling me that what they’re doing is boring. Challenging, yes. Is frustrating, sometimes. Hard, absolutely, but boring never comes up.

Corey: There’s almost no service, other than IAM, that I can think of that impacts every customer simultaneously. And it’s easy for me to sit in the cheap seats and say, “Oh, you should change this,” or, “You should change that.” But every change you have is so massive in scale that it’s going to break a whole bunch of companies’ automations around the bill processing in different ways. You have an entire category of user persona who is used to clicking a certain button in this certain place in the console to generate the report every month, and if that button moves or changes color, or has a different font, suddenly that renders their documentation invalid, and they’re scrambling because it’s not their core competency—nor should it be—and every change you make is so constricted, just based upon all the different concerns that you’ve got to be juggling with. How do you get anything done at all? I find that to be one of the most impressive aspects about your organization, bar none.

James: Yeah, I’m not going to lie and say that it isn’t a challenge, but a lot of it comes down to the talent that we have on the team. We have a super-motivated, super-smart, super-engaged team, and we spend a lot of time figuring out how to make sure that we can keep moving, keep up with the business, keep up with a world that’s getting more complicated [laugh] with every passing day. So, you’ve kind of hit on one of the core challenges there, which is, how do we keep up with all of those different dimensions that are demanding an increasing amount of engineering and new support and new investment from us, while we keep those customers happy?

And I think you touched on something else a little bit indirectly there, which is, a lot of our customers are actually pretty technical across AWS. The customers that Commerce Platform supports, are often the least technical of our customers, and so often need the most help understanding why things are the way they are, where the constraints are.

Corey: “A big bill from Amazon. How many books did you people buy last month?”—

James: [laugh]. Exactly.

Corey: —is still very much level of understanding in some cases. And it’s not because they’re dumb; far from it. It’s just, imagine that some people view there as being more to life than understanding the nuances and intricacies of cloud computing. How dare they?

James: Exactly. Who would have thought?

Corey: So, as you look now over all of your domain, such as it is, what sucks the most? What are you looking to fix as far as impactful changes that the rest of the world might experience? Because I’m not going to accept one of those questions like, “Oh, yeah, on the back-end, we have this storage subsystem for a tertiary thing that just annoys me because it wakes us up once in a whi”—no, no, I want something customer-facing. What’s the painful thing you’re looking at fixing next?

James: I don’t like surprising customers. And free tier is, sort of, one of those buckets of surprises, but there are others. Another one that’s pretty squarely in my sights is, whether we like it or not, customer accounts get compromised. Usually, it’s a password got reused somewhere or was accidentally committed into a GitHub repository somewhere.

And we have pretty established, pretty effective mechanisms for finding all of those, we’ll scan for passwords and credentials, and alert customers to those, and help them correct that pretty quickly. We’re also actually pretty good at detecting when an account does start to do something that suggests that it’s been compromised. Usually, the first thing that a compromised account starts to do is cryptocurrency mining. We’re pretty quick to catch those; we catch those within a matter of hours, much faster most days.

What we haven’t really cracked and where I’m focused at the moment is getting back to the customer in a way that’s effective. And by that I mean specifically, we detect an account compromised super-quickly, we reach out automatically. And so, you know, a customer has got some kind of contact from us usually within a couple of hours. It’s not having the effect that we need it to. Customers are still being surprised a month later by a large bill. And so, we’re digging into how much of that is because they never saw the contact, they didn’t know what to do with the contact.

Corey: It got buried with all the other, “Hey, we saw you spun up an S3 bucket. Have you heard of what S3 is?” Again, that’s all valuable, but you have 300-some-odd services. If you start doing that for every service, you’re going to hit mail sending limits for Gmail.

James: Exactly. It’s not just enough that we detect those and notify customers; we have to reduce the size of the surprise. It’s one thing to spend 100 bucks a month on average, and then suddenly find that your spend has jumped $250 because you reused the password somewhere and somebody got ahold of it and it’s cryptocurrency-mining your account. It’s a whole different ballgame to spend 100 bucks a month and then at the end of the month discover that your bill is suddenly $2,000 or $20,000. And so, that’s something that I really wanted to make some progress on this year.

Corey: I’ve really enjoyed our conversation. If people want to learn more about how you view these things, how you’re approaching some of these problems, or potentially are just the right kind of warped to consider joining up, where’s the best place for them to go?

James: They should drop me an email at [email protected]. That is the most direct way to get hold of me, and I promise I will get back to you. I try to stay on top of my email as much as possible. But that will come straight to me, and I’m always happy to talk to folks about the space, talk to folks about opportunities in this team, opportunities across AWS, or just hear what’s not working, make sure that it’s something that we’re aware of and looking at.

Corey: Throughout Amazon, but particularly within Commerce Platform, I’ve always appreciated the response of, whenever I report something, no matter how ridiculous it is—and I assure you there’s an awful lot of ridiculousness in my bug reports—the response has always been the same: “Tell me more. Help me understand what it is you’re trying to achieve—even if it is ridiculous—so we can look at this and see what is actually going on.” Every Amazonian team has been great about that or you’re not at Amazon very long, but you folks have taken that to an otherworldly level. I just want to thank you for doing that.

James: I appreciate you for calling that out. We try, you know, we really do. We take listening to our customers very seriously because, at the end of the day, that’s what makes us better, and that’s how we make sure we’re in it for the long haul.

Corey: Thanks once again for being so generous with your time. I really appreciate it.

James: Yeah, thanks for having me on. I’ve enjoyed it.

Corey: James Greenfield, VP of Commerce Platform at AWS. I’m Cloud Economist Corey Quinn, and this is Screaming in the Cloud. If you’ve enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you’ve hated this podcast, please leave a five-star review on your podcast platform of choice along with an angry comment—possibly on YouTube as well—about how you aren’t actually giving this five-stars at all; you have taken three trillions of a star off of the rating.

Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.

Announcer: This has been a HumblePod production. Stay humble.

Newsletter Footer

Get the Newsletter

Reach over 30,000 discerning engineers, managers, enthusiasts who actually care about the state of Amazon’s cloud ecosystems.

"*" indicates required fields

This field is for validation purposes and should be left unchanged.
Sponsor Icon Footer

Sponsor an Episode

Get your message in front of people who care enough to keep current about the cloud phenomenon and its business impacts.