My Route53 database is humming along nicely, my podcast interview backlog is full, and I’ve outsourced my thinking to ChatGPT, so I have some unprecedented free time to build a side project. Awesome! What cloud provider should I use?
The obvious and correct answer is “the one you’re already familiar with,” which for me and many others is AWS (due in no small part to their 5-year head start). But if AWS isn’t an option for whatever reason, and we turn the decision into an open field, a whole mess of questions arise.
Reliability
To start, I’m going to care if the potential provider’s offering is down for large swaths of time. Predicting that is deceptively challenging. Even with AWS, if you search for common outage terms, you’ll see complaining on various forums, dire warnings to avoid us-east-1, and news articles that make it sound like the entire environment is held together with spit and baling wire. I assure you, there’s no data center you’re going to build with anything approaching cloud provider availability—and that’s unlikely to be a design goal for most use cases. But if you’re picking a new provider for which you have no track record, you can’t reasonably make that assertion.
Provisioning Time
One of the best things about AWS is that it made provisioning rapidly an expectation rather than the exception. I was recently reminded of this when paying for a cloud-hosted gaming PC through another company (because it’s Starfield quarter for me): spinning up the computer took roughly an hour. “Oh, right,” I said. “This used to be the norm.” For a long time, Akamai had what I’ve dismissively referred to as a “Jason API,” since changes apparently were turned into a ServiceNow ticket for some guy named Jason to take action on. That’s kind of a problem when it comes to rapid iteration, autoscaling, and having to jump through hoops to reconfigure things when you realize you’ve gone down a wrong path and need to change some stuff. If setting up a new database instance takes three hours, I’m not going to be making a whole lot of changes—so the database I start with is the one I’m keeping. If the RDS team is suddenly feeling uncomfortable about how long it takes to restore a database from a snapshot, good: it’s been the single longest downtime-causing part of any VPC migration project I’ve ever done.
Pricing and Billing
The problem with cloud pricing isn’t necessarily the dimensions they charge for, but rather that there are so many of them—and how they interact isn’t easy to predict. Over time, you gain a familiarity with them where you can expect a small EC2 instance to cost you $7 or so per month, and you know that 10GB of storage will cost you about a quarter in S3. You can sort of assume that the other providers are pricing reasonably similarly, but there’s a big difference between “safe assumption” and “bet the financial viability of your new business.”
How It Fails
When an AWS service takes an outage, the way it manifests is pretty well known. You start seeing things not work, the AWS status page remains green, Down Detector shows a sharp spike in error rates, social media starts buzzing, seasons pass as summer becomes winter becomes summer again, the AWS status page shows “increased error rates,” and so on. It’s frustrating, but you know what’s going on. Services don’t assure you that they have your transactions safely recorded and then drop them on the floor and AWS’s recovery processes don’t turn your production environment into a pristine parking lot. The key to safely running something is to know how it degrades when it breaks, and the only way to learn that is over long periods of time.
Trust and Safety
If your provider thinks something suspicious is going on, how are they going to handle it? Are they going to turn your entire environment off if one of your hosts gets compromised? If a payment doesn’t go through because a credit card expires, are they going to reach out so you can fix it or disable your site after the first transaction fails to clear? If user-generated content violates their rules, are they going to reach out to you about your bad actor customer, or are they going to assume that all of your users are de facto “you and your company” and treat you as the problem?
Their Own Providers
Not only do you have to run this gauntlet of evaluating a provider, but the provider has to evaluate their own vendors through the same lens. For example, Wasabi took a thirteen hour outage due to choosing GoDaddy for domain services instead of a real company that understands how business works. A Wasabi customer uploaded ToS-violating content, and rather than following a communication process with the customer, GoDaddy decided to turn off their entire domain for thirteen hours. This could have been avoided via the simple rule of not doing enterprise scale business with a company whose name contains the word “Daddy,” but that’s only obvious the second time.
Community
One of the greatest evaluation criteria to use is quite simply the strength of the provider’s community. That encompasses a lot, but can be distilled down to “if I’m attempting to do something, can I find blog posts and guidance from this provider’s customer on how to do this thing?” Eventually you’re going to encounter something where the answer is “no, I cannot.”. If it’s attempting to use their DNS service as a globally distributed database, it’s probably a good sign that you Should Not Be Doing whatever the hell it is you’re attempting to do, because there’s almost certainly a better way. If it’s “put three web servers behind a load balancer,” you’re going to start wondering if the “cloud” you’re evaluating has any customers whatsoever. You don’t want to spend your time trailblazing things that have long since become undifferentiated heavy lifting via other providers.
More importantly, you want to make sure that there are lots of other customers using the provider you choose. Yes, I trust that my $450 a month in AWS spend is valued by the company, but there’s also something reassuring about having giant, multinational banks depending upon the same services that I’m using when it comes to lighting a fire under a provider’s urgency about service disruption. This also unlocks community knowledge, the sorts of things that “everybody knows” about a provider— or at least, they sure do the second time. “I’m starting out with AWS, what should I be aware of?” means you’re about to get absolutely firehosed off of your chair with the deluge of tips you’re about to receive. The one thing you won’t hear is “what’s AWS?”
Conclusion
Collectively, these points create a very high bar for a relatively new provider to surmount. Some providers have cleared this bar. Google Cloud has, after a shaky first few years. So has Azure if you don’t give a single damn about security. But other providers remain a significant question mark in many of these categories. Most of us don’t want to have to think about all of these categories all over again; we’d rather save our energies for our own unique business challenges rather than conducting vendor evaluations that are, frankly, exhausting. And so the big cloud providers get bigger, and the gap widens between the hyperscalers and everyone else.