Years ago, I founded a company that built iPhone apps. And those apps needed web services.
Back in 2009, that meant you ordered rack-mounted servers at Dell, carried them into your local data center for colocation, and managed them by hand or through automation tools like Ansible. I vividly remember spending nights restoring crashed servers and scouring web shops for the right kind of replacement power unit.
Then I was introduced to AWS.
I was blown away by its pervasive infrastructure-as-code mindset. Suddenly, spinning up a server was an API call. Restoring a backup was an API call. Deploying a new subnet was … well, you get the point. My software developer brain was overjoyed.
I’m still a big fan of infrastructure-as-code and Amazon’s API mandate. However, my relationship with AWS APIs has soured a bit. Years of intensive interaction have brought to the surface some details I wasn’t aware of when I started out. And one of those details has become supremely annoying: inconsistent APIs.
How inconsistency hurts developers — and the bottom line
Why bother ranting about inconsistent APIs? Why not accept them the way they are, read the docs, write the tests, and implement the APIs as they come?
It comes down to the developer experience. As a developer, I want to focus on the business problem I’m working to solve. That means everything else needs to operate without taking up conscious thought. I want my OS and IDE to be predictable, so they don’t get in my way. I want my keyboard shortcuts to always work, so I can blindly and efficiently write my code. And I want my APIs to be consistent, so I can build muscle memory to seamlessly integrate them.
If I need to look up the docs for every single API, or even just feel like I might have to look them up, it detracts from my actual work: solving business problems.
Inconsistent APIs cost more time to implement, and they increase the mental load on developers. But they might also reduce the quality of the applications using them. Worst case scenario, this can lead to a bug in production, with serious time and effort required to fix it. And that cost adds up.
Calling every single API reveals inconsistency in every layer
For a recent project, I collected a daily snapshot of every resource in every AWS account in my AWS Organization. Existing services like Config, Cost and Usage Reports, and Resource Groups don’t have full coverage (this frustrating fact is worthy of a blog post of its own). To collect my resources, I built a state machine that looped over all APIs and called every List
and Describe
API. I thought I’d get a consistently structured body of responses. Boy, was I wrong.
What I found was that there’s no consistency between different AWS services, between related AWS services, and even within a single service’s API. Let me give you an example of each.
Inconsistency between AWS services
There are many ways the various service APIs differ. It starts with the meaning of List
and Describe
calls. Logically, ListSomething
would return a list of all resources of that type, and Describe
would describe a single resource in detail. But there’s no such consistency.
IAM, for example, only has List
operations: ListRoles
, ListUsers
, ListPolicies
, and so on. These contain lists of resources, including their details.
Then there’s DynamoDB. It also has List
operations, like ListTables
, but this call only returns a list of plain table names. Nothing more. If you want to retrieve the details of those tables, you need to call DescribeTable
for each of the results.
Next up: RDS. This service has no List
operations at all! Instead, there are only Describe
calls, such as DescribeDBInstances
, DescribeDBClusters
, and DescribeDBSnapshots
. These calls return a list of resources, including all of their details. So functionally, they behave like the List
operations in IAM.
Inconsistency between related AWS services
Kinesis is a great example of inconsistent APIs within the same family of services. By looking at the APIs, you can tell the same people designed them or were at least inspired by one another. This is a trap that can lead you to falsely believe the APIs are consistent.
For example, Kinesis Data Streams (KDS) has ListStreams
, Kinesis Data Firehose has ListDeliveryStreams
, and Kinesis Data Analytics has ListApplications
. These calls all follow the DynamoDB API design, where the resources’ names are returned in the List
operation and you need to retrieve their details with a Describe
call. Fine.
These Describe
calls all return an ARN. Their respective fields are StreamARN
, DeliveryStreamARN
, and ApplicationARN
. OK.
They also all return a name. Their fields are StreamName
, DeliveryStreamName
, and ApplicationName
. Still good.
Then we get to the creation date. For KDS, this field is StreamCreationTimestamp
, but for both Firehose and Analytics, it’s CreateTimestamp
. They all return a Unix timestamp. Why does KDS have a different key?
When we include Kinesis Video Streams (KVS), it gets even worse. This service also has the StreamARN
and StreamName
fields, but you can get them through ListStreams
and DescribeStream
. And for KVS, the creation timestamp is stored under yet another name: CreationTime
.
Inconsistency within an AWS service
Inconsistency within a service is the worst type of inconsistency because, ostensibly, a single service is maintained by a single team.
It’s a real problem for developers. You can make a mental note that when talking to a different API, you might need to use a different syntax. But when various calls within the same service are inconsistent, you can never really rely on muscle memory, ever.
One of the more painful examples of this can be found working with REST APIs in API Gateway. This is a complex service, so it can be forgiven for having a complex service API. It has dozens of resources, including API Keys, Methods, Resources, Usage Plans, and Documentation Parts. Each of these resource types can be retrieved in singular form with a request like GetAuthorizer
and in plural form with a request like GetAuthorizers
. (Side note: That’s a whole new variation on the Describe
and List
standard.)
All of the list calls, including GetApiKeys
, GetAuthorizers
, GetBasePathMappings
, and GetRestApis
, return a JSON dictionary that has an items
field containing the list of resources. It looks like this:
{ "items": [ <list of resources> ] }
All of them do. Dozens of calls. They all return items
. Except for GetStages
, which returns the item
key. What?
{ "item": [ <list of resources> ] }
This items
versus item
example can lead to bigger issues. If you’ve seen 20 examples of an items
response, you’d be forgiven for assuming the 21st also returns items
. An automated check or test deployment might surface this bug, but it might also miss it. Fixing it after it’s live can cost significant time as well as your reputation.
An AWS API is a promise — even if it’s a bad one
When AWS releases an API into general availability, it makes a promise. It says: You can use this API like this, and you will continue to be able to use this API exactly like this until the end of time.
And that’s a good thing! It builds trust with developers. It allows systems using those APIs to work into perpetuity as well. No maintenance, no forced migrations, no deprecations.
However, this also means that when AWS releases a bad or inconsistent API, it has no way of changing it. The API is here to stay, for better or worse.
Outside of AWS, we use API versioning to deal with bad or inconsistent APIs while keeping the API promise.
API versioning has been around for decades, and it’s even supported out of the box by Amazon API Gateway. You can either use path-based versioning (https://api.mydomain.com/v1/something
), query string versioning (https://api.mydomain.com/something?v=1
), or use HTTP headers. AWS APIs use none of these, yet we’ve seen API versioning in a few services.
For example, there’s KinesisAnalytics and KinesisAnalyticsV2. The documentation for the first states:
> This documentation is for version 1 of the Amazon Kinesis Data Analytics API, which only supports SQL applications. Version 2 of the API supports SQL and Java applications.
Apparently, the new Java features couldn’t be integrated into the API without breaking backwards compatibility. We can see why in the StartApplication
call, which takes the following input in V1:
python response = client.start_application( ApplicationName='string', InputConfigurations=[ { 'Id': 'string', 'InputStartingPositionConfiguration': { 'InputStartingPosition': 'NOW'|'TRIM_HORIZON'|'LAST_STOPPED_POINT' } }, ] )
In V2, the call looks like this:
python response = client.start_application( ApplicationName='string', RunConfiguration={ 'FlinkRunConfiguration': { 'AllowNonRestoredState': True|False }, 'SqlRunConfigurations': [ { 'InputId': 'string', 'InputStartingPositionConfiguration': { 'InputStartingPosition': 'NOW'|'TRIM_HORIZON'|'LAST_STOPPED_POINT' } }, ], 'ApplicationRestoreConfiguration': { 'ApplicationRestoreType': 'SKIP_RESTORE_FROM_SNAPSHOT'|'RESTORE_FROM_LATEST_SNAPSHOT'|'RESTORE_FROM_CUSTOM_SNAPSHOT', 'SnapshotName': 'string' } } )
Fair enough. Backwards compatibility was not possible, so AWS released a new version. Applications using the old API could continue to function as they did, and the promise was kept.
This begs the question: Are there more V2 APIs? Turns out, there are:
- ApiGatewayV2
- CloudHSMV2
- GreengrassV2
- WAFV2
- ElasticLoadBalancingv2 (note the inconsistent lowercase “V”)
But these aren’t necessarily new APIs on existing services. Rather, they’re actual new services with their own API.
- ApiGatewayV2 is the API for HTTP API Gateways
- CloudHSMV2 is a different service from CloudHSM
- ElasticLoadBalancingv2 covers Application Load Balancers and Network Load Balancers, while ElasticLoadBalancing is the API for Classic Load Balancers
It’s quite amazing on how many levels the AWS APIs can be inconsistent.
The top 4 worst service APIs I’ve worked with
Inconsistency is always a pain, but some service APIs are worse offenders than others. I’ve rounded up examples of the most inconsistent APIs I’ve encountered, starting with our favorite content delivery network.
Amazon CloudFront
The worst service API I’ve encountered so far is CloudFront. This API is inconsistent compared to other services, is inconsistent within its own domain, and returns responses in such a broken way that it should be considered a bug.
First, the method-naming and response format. CloudFront has Get
, List
, and Describe
calls. The List
operations all return a dictionary with an Items
key, which is a list:
{ "CachePolicyList": { "NextMarker": "string", "MaxItems": 123, "Quantity": 123, "Items": [ { "Type": "managed"|"custom", "CachePolicy": { ... } } ] } }
This is a divergence from every other API, all of which have the resources on the top level. For example, here’s S3 ListBuckets
:
{ "Buckets": [ { "Name": "string", "CreationDate": "string" }, ], "Owner": { "DisplayName": "string", "ID": "string" } }
Then there are the APIs within CloudFront itself. The service API offers different ways to get a list of distributions:
ListDistributions
ListDistributionsByCachePolicyId
ListDistributionsByKeyGroup
ListDistributionsByOriginRequestPolicyId
ListDistributionsByRealtimeLogConfig
ListDistributionsByWebACLId
From these APIs, ListDistributions
, ListDistributionsByRealtimeLogConfig
, and ListDistributionsByWebACLId
return the object => list => object structure, as seen above. But ListDistributionsByCachePolicyId
, ListDistributionsByKeyGroup
, and ListDistributionsByOriginRequestPolicyId
only return distribution identifiers in string format (object => list => string), like below:
{ "DistributionIdList": { "Marker": "string", "NextMarker": "string", "MaxItems": 123, "IsTruncated": True|False, "Quantity": 123, "Items": [ "string", ] } }
As if this isn’t enough, all of these calls return empty output when no resources are found. Not an empty list, not a dictionary stating “Quantity”: 0, no, nothing, nada. Zero bytes. Content-Length: 0.
This is bad. This will break any parser because an empty string is not valid JSON. It also means that all of these calls have two different types of output: a dictionary or an empty string. As a developer, it means you have to inspect the result length before passing it on to the parser. You can’t even write a simple try-catch block because then you wouldn’t be able to differentiate between a failing API call and zero results. This is not only bad design, but once again, it’s inconsistent with all the other services that at least offer reliable output.
The kicker is that because this bad API is a promise, it’s guaranteed to stay bad forever.
Amazon WorkSpaces
The response keys for all Describe
calls in the WorkSpaces API are in CamelCase, like this:
{ "Images": [ { "Created": number, "Description": "string", "ErrorCode": "string", "ErrorMessage": "string", "ImageId": "string", "Name": "string", "OperatingSystem": { "Type": "string" }, "OwnerAccountId": "string", "RequiredTenancy": "string", "State": "string" } ], "NextToken": "string" }
Except DescribeIpGroups
, which uses lowerCamelCase. Additionally, it uses Result
as the result key, while all other calls have a field like Images
, Workspaces
, or Bundles
.
{ "NextToken": "string", "Result": [ { "groupDesc": "string", "groupId": "string", "groupName": "string", "userRules": [ { "ipRule": "string", "ruleDesc": "string" } ] } ] }
The DescribeWorkspaceBundles
call has a CreationTime
field, for DescribeWorkspaceImages
it’s Created
, and the DescribeWorkspaces
call returns no launch or creation time at all.
Amazon Redshift
The following Redshift calls have these strings as the top level key in the response dictionary:
- DescribeClusterParameterGroups:
ParameterGroups
- DescribeClusters:
Clusters
- DescribeClusterSnapshots:
Snapshots
So DescribeClusterSecurityGroups
should return SecurityGroups
, right? No. It’s ClusterSecurityGroups
.
Amazon Cognito
Cognito has always been “different,” and its API is no exception. Where every other API provides sensible defaults for optional parameters like MaxResults
, Cognito forces you to supply MaxResults
for some calls: ListUserPools
and ListResourceServers
. Cognito wouldn’t be Cognito if those values were consistent; the range for MaxResults
is 1-60 for ListUserPools
and 1-50 for ListResourceServers
.
The best AWS service API
But it’s not all doom and gloom. There are a few APIs that are consistent and have very thoughtful designs. I specifically want to call out Amazon FSx, which has the following JSON response keys:
DescribeBackups
:Backups
DescribeDataRepositoryTasks
:DataRepositoryTasks
DescribeFileSystems
:FileSystems
Very consistent and predictable!
The details of these calls all contain a CreationTime
and ResourceARN
field, as well as an identifier, which is found respectively under BackupId
, TaskId
, and FileSystemId
. Also consistent and predictable. Thank you, FSx team!
API inconsistencies are fixable, if AWS chooses to fix them. It probably won’t.
Building services and innovating at Amazon’s scale is hard. Really hard. AWS has impressively solved this problem with its two-pizza team approach, which allows service teams to autonomously design and implement their solutions. But while these isolated development teams can go fast, it’s obviously a trade-off with consistency. We see that in the AWS console, in CloudFormation definitions and coverage, and in the APIs.
I don’t believe this problem is unsolvable.
AWS could have a central API review board, it could have consistent guidelines, it could have API versioning, it could design its APIs in the open, it could ask for partner feedback under NDA, it could first deliver APIs internally and have another team build a consistent interface on top of that … the list goes on. Solutions galore, but I’m not holding my breath.
Despite the problems that inconsistent APIs create for AWS customers, the burden will continue to be on us developers for the foreseeable future.