Why do cloud hosting costs fly off the rails? by Chris Shaffer

Let’s do a thought experiment, envisioning two alternate realities.

  1. I deploy my app, and opt for a simple infrastructure. Let’s say I’m just rolling a VM and an RDS.

  2. I opt for a complex but optimized infrastructure, tailored to the app I’ve launched. Maybe some of that complexity is hidden from me, but let’s say I have some combination of serverless API and serverless data lake with message buses.

In which one does my code perform better and with lower costs?


It’s complicated

In the short-term, the optimized infrastructure will almost certainly be faster. Of course it will be, you’ve tailored the hosting to your use case!

But what happens over the “long” term? (which could be quite short indeed for a smaller company or newer project) What happens when, in Sprint 4, one of your assumptions turns out to be wrong? Or a business change doesn’t fit neatly into your architecture? The more optimized your architecture is to your use case, the more sensitive it is to a small change in that use case.

Here’s some examples that even the most waterfall-y discovery and design process will struggle to anticipate. What happens when…

  • A manager asks for a categorized report of your job queue?

  • It turns out the Dec 31 file from that vendor takes longer to download than Lambda’s max run time?

  • That seamless federated query needs to apply row-level permissions to keep a new client’s data confidential from your data science team?

  • I needed a third-party library that was a few megabytes, enough to mess with my cold starts?

The answer is that you’re refactoring or re-architecting a lot.

In contrast, with a more basic architecture, none of those are a big deal at all:

  • An intern wrote that query in 20 minutes and just had to ask someone with production access to review and run it

  • The file took 20 minutes to download at midnight on New Years’ Day, and no one noticed because there were no downstream implications

  • You just added an RLS policy to the table and called it a day

  • Your next deployment takes 15 seconds longer but then the library is there


Auto-tune the queues

Broadly speaking, I have two options when it comes to performance tuning:

  • Tune the code

  • Tune the infrastructure

While these options are both available in either scenario, I can bias myself toward one or the other.

If I’ve opted for the simple infrastructure, I’m biased toward tuning my code - I could re-architect my infrastructure around the code that I have, but then I’d be transitioning into that second scenario. If I’ve opted for the specialized infrastructure, I’m biased toward tuning the infrastructure around the code, not only because that’s the path I started down but also because that’s more likely to yield short-term bang-for-buck.

I could always hypothetically do both, but most organizations without a very strong engineering culture and oodles of time to throw at performance optimization will, most of the time, just do what they’re biased towards and stop when the fire is put out.

Would you rather be performance tuning like this or like this?


Aside about infrastructure-as-code

IaC is a great innovation, and it can make your life easier. But the fact that your infrastructure is (almost) as easy to change as your code doesn’t make it not infrastructure.

Code still runs on top of infrastructure, and IaC changes still have downstream effects on your actual code in the same way as Infrastructure-as-VMs-or-hardware might, and in a way that doesn’t run in reverse. (in the same way that your API also being written in JavaScript doesn’t mean the server no longer affects the client in ways that the client can’t affect the server)


Slippery slopes

There are clear paths to getting our costs and performance into a bad place no matter which broad option we choose.

With an overly simple architecture, we can easily end up stuck with no better options than to vertically scale our hardware. This can get very expensive very fast, especially if we’re unable to dedicate time to improving the performance of our code because of other priorities, but also if we’ve exhausted those options and the business is growing. The biggest common mistake here is to preclude ourselves from horizontal scaling (say, by sticking something not serializable in a static variable somewhere that’s needed across entire session(s)).

With an overly complex one, we can quickly end up in a situation where responsibility for the performance of the software and responsibility for developing it are separated. You have an SRE / DevOps team that doesn’t understand the code, and an engineering team that doesn’t understand the infrastructure. One person uses an array when they meant to use a dictionary and the next thing you know, you’re migrating to a new queuing system or just auto-scaling unnecessarily. You optimize infrastructure around code until there’s too much built on top of it, and then you have infrastructure tightly tailored to … code that’s long since changed. Alternately, you have a proliferation of stacks that are individually optimized, but the mess of coordinating and communicating between them is shunted over to yet another team.


The right answer is usually in the middle

The closest thing I can give as a recipe for engineering success is:

  1. Start with the simplest infrastructure possible

  2. Developers need to be able to run and debug the software locally, in the same manner a user would [*]

  3. Make sure you have CI/CD and a test environment that accurately represents production

  4. Make sure you can scale horizontally early on; use 2 instances in your test environment whether or not you need it

  5. The expectation is that everything goes onto the simple “core” infrastructure until otherwise necessary

  6. Introduce specialized infrastructure to target specific use cases when (all of the above)

    • There’s been a senior-level review to make sure the code can’t be optimized on the existing architecture

    • There’s a big difference in performance

    • The use case, or at least its core bits, and reasonably finalized (how might this change or evolve? would this invalidate our infrastructure choice?)

    • Clearly circumscribe what may use this infrastructure in the future. If something doesn’t fit, see #5 before #6 (with a different specialized tool).

  7. Set up monitoring and benchmarks in your test environment and make sure you can explain any sudden jumps in scaling costs or execution times - and encourage your QA team to complain when “it seems slow”

  8. Never get out of the habit of looking at the code!

  9. Remember that there’s a multi-trillion dollar industry, not to mention resume and blog cachet, behind titles like “how we solved our problems by buying an advanced proprietary technology and learning a new tech stack” and Big Do You Really Need Those Nested Loops, Pal? simply can’t match its firepower.

[*] Automated unit and integration tests, even with 100% code coverage, is not a substitute for this. If you don’t have the time to figure out how to get a local Athena and SQS running with docker-compose in development, you don’t have the resources to run them in production. (A cloud resource shared by the whole development team might also suffice, but if you’re “mocking” integral parts of your infrastructure to develop against you’re going to have a bad time)


Look at the code!

Organizations of all sizes need people who really know what they’re doing looking at the code with both a tactical and a strategic eye.

That means both experts in the chosen technology to ensure you’re using it correctly, and people who are experts in not the chosen technology to sniff out when something must be wrong because they’ve seen a “less advanced” tech stack handle a similar task more gracefully.

Otherwise, it’s just a matter of time before an engineer writes an O(n!) algorithm to solve an O(n) problem, and no amount of AWS/GCP/Azure gimmicks will outweigh it.

I’ve come into organizations - with a product-oriented technology leader missing a strong engineering counterpart, engineering leaders that didn’t have time to get into the details between their other responsibilities, or a lack of outside experience and expertise - and accomplished a lot with this One Simple Life Hack.


Names and details changed to protect the guilty

  • A company had sketched out and gotten C-level approval for a 6-month project to move some cron jobs into a cloud-based infinitely-scalable solution. They had a monthly scheduled job that took 20 days to run. Using some basic caching techniques, I was able to guide them to a solution in 6 days that trimmed it to 20 minutes, and they punted the re-architecture project.

  • A company had two teams responsible for two databases. They spun up a third team and a cross-functional task force to synchronize data between the two, and come up with clever solutions to handle conflicts and race conditions. A 3-week investigation yielded a 12-week plan to, table-by-table, eliminate dual-sources-of-truth, delete the synchronization code and costly infrastructure, and free those 5 FTEs up to work on other things.

  • An engineer was ready to send a project back to the drawing board, and force three downstream projects back to the drawing board, because they’d all chosen a database that “couldn’t load the data”. We were able to solve this in a day or two by splitting the ingestion job into batches.

  • I was personally considering migrating some long-running reports code to a background serverless system when I realized it didn’t have to be long-running. Memoizing one (unexpectedly) ugly computation that was called in a loop obviated the need … for now. When that code is eventually moved into such an architecture, it will start from a radically better performance position that it otherwise would have.

You never know until you look!


Epistemological postscript

You could also say the answer to my original question is “because we have no benchmarks”.

No two software companies are building the same software. You might have a handful of competitors that are serving the exact same market in the exact same way … but you’d still need to know how many customers they had, how often they used the software, what volume of data they processed, and how much they were spending (trade secret, trade secret, trade secret, and trade secret) to even begin to make a fair comparison.

And even then! Perhaps (assuming the aforementioned are all equal) Company B’s hosting costs twice as much as Company A’s because they made worse engineering choices. But perhaps Company B made excellent engineering choices and their hosting costs twice as much because they process data in real time, whereas Company A has no realistic path beyond daily updates, and that’s the biggest weapon B has to pry clients away from A.

You never know until you look!