Two bad strategies and three good ones for mitigating against cloud vendor lock-in by Chris Shaffer

  1. "Hedge your bets" by having a presence on multiple clouds (bad)

  2. Use a third-party service that sits between you and AWS/Azure/GCP (bad)

  3. Write code that runs on top of open-source technologies (good)

  4. Create abstraction layers rather than interfacing with specialized tech directly (good)

  5. Simplify, contain, and minimize "infrastructure" (good)

A common theme to keep in mind: the biggest way to get locked-in to anything in the software world is to depend on a proprietary language/framework/infrastructure in such as way as to make what you built on top of it not portable to a competitor. "Physical presence" is easy to move; rewriting large swathes of your IP is a massive undertaking.


1) "Hedge your bets"

This relies largely on a misconceived notion that code and data are analogous to real estate: "If I have toeholds in two or three places, I can easily move to any one of them." In reality, this doesn't pass the "talk to your coders" test - if I have an Azure Cognitive Search instance and I want to migrate to ElasticSearch on AWS, the fact that I already "have a presence" on AWS saves me only as much time as it takes to type in my credit card number; I still have to rewrite any code that does search.

There might be some value in having your engineers be familiar with both cloud providers, but that's still a fraction of the IP migration cost; having separate teams for each cloud provider will, if anything, *increase* the migration cost - it's much easier for an Azure gal to become an AWS gal than it is for an Azure guy and an AWS guy to learn each others' code.

Outcome: You're now chained just as strongly to *two* providers. Financially, you've doubled the chances of something half as bad happening (kinda cool), but (depending on design choices) may have doubled the chances of an outage (not cool).

That's not to say multi-cloud is never justified - different providers might be better for different things - but it's not going to to contain lock-in and it's not going to de-risk you.

2) Outsource risk management to Netlify / DataBricks / etc.

You might have some success in the short term, but "de-risking" is not about the short term. The idea that "by not interfacing with a cloud provider directly, you can avoid lock-in and move if need be" is tempting, but consider:

  • The middleman has its own proprietary stuff that you're locked into.

  • Most of these middlemen are loss-making businesses; the cloud providers themselves are wildly profitable.

When you're using third-party tools, you've shifted your lock-in, not eliminated it. There's a strong case to be made that this lock-in is stronger, as it's easier to find expertise in one of the primary cloud vendors than in a secondary vendor.

There's also a strong case to be made that the risk is more acute - a company that's lost money every quarter for a decade has a fairly strong incentive to eventually raise prices; a company that's profitable has an incentive to not cook the golden goose. If I'm offering "SQL but cheaper", you have to ask yourself what percentage of that is due to my brilliant efficient engineering and what percentage of it is due to me simply subsidizing in order to build market share? If I’m a private company that might be exceedingly difficult to answer.

Again, that's not to say that you shouldn't use these services - but make your decision based on their merits, not the fiction that they'll decrease risks.

3) Open-source technologies

I’d urge you to go back to basics and always ask, “what would it actually take, from a programmer’s perspective, to switch to a different service provider?” If you can’t ask that question of your staff engineers who would be responsible for the work and get a satisfying answer (or the components of one); if the answer is coming from non-technical managers in more than a break-it-down-and-coordinate-between-teams sense, then it’s likely glossing over some key details and not credible.

Some examples:

  • Migrating from Oracle to Redshift is pretty tricky, and is going to be time consuming to the point of “probably won’t happen”. This isn't just a vendor thing; the technologies are very different under the hood, too. Migrating from PostgreSQL on AWS to PostgreSQL on Azure can be as simple as a backup/restore and changing a connection string.

  • Migrating an Excel add-in to a Google Sheets add-on would take a lot of coding, but any functionality that’s part your server-side API written in TypeScript doesn’t come into play.

  • A Python script running on Ubuntu is a Python script on Ubuntu no matter who owns the hardware.

  • Converting a GitHub Actions file to a BitBucket Pipeline is a lot easier and more reliable if the only step is “deploy Docker image”.

As always, there are pros and cons to be weighed: writing a bash script is harder for most of us than using a point-and-click alternative, but I’d only bet on the former still working on the cloud of a vendor that won’t be founded until 2040. Sometimes the proprietary technology *is* just better; though it's worth footnoting that proprietary licensing is less deep of a lock-in than a single gatekeeper for a hardware/software stack (e.g., people run SQL Server outside of Azure, they don't run Glue outside of AWS).

4) Abstraction layers

Rather than write code against a vendor’s API or SDK directly, build a “wrapper” around it and and have the rest of your code reference that wrapper. Only one file/module/package/library/service in your software should reference each external service, and everything else should be able to use that interface in a vendor-agnostic way.

If two services do “basically the same thing”, then encapsulate them inside the same wrapper! So, one LLM interface, one text search interface, one queuing interface, etc.

This has a few advantages when it comes to combating lock-in:

  • Most engineers most of the time don’t need to think about (or even know) which vendor is responsible for carrying out their request - interfaces are implementation-agnostic

  • Only one component in your code needs to change if and when you decide to swap out vendors (or split responsibilities among multiple vendors)

  • Since all of the code for a particular vendor is in one place, it’s easier to estimate the cost of migrating

Perhaps more importantly, this is just good engineering, whether or not you care about vendor lock-in.

“But what if we decide to stop using SQL?” was pretty far down the list of priorities 20 years ago when we all learned not to write select * from in our view layer, and to reference repositories from our controllers. When the algorithm for fetching user sessions became “check MemoryCache then Redis then check SQL” the change took ten lines of code, rather than ten thousand.

5) Simplify and minimize "infrastructure"

I have a more generic post about this (https://www.scoutcorpsllc.com/blog/2019/10/1/true-costs-of-architecture-complexity), but I want to focus specifically on the infrastructure and the vendor risk/lock-in aspects of the equation. This is the most nuanced of the choices I’ve laid out, has the most trade-offs, and needs to be approached the most thoughtfully.

Once upon a time, the “infrastructure” in the typical company’s data center was the most commoditized part of their tech stack. No longer. When we ran our own data centers, our software neither knew nor cared whether it was running on Dell or HP, Intel or Nvidia - swapping out a SanDisk hard drive for a Western Digital one need not concern your DBAs.

While that’s still mostly true in a narrow sense of hardware, cloud computing has created a new layer of middlemen between you and your hardware. Today, some bits of our infrastructure are vendor-agnostic commodities (e.g., virtual machines, RDS) but others are among the least vendor-agnostic parts of our stack. Among them:

  • Each cloud vendor’s store-brand queue, notification service, message bus, Lucene-based search, key-value store, blob storage, etc.

  • Infrastructure-as-code and a lot of “serverless”

  • Any data pipeline / integration tools in which you’re not writing SQL

  • Anything related to machine learning that’s not managed purely through requirements.txt

The first set can be managed well with encapsulation (see previous); the others involve real tradeoffs.

I’m an advocate for infrastructure-as-code. It makes your deployments and environments reproducible and testable in a way they weren’t in the past. Serverless brings a ton of cost and scaling advantages. I’m not suggesting we should all deploy one monolith to one EC2 instance. But it is a fact that every step you take away from that increases your dependence on a single vendor and opens you up to financial and operational risks that are hard to quantify. My concrete recommendation here is that a simple code review doesn’t cut it for infrastructure - this is still a strategic decision and should be treated as such, mechanical ease notwithstanding.

I’m more inclined toward a “just learn SQL” approach when it comes to data pipelines. Though I acknowledge that this might not be for everyone … I can’t overstate how many players have come and gone over the past three decades that promised to “demystify databases into a GUI” or some such, and how many companies have had to launch desperate searches for an expert in Arcane-Thing-That-Was-Supposed-To-Be-Easier-Than-SQL so they could migrate off of a soon-to-be-discontinued product.

In contrast, I’m not at all inclined to say “just learn how to develop and train neural nets yourself using TensorFlow”. If you don’t use machine-learning models built and trained by the big players, you’re putting yourself in a straight jacket. LLMs, speech-to-text, image recognition, etc. are very heavy on the special sauce, and there’s simply no avoiding it if you want those features in your product. Encapsulation (see previous) is still a worthwhile endeavor, here - if you want to switch from GPT to Llama, the programming interface won’t be your top concern, but it might be the most avoidable one.