Super-Concise Review of Traditional DevOps
There's two main components to traditional software:
Code: There's some process by which it goes from your test environment to your production environment ... repeat
Data: Gets pulled down from your production environment to your test environment, where you mess with it using your non-release-ready code ... on some schedule you wipe out your changes and replace your test data with a fresh copy of production ... repeat
There's a zillion caveats I'm glossing over, of course. You might have sensitive data, and thus test on a simulation of production data. You might have additional levels / stages of testing with different levels of access and review processes built in. But, for now: code is handled one way, data is handed another.
Machine Learning from 1,000 Feet
Again, let's stay super generic:
You have some data
You generate or "train" a model from that data
You send new data as questions to code that takes the question and that model as inputs and spits out an answer
On some schedule, you may re-train your model using new data (repeat #2)
What's code and what's data?
Let's expand this:
Data (duh)
You write a piece of code that trains a model; the input is the data from #1 and the output is #3 ...
A model
Code that uses the model to answer questions
A process by which you re-train your model using new data (repeat #2)
The first point here is that #2 and #4 are code and have to be tested / DevOp-ed as such. A lot of shops tend to miss that, especially with #2. The code that generates the model is usually incredibly simple (a few parameters to a function that's part of a package you got from elsewhere), so those parameters tend to get (wrongly) tweaked in production without the new model-generation code ever being executed in a test environment.
The same goes for the model, as it's trained and retrained - it's frequently treated without the respect that we'd traditionally afford to either data or code. The right answer here is actually more nuanced - the model is "sort of code, sort of data." You want to treat it as one or the other depending on where you're at in the the software life cycle, your particular use case, and cost/benefit considerations.
Of course, with anything testing and/or DevOps, nothing that follows is a dogma. You're allowed to take shortcuts; they typically become dangerous when you don't know they're shortcuts.
Handling a Model
I'd consider this a minimum: "Treat it like data."
Generate/train model in test environment using whatever training set you have; just get the code running
Generate model in production using a training set from production
Copy that model to test
Change your code that uses the model (maybe you're changing the format or pre-processing of what goes in, maybe you're changing how you interpret what comes out)
Push that code to production
Regenerate production model with new data, repeat from #2
Again, this is a minimum. A more standard situation could replace #1 with:
Generate/train model in test environment using data that's copied from production
Iterate on that process as necessary
Push that code to production
If you don't have real data in your test environment, consider doing so. If it doesn't make sense (privacy, cost, etc.) for you, then you might want to hook up something a little bespoke.
Maybe you have a process that generates and trains models in production using production data and different parameters, but then instead of deploying them, ships them to test - you can test different models in your test environment with a configuration setting.
Maybe you point the test model generation/training code at production data, but then leave everything else pointing at the test environment. This is simpler and cheaper, but less secure.
That's the answer to our previous question: your model is code, but it's code with data baked in. That means that you have to treat it with the respect of both if you're really serious about this AI thing.
Analogy to Tradition
There are some things that are like this outside of the AI/ML world: a query plan cache. Imagine for a moment, that you had a lot of control over how the query plan cache was built. Would you change that without doing it in a test environment, first? Would you generate a query plan using fake data and expect it to work with real data? You monster.
Side Note
You can ship your test models up to production as part of your build process, if they've been trained on real data. This shouldn't be a fundamental departure, just a performance optimization, or else you're doing it wrong.
Training and Re-Training
This is less likely to break functionality than changing the parameters of the model, or how it's used, but a model that evolves as new data comes in can get dumber, too. We can imagine some scenarios in which this happens: You sign up a large client for whom a rest-of-the-world edge case is common, or otherwise gather some unrepresentative data - a fat finger, mistake by a user, or intentionally malicious behavior... Your previously working image recognition software gets pointed at Reddit and all of a sudden it thinks that floating text in the Impact font is part of what defines a cat.
Most of these are going to be custom-tailored to your use case, mixing and matching from some of these ideas:
Have an "original" training set or hand-crafted "quiz" with know correct results and ensuring that the new model handles those better (or not worse).
If you don't have correct answers for everything, then look for deviations from the output of the previous model.
Actively ask “what might mistakes look like?” and test for them. Categorize based on things the algorithm shouldn’t look at and make sure it doesn’t. If you find that 10% of apples the software marks rotten are false positives, but 50% of the “rotten” Gala apples are actually fresh, you’re clearly pulling over too many Galas and might want to look more closely at what you put into your training set. Note the focus on false positives rather than total numbers or percentages—a bug may be masked by a factual correlation (some apples keep longer, or ship from farther).
Random samples are good, the larger the better. If you have a small enough data set for the "sample" to be "everything," do it.
Review discrepancies individually if practical, statistically if not.
In the early stages, or for an app with few customers that everyone knows is a beta, re-testing every model is probably okay to skimp on. But once it becomes a serious part of a serious business... it's a matter of graduating from high school to adult life.
Auditability / Explainablity
Sometimes, we expect our algorithms to be auditable and explainable.
In these cases, you can’t just build an algorithm and deploy it to production; you build an algorithm that tells you which human-comprehensible factors are important, how important, and in what combinations… and then turn that into code and deploy it. You don’t have an AI model make underwriting decisions directly, you use it to build a set of criteria for underwriting decisions, which your CFO can read. You don’t have AI decide, on a case-by-case basis, who is guilty of a crime (a la Minority Report), but rather what are the likely hotspots and where to patrol based on prior reports and patterns.
The way to think about this from a DevOps perspective is that you have a piece of code that generates another piece of code, that you then deploy to production. The first piece of code is really more of a tool for your development process — more like your IDE. It’s not crucial that the first piece of code is tested in a rigorous way (though doing so might make developing/iterating easier), but it’s absolutely crucial that the second piece is, because that’s what you’re deploying.
(When) Does the Algorithm Need to Explain Itself?
Whether your use case falls into one category or another is fundamental from a technical perspective, but should generally be based on non-technical considerations: How serious is an individual bad decision? Is there a review process? An appeals process? Who needs to know how it works and why?
An example of something that probably doesn't meet this threshold is fraud detection on credit card purchases. (Assuming that if a transaction meets the threshold, the action you take is to call the customer and ask them if it's legitimate) If the customer asks, "why did this get flagged and not this [similar transaction]" and you don't know, that's fine. It's just a phone call. You probably want to have an idea of what causes something to be flagged, but you don't need to know exactly.
An example of something that does is a loan application. If someone asks why they got denied for a loan while someone with similar-looking credit got approved, you'd better know. In fact, it's the law. If someone gets denied for a loan, you'd better be able to tell regulators that "we have a debt-to-income cutoff of x%" - even if deciding on the right x for the business is handled by an opaque algorithm.
This isn't necessarily a case of luddite-ism, here are a few good reasons:
Reproducibility: Investors in a lending scheme won't want to just take your word for it, they'll want to take your decision criteria and run it through their own simulations to come up with their own opinion of how it'll perform in the real world with their real money.
Ethics: some might laugh at the idea that a race-blind algorithm could be racist, but it's actually pretty easy when you consider how many variables can be proxies for other variables in real data sets. Take the lending example: a lot of machine-learning will very quickly find that zip code correlates with creditworthiness.
Missing information: See the prior example, I wonder if that has anything to do with a history of redlining? Did you feed an understanding of US history into the machine?
Self-fulfilling prophecy: When you put a product farther down the page, you make it less likely that a user finds and buys it. That doesn't necessarily mean you were right to declare it unwanted.