The EU data privacy law (GDPR) will probably have a painful and messy rollout. It’s likely to be over-complex, too specific in some places and too vague in others. Flaws aside, though, we need that law or something like it. Some of the most important principles:
- Individual control over what data is used for, and what partners get access to what
- Anonymizing data when possible
- Protection from access scope creep (“we need your call logs and we can’t possibly make our quiz app without them”) **
- Right to delete
- Right to port/download
Separately, we need a few laws on security standards, as well — something like building codes for cybersecurity.
** Yes, I do realize the irony that I’m posting this on Medium, which desperately “needs” your friends list in order to post to Facebook.
One of tech’s dirty secrets is that once my software looks at data, it’s basically impossible to make me delete it. Between backups, caches, replication, and test environments, once data is integrated into complex software, it’s non-trivial even if I’m trying to delete it. A malicious actor has no shortage of places to hide a database — from a thumb drive at someone’s house to an encrypted machine image at a cloud provider.
We talk about whether Facebook should have done a better job auditing Cambridge Analytica to make sure they deleted the data, but really that diligence has to happen before the data gets in. If you’re playing cleanup with an unwilling entity, you’ve already lost.
It’s easy to say “Facebook doesn’t make privacy a priority,” so let’s look at how companies who most certainly do care approach this. I’m talking about institutional financial data, where a company’s entire business might be selling a data feed, and the dollar amounts climb into 8-figures. Even the credit ratings companies at least pay lip service to taking these steps.
- Usually, there’s a requirement that you’re able to “purge” the system of all their data
- Sometimes this is audited, but rarely does that audit get to the point of reading code — remember, it’s prohibitively expensive for an auditor to read your entire code base and database structure — what they’re looking for is that you’ve put serious thought into how you would complete the purge; using ability as an indicator of willingness
- A big factor in these contracts is what you plan on doing with the data — only certain things are allowed
- Reputation and relationships are hugely important — a startup often needs a board member or executive who’s known in the industry to even be allowed the privilege of paying millions of dollars for a data set
- Having a US footprint can make or break a deal — even though these things almost never end up in court, if there’s no one important that can be arrested or assets that can be seized in a jurisdiction that’s likely to enforce a court judgement, it probably won’t be signed in the first place
- For academic research, you’ll often see a “data room” — where you physically go into the provider’s office to run queries, but can’t access it remotely or from your own equipment. You might also see a situation where a researcher is allowed to see a schema (think an empty database in your format), but they can only run queries on real data by emailing the code to a data scientist at the provider’s company, who reviews it and decides whether to run it and send them the results.
- Cutting off a malicious actor’s access doesn’t do anything about the data they’ve already accessed, so the first indication of trouble can’t be “when they do something bad”
- Preventing bad actors from getting the data in the first place is where 99% of the effort goes in. The enforcement mechanisms after the fact are almost entirely legal and reputational — since it’s basically impossible to force you to delete it once you have it — it comes down to “we’ll sue you”, “no one will sell you data again if you violate”, and “big banks won’t buy a product made from stolen data”
Blockchain enthusiasts and others like to believe that there are computational solutions to these problems, but it’s a false hope. Data on a blockchain may be safe — but if, at any point, a user is trusting my code to put their data on that blockchain or read something from it, that’s an opportunity for me to make a non-blockchain-ed copy and send us straight back to square one. The existence of data on a blockchain doesn’t preclude or in any way affect copies of that data in other places.
Think of DRM for movies — the technology to prevent someone from pointing a camcorder at a screen will never exist. You do your best to stop the high-quality copies from getting out to people you don’t trust, legal protections stop the top distributors from selling ones obtained illegally, and you generally try your best to make it cheaper and easier to buy it than steal it.
I’m sorry, coders, the only solutions here are societal, institutional, and legal. Data access decisions have to be made before any data transfer (technology solutions can help us here), some onus has to be put on data purchasers to do diligence on where data came from and not use it if it were obtained illegally, and protections after the fact have to be enforced the old-fashioned way (but technology here is only useful if the data holders play along). We’ll never be able to stop any and all data leaks, but we can set up some norms that limit the damage.