Something we're always taught as developers, usually by tutorials or via the defaults in various ORM tools, is every SQL table needs an auto-incrementing ID. This is a weirdly common fallacy, enforced by old tutorials, new tutorials and half-arsed tooling in various forms. Why are auto-incrementing IDs a problem? Because it means people can download your database.
This is not a new problem, it's been covered well over the last twenty years, but it is a problem I tell people about in my API talks and it seems new to many. People want to know how to avoid it, and how API data can ever be safe.
A Case Study
I contracted for a startup going through an accelerator, where one of the teams were building a unique database of user-generated photographs, and these photos were of artwork. Artwork like awesome graffiti, unknown installations, etc., and this data was not available anywhere else on the Internet.
They had these two URLs on their API:
What are the problems here?
Leaking "Success" to Competitors
As a startup, you often have competitors. Letting those competitors know the size of your data set, and how many users you have in your system, could lead to some pretty embarrassing situations.
Looking at the URL
GET /users/6138, I can go to
GET /users/1. If that shows me data, I can reasonably assume they have at least 6138 users, but to find the total I can easily make a script that
++$id and counts every HTTP status
200, and also counts things like 404 or 410, to give a accurate number of how many active users there are, and how many closed their account.
That could embarrass a larger company, or utterly destroy a small startup.
Leaking Datasets to Competitors (or Assholes)
Leaking statistics to your enemies might not be something that keeps you up at night, but what about giving away your actual data?
In our case study, these people had unique data, available nowhere else. If other people want this data, they can use the same
++$id approach to grab a hold of that data, and populate their own app, making a new competitor quite easily, and with a slightly better app could potentially put them out of business.
There are some alternatives to this, which can be used in pretty much any scenario.
If you're stuck with 1, 2, 3 in your DB due to a picky ORM, or due to not wanting to change an existing schema and all the data that goes with it, you can keep your IDs and just "hide" them.
Obfuscation allows you to turn a
1 into a
"df234FSafd" and a
"p363fdte7" into a
2364. Your IDs stay the same, but the user cannot
++$id anymore because
"df234FSafd" + 1 isn't a thing.
They use a hidden secret key, without which the strings are not reversible, so they're relatively safe.
- obfuscate_id - Rails only sorry other people, but looks good
- Opaque ID - Obfuscation for integer IDs
- HashIDs - Apparently they're not so secure and the salt can be guessed from about 30 hashes, so probably don't use them, but if you're stuck for anything else...
Universally Unique IDs
UUID is a cool way to make some really unique identifiers for your records, without relying on the DB to auto-increment to make it unique.
These can be used in place of usual IDs. They look a little bit like this:
That, again, is not
On top of that, two records made at the exact same split second will not have the same UUID, which is where the "universally" bit comes in. You can use this for offline support, meaning you make an item locally, give it an UUID, then sync up with the database when your user gets out of the subway and you've maintained integrity of your IDs and your data at large.
Update: Performance for UUIDs was brought up on Twitter, and there is a common misconception that UUIDs are slow or terrible. Firstly, do not store UUIDs as a string (varchar or char). Store them as binary, and if you use something that prefers PKs to be ordered then you can use ordered UUIDs.
- ramsey/uuid - PHP 5.4+ library for generating RFC 4122 version 1, 3, 4, and 5 UUIDs
- SecureRandom.uuid - Built in!
- uuid - Built in!
But if your app can get it, so can the user
Maybe, but how do your users get to the data?
In the case of pageable content, like a "Recent Events" list, simply limit the history of the data you can search through to a few days.
In the case of geo data searched for by proximity (stuff near you), set a limit on the number of records that can be returned, and how far out they can zoom. They'd have to script a bot to get it all which would take a while and is something you can probably program to detect.
Also rate limiting. If you require folks to register an app, you can limit their credentials on the free tier to stop them downloading everything.
Then, if they do start trying to download your whole data set, it'll at least take them a long time, and again, hopefully give you a chance to notice suspect activity.
How? Look for apps regularly hitting their rate limit. Look for stuff that is randomly search geo boxes moving across the map in an unnatural way (people don't move in a sweeping left to right grid pattern).
Automate those detections, and fire off an email to the bastards trying to download your dataset. You'll have their email because they signed up, and if they're really daft they'll use their work email. I've seen that happen, and we had a nice little chat.