Several years ago I parked my car into a car park. When collecting my car and driving out, I was surprised that I didn’t have to insert my parking token that I had just used to pay my parking fee. The barrier opened automatically, as if it magically knew I had paid. I quickly understood the ‘magic’ involved: there was a camera scanning the license plates of all cars entering and leaving the car park. The paper parking token I was given upon entry actually had the license plate of my car printed on it: that’s how my payment was linked to my car, and that’s how the barrier could tell I paid. Is the company responsible for maintaining the car park correct when it claims it is not collecting personal data?
(This is the first myth discussed in my book Privacy Is Hard and Seven Other Myths. Achieving Privacy through Careful Design, that will appear October 5, 2021 at MIT Press. The image is courtesy of Gea Smidt.)
According to the General Data Protection Regulation (also known as the GDPR, the European ‘privacy’ law), license plates are personal data. And so are IP addresses, phone numbers, cookies, the MAC addresses transmitted over Bluetooth or WiFi, etc. This is because the GDPR defines personal data as:
any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person.
In other words, personal data is also personal if it can be linked to someone indirectly, for example by looking up and linking several pieces of information scattered over different databases. This rather broad definition makes sense because it is often easy to make this connection between such random looking identifiers and the actual person using them. Phone companies keep records of their subscribers. IP addresses are often static and linked to their owner by the Internet Service Provider.
Once you start paying attention, and look under the hood and try to understand how they actually work, you realise that many digital services (electronic payment, travelling by public transport, using a mobile phone, even parking your car, reading an ebook or a newspaper online) collect personal data as well. As explained at length in the book, this creates significant privacy problems: information on parked cars is used to detect tax fraud, your financial transactions are used to determine your credit score and for targeted advertising, and your reading patterns are used to influence how novelist and journalists write their stories. These famous sportsmen were right:
“You can observe a lot by just watching.” (Yogi Berra)
“Je gaat het pas zien als je het door hebt. (Johan Cruijff)
Unfortunately, many organisations are often unaware of this broad definition of what personal data is, and therefore believe they are not collecting personal data at all. As far as they are concerned they are only collecting some uninteresting, random looking identifiers. For many it’s also a form of denial: many online services necessarily process IP addresses, which means they fall under the GDPR (which they would rather not). This explains, and busts, this common privacy myth.
The question is: can such organisations avoid collecting personal data at all, or is there a more privacy friendly way to deal with such personal data? In general, this is what privacy by design is all about. In the context of the car park example above the question is whether a barrier that automatically opens for all cars for which the parking fee has been paid can be implemented without collecting license plates. The answer is a qualified yes, if we use pseudonyms.
The use of a pseudonym is an age-old method to conceal one’s true identity from the general public, used by artists, activists, criminals, celebrities, lovers, terrorists, and others alike. A pseudonym is any identifier (like an email address, a nickname, or a random string of letters and/or digits) that uniquely belongs to some person and allows others to single out that person, while preventing anyone from recovering or determining the true identity of that person. (Note that although this a good protective measure, pseudonymous data is typically still considered personal data.) The question then becomes how to generate “good” pseudonyms. A common method is to use hashing.
The book goes into much more detail, but essentially a hash
function is a function that is easy to compute but hard to invert.
Given a document it computes a unique hash code summarising the
contents of the document, but given only this hash code, the document
itself cannot be recovered. What is possible though is to try
a particular document and test whether it hashes to a known hash code.
Hash functions accept arbitrary inputs, not only large documents, but
also shorter inputs like phone numbers, passwords, or license plates.
For example, using hashing to hash German license plates, the license
plate F PC 1313
could hash to >5a39!xv
,
while the license plate HB T 184
could hash to
vjs8?42@
. But for short inputs like that the ability to
test inputs and see whether they hash to a particular hash code poses
a problem. There are only a relatively small number of license plates
in circulation in any given country, so it is quite trivial to compute
for each issued license plate the corresponding hash code and put all
these into a large reverse-lookup table (like a phone directory or
dictionary) that would allow you to lookup the license plate that
belongs to a particular hash code.
This can be made more cumbersome by making the hash code
context specific: by adding the unique name or location of
the car park before or after the license plate before hashing it, the
hash code corresponding to
Q-Park-Alexanderplatz/F PC 1313
will be totally different
from the hash code corresponding to
IKEA-München/F PC 1313
and therefore such a
reverse-lookup table would have to be constructed from scratch for
every different car park. By further increasing the time necessary to
compute the hash for a given input, using key derivation functions
like scrypt, the time needed to build such a dictionary could be made
prohibitively expensive — not something someone is willing to do
unless very motivated. Moreover, context specific hash codes prevents
visits of the same car to different car parks to be linked.
In the case of the car park, such a context specific pseudonym can be generated on the fly by the license plate camera scanning cars entering and leaving the car park, without storing the license plate itself. Instead of adding the car park name, the license plate scanner could add a random string that changes every day. This would make the pseudonyms even more resilient against such dictionary attacks, especially if it is destroyed at the end of the day. (Note that this assumes the car park is used only for daily parking, and closes at night.) This is better than storing the actual license plates in plaintext in the car park database, even if they are deleted from the database as soon as you drive out of the car park (something that should also be done when only hash codes are stored in the database). No matter how you look at it, though, in the end people parking their cars in the car park simply have to trust the car park to process this information exactly as it claims to do.
The car park example shows that although the use of pseudonyms does reduce the privacy risk, their use is fraught with possible pitfalls in practice and should therefore only be used with care and proper consideration for the residual risks. This is in fact a general observation that can be made about many of the privacy-enhancing technologies and privacy by design approaches to be discussed in the book. They are by no means a silver bullet magically solving all privacy problems. As such, they should be applied in practice with appropriate care and sensitivity to the particular problem at hand. The subtitle says it all: “Achieving Privacy through Careful Design”.
(For all other posts related to my book see here)