The VID (VerkeersInformatieDienst), a private company in the Netherlands, uses Bluetooth signals to provide traffic information. To this end a network of thousands of road side sensors scan for Bluetooth signals emitted by smart phones, car radios etc. in passing cars. Each such Bluetooth device has unique MAC address that it regularly broadcasts. The sensors transmit that MAC address together with the time it was observed to a central server. Using this data, and matching MAC addresses observed at different sensors, this central server can estimate travel times and hence congestion. It turns out that VID stores the raw sensor data, and that law enforcement agencies in the Netherlands have used this data to link suspects to a crime scene.
VID claims it does not store personally identifiable information, because it hashes the MAC address before processing and storing it. Given the fact that Bluetooth MAC addresses are taken from a 48-bit address space, this is a fair statement (although 48 bits of entropy do not outright rule out brute forcing a particular hash). The problem is that the police can still use this database to check whether MAC addresses of Bluetooth enabled devices that belong to a suspect have been seen by a sensor close to the scene of a crime, at the time the crime was committed. They simply apply the same hash to the MAC addresses and lookup the results in the database kept by VID. To obtain the MAC addresses of a suspect the police doesn’t even need to confiscate the equipment of the suspect; they can simply use the same technique used by VID to scan the MAC addresses remotely (without the suspect noticing).
A suggested solution to the problem
In the story it is suggested that a better way to protect the MAC addresses is to mix in a random salt value with MAC address before hashing, and to use a fresh salt each day while throwing away the previously used salt. This still allows VID to analyse the data in the database for patterns, but makes it impossible for the police to test for matches of suspected MAC addresses: they would need the corresponding salt-of-the-day for that.
This trick works in theory, but may be a bit harder to apply in practice. The problem is that all sensors need to use the same salt value on a particular day. Broadcasting this to the devices is not a good idea: the police can easily intercept and store this salt value. Moreover, it introduces two-way communication where the original design only requires the sensor to sent its data to the central server. An alternative approach is to seed all sensors initially with the same salt and let them compute a new salt value by hashing the old value at the end of the day. This does introduce the risk that some sensors fail to update their salt (for example because they did not receive the trigger to do so). As a result, sensors get out of sync and their data becomes useless forever. Moreover, it doesn’t help to prevent the police from learning the salt-of-the-day. They simply need to ‘steal’ one sensor, keep it in sync, and store all salts in a database.
A more fundamental approach
From a theoretical point of view the interesting question is whether a practical yet secure solution to this problem exists. This requires us to be a bit more precise in formulating the problem, along the following lines.
Suppose a central server operates a set of sensors. The sensors can sense values from a certain large (non-brute-forceable) domain. Sensors have some state , that can contain an initial secret (that differs among sensors). Whenever a sensor senses a value , it applies a publicly known function to this value and its local state , to produce a tag . It may also record some associated data (like the time the value was sensed) with the tag. It sends this tag and associated data to the central server.
Time is divided into epochs (e.g. days). At the end of each epoch, sensors update their state using a publicly known function and then discard the old state. The central server also updates its state using a public function .
The central server stores the received tags (along with the identity of the actual sensor) in its database. The central server also has some secret state. The central server must be able to efficiently determine for two tags and whether provided and are states of sensors from the same epoch. Let be the predicate implementing this test. I.e. if and only if and and belong to the same epoch. is public. Moreover, the central server must be able to quickly find all tags in its database that correspond to the same (unknown) value in a certain epoch. (Preferably, tags for equal values within an epoch should be equal to allow sorting of tags to quickly find all matches.) It should be unfeasible, however, to compute given .
The adversary (law enforcement, or other parties) can subpoena the central server to hand over the contents of the database. It can also request the secret state of the central server (so it is of no actual use to assume this is secret). The adversary can also corrupt sensors. However, once a sensor is corrupted, and its secret state revealed to the adversary, the central server can detect this and no longer accepts input from this sensor. In practice this could be done by making the sensors tamper proof or tamper evident. Also, the adversary cannot recover states of sensors (and the central server) of past epochs.
Suppose the adversary subpoenas the central database at epoch . Goal of the adversary is to determine for a value of its choice whether a tag corresponding to that value occurs in the database for some sensor state in an epoch before .
Given this more precise definition of the problem, the only feasible way for the central server to prevent the adversary from reaching his goal seems to be to erase the database at the end of each epoch. To see why, consider the predicate that returns true for two tags that correspond to the same sensed value . All law enforcement has to do to always be able to test whether a tag for some value occurs in the database is the following. At the start of the ‘game’ they corrupt one sensor, and store its state. They create a clone of this sensor using this stored state, advance its state to that corresponding to the desired epoch, and make it sense the desired value to create a tag . It then tries to find a tag in the database for which . If such exists, is in the database.
The only escape from this line of reasoning is when the transition to a new state corresponding to the next epoch by the sensors can never be simulated or performed after the fact. For example by working in the bounded-storage model introduced by Ueli Maurer and letting the state update depend on a too-large-too-store-for-later sequence of bits. (This also assumes that corrupted nodes refuse to faithfully update their state and still allow the new state to be observed by the adversary too; again this requires some form of tamper proofing)
The only way that VID can truly say it is not storing personally identifiable information, is by properly erasing its entire database regularly.
Post scriptum (22-4-2015)
Several people personally commented on this blog and provided some additional insights.
First of all, the problem would disappear if the mobile devices themselves would randomise their MAC address once in a while (again say each day).
Second of all, instead of salting the hash at the sensor, the hash can also be salted at the central server. This avoids the problem of having to distribute the daily salt to the sensors. Of course, to prevent the central server from learning the original MAC address this MAC address needs to be hashed also at the sensor node before sending it to the central server. The final tag stored at the central server then becomes something like .