Last Thursday I attended the 2nd privacy round table organised by Google, the Netherlands, in their Amsterdam offices. Alma Whitten, Google’s global privacy engineering lead chaired this session. She explained what type of personal data Google is storing, for how long, and for what purpose.

The main distinction she made was between data stored for authenticated users (data associated with gmail, calendar, docs etc.) and data stored for unauthenticated users (mainly search logs). After the observation that all such authenticated ‘account’ data is really actually deleted (after some grace period) if you close your Google account, the main discussion revolved around the use of search logs and the balance between privacy and search quality.

For every Google query you submit, Google stores your IP number, your query, your OS and browser, date and time of the query and a Google search cookie (that stores your search preferences, and contains a more-or-less unique id). They also monitor which links you clicked on the result pages that you received in answer to your query. Anonymisation of these search logs occurs in two steps. The last octet of the IP address is stripped from each entry in the log after 9 months. The cookie is removed after 18 months. Google claims that it needs to keep that data un-anonymised for this long to improve the quality of search results. In particular, the 18 months term is taken to ensure that seasonal influence on search behaviour is properly measured. IP addresses and cookies are used to link seperate entries in the search logs, to create so-called ‘stories’ that describe the search experience of a particular person over a prolonged period of time.

Interestingly enough, at the 1st privacy round table that I also attended, Google’s chief privacy officer Peter Fleischer more-or-less refused to discuss the privacy implications of Google’s search logs, dismissing it as an old, uninteresting, yesteryear debate. I guess Google changed it mind (making it the subject of this round table)…

The main problem is the following. Google will not tell how much their search engine improves because of the use of this data. We have to pretty much believe them on their blue eyes that this long retention of personal data is really improving the quality a lot. For example, when pressing Alma on why Google only removes the last octet from an IP address (effectively limiting my privacy in the Google logs to a mere anonymity set of 256), she replied that the first three octets were used to improve the geographic origin of the request (when that data was inaccurate at the time the search log entry was created). But that argument does not make much sense, because most of that data is accurate when the log entry is created, and surely Google is able to handle a little bit of noise in their data… Also, the fact that Google has to keep the bad guys out that prevent click fraud has nothing to do with search quality. That only improves the quality of their syndicated system for advertisement placement.

There was also some discussion about the privacy threat imposed by such a database. One can argue about whether such data are personal – often, IP addresses can be linked tom individuals. But not always. And certainly not by Google itself. Then again, the question is how much that really matters: if you are identified on the Internet by your IP address, your real name is not important anymore to make all kinds of decisions about you.

Google’s view on this issue is pretty much that they “do no evil”. They themselves are the biggest privacy advocates, protecting that data with the highest standards, and not giving in to frivolous requests by law enforcement agencies. But I am afraid that their focus on the Google “do no evil” mantra has made them at least partially blind to the potential evil they could do. Maybe we should not focus on the possible use of Google’s logs by governments and law enforcement. Big Brother is not the issue here. Maybe we should focus on what Google does (and can do) with that data. Kafka is the issue here. The potential of abuse is huge. And who guarantees us that Google will do no evil in the future?

Clearly, the more data Google and other search engines collect, the better their search results will be. But how much better? And against which privacy cost? It would be interesting to have some proper, independently collected and verified, data to answer this question. Also, we should have a debate on how much data such companies should be allowed to collect for such a purpose. Perhaps new regulations are in order, that limit the time that personal data can be stored to improve the overall service levels (I am definitely not talking about personalised services here). This creates a level playing field for all, but increases our privacy.