Summary of Day #4 of the Interdisciplinary Summerschool on Privacy

July 14, 2016

Please find below a summary of the lectures given on day #2 of the Interdisciplinary Summerschool on Privacy (ISP 2016), held at Berg en Dal this week. There were lectures by Solon Barocas on fairness in machine learning, and Stefania Milan on privacy from the point of view of (organized) collective action.

Solon Barocas: Fairness in machine learning

The rise of big data has let to questions about how to regulate it, or control the consequences. In the EU, the focus is more on transparency while the US has given up on this idea and is moving towards fairness instead. Hence the topic of this talk: fairness by design.

Data is not only a privacy issue but also a discrimination issue, see e.g. "It's Discrimination, stupid" (Gandy 1995). Data enables differential treatment. We can use it to set opportunities, access, eligibility, price, attention (scrutiny), exposure.

Machine learning is used to use this data for differential treatment.

In machine learning, a system is trained to classify data (for example: is this email spam) using a set of examples with know properties (in the example: a set of known spam mails and a set of known regular mail). The system internals sort of conceptually synthesize a set of rules to classify new incoming data. These rules are not ever explicit in the system; the full set of rules are unknown, even to the engineer.

This involves a couple of steps.

First the target variables must be defined. They capture the thing you are trying to predict. In practice this involves the most creativity. E.g. to turn the question "help me figure out who is the ideal customer" into some specific attribute(s). This is what a 'data miner' does. This depends on what it means to be the best customer (the person most likely to click an ad, the person who will generate the most profit, etc.). The way the original question is turned into target variables is far more consequential than whether the machine learning model accurately classifies against the target variables. Even if this model has very low false accept and/or false reject ratios, when the target variables do not really represent what you want to classify, you still have a large error. For example, if you have a set of target variables that very broadly classifies terrorist traits, then even if the model is very accurate, many people are wrongly classified as terrorist. This is thus massively important. Also, certain variables may correlate strongly with a certain question yet they may be contain a bias, which may classify certain groups of people disproportionally. Formalization can make these beliefs, values and goals that motivate a project explicit.

One danger is that for complexity reasons you may want to reduce the number of variables, whereas if you want to make the classifications more nuanced (and hence better capture what we as society find valuable) you need to include much more variables. Also, a certain selection of variables may be more accurate for certain subgroups while they are less accurate for other subgroups. Of course there is no such thing as perfectly individualised decisions... Also, aiming to improve accuracy may lead to massive privacy problem. And it will induce a cost, that may be significant. This by itself should not be seen as a valid argument against an investment to increase the accuracy. There may be obligations to do the investment anyway, for example because the cost is the result of existing biases, or if the reduced accuracy induces a cost on society.

On the other hand perfect accuracy is not always desirable, for example in the context of health insurance. If you can perfectly predict a persons health and treatment costs, you can make the insurance premium exactly equal to these costs. Totally defeating solidarity and the purpose of insurances...

After defining the target variables, the system must be trained. You teach the system by giving it examples. But the examples may be skewed (containing a bias). Or the examples may be bad. Solon continued to explain different ways such bias is introduced.

Convenience samples: instead of properly creating a set of examples using the rigor of social scientific data collection, you 'conveniently' take an already existing data set, or one that is easy to collect. For example, in Boston they wanted to use an app called StreetBump to crowdsource the location of potholes on roads in Boston. This creates several skews: it only applies to people with smartphones (and this may correlate to neighbourhood) and to people that install the app (people that are tech savvy and know about the app in the first place). Interestingly enough, the developers knew this skew existed, and the primary use actually was by city-run garbage collection cars that by definition cover all parts of the city.

There is also a feedback effect. An initial bias may create an outcome that diverts attention to certain groups of people, creating less contact with people outside this group, increasing the bias in future samples.

You have to account for drift (explained by Netflix as example). Netflix started as a DVD-by-mail rental service. It collected data to build a recommendation service for other movies to watch. This system was trained using data from the mail-rental period, but the system was also used when Netflix turned into the online services it is now. But it turns out that what people watch when they can decide immediately is different from what they want to watch when they have to decide days in advance. This is called concept drift. You also have population drift. The demographics from the early Netflix days is different from the current demographics, meaning that the recommendations are less accurate.

Under- and overrepresentation of certain groups is not always evident, so correcting for bias is hard. It assumes that analysts have some independent mechanisms and/or sources to test for this, i.e. we must somehow be able to 'know' the actual proportions. Standard validation does not help to detect this, because the validation data actually come from the same underlying data set.

The more samples we have, the smaller the error is. But then, for a minority group, the consequence is that the error is larger than for a larger group. This can be countered by oversampling the minority group when training the system.

Note: even if the accuracy for two groups is the same, it can be the case that for one group the false accept ratio is high while for the other group the false reject ratio is similarly high: persons from one group may wrongly be classified to be criminals, while persons from the other group are wrongly classified not to be criminals.

It is difficult to correct for past injustices. It is even harder to do so for correcting for current prejudice. The only thing you can do here is ex-post outcome-based testing and rebalancing.

With respect to discrimination, there is no easy way to decide whether to include or exclude a certain variable that strongly correlates to the question at hand. Economic inequality is not random; it is often divided along the line of social categories. As a result, there are sets of factors (so called proxies) not directly related to economic status that do allow to accurately classify this status, especially when the models become more accurate. This creates the so called fairness/accuracy trade-off. A way to resolve this might be to adjudicate decisions to see whether the disparate impact is acceptable. yet this will require the plaintiff to prove that the same business objective could have been achieved with a more fair system. This is, of course, almost impossible.

Masking is the process of using machine learning to create systems that intentionally discriminate, by using the same (unintentional) factors explained above, but maliciously using them to achieve the intended effect. But Solon believes this is a much smaller threat than the threat of unintentional biases introduced in the same way.

To summarise: we have seen how data mining discriminates through target variables, training data, skewed samples, tainted examples, feature selection, limited and coarse features, proxies and masking.

Stefania Milan: Privacy from the point of view of (organized) collective action

Stefania brightened up her talk with cool graphics from the Beehive Design Collective.

Topic of her talk is to study privacy through the lens of collective action. But what does this mean? Some examples: How to organise when you don't know each other. Trust is difficult in this case, and quick secure communications may be hard. Also, what does the lack of privacy (e.g. after the Snowden revelations) mean for the decision to engage with concrete political action. Or privacy as a collective action problem (like Tor: you only have privacy if he crowd is large enough).

Stefania's talk covered: a critical assessment of the toolbox for collective action, and two understandings of collective action: political activism and multistakeholder governance

The social movement toolbox

Social movement studies (less interdisciplinary than Stefania would like it to be) are concerned with why and how mobilizations occur (their forms, dynamics, discourses). There is a huge bias (90%!) to study 'nice' movements, i.e left wing movements. This partially due to the fact that these studies require engagement, which is easier with groups that are more open and that you can relate with.

What is a social movement? It's a long term "network of informal interactions between a plurality of individuals, groups and/or organizations, engaged in a political or cultural conflict, on the basis of a shared collective identity". The long term aspect sets it apart from collaborations. Political movement and social movement are seen, by Stefania, as equivalent.

For many years people have focused on movements within nation states, and have compared similar movements in different states. The last ten years there has been a shift towards transnational research.

Approaches studied are (in chronological order):

  • collective behaviour: studies social movements as a negative, irrational, thing; something to difficult to deal with.
  • rational choice: economical approach, assumes people participate if there is a potential gain, and studies incentives to make people participate or prevent people joining. Example: strikes.
  • resource mobilisation theory: studies the influence of resources like money, time, but also leadership on social movements (especially popular in the US; doesn't work as well Europe).
  • political process approach: looks at things around and outside the movements themselves; studies the influence of the environment on the movement. Social movements for the most part exist to interact with the state; hence the way the state responds or acts matters.
  • 'new' social movements (last approach, emerged in the 1980's) did not revolve around economic issues, but around symbolic and cultural issues like peace, environment, and other immaterial values. Aspects studied are collective identity.

What are the tools at our disposal?

  • organizational forms and dynamics: non-governmental organisations, movements turning into political parties, ...
  • collective identity: culturally bounded, one of the main forces that allow the formation of movements.
  • and: institutions and norms, political opportunity structures, networking dynamics, and tech as tool: unfortunately studied in a blackboxed manner, with no agency.

The research methods used are: interviews & focus groups, participant observation, surveys, also social network analysis, protest event analysis. There are software tools that automate or support these traditional methods (e.g surveymonkey), but no new methods enabled by technology have emerged yet.

Stefania is especially interested in the epistemology of movements, as 'bearers' of 'new ways of seeing the world'.

Privacy in relation to political activism

Social movements have, historically, had to deal with surveillance.

Mass surveillance kills dissent in a deeper and more important place: in the mind, where the individual trains him- or herself to think only in line with what is expected and demanded. (Glenn Greenwald, 2014)

Surveillance has several effects. Asymmetric power relations between the monitored and the surveillers are detrimental to the forming of social movements. After the Snowden revelations we have seen a decline in 'privacy-sensitive' searches on Google. (Not surprising given the fact that the UK tracked visitors to Wikileaks.)

How can these issues be solved? Through the use of encryption, Tor, and creating awareness of such tools through crypto-parties. Other approaches are resource collections like EFF's Surveillance Self-Defense guide, or Tactical tech's Security in a Box. Finally there are projects that monitor and rank service providers for their respect for human rights (including privacy).

The other way around, privacy itself has become the topic of social movement. It has entered the advocacy agenda. In Amsterdam (where Stefania works) this is framed as 'data activism': a heuristical tool to think about privacy. The relationship with other movements like the hacker scene or open source movement are explored.

Thirdly, privacy is a concern for movements themselves. There is a tension between visibility and privacy of the people in the movement. And there are little if any tools available to collaborate privately (beyond secure messaging). Also there is a tension between transparency and privacy, both in terms of requesting transparency of government (and the resulting privacy infringements on those people included in the related data), as well as in terms of information sharing within the movement itself. Social movements are about sharing, but now the movements hide everything they do. This also raises interesting issues in terms of accountability.

Policy advocacy in multistakeholder governance

This work tries to enter the control room, for example by working on technical standards and protocols. To this end Stefanie joined ICANN. ICANN (responsible for DNS and distributing IP addresses, but it also provides a logical layer of internet governance) is at a crucial turning point (transitioning from US stewardship to a global multistakeholder community).

One particular item she studies there is the whois database. Whois is an enormous breach of privacy: it contains phone numbers and addresses of anybody owning an Internet domain name. This is a concern for social movements, where the person in the whois database becomes the target for hate campaigns. It is also loved by law enforcement and security agencies.

Stefania tries, together with Niels ten Oever, to bring human rights within the remit of ICANN. This is an interesting struggle because cyber libertarians are totally against this, yet governments like the idea because human rights are the remit of nation states and hence give governments a leeway into ICANN.

In case you spot any errors on this page, please notify me!
Or, leave a comment.
Inderdaad, mensen zijn niet objectief. Maar de computer is geen haar beter! // Jaap-Henk Hoepman
, 2016-09-13 08:49:13

[…] suggestie is dat alleen mensen subjectieve beslissingen nemen, en computers objectief zijn. Dat laatste is echter niet juist. Ook beslissingen van computers zijn per definitie subjectief. Dat lijkt vreemd: computers zijn […]