As governments explore how they can use machine learning in their operations, Burkhard Schafer discusses the importance of transparency and where some of the other potential dangers lie.

One of the great potentials of machine learning is its promise to handle heterogeneous data from disparate data sources much better than previous approaches. This will make it easier to incorporate data from sources that have been historically neglected, or which official state agencies have found difficult to access. The combination of, for instance, citizen science with machine learning could be one of the ways in which machine learning can enable wider participation and contribution to decision making in the democratic process.

This potential also highlights one of the greatest dangers. While initiatives such as citizen science can contribute new, and sometimes better, forms of data, it won’t necessarily do this in an unbiased way. Participants in projects come more often than not from specific socio-economic groups, and have specific educational backgrounds. As data becomes more heterogeneous and sources more varied, keeping track of whose voices are not heard becomes more difficult. This can become dangerous when decision-makers get misled by the apparent objectivity of the algorithm, and the (mis)perception that computers are not affected by human biases.

We have come a long way in our understanding of hidden cognitive biases, the role of inequalities in political and legal decision-making, and have developed at least some strategies to prevent discrimination. These strategies and legal constraints will not become less relevant when we assist, or even replace, decision-makers with machine learning tools. On the contrary, identification and avoidance of inherent biases and unjustified inequalities will become even more challenging. “Discrimination aware computing” is part of the solution, but this is not just a technological issue. It also needs to reflect the way in which organisations rely on computer analysis, and how algorithms are perceived, understood and interpreted.

This also has implications for the regulatory framework that machine learning techniques in the public sector will operate within. An important point will be to ensure, wherever possible, transparency, replicability of results by “outsiders”, and accountability of use. Data protection law will play a role in this task where the data in question is about citizens, but of at least equal importance is copyright law and similar instruments that give or deny control over data to certain groups. This is important for the input data from which the algorithm learns, the algorithm itself, and any new data that is generated as a result of the learning process. Maximising openness of all three parts of the learning process will increase replicability and, with that, provide checks for inadvertent biases and inequalities. It also creates transparency, which is needed to allow meaningful challenge against machine-assisted decisions by the citizens that are affected. This will inevitably create tensions with proprietary and commercial interests, but also concerns for security, confidentiality and data protection.

This blog is part of the Shadow of the Smart Machine[5] series, looking at issues in the ethics and regulation of machine learning technologies, particularly in government.