Background

Big Data analytics such as credit scoring and predictive analytics offer numerous opportunities but also raise considerable concerns, among which the most pressing is the risk of discrimination. Although this issue has been examined before, a comprehensive study on this topic is still lacking. This literature review aims to identify studies on Big Data in relation to discrimination in order to (1) understand the causes and consequences of discrimination in data mining, (2) identify barriers to fair data-mining and (3) explore potential solutions to this problem.

Methods

Results

Most of the articles addressed the potential risk of discrimination of data mining technologies in numerous aspects of daily life (e.g. employment, marketing, credit scoring). The majority of the papers focused on instances of discrimination related to historically vulnerable categories, while others expressed the concern that scoring systems and predictive analytics might introduce new forms of discrimination in sectors like insurance and healthcare. Discriminatory consequences of data mining were mainly attributed to human bias and shortcomings of the law; therefore suggested solutions included comprehensive auditing strategies, implementation of data protection legislation and transparency enhancing strategies. Some publications also highlighted positive applications of Big Data technologies.

Conclusion

This systematic review primarily highlights the need for additional empirical research to assess how discriminatory practices are both voluntarily and accidentally emerging from the increasing use of data analytics in our daily life. Moreover, since the majority of papers focused on the negative discriminative consequences of Big Data, more research is needed on the potential positive uses of Big Data with regards to social disparity.

Big Data has been described as a “one-size-fits-all (so long as it’s triple XL) answer” [24] to solve some of the most challenging problems in the fields of climate change, healthcare, education and criminology. This may explain why it has become the buzzword of the decade. Big Data is a very complex and extensive phenomenon that has had fluctuating meanings since its appearance in the early 2010’s [86]. Traditionally it has been defined in terms of four dimensions (the four V’s of Big Data): volume, velocity, variety, and veracity—although some scholars also include other characteristics such as complexity [63] and value [5]—and it consists of capturing, storing, analyzing, sharing and linking huge amount of data created through computer-based technologies and networks, such as smartphones, computers, cameras, sensors etc. [40]. As we live in an increasingly networked world, where new forms of data sources and data creation abound (e.g., video sharing, online messaging, online purchasing, social media, smartphones), the amount and variety of data that is collected from individuals has increased exponentially, ranging from structured numeric data to unstructured text documents such as email, video, audio and financial transactions (SAS-Institute) [72].

Interestingly, due to the fact that traditional computational systems are unable to process and work on Big Data, characteristics of this phenomenon have been described by scholars in strict relation to the technical challenges they raise: volume and velocity, for example, present the most immediate challenge to traditional IT structures since companies do not have the necessary infrastructures to collect, store and process the vast amount of data that is created at increasingly higher speeds; variety refers to the heterogeneity of both structured and unstructured data that is collected from very different sources making storage and processing even more complex; and finally, since Big Data technologies are dealing with high volume, velocity and great variety of qualitatively very heterogeneous data, it is highly improbable that the resulting data set will be completely accurate or trustworthy, creating issues of veracity [5].

Despite the aforementioned issues, we should not forget that Big Data analytics—understood here as the plethora of advanced digital techniques (e.g. data mining, neural networks, deep learning, profiling, automatic decision making and scoring systems) designed to analyze large datasets with the aim of revealing patterns, trends and associations, related to human behavior—play an increasingly important role in our everyday life: the decision to accept or deny a loan, to grant or deny parole, or to accept or decline a job application are influenced by machines and algorithms rather than by individuals. Data analysis technologies are thus becoming more and more entwined with people’s sensitive personal characteristics, their daily actions and their future opportunities. Hence it should not come as a surprise that many scholars have started to scrutinize Big Data technologies and their applications to analyze and grasp the novel ethical and societal issues of Big Data. The most common concerns that arise regard privacy and data anonymity [26, 29], informed consent [41], epistemological challenges [28], and more conceptual concerns such as the mutation of the concept of personal identity due to profiling [27] or the analysis of surveillance in an increasing “datafication” or “data-fied” society [7].

One of the most worrying but still under researched aspects of Big Data technologies is the risk of potential discrimination. Although “there is no universally accepted definition of discrimination” [82], the term generally refers to acts, practices or policies that impose a relative disadvantage on persons because of their membership of a salient social or recognized vulnerable group based on gender, race, skin color, language, religion, political opinion, ethnic minority etc. [61]. For the scope of our study we adhere to the aforementioned general conception of discrimination and only distinguish between direct discrimination (i.e. procedures that discriminate against minorities or disadvantaged groups on the basis of sensitive discriminatory attributes related to group membership such as race, gender or sexual orientation) and indirect discrimination (i.e. procedures that might intentionally or accidentally discriminate against a minority, while not explicitly mentioning discriminatory attributes) [32]. We also acknowledge the close connection between discrimination and inequality, since a disadvantage caused by discrimination necessarily leads to inequality between the considered groups [75].

Although research on discrimination in data mining technologies is far from new [69], it has gained momentum recently, in particular after the publication of the White House report of 2014 which firmly warned that discrimination might be the inadvertent outcome of Big Data technologies [65]. Since then, possible discriminatory outcomes of profiling and scoring systems have increasingly come to the attention of the general public. In the United States, for example, a system technology used for the assessment of future risk of re-offending among defendants was found to discriminate against black people [23]. Likewise, in the United Kingdom, an algorithm used to make custodial decisions was found to discriminate against people with lower incomes [15]. But more citizen-centered applications, such as the Boston’s Street Bump App, which is developed to detect potholes on roads are also potentially discriminatory. By relying on the use of a smartphone, the App, risks increasing the social divide between neighborhoods with a higher number of older or less affluent citizens and those more wealthy areas with more young smartphone owners [67].

The proliferation of these cases explains why discrimination in Big Data technologies has become a hot topic in a wide range of disciplines, ranging from computer science and marketing to philosophy, resulting in a scattered and fragmented multidisciplinary corpus that makes it difficult to fully access the core of the issue. Our literature review therefore aims to identify relevant studies on Big Data in relation to discrimination from different disciplines in order to (1) understand the causes and consequences of discrimination in data analytics; (2) to identify barriers to fair data-mining and (3) explore suggested solutions to this problem.

A systematic literature review was performed by searching the following six databases: PsycINFO, SocINDEX, PhilPapers, Cinhal, Pubmed and Web of Science (see Table 1).

Table 1

Search terms

No.

Matches search terms

PsychInfo

PhilPapers

SocIndex

CINAHL

PubMed

Web of science

1

“Big data” OR “digital data” OR “data mining” OR “data linkage”

2385

179

507

944

13214

23740

2

Discriminat* OR *equality OR vulnerab* OR *justice OR ethic* OR exclusion

69,435

46,349

46,624

38,096

245,604

414,661

3

1 AND 2

156

67

88

55

769

1177

The following search terms were used: “big data”, “digital data”, “data mining”, “data linkage”, “discriminat*”, “*equality”, “vulnerab*”, “*justice”, “ethic*” and “exclusion””. The terms were combined using Boolean logic (see Table 1). The inclusion criteria were: (1) papers published between 2010 and December 2017 and (2) written in English. A relatively narrow publication window was chosen as “Big Data” has become a buzzword in academic circles only over the last decade and because we wanted to target only those articles that focus on the latest digital technologies for profiling and predictive analysis. In order to obtain a broader understanding of discrimination and inequality related to Big Data, no restriction was placed on the discipline of the papers (medicine, psychology, sociology, computer science, etc.), or on the type of methodology (quantitative, qualitative, mixed methods or theoretical). Books (monographs and edited volumes), conference proceedings, dissertations, literature reviews and posters were omitted.

The search protocol from the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) method [57] was followed and resulted in 2312 papers (see Fig. 1). Two papers were added that were identified through other sources. The results were scanned for duplicates (609) and 1705 remained. In this phase, we included all articles that mentioned, discussed, enumerated or described discrimination, the digital divide or social inequality related to Big Data (from data mining and predictive analysis to profiling). Therefore, papers that focused mainly on issues of autonomy, privacy and consent were excluded, together with those that merely described means to recognize or classify individuals using digital technologies without acknowledging the risk of discrimination. Disagreements between the first and second authors were evaluated by a third reviewer who determined which articles were eligible based on their abstracts. In total, 1559 records were excluded.

Fig. 1

PRISMA flowchart

The first author subsequently scanned the references of the remaining 91 articles to identify additional relevant studies. 12 papers were added through this process. The final sample included 103 articles. During the next phase, the first author read the full texts. After thorough evaluation, 42 articles were excluded because (1) they did not or only superficially referred to discrimination or inequality in relation to Big Data technologies and focused more on risks related to privacy or consent; (2) they discussed discrimination but not in relation to the development of Big Data analytic technologies; (3) they focused on the growing divide between organizations that have the power and resources to access, analyze and understand Big Datasets (“the Big Data rich”) and those that do not (“the Big Data poor”) [4] instead of on the concept of Digital Divide, which is defined as the gap between individuals who have easy access to internet-based technologies and those who do not; or (4) they assessed disparities affecting participation in social media. The subsequent phase of the literature review involved the analysis of the remaining 61 articles. The following information was extracted from the papers: year of publication, country, discipline, methodology, type of discrimination/inequality fostered by data mining technologies, suggested solutions to the discrimination/inequality issue, beneficial applications of Big Data to contrast discrimination/inequality, reference to the digital divide, reference to the concept of the Black Box as an aggravator of discrimination, evaluation of the human element in data mining, mention of the shift from individual to group harm, reference to conceptual challenges introduced by Big Data, and mention of legal shortcomings when confronted with Big Data technologies.

Among the 61 papers included in our analysis, 38 were theoretical papers that critically discussed the relation between discrimination, inequality and Big Data technologies. Of the remaining 23 articles, 7 employed quantitative methods, 3 qualitative methods and 13 computer science methodologies that used a theory to combat or analyze discrimination in data mining and then empirically tested this theory on a data set. To distinguish the latter approach from the more traditional empirical research methods, we classified such studies as “other” (experimental) methods. Most of the papers were published after 2014 (n = 44), the year of the publication of the White House report on the promises and challenges of Big Data [65]. Almost one-third of the studies (n = 22) were from the United States, 6 came from the Netherlands, 3 from the United Kingdom and the remaining ones were from Belgium, Spain, Germany, France, Australia, Ireland, Italy, Canada, or Israel. Ten papers were from more than one country (see table). Regarding the scientific discipline, 20 papers were published in papers from the field of Social Sciences, 14 from Computer Science, 14 from Law, 9 from Bioethics and only 2 from Philosophy and Ethics. As to the field of application, a considerable number of papers (n = 24) discussed discriminatory practices in relation to various aspects of daily living such as employment, advertisement, housing, insurance, credit scoring etc., while others focused on one specific area.

The majority of the studies (n = 38) did not provide a definition of discrimination, but instead treated the word as self-explanatory and frequently linked it to others concepts such as inequality, injustice and exclusion. A few defined discrimination as “disparate impact”, “disparate treatment”, “redlining”, “statistical discrimination”, while others gave a more “juridical” definition and referred to the unequal treatment of “legally protected classes”, or directly referred to existing national or international legislation. Only one article discussed the difference between direct and indirect discrimination (see Table 2).

In order to explore whether and how Big Data analysis and/or data mining techniques can have discriminatory outcomes, we decided to divide the studies according to (a) the possible discriminatory outcomes of data analytics and (b) some of the most commonly identified causes of discrimination or inequality in Big Data technologies.

Forms, targets and consequences of discrimination

Numerous papers assessed the possible various discriminative and unfair outcomes that might result from data technologies (see Table 3).

Among these, a considerable number of papers highlighted the two main forms of discrimination introduced by data mining. In this context, some authors stressed the fact that the aforementioned algorithmic mechanisms might result in involuntary and accidental discrimination [8, 14, 17, 21, 25, 39, 45, 54, 73, 93]. Barocas and Selbst [8], for example, claimed that “when it comes to data mining, unintentional discrimination is the more pressing concern because it is likely to be far more common and easier to overlook” [8] and expressed concern about the possibility that classifiers in data mining could contain unlawful and harmful discrimination towards protected classes and or vulnerable groups. Holtzhausen, along the same lines, argued that “algorithms can have unintended consequences” [39] and might cause real harm to individuals, ranging from differences in pricing, to employment practices, to police surveillance. Some other studies instead highlighted that data mining technologies could result in direct and voluntary discrimination [32, 39, 46]. Here we follow the aforementioned definition of direct discrimination offered by [32] that describes it as discrimination against minorities or disadvantaged groups on the basis of sensitive discriminatory attributes related to group membership such as race, gender or sexual orientation. Holtzhausen, for instance, warned against the discriminatory use of ethnic profiling in housing and surveillance [1, 39] discussed potentially oppressive and discriminatory outcomes of data mining on migration and profiling that impose an automatic and arbitrary classification and categorization upon supposedly risky travelers.

Some papers also defined the potential targets of data mining technologies [46, 58] discussed the increased exploitation of the vulnerable as one of the most worrying consequences of data mining; they claimed that algorithms might identify those who are less capable, such as elder individuals with gambling habits, and prey on them with targeted advertisements or by persuading them “to take out risky loans, or high-rate instant credit options, thereby exploiting their vulnerability” [58]. Leese [48] claimed that discrimination is one of the harms that derives from the massive scale of the profiling of society and that the risk is even higher for vulnerable populations. Four of the reviewed papers also noticed how profiling and data mining technologies are causing a shift in harm from single profiled and classified individuals to larger groups. The papers argued that decisions taken on the aggregation of collected information might have harmful consequences for (a) the entire collectivity of the people involved in the data set [53], (b) for people who were not in the original analyzed dataset [30], and (c) for the general public due to the penetration of data mining practices into each of our every day’s activity thanks to big companies like Facebook, Twitter, Google [44]. de Vries [27], has taken this concept a step further and argued that the increased use of machine profiling and automatic classification could lead to a general increase of discrimination in many sectors to a level that might make discrimination perceived as a legitimate practice in a constitutional democracy.

Regarding the consequences of the use of Big Data technologies, social exclusion, marginalization and stigmatization were mentioned in 11 articles. Lupton [51] argued that the disclosure of sensitive data, specifically sexual preference and heath data related to fertility and sexual activity could result in stigma and discrimination. Ploug [63] described how health registries for sexual transmittable diseases risk singling out and excluding minorities, Barocas and Selbst [8], Pak et al. [59], and Taylor [78] argued that some individuals will be marginalized and excluded from social engagement due to the digital divide.

According to the literature, Big Data technologies might also perpetuate existing social and geographical historical disparities and inequalities, for example by increasing the exclusion of ethnic minorities from social engagement, worsening the living conditions of the economically disadvantaged, widening the economic gap between poor and rich countries, excluding some minorities from healthcare [13, 14, 60, 79, 80, 85], and/or delivering a fragmented and incomplete picture of the population through data mining technologies [13].

Some papers also highlighted how new means of automated decision making and personalization could create novel forms of discrimination that transcend the historical concept of unlawful discrimination and that are not related to historically protected classes or vulnerable categories. According to Newell and Marabelli [58], individuals could be inexplicably and unexpectedly excluded from certain opportunities, exploited on the basis of their lack of capacities, and be unfairly treated through targeted advertisement and profiling. The reviewed literature pinpointed two main new forms of discrimination: first, economic or marketing discrimination, that is, the unequal treatment of different consumers based on their purchasing habits or inequality in pricing and offers that are given to costumers based on profiling, such as insurance or housing [35, 62, 81]; secondly, discrimination based on health prediction, that is the unequal treatment or discrimination of individuals based on predictive, and not actual, health data [2, 22, 37, 38].

Causes of discrimination

Many papers highlighted the main elements that might cause discrimination or inequality in Big Data technologies (see Table 4).

Algorithmic causes of discrimination

Ten papers focused on how algorithmic and classificatory mechanisms might make data mining, classification and profiling discriminatory. These studies underlined that data mining technologies always involve a form of statistical discrimination. Adverse outcomes against protected classes might occur involuntarily due to the classification system. Barocas and Selbst [8] and d’Alessandro et al. [25], for example, pointed out that while the process of locating statistical relationships in a dataset is automatic, computer scientists still have to personally set both the target variable or outcome of interest (“what data miners are looking for”) and the “class labels” (“that divides all the possible outcomes of the target variable in binary and mutually exclusive categories”) [8]. Insofar the data scientist needs to translate a problem into formal computer coding, deciding on the target variable and the class labels is a subjective process. Another algorithmic cause of discrimination is related to biased data in the model. In order to develop automatization, data mining models need datasets to train on, since they learn to make classifications on the basis of given examples. Schermer [73] argued that if the training data is contaminated with discriminatory or prejudiced cases, the system will assume them as valid examples to learn from and reproduce discrimination in its own outcomes. This contamination could derive from historically biased datasets [14] or from the manual assignment of class labels by data miners [8]. An additional issue with the training data might be the data collection bias [8] or sample bias [25]. Bias in the data collection can present itself as an underrepresentation of specific groups and/or protected classes in the data set, which might result in unfair or unequal treatment, or also an overrepresentation in the data set which might result in a “disproportioned attention to a protected class group, and the increased scrutiny may lead to a higher probability of observing a target transgression” [25]. Within this context, Kroll and colleagues mentioned the phenomenon of “overfitting” where “models may become too specialized or specific to the data used for training” and, instead of finding the best possible decision rule overall, they simply learn the most suited rule to the training data thus perpetrating its bias [45]. Another possible algorithmic cause of discriminatory outcomes is proxies for protected characteristics such as race and gender. A historically recognized proxy for race, for example, is ZIP or post-code and “redlining” is defined as the systematic disadvantaging of specific, often racially associated, neighborhoods or communities [73]. On this note, Zliobaite and Custers [95] highlighted how, in data mining, the elimination of sensitive attributes from the data set does not help to avoid discriminative outcomes as the algorithm could automatically identify unpredictable proxies for protected attributes. Two papers discussed feedback loop and systematic loop as a possible cause of unfair predictions [14, 25]. These involve the creation of a negative vicious cycle where certain inputs in the data set induce statistical deviations that are learned and perpetuated by the algorithm in a self-fulfilling loop of cause and consequence. An example might help to clarify this mechanism: police crime notification in certain urban areas will increase police patrol activity since crime notification is considered predictive of increased criminal activity. However, intensive paroling will result in an increasingly higher rate of criminal activity reports in that area, irrespective of the true crime rate of that neighborhood with respect to others. “Feature selection” is another possible cause of discrimination identified by Barocas and Selbst [8]. This is a process that is used by those who collect and analyze the data to decide what kind of attributes or features they want to observe and take into account in their decision making processes. The authors argued that the selection of attributes always involves a reductive representation of the more complex real world object, person, or phenomena that it aims to portray insofar as it cannot take into account all the attributes and all the social or environmental factors related to that individual [8].

d’Alessandro identified two additional possible causes of discrimination lined to model misspecification, that is “the functional form of feature set of a model under study not being reflective of the true model” [25]. These are “cost function” misspecification and “error by omission”. “Cost function” misspecification is defined as the failure to consider the additional weight given to the event or attribute of interest (e.g. criminal record) by the data scientist. d’Alessandro argued that since “discrimination is enforced when a protected class receives an unwarranted negative action”, if a “false positive error could cause significant harm to an individual in a protected class”, the weight of the attribute, namely its asymmetry with respect to others, has to be taken into account [25]. “Error by omission” is another form of cost function misspecification that occurs when terms that penalize discrimination are ignored or left out from the model. Simply put, it means that the model does not take into account the differences in how the algorithm classifies protected and non-protected classes [25].

Finally, the reviewed articles also highlighted how algorithmic analysis can become an excellent and innovative tool for direct voluntary discrimination. This practice, defined as “masking”, involves the intentional exploitation of the mechanisms described above to perpetrate discrimination and unfairness. The most common practice of masking is the intentional use of proxies as indicators of sensitive characteristics [8, 45, 62, 93, 95].

Digital divide

We identified nine papers that discussed the digital divide, that is, the gap between those who have continuous and ready access to internet, computer and smartphones and those who do not, as a possible cause of inequality, injustice or discrimination. Lack of resources or computational skills, older age, geographical location, and low income were identified as.

possible causes of this digital divide [8, 18, 60]. Two papers [49, 74] discussed the “big data exclusions” referring to those individuals “whose information is not regularly collected or analyzed because they do not routinely engage in data-generating practices” [49]. On the same note, Bakken and Reame [6] argued that data is mainly gathered from white, educated people leaving out racial minorities such as Latinos. Boyd and Crawford discussed the creation of new digital divides, arguing that discrimination may arise due to (1) differences in information access and processing skills—the Big Data rich and the Big Data poor, and due to (2) gender differences insofar most researchers with computational skills are men [12]. Lastly, Cohen et al. [22] described how the commercialization of predictive models will leave out vulnerable categories such people with disabilities or limited decision-making capacities and high risk patients.

Data linkage and aggregation

Four papers discussed data linkage, that is, the possibility of automatically obtaining, linking, and disclosing personal and sensitive information as an important cause of discrimination. Two articles [19, 91] described how the use of electronic health records could result in the automatic disclosure of sensitive data without the patient’s explicit agreement or to re-identification. Others [64, 74] also highlighted that discrimination is not created by a data collection system (such as social and health registries) in itself, but is made easier by the linkage and aggregation potentiality embedded in the data.

The literature has suggested several different strategies to prevent discrimination and inequality in data analytics, ranging from computer based and algorithmic solutions to the incorporation of human involvement and supervision (see Table 5).

Practical computer science and technological solutions

Some articles authored by IT specialists suggested practical computer science solutions, namely the development of discrimination-aware methods to be applied during the development of the algorithmic models. These techniques include: pre-processing methods that involve the sanitization or distortion of the training data set to remove possible bias in order to prevent the new model from learning discriminatory behaviors (e.g. [33, 43]; in-processing techniques that provide for the modification of the learning algorithm through the application of regularization to probabilistic discriminative models [43]) such as the inclusion of sensitive attributes to avoid discriminatory predictions [66, 95] or the addition of randomness to avoid overfitting or hidden model bias [45]; post-processing methods that involve the auditing of the extracted data mining models for discriminative patterns and eventually their sanitization [34]. Along these lines, [25] suggested the implementation of an overall discrimination-aware auditing process that involves the coherent combination of all pre-, in-, and post-processing methods to avoid discrimination. Many papers indicated how the implementation of transparency of data mining processes could help avoid injustice and harm. Practical suggestions to reinforce transparency in data mining include the development of interpretable algorithms that will give explanations on the logical steps behind a certain classification [45, 73], and the creation of transparent models that will allow individuals to see in advance how their behavior and choices will be interpreted by the algorithm or the infrastructure [21, 35]. Another solution was the enhancement of proper privacy preserving strategies since it’s impossible to eradicate the likelihood of discriminative practices in data mining if discrimination-preventing data mining is not integrated with privacy-preserving data mining models [34]. Lastly, one paper suggested the promotion of exploratory fairness analysis that could be used to build up knowledge of the mechanisms and logics behind machine learning decisions [84].

Legal solutions

Implementation of legislation on data protection and discrimination was another common suggestion among the papers from the USA. Kuempel [46] suggested that the harmonization of stronger data protection legislation across different sectors in the US, could help contrast discrimination in under regulated areas, such as online marketing and data brokering. One author [62] argued that policies to constrain data use should be put into place. Such constraints should limit or deny the disclosure of sensitive data in specific contexts (e.g. health data in employment) or even deny specific uses of data in contexts where sensitive data is already disclosed if such use might cause harm to the individual (e.g. the use of health data to increase premiums in insurance). Finally, one article [35] suggested the idea of “code as law”, that is a transition from written-law to computational law, implying the articulation of specific legal norms in digital technologies through the use of software.

Human-centered solutions

Keeping the human in the loop of data mining was another recommendation. According to some papers, human oversight and supervision is critical to improve fairness since humans could notice where important factors are unexpectedly overlooked or sensitive attributes are improperly correlated [11, 25]. Other solutions that include human involvement were: (a) the participation of trusted third parties to either store sensitive data and rule on their disclosure to companies [84] or supervise and assess suspicious data mining and classification practices [54]; (b) the engagement of all relevant stakeholders involved in a decision making or profiling process—such as health care institutions, physicians, researchers, subjects of research, insurance companies, and data scientists—in a multidisciplinary discussion towards the creation of a theoretical overarching framework to regulate data mining and promote the implementation of fair algorithms [22]; (c) the implementation of strategies to educate data scientists in building proper models, such as the creation of a knowledge base platform for fairness in data mining that could be investigated by data scientists in case they stumbled upon problematic correlations; and (d) the implementation of flexibility and discretion in EHR disclosing system to avoid stigma from the disclosure of personal and private information [37].

Many papers described algorithmic decision making as a black box system where the input and the output of the algorithm are visible but the inner process remains unknown [13, 21, 25], resulting in lack of transparency regarding the methods and the logic behind scoring and predictive systems [35, 48, 54, 92]. Reasons behind

the opacity of automated decision making are multiple: first, algorithms might use enormous and very complex data sets that are uninterpretable to regulators [25], who frequently lack the required computer science knowledge to understand algorithmic processes [73]; second, automatic decision making might intrinsically transcend human comprehension since algorithms do not make use of theories or contexts as in regular human based decision-making [58]; and finally, algorithmic processes of firms or companies might be subject to intellectual property rights or covered by trade secret provisions [35]. If there is no transparent information on how algorithms and processes work it is almost impossible to [44] evaluate the fairness of the algorithms or discover discriminatory patterns in the system [45].

Human bias was identified as another main obstacle to fair data mining. Human subjectivity is at the very core of the design of data mining algorithms since the decisions regarding which attributes will be taken into account and which will be ignored are subject to human interpretation [12], and will inevitably reflect the implicit or explicit values of their designers [1].

Algorithmic data mining also poses considerable conceptual challenges. Many papers claimed that automatic decision making and profiling are reshaping the concept of discrimination, beyond legally accepted definitions. In the United States (US), for example, Barocas and Selbst [8] claimed that algorithmic bias and automatization are blurring notions of motive, intention and knowledge, making it difficult for the US doctrine on disparate impact and disparate treatment to be used to evaluate and persecute causes of algorithmic discrimination. One article [48], discussing European Union (EU) regulation, argued that it is necessary to rethink discrimination in the context of data driven profiling, since the production of arbitrary categories in data mining technologies and the automatic correlation of the individual’s attributes by the algorithm differ from traditional profiling, which is based on the establishment of a causal chain developed by human logic. Some articles have also pointed out that concepts like “identity” and “group” are being transformed by data mining technologies. de Vries argued that individual identity is increasingly shaped by profiling algorithms and ambient intelligence in terms of increased grouping created in accordance with algorithms’ arbitrary correlations, which sort individuals into a virtual, probabilistic “community “or “crowd” [27]. This typology of “group” or “crowd” differs from the traditional understanding of groups, since the people involved in the “group” might not be aware of (1) their membership to that group, (2) the reasons behind their association with that group and, most importantly, (3) the consequences of being part of that group [54]. Two other concepts are being reshaped by data technologies. The first is the concept of border [1], which is no longer a physical and static divider between countries but has become a pervasive and invisible entity embedded in bureaucratic processes and the administration of the state due to Big Data surveillance tools such as electronic passports and airport security measures. The second is the concept of disability, which needs to be broadened to include all diseases and health conditions, such as obesity, high blood pressure and minor cardiac conditions, which might result in discriminatory outcomes from automatic classifiers through algorithmic correlation with more serious diseases [37, 38].

The final barrier that was pinpointed in the literature is of a legal nature. According to some authors, current antidiscrimination and data protection legislation, both in the EU and in the US, are not well equipped to address cases of discrimination stemming from digital technologies [8]. Kroll et al. [45] claimed that current antidiscrimination laws might legally prevent users of algorithms from revising to inspecting algorithms after the discriminatory fact has happened, making the development of ex-ante anti-discriminatory models even more pressing. Kuempel [46] argued that data protection legislation is too sectorial and does not provide sufficient safeguards from discrimination in sectors like marketing. Some papers focused on the implications of the implementation of European data protection regulations, specifically the new General Data Protection Regulation (GDPR) of May 2018. The authors emphasized that data protection requirements, such as data gathering minimization and the limitation of use of personal data, might result in barriers into the development of antidiscrimination models that demand the inclusion of sensitive data in order to avoid discriminatory outcomes [35, 95] (see Table 6).

Data mining is said to promote objectivity in classification and profiling because decisions are made by a formal, objective and constant algorithmic process with a more reliable empirical foundation than human decision-making [8]. This feature of objectivity could limit human error and bias. According to some of the literature, automatic data mining could also be used to discover and assess discriminatory practices in classification and data mining. Through the construction of discrimination-aware algorithmic models (e.g. [10, 71]), individuals who suspect that they are being discriminated against could be helped to identify and assess direct/indirect discrimination, favoritism or affirmative action, and decision makers (such as employers, insurance companies managers and so on) could be protected against wrongful discrimination allegations. Some of the papers also highlighted that the potential of Big Data technologies to integrate socioeconomic data, mobile data and geographical data could promote equitable and beneficial implementations in various sectors. In healthcare, for example, the integration of healthcare data with spatial contextual information might help identifying areas and groups that require health promotion [47]; moreover the use of Big Data, profiling and classification could foster equity with regard to health disparities in research, since it could promote the implementation of tailored strategies that take into account an individual’s ethnicity, living conditions and general lifestyle [6]. Economic and urban development is another area in which data mining could help foster equity. The integration of analysis from mobile phone activity and socio-economic factors within geographical data could help monitoring and assessment of social structural inequalities to promote the implementation of more equitable city development and growth [55, 83, 85]. Migration could also

benefit from the use of Big Data technologies, as it can provide scholars and activists with more accurate data regarding migration flows and thus prepare and enhance humanitarian processes [1]. Finally, two papers also discussed the positive influence of social media [59] analyzed how text mining could be used to assess the level and diffusion of discrimination related to people affected by Human Immunodeficiency Virus Infection (HIV) and Acquired Immune Deficiency Syndrome (AIDS) in popular social media like Facebook and at the same time implement awareness-raising campaigns to spread tolerance. Another article [18] claims that social media could be used to enhance the participation of people receiving pediatric palliative care, a particularly vulnerable group, in research.

The majority of the reviewed papers (49 out of 61) date from the last 5 years. This shows that although Big Data has been a trending buzzword in the scientific literature since 2011 [16], the problem of algorithmic discrimination has become of prime interest only recently, in conjunction with the publication of the White House report of 2014 [65]. Hence, scholarly reflection on this issue has appeared rather late, leaving potentially discriminatory outcomes of data mining unaddressed for a long time. Moreover, in line with other studies [56], our review indicates that while a theoretical discussion on this topic is finally emerging, empirical studies on discrimination in data mining, both in the field of law and social sciences, are largely lacking. This is highly problematic especially in light of the new forms of disparate treatment that arise with the increased “datafication” of society. Price and health prediction discrimination (e.g. in insurance policies), for example, are not illegal but might become ethically problematic if persons are denied access to essential goods or services based on their income or lifestyle. More evidence-based studies on the possible harmful use of these practices are urgently needed if we want to understand the complexity of this problem in depth. In addition, it is interesting to notice that no paper examined discrimination in relation to the four V’s of Big Data, as they focused more on the classificatory and algorithmic issues of data analytics. It is thus important that future studies also take into account the issue of harmful discrimination related to the specific problems related to the unique characteristic of Big Data, such as the veracity of the data sets and the constraints related to the high volume of data, and the velocity of their production.

Although the majority of papers were theoretical in nature, the term discrimination was presented as self-explanatory and linked to other notions such as injustice, inequality and unequal treatment, with the exception of some papers in law and computer science. This overall lack of a working definition in the literature is highly problematic, for several reasons.

First given that data mining technologies are purposely created to classify, discern, divide and separate individuals, groups or actions [8], discussing the problem of unfair discrimination in absence of a clear definition is creating confusion. The discrimination operated in data-mining, in fact, is not in itself illegal or ethically wrong as long as it limits itself to making a distinction between people with different characteristics [35]. For example distinguishing between minors and adults is a socially and legally accepted practice of “neutral discrimination”; based on a straightforward distinction of age (in most countries set at 18 years old) individuals are dissimilarly treated: adults have different rights and duties than minors, they can drive and vote, they are judged differently in a court of law and so on. Moreover, even efforts to achieve social equality sometimes imply a sort of differential treatment; for example in the case of gender equality, divergent treatment of individuals based on gender is allowed if such treatment is adopted with the long term goal of evening out social disparities [87]. Hence, if researchers want to discuss the problem of discrimination in data-mining, a distinction between harmful and unfair versus neutral or fair discrimination is of utmost importance.

Second, without an adequate definition of discrimination, it is difficult for computer scientists and programmers to appropriately implement algorithms. In fact, to avoid unfair practices, measure fairness and quantify illegal discrimination [43], they need to translate the notion of discrimination into a formal statistical set of operations. The need for this expert knowledge may explain why, compared to other researchers in the field, computer scientists have been at the forefront of the search for a viable definition.

Still, despite the need for a working definition of discrimination, we should not forget that it remains an elusive ethical and social notion which cannot and should not be reduced to a “petrified” statistical measurement. As seen in our review, data-mining has given rise to novel forms of differential treatment. To properly understand the implications of these new discriminatory practises, a reconceptualization of the notion of fair and unfair discrimination might be needed. To keep the debate on discrimination in Big Data open it is important to keep humans in the loop.

Practices of automatic profiling, sorting and decision making through data mining have been introduced with the prima facie concept that Big Data technologies are objective tools capable of overcoming human subjectivity and error resulting in increased fairness [3]. However, data mining can never be fully human-free, not only because humans always risk undermining the presumed fairness and objectivity of the process with subconscious bias, personal values or inattentiveness, but also because they are crucial in order to avoid improper correlations and thus to ensure fairness in data mining. It thus seems that Big Data technologies are deeply tied to this dichotomous dimension where humans are both the cause of its flaws and the overseers of its proper functioning.

One way of keeping the human in the loop is through legislation. Our results, however, show that although legal scholars have tried to address possible unfair discriminatory outcomes of new forms of profiling, Big Data poses important challenges to “traditional” antidiscrimination and privacy protection legislation because core notions, such as motive and intention, are no longer in place [8]. A recurring theme in many papers was that legislation always lacks behind technological developments and that while gaps in legal protection are somehow systemic [35], an overarching legal solution to all unfair discriminatory outcomes of data mining is not feasible [45].

In our review, very few papers offered a pragmatic legal solution to the problem of unfair discrimination in data-mining: for example one study advocated for a generally applicable rule [46], while another suggested the production of a set of precedents built in time through a case by case adjudication [36]. Both solutions are incompatible with the reality and needs of data management because they are either too rigid [46] or too specialized and protracted [36].

This poor outcome is probably the result of the technically complex nature of data mining and the intrinsically tricky legal designation of what represents unfair discrimination that should be prohibited by law. The new European General Data Protection Regulation (GDPR) is exemplary in this regard. Two key features of the GDPR are: data minimization (i.e. data collection and processing should be kept to a minimum) and purpose limitation (i.e. data should be analysed and processed only for the purpose it was collected for). Since both these principles are inspired from data privacy regulations established in the 1970s, they fail to take into account two crucial points that have been reiterated by many computer science, technical and legal scholars in the past few years [31]: first, with Big Data technologies, information is not collected for a specific, limited and specified purpose, rather it is gathered to discover new and unpredictable patterns and correlations [53]; second, antidiscrimination models require the inclusion of sensitive data in order to detect and avoid discriminatory outcomes [95].

The difficulties encountered in adequately regulating discrimination in Big Data, especially from a legal point of view, could be partly related to a diffuse lack of dialogue among disciplines. The reviewed literature in fact pinpointed that while on the one hand, unfair discrimination is a complex philosophical and legal concept that stores difficulties for trained data scientists [20], Big Data, on the other, is quite a technological field so philosophers, social scientists and lawyers do not always fully understand the implications of algorithmic modelling for discrimination [73].

This mutual lack of understanding highlights the urgent need for a multidisciplinary collaboration between fields, such as philosophy, social science, law, computer science and engineering. The idea of collaboration between disciplines due to the spreading of digital technologies is not new. An example of this can be found in the conception of “code as law” first proposed by both Reidenberg and Lessing in the late 1990s, which implies the design of digital technologies to support specific norms and laws such as privacy and antidiscrimination [50, 68]. As shown by our results (e.g. [25, 42, 43]), the “code as law” proposal has been steadily implemented in computer science practice by many scholars who want to implement antidiscrimination rules in algorithmic models to avoid unfair harmful outcomes. Some papers, however, recommended a broader and overarching dialogue among disciplines [22, 31, 45]. Nonetheless, concrete means to put this multidisciplinarity into practice were lacking in the literature.

Finally, a few studies highlighted that Big Data technologies may tackle discrimination and promote equality in various sectors, such as healthcare and urban development [6, 18, 47]. Such interventions, however, might have the opposite effect and create other types of social disparities by widening the divide between people who have access to digital resources and those who do not, on the basis of income, ethnicity, age, skills, and geographical location. The significant number of papers that identified the digital divide as a major cause of inequality indicates how, despite all the efforts made to enhance digital participation across the globe [89, 90], social disparities due to lack of access to digital technologies are increasing in many sectors including health [88], public participation/engagement [9] and public infrastructure development [60, 79]. Scholars are rather sceptical about finding a solution to this problem due to the ever-changing technological landscape that creates new inclusion difficulties [89, 90]. Still, due to the potential promising beneficial applications of Big Data technologies, more studies should focus on the analysis and implementation of such fair uses of data-mining while considering and avoiding the creation of new divides.

In conclusion, more research is needed on the conceptual challenges that Big Data technologies raise in the context of data mining and discrimination. The lack of adequate terminology regarding digital discrimination and the possible presence of latent bias might mask persistent forms of disparate treatment as normalized practices. Although a few papers tackled the subject of a possible conceptual revision of discrimination and fairness [79], no study has done so in an exhaustive way.

A total of 61 peer-reviewed articles in English qualified for inclusion and were further assessed. It might thus be possible that studies in other languages and relevant grey literature have been overlooked. Aside from these limitations, this is the first study to comprehensively explore the relation between Big Data and discrimination from a multidisciplinary perspective.

Big Data offers great promise but also poses considerable risks. The literature review highlights that unfair discrimination is one of the most pressing, but at the same time an often underestimated issue in data mining. A wide range of papers proposed solutions on how to avoid discrimination in the use of data technologies. Though most of the suggested strategies were practical computational/algorithmic methods, numerous papers recommended human solutions. Transparency was a commonly suggested solution to enhance algorithmic fairness. Improving algorithmic transparency and resolving the black box issue might thus be the best course to undertake when dealing with discriminatory issues in data analytics. However, our study results identify a considerable number of barriers to the proposed strategies, such as technical difficulties, conceptual challenges, human bias and shortcomings of legislation, all of which hamper the implementation of such fair data mining practices. Due to the risk of discrimination in data mining and predictive analytics and the strikingly shortage of empirical studies on the topic that our review has brought to light, we argue that more empirical research is needed to assess how discriminatory practices are deliberately and accidentally emerging from their increased use in numerous sectors such as healthcare, marketing and migration. Moreover, since most studies focused on the negative discriminatory consequences of Big Data, more research is needed on how data mining technologies, if properly implemented, could also be an effective tool to prevent unfair discrimination and promote equality. As more reports from the press are emerging on the positive use of data technologies to assist vulnerable groups, future research should focus on the diffusion of similar beneficial applications. However, since even such practices are creating new forms of disparity between those who can access digital technologies and those who do not, research should also focus more on the implementation of practical strategies to mitigate the Digital Divide.

Authors’ contributions

MF collected the data, performed the analysis and drafted the manuscript. EDC supported with data analysis, contributed in writing the manuscript and revised the initial versions of the manuscript. BE provided general guidance, proof-read the manuscript, suggested necessary amendments and helped in revising the paper. All authors read and approved the final manuscript.

Acknowledgements

We thank Dr. David Shaw for his valuable contribution ot the project.

Competing interests

The authors declare that they have no competing interests.

Availability of data materials

The datasets used for the current study are available from the corresponding author on reasonable request.

Funding

The funding for this study was provided by the Swiss National Science Foundation in the framework of the National Research Program “Big Data”, NRP 75 (Grant-No: 407540_167211).

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.