In 1948 when Claude Shannon was inventing information science [pdf] (and, I’d say, information itself), he took as an explanatory example a simple algorithm for predicting the element of a sentence. For example, treating each letter as equiprobable, he came up with sentences such as:

XFOML RXKHRJFFJUJ ZLPWCFWKCYJ FFJEYVKCQSGHYD QPAAMKBZAACIBZLHJQD.

If you instead use the average frequency of each letter, you instead come up with sentences that seem more language-like:

IN NO IST LAT WHEY CRATICT FROURE BIRS GROCID PONDENOME OF DEMONSTURES OF THE REPTAGIN IS REGOACTIONA OF CRE.

Then Shannon changes his units from triplets of letters to triplets of words, and gets:

THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE CHARACTER OF THIS POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS THAT THE TIME OF WHO EVER TOLD THE PROBLEM FOR AN UNEXPECTED.

Pretty good! But still gibberish.

Now jump ahead seventy years and try to figure out which pieces of the following story were written by humans and which were generated by a computer:

In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.

The scientist named the population, after their distinctive horn, Ovid’s Unicorn. These four-horned, silver-white unicorns were previously unknown to science.

Now, after almost two centuries, the mystery of what sparked this odd phenomenon is finally solved.

Dr. Jorge Pérez, an evolutionary biologist from the University of La Paz, and several companions, were exploring the Andes Mountains when they found a small valley, with no other animals or humans. Pérez noticed that the valley had what appeared to be a natural fountain, surrounded by two peaks of rock and silver snow.

Pérez and the others then ventured further into the valley. “By the time we reached the top of one peak, the water looked blue, with some crystals on top,” said Pérez.

“Pérez and his friends were astonished to see the unicorn herd”Pérez and his friends were astonished to see the unicorn herd. These creatures could be seen from the air without having to move too much to see them – they were so close they could touch their horns.

While examining these bizarre creatures the scientists discovered that the creatures also spoke some fairly regular English. Pérez stated, “We can see, for example, that they have a common ‘language,’ something like a dialect or dialectic.”

Dr. Pérez believes that the unicorns may have originated in Argentina, where the animals were believed to be descendants of a lost race of people who lived there before the arrival of humans in those parts of South America.

While their origins are still unclear, some believe that perhaps the creatures were created when a human and a unicorn met each other in a time before human civilization. According to Pérez, “In South America, such incidents seem to be quite common.”

However, Pérez also pointed out that it is likely that the only way of knowing for sure if unicorns are indeed the descendants of a lost alien race is through DNA. “But they seem to be able to communicate in English quite well, which I believe is a sign of evolution, or at least a change in social organization,” said the scientist.

The answer: The first paragraph was written by a human being. The rest was generated by a machine learning system trained on a huge body of text. You can read about it in a fascinating article (pdf of the research paper) by its creators at OpenAI. (Those creators are: Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever.)

There are two key differences between this approach and Shannon’s.

First, the new approach analyzed a very large body of documents from the Web. It ingested 45 million pages linked in Reddit comments that got more than three upvotes. After removing duplicates and some other cleanup, the data set was reduced to 8 million Web pages. That is a lot of pages. Of course the use of Reddit, or any one site, can bias the dataset. But one of the aims was to compare this new, huge, dataset to the results from existing sets of text-based data. For that reason, the developers also removed Wikipedia pages from the mix since so many existing datasets rely on those pages, which would smudge the comparisons.

(By the way, a quick google search for any page from before December 2018 mentioning both “Jorge Pérez” and “University of La Paz” turned up nothing. “The AI is constructing, not copy-pasting.”The AI is constructing, not copy-pasting.)

The second distinction from Shannon’s method: the developers used machine learning (ML) to create a neural network, rather than relying on a table of frequencies of words in triplet sequences. ML creates a far, far more complex model that can assess the probability of the next word based on the entire context of its prior uses.

The results can be astounding. While the developers freely acknowledge that the examples they feature are somewhat cherry-picked, they say:

When prompted with topics that are highly represented in the data (Brexit, Miley Cyrus, Lord of the Rings, and so on), it seems to be capable of generating reasonable samples about 50% of the time. The opposite is also true: on highly technical or esoteric types of content, the model can perform poorly.

There are obviously things to worry about as this technology advances. For example, fake news could become the Earth’s most abundant resource. For fear of its abuse, its developers are not releasing the full dataset or model weights. Good!

Nevertheless, the possibilities for research are amazing. And, perhaps most important in the longterm, one by one the human capabilities that we take as unique and distinctive are being shown to be replicable without an engine powered by a miracle.

That may be a false conclusion. Human speech does not consist simply of the utterances we make but the complex intentional and social systems in which those utterances are more than just flavored wind. But ML intends nothing and appreciates nothing. “Nothing matters to ML.”Nothing matters to ML. Nevertheless, knowing that sufficient silicon can duplicate the human miracle should shake our confidence in our species’ special place in the order of things.

(FWIW, my personal theology says that when human specialness is taken as conferring special privilege, any blow to it is a good thing. When that specialness is taken as placing special obligations on us, then at its very worst it’s a helpful illusion.)

For the past six months I’ve been a writer in residence embedded in a machine learning research group — PAIR (People + AI Research) — at the Google site in Cambridge, MA. I was recently renewed for another 6 months.

No, it’s not clear what a “writer in residence” does. So, I’ve been writing occasional posts that try to explain and contextualize some basic concepts in machine learning from the point of view of a humanities major who is deeply lacking the skills and knowledge of a computer scientist. Fortunately the developers at PAIR are very, very patient.

Here are three of the posts:

Machine Learning’s Triangle of Error: “…machine learning systems ‘think’ about fairness in terms of three interrelated factors: two ways the machine learning (ML) can go wrong, and the most basic way of adjusting the balance between these potential errors.”

Confidence Everywhere!: “… these systems are actually quite humble. It may seem counterintuitive, but we could learn from their humility.”

Hashtags and Confidence: “…in my fever dream of the future, we routinely say things like, “That celebrity relationship is going to last, 0.7 for sure!” …Expressions of confidence probably (0.8) won’t take exactly that form. But, then, a decade ago, many were dubious about the longevity of tagging…”

I also wrote about five types of fairness, which I posted about earlier: “…You appoint five respected ethicists, fairness activists, and customer advocates to figure out what gender mix of approved and denied applications would be fair. By the end of the first meeting, the five members have discovered that each of them has a different idea of what’s fair…”

I’ve also started writing an account of my attempt to write my very own machine learning program using TensorFlow.js: which lets you train a machine learning system in your browser; TensorFlow.js is a PAIR project. This project is bringing me face to face with the details of implementing even a “Hello, world”-ish ML program. (My project aims at suggesting tags for photos, based on a set of tagged images (Creative Commons-ed) from Flickr. It’s a toy, of course.)

I have bunch of other posts in the pipeline, as well as a couple of larger pieces on larger topics. Meanwhile, I’m trying to learn as much as I possibly can without becoming the most annoying person in Cambridge. But it might be too late to avoid that title…

Markets and institutions are parts of complex ecosystem, Neil says. His research looks at data from satellites that show how the Earth is changing: crops, water, etc. Once you’ve gathered the data, you can use machine learning to visualize the changes. There are ecosystems, including of human behavior, that are affected by this. It affects markets and institutions. E.g., a drought may require an institutional response, and affect markets.

Traditional markets, financial markets, and gig economies all share characteristics. Farmers markets are complex ecosystems of people with differing information and different amounts of it, i.e. asymmetric info. Same for financial markets. Same for gig economies.

Indian markets have been failing; there have been 300,000 suicides in the last 30 years. Stock markets have crashed suddenly due to blackbox marketing; in some cases we still don’t know why. And London has banned Uber. So, it doesn’t matter which markets or institutions we look at, they’re losing our trust.

An article in New Scientist asked what we can do to regain this trust. For black box AI, there are questions of fairness and equity. But what would human-machine collaboration be like? Are there design principles for markets.?

Neil stops for us to discuss.

Q: How do you define the justice?

A: Good question. Fairness? Freedom? The designer has a choice about how to define it.

Q: A UN project created an IT platform that put together farmers and direct consumers. The pricing seemed fairer to both parties. So, maybe avoid intermediaries, as a design principle?

Neil continues. So, what is the concept of justice here?

1. Rawls and Kant: Transcendental institutionalism. It’s deontological: follow a principle for perfect justice. Use those principles to define a perfect institution. The properties are defined by a social contract. But it doesn’t work, as in the examples we just saw. What is missing. People and society. [I.e., you run the institution according to principles, but that doesn’t guarantee that the outcome will be fair and just. My example: Early Web enthusiasts like me thought the Web was an institution built on openness, equality, creative anarchy, etc., yet that obviously doesn’t ensure that the outcome will share those properties.]

2. Realized-focused institutionalism (Sen
2009): How to reverse this trend. It is consequentialist: what will be the consequences of the design of an institution. It’s a comparative assessment of different forms of institutions. Instead of asking for the perfectly justice society, Sen asks how justice can be advanced. The most critical tool for evaluating any institution is to look at how it actually realizes how people’s lives change.

Sen argues that principles are important. They can be expressed by “niti,” Sanskrit for rules and institutions. But you also need nyaya: a form of social arrangement that makes sure that those rules are obeyed. These rules come from social choice, not social contract.

Example: Gig economies. The data comes from mechanical turk, upwork, crowdflower, etc. This creates employment for many people, but it’s tough. E.g., identifying images. Use supervised learning for this. The Turkers, etc., do the labelling to train the image recognition system. The Turkers make almost no money at this. This is the wicked problem of market design: The worker can have identifications rejected, sometimes with demeaning comments.

“The Market for Lemons” (Akerlog, et al., 1970): all the cars started to look alike and now all gig-workers look alike to those who hire them: there’s no value given to bringing one’s value to the labor.

So, who owns the data? Who has a stake in the models? In the intellectual property?

If you’re a gig worker, you’re working with strangers. You don’t know the reputation of the person giving me data. Or renting me the Airbnb apartment. So, let’s put a rule: reputation is the backbone. In sharing economies, most of the ratings are the highest. Reputation inflation. So, can we trust reputation? This happens because people have no incentive to rate. There’s social pressure to give a positive rating.

So, thinking about Sen, can we think about an incentive for honest reputation? Neil’s group has been thinking about a system [I thought he said Boomerang, but I can’t find that]. It looks at the workers’ incentives. It looks at the workers’ ratings of each other. If you’re a requester, you’ll see the workers you like first.

Does this help AI design?

MoralMachine has had 1.3M voters and 18M pairwise comparisons (i.e., people deciding to go straight or right). Can this be used as a voting based system for ethical decision making (AAAI 2018)? You collect the pairwise preferences, learn the model of preference, come to a collective preference, and have voting rules for collective decision.

Q: Aren’t you collect preferences, not normative judgments? The data says people would rather kill fat people than skinny ones.

A: You need the social behavior but also rules. For this you have to bring people into the loop.

Q: How do we differentiate between what we say we want and what we really want?

A: There are techniques, such as “Bayesian Truth Serum”nomics.mit.edu/files/1966”>Bayesian Truth Serum.

Conclusion: The success of markets, institutions or algorithms, is highly dependent on how this actually affects people’s lives. This thinking should be central to the design and engineering of socio-technical systems.

Patrick Sharkey [twitter: patrick_sharkey] uses a Twitter thread to evaluate the evidence about a possible relationship between exposure to lead and crime. The thread is a bit hard to get unspooled correctly, but it’s worth it as an example of:

1. Thinking carefully about complex evidence and data.

2. How Twitter affects the reasoning and its expression.

3. The complexity of data, which will only get worse (= better) as machine learning can scale up their size and complexity.

Note: I lack the skills and knowledge to evaluate Patrick’s reasoning. And, hat tip to David Lazer for the retweet of the thread.

A new research paper, published Jan. 24 with 34 co-authors and not peer-reviewed, claims better accuracy than existing software at predicting outcomes like whether a patient will die in the hospital, be discharged and readmitted, and their final diagnosis. To conduct the study, Google obtained de-identified data of 216,221 adults, with more than 46 billion data points between them. The data span 11 combined years at two hospitals,

That’s from an article in Quartz by Dave Gershgorn (Jan. 27, 2018), based on the original article by Google researchers posted at Arxiv.org.

…Google claims vast improvements over traditional models used today for predicting medical outcomes. Its biggest claim is the ability to predict patient deaths 24-48 hours before current methods, which could allow time for doctors to administer life-saving procedures.

Dave points to one of the biggest obstacles to this sort of computing: the data are in such different formats, from hand-written notes to the various form-based data that’s collected. It’s all about the magic of interoperability … and the frustration when data (and services and ideas and language) can’t easily work together. Then there’s what Paul Edwards, in his great book A Vast Machine calls “data friction”: “…the costs in time, energy, and attention required simply to collect, check, store, move, receive, and access data.” (p. 84)

On the other hand, machine learning can sometimes get past the incompatible expression of data in a way that’s so brutal that it’s elegant. One of the earlier breakthroughs in machine learning came in the 1990s when IBM analyzed the English and French versions of Hansard, the bi-lingual transcripts of the Canadian Parliament. Without the machines knowing the first thing about either language, the system produced more accurate results than software that was fed rules of grammar, bilingual dictionaries, etc.

Indeed, the abstract of the Google paper says “Constructing predictive statistical models typically requires extraction of curated predictor variables from normalized EHR data, a labor-intensive process that discards the vast majority of information in each patient’s record. We propose a representation of patients’ entire, raw EHR records based on the Fast Healthcare Interoperability Resources (FHIR) format. ” It continues: “We demonstrate that deep learning methods using this representation are capable of accurately predicting multiple medical events from multiple centers without site-specific data harmonization.”

The paper also says that their approach affords clinicians “some transparency into the predictions.” Some transparency is definitely better than none. But, as I’ve argued elsewhere, in many instances there may be tools other than transparency that can give us some assurance that AI’s outcomes accord with our aims and our principles of fairness.

I found this article by clicking on Dave Gershgon’s byline on a brief article about the Wired version of the paper of mine I referenced in the previous paragraph. He does a great job explaining it. And, believe me, it’s hard to get a writer — well, me, anyway — to acknowledge that without having to insert even one caveat. Thanks, Dave!

Ulla has been working on the Jyväskylä< Longitudinal Study of Dyslexia (JLD). Globally, one third of people can’t read or have poor reading skills. One fifth of Europe also. About 15% of children have learning disabilities.

One Issue: knowing which sound goes with which letters. GraphoLearn is a game to help students with this, developed by a multidisciplinary team. You learn a word by connecting a sound to a written letter. Then you can move to syllables and words. The game teaches by trial and error. If you get it wrong, it immediately tells you the correct sound. It uses a simple adaptive approach to select the wrong choices that are presented. The game aims at being entertaining, and motivates also with points and rewards. It’s a multi-modal system: visual and audio. It helps dyslexics by training them on the distinctions between sounds. Unlike human beings, it never displays any impatience.

It adapts to the user’s skill level, automatically assessing performance and aiming at at 80% accuracy so that it’s challenging but not too challenging.

13,000 players have played in Finland, and more in other languages. Ulla displays data that shows positive results among students who use GraphoLearn, including when teaching English where every letter has multiple pronunciations.

There are some difficulties analyzing the logs: there’s great variability in how kids play the game, how long they play, etc. There’s no background info on the students. [I missed some of this.] There’s an opportunity to come up with new ways to understand and analyze this data.

Q&A

Q: Your work is amazing. When I was learning English I could already read Finnish, so I made natural mispronunciations of ape, anarchist, etc. How do you cope with this?

A: Spoken and written English are like separate languages, especially if Finnish is your first language where each letter has only one pronunciation. You need a bigger unit to teach a language like English. That’s why we have the Rime approach where we show the letters in more context. [I may have gotten this wrong.]

There’s a triennial worldwide study by the OECD to assess students. Usually, people are only interested in its ranking of education by country. Finland does extremely well at this. This is surprising because Finland does not do particularly well in the factors that are taken to produce high quality educational systems. So Finnish ed has been studied extensively. PISA augments this analysis using learning analytics. (The US does at best average in the OECD ranking.)

Traditional research usually starts with the literature, develops a hypothesis, collects the data, and checks the result. PISA’s data mining approach starts with the data. “We want to find a needle in the haystack, but we don’t know what the needle looks like.” That is, they don’t know what type of pattern to look for.

Results of 2012 PISA: If you cluster all 24M students with their characteristics and attitudes without regard to their country you get clusters for Asia, developing world, Islamic, western countries. So, that maps well.

For Finland, the most salient factor seems to be its comprehensive school system that promotes equality and equity.

In 2015 for the first time there was a computerized test environment available. Most students used it. The logfile recorded how long students spent on a task and the number of activities (mouse clicks, etc.) as well as the score. They examined the Finnish log file to find student profiles, related to student’s strategies and knowledge. Their analysis found five different clusters. [I can’t read the slide from here. Sorry.] They are still studying what this tells us. (They purposefully have not yet factored in gender.)

Nov. 2017 results showed that girls did far better than boys. The test was done in a chat environment which might have been more familiar for the girls? Is the computerization of the tests affecting the results? Is the computerization of education affecting the results? More research is needed.

I’m at the STEAM ed Finland conference in Jyväskylä. Harri Ketamo is giving a talk on “micro-learning.” He recently won a prestigious prize for the best new ideas in Finland. He is interested in the use of AI for learning.

We don’t have enough good teachers globally, so we have to think about ed in new ways, Harri says. Can we use AI to bring good ed to everyone without hiring 200M new teachers globally? If we paid teachers equivalent to doctors and lawyers, we could hire those 200M. But we apparently not willing to do that.

One challenge: Career coaching. What do you want to study? Why? What are the skills you need? What do you need to know?

His company does natural language analysis — not word matches, but meaning. As an example he shows a shareholder agreement. Such agreements always have the same elements. After being trained on law, his company’s AI can create a map of the topic and analyze a block of text to see if it covers the legal requirements…the sort of work that a legal assistant does. For some standard agreements, we may soon not need lawyers, he predicts.

The system’s language model is a mess of words and relations. But if you zoom out from the map, the AI has clustered the concepts. At the Slush Sanghai conference, his AI could develop a list of the companies a customer might want to meet based on a text analysis of the companies’ web sites, etc. Likewise if your business is looking for help with a project.

Finland has a lot of public data about skills and openings. Universities’ curricula are publicly available.[Yay!] Unlike LinkedIn, all this data is public. Harri shows a map that displays the skills and competencies Finnish businesses want and the matching training offered by Finnish universities. The system can explore public information about a user and map that to available jobs and the training that is required and available for it. The available jobs are listed with relevancy expressed as a percentage. It can also look internationally to find matches.

The AI can also put together a course for a topic that a user needs. It can tell what the core concepts are by mining publications, courses, news, etc. The result is an interaction with a bot that talks with you in a Whatsapp like way. (See his paper “Agents and Analytics: A framework for educational data mining with games based learning”). It generates tests that show what a student needs to study if she gets a question wrong.

His newest project, in process: Libraries are the biggest collections of creative, educational material, so the AI ought to point people there. His software can find the common sources among courses and areas of study. It can discover the skills and competencies that materials can teach. This lets it cluster materials around degree programs. It can also generate micro-educational programs, curating a collection of readings.

A: Yes. We’ve found that people get 20-40% better performance when our software is used in blended model, i.e., with a human teacher. It helps motivate people if they can see the areas they need to work on disappear over time.

Q: The sw only found male authors in the example you put up of automatically collated materials.

A: Small training set. Gender is not part of the metadata in Finland.

A: Don’t you worry that your system will exacerbate bias?

Q: Humans are biased. AI is a black box. We need to think about how to manage this

Q: [me] Are the topics generated from the content? Or do you start off with an ontology?

A: It creates its ontology out of the data.

Q: [me] Are you committing to make sure that the results of your AI do not reflect the built in biases?

A: Our news system on the Web presents a range of views. We need to think about how to do this for gender issues with the course software.

I’ve been at a two-day workshop sponsored by the Michigan State Uiversity and the National Science Foundation: “Workshop on Trustworthy Algorithmic Decision-Making.” After multiple rounds of rotating through workgroups iterating on five different questions, each group presented its findings — questions, insights, areas of future research.

Conduct of Data Science

Who defines and how do we ensure good practice in data science and machine learning?

Why is the topic important? Because algorithms are important. And they have important real-world effects on people’s lives.

Why is the problem difficult?

Wrong incentives.

It can be difficult to generalize practices.

Best practices may be good for one goal but not another, e.g., efficiency but not social good. Also: Lack of shared concepts and vocabulary.

How to mitigate the problems?

Change incentives

Increase communication via vocabularies, translations

Education through MOOCS, meetups, professional organizations

Enable and encourage resource sharing: an open source lesson about bias, code sharing, data set sharing

Accountability group

The problem: How to integratively assess the impact of an algorithmic system on the public good? “Integrative” = the impact may be positive and negative and affect systems in complex ways. The impacts may be distributed differently across a population, so you have to think about disparities. These impacts may well change over time

We aim to encourage work that is:

Aspirationally casual: measuring outcomes causally but not always through randomized control trials.

The goal is not to shut down algorithms to to make positive contributions that generat solutions.

This is a difficult problem because:

Lack of variation in accountability, enforcements, and interventions.

It’s unclear what outcomes should be measure and how. This is context-dependent

It’s unclear which interventions are the highest priority

Why progress is possible: There’s a lot of good activity in this space. And it’s early in the topic so there’s an ability to significantly influence the field.

What are the barriers for success?

Incomplete understanding of contexts. So, think it in terms of socio-cultural approaches, and make it interdisciplinary.

The topic lies between disciplines. So, develop a common language.

High-level triangulation is difficult. Examine the issues at multiple scales, multiple levels of abstraction. Where you assess accountability may vary depending on what level/aspect you’re looking at.

Handling Uncertainty

The problem: How might we holistically treat and attribute uncertainty through data analysis and decisions systems. Uncertainty exists everywhere in these systems, so we need to consider how it moves through a system. This runs from choosing data sources to presenting results to decision-makers and people impacted by these results, and beyond that its incorporation into risk analysis and contingency planning. It’s always good to know where the uncertainty is coming from so you can address it.

Why difficult:

Uncertainty arises from many places

Recognizing and addressing uncertainties is a cyclical process

End users are bad at evaluating uncertain info and incorporating uncertainty in their thinking.

Many existing solutions are too computationally expensive to run on large data sets

Progress is possible:

We have sampling-based solutions that provide a framework.

Some app communities are recognizing that ignoring uncertainty is reducing the quality of their work

How to evaluate and recognize success?

A/B testing can show that decision making is better after incorporating uncertainty into analysis

Statistical/mathematical analysis

Barriers to success

Cognition: Train users.

It may be difficult to break this problem into small pieces and solve them individually

Gaps in theory: many of the problems cannot currently be solved algorithmically.

The presentation ends with a note: “In some cases, uncertainty is a useful tool.” E.g., it can make the system harder to game.

Adversaries, workarounds, and feedback loops

Adversarial examples: add a perturbation to a sample and it disrupts the classification. An adversary tries to find those perturbations to wreck your model. Sometimes this is used not to hack the system so much as to prevent the system from, for example, recognizing your face during a protest.

Feedback loops: A recidivism prediction system says you’re likely to commit further crimes, which sends you to prison, which increases the likelihood that you’ll commit further crimes.

What is the problem: How should a trustworthy algorithm account for adversaries, workarounds, and feedback loops?

Who are the stakeholders?

System designers, users, non-users, and perhaps adversaries.

Why is this a difficult problem?

It’s hard to define the boundaries of the system

From whose vantage point do we define adversarial behavior, workarounds, and feedback loops.

Unsolved problems

How do we reason about the incentives users and non-users have when interacting with systems in unintended ways.

How do we think about oversight and revision in algorithms with respect to feedback mechanisms

How do we monitor changes, assess anomalies, and implement safeguards?

How do we account for stakeholders while preserving rights?

How to recognize progress?

Mathematical model of how people use the system

Define goals

Find stable metrics and monitor them closely

Proximal metrics. Causality?

Establish methodologies and see them used

See a taxonomy of adversarial behavior used in practice

Likely approaches

Security methodology to anticipating and unintended behaviors and adversarial interactions’. Monitor and measure

Algorithms and trust

The problem: What are the processes through which different stakeholders come to trust an algorithm?

Multiple processes lead to trust.

Procedural vs. substantive trust: are you looking at the weights of the algorithms (e.g.), or what were the steps to get you there?

Social vs personal: did you see the algorithm at work, or are you relying on peers?

These pathways are not necessarily predictive of each other.

Stakeholders build truth through multiple lenses and priorities

the builders of the algorithms

the people who are affected

those who oversee the outcomes

Mini case study: a child services agency that does not want to be identified. [All of the following is 100% subject to my injection of errors.]

The agency uses a predictive algorithm. The stakeholders range from the children needing a family, to NYers as a whole. The agency knew what into the model. “We didn’t buy our algorithm from a black-box vendor.” They trusted the algorithm because they staffed a technical team who had credentials and had experience with ethics…and who they trusted intuitively as good people. Few of these are the quantitative metrics that devs spend their time on. Note that FAT (fairness, accountability, transparency) metrics were not what led to trust.

Temporality:

Processes that build trust happen over time.

Trust can change or maybe be repaired over time. “

The timescales to build social trust are outside the scope of traditional experiments,” although you can perhaps find natural experiments.

Barriers:

Assumption of reducibility or transfer from subcomponents

Access to internal stakeholders for interviews and process understanding

Some elements are very long term

What’s next for this workshop

We generated a lot of scribbles, post-it notes, flip charts, Slack conversations, slide decks, etc. They’re going to put together a whitepaper that goes through the major issues, organizing them, and tries to capture the complexity while helping to make sense of it.

There are weak or no incentives to set appropriate levels of trust

Key takeways:

Trust is irreducible to FAT metrics alone

Trust is built over time and should be defined in terms of the temporal process

Isolating the algorithm as an instantiation misses the socio-technical factors in trust.