We’ve all heard now about AI-based algorithms that are being used to do risk assessments in pretrial bail decisions. She thinks this is a good place to start using algorithms, although it’s not easy.

The pre-trial stage is supposed to be very short. The court has to determine if the defendant, presumed innocent, will be released on bail or jailed. The sole considerations are supposed to be whether the def is likely to harm someone else or flee. Preventive detention has many efffects, mostly negative for the defendant.
(The US is a world leader in pre-trial detainees. Yay?)

Risk assessment tools have been used for more than 50 years. Actuarial tools have shown greater predictive power than clinical judgment, and can eliminate some of the discretionary powers of judges. Use of these tools have long been controversy What type of factors to include in the power? Is the use of demographic factors to make predictions fair to individuals?

Existing tools use regression analysis. Now machine learning can learn from much more data. Mechanical predictions [= machine learning] are more accurate than statistical predictions, but may not be explicable.

We think humans can explain their decisions and we want machines to be able to as well. But look at movie reviews. Humans can tell if a review is positive. We can teach which words are positive or negative, getting 60% accuracy. Or we can have a human label the reviews as positive or negative and let the machine figure out what the factor are — via machine leaning — in which case we get 80% accuracy but may lose explicability.

With pretrial situations, what is the automated task is that the machine should be performing?

There’s a tension between accuracy and fairness. Computer scientists are trying to quantify these questions What does a fair algorithm look like? John Kleinberg and colleagues did a study of this [this one?]. Their algorithms reduced violent crime by 25% with no change in jailing rates, without increasing racial disparities. In short, the algorithm seems to have done a more accurate job with less bias.

Doaa goes through questions that should be asked of these tools, beginning with: Which factors are considered in each? [She dives into the details for all four tools. I can’t capture it. Sorry.]

What are the sources of data? (3 out of 4 rely on interviews and databases.)

What is the quality of the data? “This is the biggest problem jurisdictions are dealing with when using such a tool.” “Criminal justice data is notoriously poor.” And, of course, if a machine learning system is trained on discriminatory data, its conclusions are likely to reflect those biases.

The tools neeed to be periodically validated using data from its own district’s population. Local data matters.

There should be separate scores for flight risk and public safety All but the PSA provide only a single score. This is important because there are separate remedies for the two concerns. E.g., you might want to lock up someone who is a risk to public safety, but take away the passport of someone who is a flight risk.

Finally, the systems should discriminate among reasons for flight risk. E.g., because the defendant can’t afford the cost of making it to court or because she’s fleeing?

Conclusion: Pretrial is the front door of the criminal justice system and affects what happens thereafter. Risk assessment tools should not replace judges, but they bring benefits. They should be used, and should be made as transparent as possible. There are trade offs. The tool will not eliminate all bias but might help reduce it.

Q&A

Q: Do the algorithms recognize the different situations of different defendants?

A: Systems do recognize this, but not in sophisticated ways. That’s why it’s important to understand why a defendant might be at risk of missing a court date. Maybe we could provide poor defendants with a Metro card.

Q: Could machine learning be used to help us be more specific in the types of harm? What legal theories might we drawn on to help with this?

A: [The discussion got too detailed for me to follow. Sorry.]

Q: There are different definitions of recidivism. What do we do when there’s a mismatch between the machines and the court?

A: Some states give different weights to different factors based on how long ago the prior crimes were committed. I haven’t seen any difference in considering how far ahead the risk of a possible next crime is.

Q: [me] While I’m very sympathetic to allowing machine learning to be used without always requiring that the output be explicable, when it comes to the justice system, do we need explanations so not only is justice done, but we can have trust that it’s being done?

A: If we can say which factors are going into a decision — and it’s not a lot of them — if the accuracy rate is much higher than manual systems, then maybe we can give up on always being able to explain exactly how it came to its decisions. Remember, pre-trial procedures are short and there’s usually not a lot of explaining going on anyway. It’s unlikely that defendants are going to argue over the factors used.

Q: [me] Yes, but what about the defendant who feels that she’s being treated differently than some other person and wants to know why?

A: Judges generally don’t explain how they came to their decisions anyway. The law sets some general rules, and the comparisons between individuals is generally within the framework of those rules. The rules don’t promise to produce perfectly comparable results. In fact, you probably can’t easily find two people with such similar circumstances. There are no identical cases.

Q: Machine learning, multilevel regression level, and human decision making all weigh data and produce an outcome. But ML has little human interaction, statistical analysis has some, and the human decision is all human. Yet all are in fact algorithmic: the judge looks at a bond schedule to set bail. Predictability as fairness is exacerbated by the human decisions since the human cannot explain her model.

Q: Did you find any logic about why jurisdictions picked which tool? Any clear process for this?

A: It’s hard to get that information about the procurement process. Usually they use consultants and experts. There’s no study I know of that looks at this.

Q: In NZ, the main tool used for risk assessment for domestic violence is a Canadian tool called ODARA. Do tools work across jurisdictions? How do you reconcile data sets that might be quite different?

A: I’m not against using the same system across jurisdictions — it’s very expensive to develop one from scratch — but they need to be validated. The federal tool has not been, as far as I know. (It was created in 2009.) Some tools do better at this than others.

Q: What advice would you give to a jurisdiction that might want to procure one? What choices did the tools make in terms of what they’re optimized for? Also: What about COMPAS?

A: (I didn’t talk about COMPAS because it’s notorious and not often used in pre-trial, although it started out as a pre-trial tool.) The trade off seems to be between accuracy and fairness. Policy makers should define more strictly where the line should be drawn.

Q: Who builds these products?

A: Three out of the four were built in house.

Q: PSA was developed by a consultant hired by the Arnold Foundation. (She’s from Luminosity.) She has helped develop a number of the tools.

Q: Why did you decide to research this? What’s next?

A: I started here because pre-trial is the beginning of the process. I’m interested in the fairness question, among other things.

Q: To what extent are the 100+ factors that the Colorado tool considers available publicly? Is their rationale for excluding factors public? Because they’re proxies for race? Because they’re hard to get? Or because back then 100+ seemed like too many? And what’s the overlap in factors between the existing systems and the system Kleinberg used?

A: Interviewing defendants takes time, so 100 factors can be too much. Kleinberg only looked at three factors. Another tool relied on six factors.

Q: Should we require private companies to reveal their algorithms?

A: There are various models. One is to create an FDA for algorithms. I’m not sure I support that model. I think private companies need to expose at least to the govt the factors that they’re including. Others would say I’m too optimistic about the government.

Q: In China we don’t have the pre-trial part, but there’s an article saying that they can make the sentencing more fair by distinguishing among crimes. Also, in China the system is more uniform so the data can be aggregated and the system can be made more accurate.

A: Yes, states are different because they have different laws. Exchanging data between states is not very common and may not even be possible.

The screen next to a patient’s hospital bed that displays the heart rate, oxygen level, and other moving charts is the definition of a dumb display. How dumb is it, you ask? If the clip on a patient’s finger falls off, the display thinks the patient is no longer breathing and will sound an alarm…even though it’s displaying outputs from other sensors that show that, no, the patient isn’t about to die.

The problem, as explained by David Arney at an open house for MD PnP, is that medical devices do not share their data in open ways. That is, they don’t interoperate. MD PnP wants to fix that.

The small group was founded in 2004 as part of MIT’s CIMIT (Consortia for Improving Medicine with Innovation and Technology). Funded by grants, including from the NIH and CRICO Insurance, it currently has 6-8 people working on ways to improve health care by getting machines talking with one another.

The one aspect of hospital devices that manufacturers have generally agreed on is that they connect via serial ports. The FDA encourages this, at least in part because serial ports are electrically safe. So, David pointed to a small connector box with serial ports in and out and a small computer in between. The computer converts the incoming information into an open industry standard (ISO 11073). And now the devices can play together. (The “PnP” in the group’s name stands for “plug ‘n’ play,” as we used to say in the personal computing world.)

David then demonstrated what can be done once the data from multiple devices interoperate.

You can put some logic behind the multiple signals so that a patient’s actual condition can be assessed far more accurately: no more sirens when an oxygen sensor falls off a finger.

You can create displays that are more informative and easier to read — and easier to spot anomalies on — than the standard bedside monitor.

You can transform data into other standards, such as the HL7 format for entry into electronic medical records.

If there is more than one sensor monitoring a factor, you can do automatic validation of signals.

You can record and perhaps share alarm histories.

You can create what is functionally an API for the data your medical center is generating: a database that makes the information available to programs that need it via publish and subscribe.

You can aggregate tons of data (while following privacy protocols, of course) and use machine learning to look for unexpected correlations.

MD PnP makes its stuff available under an open BSD license and publishes its projects on GitHub. This means, for example, that while PnP has created interfaces for 20-25 protocols and data standards used by device makers, you could program its connector to support another device if you need to.

Presumably not all the device manufacturers are thrilled about this. The big ones like to sell entire suites of devices to hospitals on the grounds that all those devices interoperate amongst themselves — what I like to call intraoperating. But beyond corporate greed, it’s hard to find a down side to enabling more market choice and more data integration.

J. Nathan Matias is giving a talk at the weekly AI session held by MIT Media Lab and Harvard’s Berkman Klein Center for Internet & Society. The title is: Testing the social impact of real-time algorithm decisions. (SPOILER: Nate is awesome.) Nathan will be introducing CivilServant.io to us, a service for researching the effects of tech and how it can be better directed to toward the social outcomes we (the civil society “we”) desire. (That’s my paraphrase.)

In 2008, the French government approved a law against Web sites that encourage anorexia and bulimia. In 2012, Instagram responded to pressure to limit hashtags that “actively promote self-harm.” Instagram had 40M users, almost as many as France’s 55M active Net users. Researchers at Georgia Tech several years later found that some self-harm sites on Instagram had higher engagement after Instagram’s actions. “ If your algorithm reliably detects people who are at risk of committing suicide, what next? ” If your algorithm reliably detects people who are at risk of committing suicide, what next? If the intervention isn helpful, your algorithm is doing harm.

Nathan shows a two-axis grid for evaluating algorithms: fair-unfair and benefits-harms. Accuracy should be considered to be on the same axis as fairness because it can be measured mathematically. But you can’t test the social impact without putting it into the field. “I’m trying to draw attention to the vertical axis [harm-benefit].”

We often have in mind a particular pipeline: training > model > prediction > people . Sometimes there are rapid feedback loops where the decisions made by people feed back into the model. A judicial system’s prediction risk scores may have no such loop. But the AI that manages a news feed is probably getting the readers’ response as data that tunes the model.

We have organizations that check the quality of items we deal with: UL for electrical products, etc. But we don’t have that sort of consumer protection for social tech. The results are moral panics, bad policies, etc. This is the gap Nate is trying to fill with CivilServant.io, a project supported by the Media Lab and GlobalVoices.

Here’s an example of one of CivilServant’s projects:

Managing fake news is essential for democracy. The social sciences have been dealing with this for quite a while by doing research on individual perception and beliefs, on how social context and culture influence beliefs … and now on algorithms that make autonomous decisions that affect us as citizens e.g., newsfeeds. Newsfeeds work this way: someone posts a link. People react to it, e.g. upvote, discuss, etc. The feed service watches that behavior and uses it to promote or demote the item. And then it feeds back in.

We’ve seen lots of examples of pernicious outcomes of this. E.g., at Reddit an early upvote can have dramatic impact on its ratings over time.

What can we do to govern online misinfo? We could surveill and censor. We could encourage counter-speech. We can imagine some type of algorithmic governance. We can use behavioral nudges, e.g. Facebook tagging articles as “disputed.” But all of these assume that these interventions change behaviors and beliefs. Those assumptions are not always tested.

Nate was approached by /r/worldnews at Reddit, a subreddit with14M subscribers and 70 moderators. At Reddit, moderating can be a very time consuming effort. (Nate spoke to a Reddit mod who had stopped volunteering at a children’s hospital in order to be a mod because she thought she could do more good that way.) This subreddit’s mods wanted to know if they could question the legitimacy of an item without causing it to surge on the platform. Fact-checking a post could nudge Reddit’s AI to boost its presence because of the increased activity.

So, they did an experiment asking people to fact check an article, or fact check and downvote if you can’t verify it. They monitored the ranking of the articles by Reddit for 3 months. [Nate now gives some math. Sorry I can’t capture (or understand) it.] The result: to his surprise, “encouraging fact checking reduced the average rank position of an article”encouraging fact checking reduced the average rank position of an article. Encouraging fact checking and down-voting reduced the spread of inaccurate news by Reddit’s algorithms. [I’m not confident I’m getting that right

Why did encouraging fact checking reduce rankings, but fact checking and voting did not? The mods think this might be because it gave users a constructive way to handle articles from reviled sources, reducing the number of negative comments about them. [I hope I’m getting this right.] Also, “reactance” may have nudged people to upvote just to spite the instructions. Also, users may have mobilized friends to vote on the artciles. Also, encouraging two tasks (fact check and then vote) rather than one may have influenced he timing of the algorithm, making the down-votes less impactful.

This is what Nate calls an “AI-Nudge”: a “second-order effect of influencing human behavior on the behavior of an algorithmic system.” It means you have to think about how humans interact with AI.

Often when people are working on AI, they’re starting from computer science and math. The question is: how can we use social science methods to research the effect of AI? Paluck and Cialdini see a cycle of Pilot/Lab experiments > qualitative methods > field experiences > theory / policy / design. In the Reddit example, Nathan spent considerable time with the community to understand their issues and how they interact with the AI.

Another example of a study: identifying and reducing side-effects of automated copyright law enforcement on Twitter. When people post something to Twitter, bots monitor it to see if violates copyright, resulting in a DMCA takedown notice being issued. Twitter then takes it down. The Lumen Project from BKC archives these notices. The CivilService project observes those notices in real time to study the effects. E.g., “a user’s tweets per day tends to drop after they receive a takedown notice … for a 42-day period”a user’s tweets per day tends to drop after they receive a takedown notice, and then continues dropping throughout the 42-day period they researched. Why this long-term decrease in posting? Maybe fear and risk. Maybe awareness of surveillance.

So, how can these chilling effects be reduced? The CivilService project automatically sends users info about their rights and about surveillance. The results of this intervention are not in yet. The project hopes to find ways to lessen the public’s needless withdrawal from social media. The research can feed empirical legal studies. Policymakers might find it useful. Civil rights orgs as well. And the platforms themselves.

In the course of the Q&As, Nathan mentions that he’s working on ways to explain social science research that non-experts can understand. CivilService’s work is with user communities and it’s developed a set of ways for communicating openly with the users.

Q: You’re trying to make AI more fair…

A: I’m doing consumer protection, so as experts like you work on making AI more fair, we can see the social effects of interventions. But there are feedback loops among them.

Q: What would you do with a community that doesn’t want to change?

A: We work with communities that want our help. In the 1970s, Campbell wrote an essay: “The Experimenting Society.” He asked if by doing behavioral research we’re becoming an authoritarian society because we’re putting power in the hands of the people who can afford to do the research. He proposed enabling communities to do their own studies and research. He proposed putting data scientists into towns across the US, pool their research, and challenge their findings. But this was before the PC. Now it’s far more feasible.

Q: What sort of pushback have you gotten from communities?

A: Some decide not to work with us. In others, there’s contention about the shape of the project. Platforms have changed how they view this work. Three years ago, the platforms felt under siege and wounded. That’s why I decided to create an independent organization. The platforms have a strong incentive to protect their reputations.

[Disclosure: Typical conversations about JP, when he’s not present, attempt — and fail — to articulate his multi-facted awesomeness. I’ll fail at this also, so I’ll just note that JP is directly responsible for my affiliation with the BKC and and for my co-directorship of the Harvard Library Innovation Lab…and those are just the most visible ways in which he has enabled me to flourish as best I can. ]

Also, at the end of this post I have some reflections on rules vs. models, and the implicit vs. explicit.

John begins by framing the book as an attempt to find a balance between diversity and free expression. Too often we have pitted the two against each other, especially in the past few years, he says: the left argues for diversity and the right argues for free expression. It’s important to have both, although he acknowledges that there are extremely hard cases where there is no reconciliation; in those cases we need rules and boundaries. But we are much better off when we can find common ground.

“This may sound old-fashioned in the liberal way. And that’s true,” he says. But we’re having this debate in part because young people have been advancing ideas that we should be listening to. We need to be taking a hard look.

Our institutions should be deeply devoted to diversity, equity and inclusion. Our institutions haven’t been as supportive of these as they should be, although they’re getting better at it, e.g. getting better at acknowledging the effects of institutional racism.

The diversity argument pushes us toward the question of “safe spaces.” Safe spaces are crucial in the same way that every human needs a place where everyone around them supports them and loves them, and where you can say dumb things. We all need zones of comfort, with rules implicit or explicit. It might be a room, a group, a virtual space… E.g., survivors of sexual assault need places where they know there are rules and they can express themselves without feeling at risk.

But, John adds, there should also be spaces where people are uncomfortable, where their beliefs are challenged.

Spaces of both sorts are experienced differently by different people. Privileged people like John experience spaces as safe that others experience as uncomfortable.

The examples in his book include: trigger warnings, safe spaces, the debates over campus symbols, the disinvitation of speakers, etc. These are very hard to navigate and call out for a series of rules or principles. Different schools might approach these differently. E.g.,students from the Gann Academy are here tonight, a local Jewish high school. They well might experience a space differently than students at Andover. Different schools well might need different rules.

Now John turns it over to students for comments. (This is very typical JP: A modest but brilliant intervention and then a generous deferral to the room. I had the privilege of co-teaching a course with him once, and I can attest that he is a brilliant, inspiring teacher. Sorry, but to be such a JP fanboy, but I am at least an evidence-based fanboy.) [I have not captured these student responses adequately, in some cases simply because I had trouble hearing them. They were remarkable, however. And I could not get their names with enough confidence to attempt to reproduce them here. Sorry!]

Student Responses

Student: I graduated from Andover and now I’m at Harvard. I was struck by the book’s idea that we need to get over the dichotomy between diversity and free expression. I want to address Chapter 5, about hate speech. It says each institution ought to assess its own values to come up with its principles about speech and diversity, and those principles ought to be communicated clearly and enforced consistently. But, I believe, we should in fact be debating what the baseline should be for all institutions. We don’t all have full options about what school we’re going to go to, so there ought to be a baseline we all can rely on.

JP: Great critique. Moral relativism is not a good idea. But I don’t think one size fits all. In the hardest cases, there might be sharpest limits. But I do agree there ought to be some sort of baseline around diversity, equity, and inclusion. I’d like to see that be a higher baseline, and we’ve worked on this at Andover. State universities are different. E.g., if a neo-Nazi group wants to demonstrate on a state school campus and they follow the rules laid out in the Skokie case, etc., they should be allowed to demonstrate. If they came to Andover, we’d say no. As a baseline, we might want to change the regulations so that the First Amendment doesn’t apply if the experience is detrimental to the education of the students; that would be a very hard line to draw. Even if we did, we still might want to allow local variations.

Student: Brave spaces are often build from safe spaces. E.g., at Andover we used Facebook to build a safe space for women to talk, in the face of academic competitions where misogyny was too common. This led to creating brave places where open, frank discussion across differences was welcomed.

JP: Yes, giving students a sense of safety so they can be brave is an important point. And, yes, brave spaces do often grow from safe spaces.

Andover student: I was struck by why diversity is important: the cross-pollination of ideas. But from my experience, a lot of that hasn’t occurred because we’re stuck in our own groups. There’s also typically a divide between the students and the faculty. Student activitsts are treated as if they’re just going through a phase. How do we bridge that gap?

JP: How do we encourage more cross-pollination? It’s a really hard problem for educators. I’ve been struck by the difference between teaching at Harvard Law and Andover in terms of the comfort with disagreeing across political divides; it was far more comfortable at the Law School. I’ve told students if you present a paper that disagrees with my point of view and argues for it beautifully, you’ll do better than parroting ideas back to me. Second, we have to stop using demeaning language to talk about student activists. BTW, there is an interesting dynamic, as teachers today may well have been activists when they were young and think of themselves as the reformers.

Student: [hard to hear] At Andover, our classes were seminar-based, which is a luxury not all students have. Also: Wouldn’t encouraging a broader spread of ideas create schisms? How would you create a school identity?

JP: This echoes the first student speaker’s point about establishing a baseline. Not all schools can have 12 students with two teachers in a seminar, as at Andover. We need to find a dialectic. As for schisms: we have to communicate values. Institutions are challenged these days but there is a huge place for them as places that convey values. There needs to be some top down communication of those values. Students can challenge those values, and they should. This gets at the heart of the problem: Do we tolerate the intolerant?

Student: I’m a graduate of Andover and currently at Harvard. My generation has grown up with the Internet. What happens when what is supposed to be a safe space becomes a brave space for some but not all? E.g., a dorm where people speak freely thinking it’s a safe space. What happens when the default values overrides what someone else views as comfortable? What is the power of an institution to develop, monitor, and mold what people actually feel? When communities engage in groupthink, how can an institution construct space safes?

JP: I don’t have an easy answer to this. We do need to remember that these spaces are experienced differently by different people, and the rules ought to reflect this. Some of my best learning came from late night bull sessions. It’s the duty of the institution to do what it can to enable that sort of space. But we also have to recognize that people who have been marginalized react differently. The rule sets need to reflect that fact.

Student: Andover has many different forum spaces available, from hallways to rooms. We get to decide to choose when and where these conversations will occur. For a more traditional public high school where you only have 30-person classroom as a forum, how do we have the difficult conversations that students at Andover choose to have in more intimate settings?

JP: The size and rule-set of the group matters enormously. Even in a traditional HS you can still break a class into groups. The answer is: How do you hack the space?

Student: I’m a freshman at Harvard. Before the era of safe spaces, we’d call them friends: people we can talk with and have no fear that our private words will be made public, and where we will not be judged. Safe spaces may exclude people, e.g., a safe space open only to women.

JP Andover has a group for women of color. That excludes people, and for various reasons we think that’s entirely appropriate an useful.

Q&A

Q [Terry Fisher]: You refer frequently to rule sets. If we wanted to have a discussion in a forum like this, you could announce a set of rules. Or the organizer could announce values, such as: we value respect, or we want people to take the best version of what others say. Or, you could not say anything and model it in your behavior. When you and I went to school, there were no rules in classrooms. It was all done by modeling. But this also meant that gender roles were modeled. My experience of you as a wonderful teacher, JP, is that you model values so well. It doesn’t surprise me that so many of your students talk with the precision and respectfulness that you model. I am worried about relying on rule sets, and doubt their efficacy for the long term. Rather, the best hope is people modeling and conveying better values, as in the old method.

JP: Students, Terry Fischer was my teacher. May answer will be incredibly tentative: It is essential for an institution to convey its values. We do this at Andover. Our values tell us, for example, that we don’t want gender-based balance and are aware that we are in a misogynist culture, and thus need reasonable rules. But, yes, modeling is the most powerful.

Q [Dorothy Zinberg]: I’ve been at Harvard for about 70 yrs and I have seen the importance of an individual in changing an institution. For example, McGeorge Bundy thought he should bring 12 faculty to Harvard from non-traditional backgrounds, including Erik Erikson who did not have a college degree. He had been a disciple of Freud’s. He taught a course at Harvard called “The Lifecycle.” Every Harvard senior was reading The Catcher in the Rye. Erikson was giving brilliant lectures, but I told him it was from his point of view as a man, and had nothing to do with the young women. So, he told me, a grad student, to write the lectures. No traditional professor would have done that. Also: for forming groups, there’s nothing like closing the door. People need to be able to let go and try a lot of ideas.

Q: I am from the Sudan. How do you create a safe space in environments that are exclusive. [I may have gotten that wrong. Sorry.] How do you acknowledge the native American tribes whose land this institution is built on, or the slaves who did the building?

JP: We all have that obligation. [JP gives some examples of the Law School recently acknowledging the slave labor, and the money from slave holders, that helped build the school.]

Q: You used a kitchen as an example of a safe space. Great example. But kitchens are not established or protected by any authority. It’s a new idea that institutions ought to set these up. Do you think there should be safe spaces that are privately set up as well as by institutions? Should some be permitted to exclude people or not?

(JP asks a student to respond): Institutional support can be very helpful when you have a diversity of students. Can institutional safe spaces supplement private ones? I’m not sure. And I do think exclusive groups have a place. As a consensus forms, it’s important to allow the marginalized voices to connect.

Q [ head of Gann]: I’m a grad of Phillips Academy. As head of a religious school, we’re struggling with all these questions. Navigating these spaces isn’t just a political or intellectual activity. It is a work of the heart. If the institution thinks of this only as a rational activity and doesn’t tend to the hearts of our students, and is not explicit about the habits of heart we need to navigate these sensitive waters, only those with natural emotional skills will be able to flourish. We need to develop leaders who can turn hard conversations into generative ones. What would it look like to take on the work of developing social and emotional development?

JP: Ive been to Gann and am confident that’s what you’re doing. And you can see evidence of Andover’s work on it in the students who spoke tonight. Someone asked me if a student became a Nazi, would you expel him? Yes, if it were apparent in his actions, but probably not for his thoughts. Ideally, our students won’t come to have those views because of the social and emotional skills they’re learning. But people in our culture do have those views. Your question brings it back to the project of education and of democracy.

[This session was so JP!]

A couple of reactions to this discussion without having yet read the book.

First, about Prof. Fisher’s comment: I think we are all likely to agree that modeling the behavior we want is the most powerful educational tool. JP and Prof. Fisher, are both superb, well, models of this.

But, as Prof. Fisher noted in his question, the dominant model of discourse for our generation silently (and sometimes explicitly) favored males, white middle class values, etc. Explicit rules weren’t as necessary because we had internalized them and had stacked the deck against those who were marginalized by them. Now that diversity has thankfully become an explicit goal, and now that the Internet has thrown us into conversations across differences, we almost always need to make those rules explicit; a conversation among people from across divides of culture, economics, power, etc. that does not explicitly acknowledge the different norms under which the participants operate is almost certainly going to either fragment or end in misunderstanding.

(Clay Shirky and I had a collegial difference of opinion about this about fifteen years ago. Clay argued for online social groups having explicit constitutions. I argued
for the importance of the “unspoken” in groups, and the damage that making norms explicit can cause.)

Second, about the need for setting a baseline: I’m curious to see what JP’s book says about this, because the evidence is that we as a culture cannot agree about what the baseline is: vociferous and often nasty arguments about this have been going on for decades. For example, what’s the baseline for inviting (or disinviting) people with highly noxious views to a private college campus? I don’t see a practical way forward for establishing a baseline answer. We can’t even get Texas schools to stop teaching Creationism.

So, having said that modeling is not enough, and having despaired at establishing a baseline, I think I am left being unhelpfully dialectical:

1. Modeling is essential but not enough.

2. We ought to be appropriately explicit about rules in order to create places where people feel safe enough to be frank and honest…

3. …But we are not going to be able to agree on a meaningful baseline for the U.S., much less internationally — “meaningful” meaning that it is specific enough that it can be applied to difficult cases.

4. But modeling may be the only way we can get to enough agreement that we can set a baseline. We can’t do it by rules because we don’t have enough unspoken agreement about what those rules should be. We can only get to that agreement by seeing our leading voices in every field engage across differences in respectful and emotionally truthful ways. So at the largest level, I find I do agree with Prof. Fisher: we need models.

5. But if our national models are to reflect the values we want as a baseline, we need to be thoughtful, reflective, and explicit about which leading voices we want to elevate as models. We tend to do this not by looking for rules but by looking for Prof. Fisher’s second alternative: values. For example, we say positively that we love John McCain’s being a “maverick” or Kamala Harris’ careful noting of the evidence for her claims, and we disdain Trump’s name-calling. Rules derive from values such as those. Values come before rules.

I just wish I had more hope about the direction we’re going in…although I do see hopeful signs in some of the model voices who are emerging, and most of all, in the younger generation’s embrace of difference.

Sandra gives an introduction the BKC Youth and Media project. She points out that their projects are co-designed with the groups that they are researching. From the AI folks they’d love ideas and better understanding of AI, for they are just starting to consider the importance of AI to education and youth. They are creating a Digital Media Literacy Platform (which Sandra says they hope to rename).

They show an intro to AI designed to be useful for a teacher introducing the topic to students. It defines, at a high level, AI, machine learning, and neural networks. They also show “learning experiences” (= “XP”) that Berkman Klein summer interns came up with, including AI and well-being, AI and news, autonomous vehicles, and AI and art. They are committed to working on how to educate youth about AI not only in terms of particular areas, but also privacy, safety, etc., always with an eye towards inclusiveness.

They open it up for discussion by posing some questions. 1. How to promote inclusion? How to open it up to the most diverse learning communities? 2. Did we spot any errors in their materials? 3. How to reduce the complexity of this topic? 4. Should some of the examples become their own independent XPs? 5. How to increase engagement? How to make it exciting to people who don’t come into it already interested in the topic?

I am, surprisingly, at the first PAIR (People + AI Research) conference at Google, in Cambridge. There are about 100 people here, maybe half from Google. The official topic is: “How do humans and AI work together? How can AI benefit everyone?” I’ve already had three eye-opening conversations and the conference hasn’t even begun yet. (The conference seems admirably gender-balanced in audience and speakers.)

The great Martin Wattenberg (half of Wattenberg – Fernanda Viéga) kicks it off, introducing John Giannandrea, a VP at Google in charge of AI, search, and more. “We’ve been putting a lot of effort into using inclusive data sets.”

John says that every vertical will affected by this. “It’s important to get the humanistic side of this right.” He says there are 1,300 languages spoken world wide, so if you want to reach everyone with tech, machine learning can help. Likewise with health care, e.g. diagnosing retinal problems caused by diabetes. Likewise with social media.

PAIR intends to use engineering and analysis to augment expert intelligence, i.e., professionals in their jobs, creative people, etc. And “how do we remain inclusive? How do we make sure this tech is available to everyone and isn’t used just by an elite?”

He’s going to talk about interpretability, controllability, and accessibility.

Interpretability. Google has replaced all of its language translation software with neural network-based AI. He shows an example of Hemingway translated into Japanese and then back into English. It’s excellent but still partially wrong. A visualization tool shows a cluster of three strings in three languages, showing that the system has clustered them together because they are translations of the same sentence. [I hope I’m getting this right.] Another example: a photo of integrated gradients hows that the system has identified a photo as a fire boat because of the streams of water coming from it. “We’re just getting started on this.” “We need to invest in tools to understand the models.”

Controllability. These systems learn from labeled data provided by humans. “We’ve been putting a lot of effort into using inclusive data sets.” He shows a tool that lets you visuallly inspect the data to see the facets present in them. He shows another example of identifying differences to build more robust models. “We had people worldwide draw sketches. E.g., draw a sketch of a chair.” In different cultures people draw different stick-figures of a chair. [See Eleanor Rosch on prototypes.] And you can build constraints into models, e.g., male and female. [I didn’t get this.]

Accessibility. Internal research from Youtube built a model for recommending videos. Initially it just looked at how many users watched it. You get better results if you look not just at the clicks but the lifetime usage by users. [Again, I didn’t get that accurately.]

Google open-sourced Tensor Flow, Google’s AI tool. “People have been using it from everything to to sort cucumbers, or to track the husbandry of cows.”People have been using it from everything to to sort cucumbers, or to track the husbandry of cows. Google would never have thought of this applications.

AutoML: learning to learn. Can we figure out how to enable ML to learn automatically. In one case, it looks at models to see if it can create more efficient ones. Google’s AIY lets DIY-ers build AI in a cardboard box, using Raspberry Pi. John also points to an Android app that composes music. Also, Google has worked with Geena Davis to create sw that can identify male and female characters in movies and track how long each speaks. It discovered that movies that have a strong female lead or co-lead do better financially.

He ends by emphasizing Google’s commitment to open sourcing its tools and research.

Fernanda and Martin talk about the importance of visualization. (If you are not familiar with their work, you are leading deprived lives.) When F&M got interested in ML, they talked with engineers. ““ML is very different. Maybe not as different as software is from hardware. But maybe. ”ML is very different. Maybe not as different as software is from hardware. But maybe. We’re just finding out.”

M&F also talked with artists at Google. He shows photos of imaginary people by Mike Tyka created by ML.

This tells us that AI is also about optimizing subjective factors. ML for everyone: Engineers, experts, lay users.

Fernanda says ML spreads across all of Google, and even across Alphabet. What does PAIR do? It publishes. It’s interdisciplinary. It does education. E.g., TensorFlow Playground: a visualization of a simple neural net used as an intro to ML. They opened sourced it, and the Net has taken it up. Also, a journal called Distill.pub aimed at explaining ML and visualization.

She “shamelessly” plugs deeplearn.js, tools for bringing AI to the browser. “Can we turn ML development into a fluid experience, available to everyone?”
What experiences might this unleash, she asks.

They are giving out faculty grants. And expanding the Brain residency for people interested in HCI and design…even in Cambridge (!).

It’s actually the first essay in the book, which obviously is not arranged in order of preference, but probably means at least the editors didn’t hate it.

The next day: Thanks to a tweet by Siva Vaidhyanathan, I and a lot of people on Twitter have realized that all but one of the authors in this volume are male. I’d simply said yes to the editors’ request to re-publish my article. It didn’t occur to me to ask to see the rest of the roster even though this is an issue I care about deeply. LARB seems to feature diverse writers overall, but apparently not so much in tech.

Ethan Zuckerman brilliantly frames the public’s distrust of institutional journal in a whitepaper he is writing for Knight. (He’s posted it both on his blog and at Medium. Choose wisely.)
As he said at an Aspen event where he led a discussion of it:

…I think mistrust in civic institutions is much broader than mistrust in the press. Because mistrust is broad-based, press-centric solutions to mistrust are likely to fail. This is a broad civic problem, not a problem of fake news,

The whitepaper explores the roots of that broad civic problem and suggests ways to ameliorate it. The essay is deeply thought, carefully laid out, and vividly expressed. It is, in short, peak Ethanz.

The best news is that Ethan notes that he’s writing a book on civic mistrust.

In the early 2000’s, some of us thought that journalists would blog and we would thereby get to know who they are and what they value. This would help transparency become the new objectivity. Blogging has not become the norm for reporters, although it does occur. But it turns out that Twitter is doing that transparency job for us. Jake Tapper (@jaketapper) at CNN is one particularly good example of this; he tweets with a fierce decency. Margie Haberman (@maggieNYT) and Glenn Thrush (@glennThrush) from the NY Times, too. And many more.

This, I think is a good thing. For one thing, it increases trust in at least some news media, while confirming our distrust of news media we already didn’t trust. But we are well past the point where we are ever going to trust the news media as a generalization. The challenge is to build public trust in news media that report as truthfully and fairly as they can.

Lionel Brossi recounts growing up in Argentina and the assumption that all boys care about football. He moved to Chile which is split between people who do and do not watch football. “Humans are inherently biased.” So, our AI systems are likely to be biased. Cognitive science has shown that the participants in their studies tend to be WEIRD: western, educated, industrialized, rich and developed. Also straight and white. He references Kate Crawford‘s “AI’s White Guy Problem.” We need not only diverse teams of developers, but also to think about how data can be more representative. We also need to think about the users. One approach is work on goal centered design.

If we ever get to unbiased AI, Borges‘ statement, “The original is unfaithful to the translation” may apply.

Chelsea: What is an inclusive way to think of cross-border countries?

Lionel: We need to co-design with more people.

Madeline Elish is at Data and Society and an anthropology of technology grad student at Columbia. She’s met designers who thought it might be a good to make a phone run faster if you yell at it. But this would train children to yell at things. What’s the context in which such designers work? She and Tim Hwang set about to build bridges between academics and businesses. They asked what designers see as their responsibility for the social implications of their work. They found four core challenges:

She and Tim wrote An AI Pattern Language [pdf] about the frameworks that guide design. She notes that none of them were thinking about social justice. The book argues that there’s a way to translate between the social justice framework and, for example, the accuracy framework.

Ethan Zuckerman: How much of the language you’re seeing feels familiar from other hype cycles?

Madeline: Tim and I looked at the history of autopilot litigation to see what might happen with autonomous cars. We should be looking at Big Data as the prior hype cycle.

Yarden Katz is at the BKC and at the Dept. of Systems Biology at Harvard Medical School. He talks about the history of AI, starting with 1958 claim about translation machine. 1966: Minsky Then there was an AI funding winter, but now it’s big again. “Until recently, AI was a dirty word.”

Today we use it schizophrenically: for Deep Learning or in a totally diluted sense as something done by a computer. “AI” now seems to be a branding strategy used by Silicon Valley.

“AI’s history is diverse, messy, and philosophical.” If complexit is embraced, “AI” might not be a useful caregory for policy. So we should go basvk to the politics of technology:

1. who controls the code/frameworks/data
2. Is the system inspectable/open?
3. Who sets the metrics? Who benefits from them?

The media are not going to be the watchdogs because they’re caught up in the hype. So who will be?

Q: There’s a qualitative difference in the sort of tasks now being turned over to computers. We’re entrusting machines with tasks we used to only trust to humans with good judgment.

Yarden: We already do that with systems that are not labeled AI, like “risk assessment” programs used by insurance companies.

Madeline: Before AI got popular again, there were expert systems. We are reconfiguring our understanding, moving it from a cognition frame to a behavioral one.

Chelsea: I’ve been involved in co-design projects that have backfired. These projects have sometimes been somewhat extractive: going in, getting lots of data, etc. How do we do co-design that are not extractive but that also aren’t prohibitively expensive?

Nathan: To what degree does AI change the dimensions of questions about explanation, inspectability, etc.

Yarden: The promoters of the Deep Learning narrative want us to believe you just need to feed in lots and lots of data. DL is less inspectable than other methods. DL is not learning from nothing. There are open questions about their inductive power.

Amy Zhang and Ryan Budish give a pre-alpha demo of the AI Compass being built at BKC. It’s designed to help people find resources exploring topics related to the ethics and governance of AI.