Shared Task 2019

Fast Facts

The problem. The shared task tracks this year all involve predicting someone’s degree of suicide risk using posts they’ve made on Reddit. The tasks are all based on a four-way scale of no, low, moderate, and severe risk. These have different definitions but are similar in spirit to the green, amber, red, and crisis categories in previous CLPsych shared tasks using ReachOut data.

Getting sample data. Please fill in the form here and we will email you a small sample of data so that you can see the format and start writing code, e.g. for preprocessing.

Getting the training data. Unlike last year’s shared task, no formal legally binding data use agreement needs to be signed to obtain the dataset this year. But you do need to submit a participant application for review where you affirm that you will follow the shared task’s rules, and you will need to get your organization’s ethical review board (IRB or equivalent) to agree you can use the data, by the time you submit your system results. Because your organization’s IRB process could take a little time, we recommend you get started with this as quickly as possible. That said, getting such IRB approval (most likely a finding that your research with the data is in the Exempt category) should be straightforward, because we are using publicly available, anonymous data (extracted from Reddit). We provide example wording for your IRB submission, which may be useful to those with less IRB-related experience.

Annotations. You’ll be able to do supervised learning with annotated data. We will be dividing a set of approximately 1,200 users (including both positive examples and controls) into annotated training and test data. You’ll also have the option of using a larger quantity of relevant but unannotated data from approximately 10,000 other users.

Tasks. You can participate in one, two, or all three of these tasks.

Task A is about risk assessment: this task simulates a scenario where you already have online evidence that a person might be in need of help (because they have posted to a relevant online forum or discussion, in this case r/SuicideWatch), and the goal is to assess their level of risk from what they posted. This task uses the smallest amount of data, with each user typically having no more than a few SuicideWatch posts.

Task B is the same risk assessment problem as task A, but in addition to the SuicideWatch posts (which tell you they may need help), you can also use the user’s posts elsewhere on Reddit (which might tell you more about them or their mental state). On average each user we collected data for has more than 130 posts on Reddit, and the subreddit categories are wildly diverse, from Accounting to mylittlepony to SkincareAddiction to zombies.

Task C is about screening. This task simulates a scenario in which someone has opted in to having their social media monitored (e.g. a new mother at risk for postpartum depression, a veteran returning from a deployment, a patient whose therapist has suggested it) and the goal is to identify whether they are at risk even if they have not explicitly presented with a problem. Here predictions will be made only from users’ posts that are not on SuicideWatch (or any other mental health related forum).

Publication. All shared task participants will have the opportunity to contribute a short paper for inclusion in the official workshop proceedings. Note that the timeline, particularly for paper writing/reviewing/revision, is quite compressed, because we want to provide shared task participants with an official publication in the workshop proceedings and we have been given a strict, unmovable deadline by the conference organizers for sending final camera-ready shared task papers.

Timeline

Now. Availability of sample data, for initial development of preprocessing etc.

Mar 2 Availability of training data. (You will need to submit the participant application form, including documentation that you have started your organization’s IRB review process or equivalent, to get the data.)

Detailed Task Description

Important Note

This task involves research that may ultimately help
to prevent suicides. Before deciding to participate, though, please note that
some of the materials we will be providing are from people in real distress,
and they can be difficult or upsetting to read. If you believe that you or
someone on your team might be affected personally in a negative way by doing
this task, please err on the side of caution and do not participate. If you
start working on it and you or a member of your team finds that it’s upsetting,
please stop. If you’re feeling like you (or someone you know) could use some
support or assistance, please take advantage of one of the following resources:

https://www.reddit.com/r/SuicideWatch/wiki/hotlines – This page provides information about phone and chat hotlines and online resources in the U.S. and worldwide.

Please note that all the materials we are providing are anonymous — not even the researchers who collected the dataset know who these people are, and the posts were made over a period of years. Although it’s tragic that there is no direct way for us to help the people who have written these posts who may be at risk of suicide, you are contributing to an effort aimed at better understanding the factors connected with suicide attempts, using that information to do a better job assessing risk, and hopefully contributing to more effective ways of getting people help.

Motivation

Researchers have recently observed
a troubling long-term lack of progress in predicting suicide risk. McHugh et al. (2019), reviewing more than 70
studies, conclude that suicidality can’t be predicted effectively using the
standard practice of clinicians asking people in person about suicidal thoughts:
80% of patients who were not already undergoing psychiatric treatment and who
died of suicide reported not having
suicidal thoughts when asked by a general practitioner. After carefully
reviewing more than three hundred studies, Franklin et al. (2015) conclude that
predictive ability for suicidal thoughts and behaviors has not improved across 50
years of research (!), and they suggest a change in focus from the traditional study
of risk factors to approaches based on machine learning. Similarly, Nock et al. (2019) argue for
addressing the lack of progress in suicide prevention by shifting from
theory-driven models to a more inductive approach.
At the same time, Nock et al. note that suicidal thoughts and behaviors “rarely occur in a research laboratory” and they identify a need for pursuing new technologies that provide for continuous monitoring “in situ”. As one powerful example of capturing people’s experiences in situ, Coppersmith et al. (2018) have found that for many people the “clinical whitespace” — the long intervals between healthcare encounters — is occupied by frequent use of social media, which can be tapped in order to build binary risk classifiers. This raises a new problem, though: when such systems are deployed, the number of people flagged as “at risk” will far exceed clinical capacity for intervention. So rather than a binary classification, a finer grained assessment for degree of risk is needed, in order to support decisions about intervention priority. This motivates a shared task that looks at multi-level assessment risk for suicidality based on online postings.

Data

In this year’s shared task we look at several
variations on assessment of suicide risk from online postings. Participants
will use data in the University of Maryland Reddit Suicidality Dataset (Shing
et al., 2018), which has been constructed using data from Reddit, an online site for anonymous discussion on a wide
variety of topics. Specifically, the UMD dataset was extracted from the 2015 Full Reddit Submission Corpus,
using postings in the r/SuicideWatch subreddit (henceforth simply SuicideWatch or SW)
to identify (anonymous) users who might represent positive instances of
suicidality and including a comparable number of non-SuicideWatch controls.
(See Gaffney and Matias 2018 for some recent caveats regarding the use of the
2015 Reddit Corpus related to missing data.)
As reported in Shing et al. (2018),
users’ SW posts were assessed using a four point scale, including no risk, low risk, moderate risk,
and severe risk, summarized as
follows:

(a) No Risk (or “None”): I don’t see evidence that this person is at risk for suicide.

(b) Low Risk: There may be some factors here that could suggest risk, but I don’t really think this person is at much of a risk of suicide.

(c) Moderate Risk: I see indications that there could be a genuine risk of this person making a suicide attempt.

(d) Severe Risk: I believe this person is at high risk of attempting suicide in the near future.

The
dataset includes posts from more than 11,000 users who posted at least once on
SuicideWatch and a comparable number who did not. For each post, we have:

post_id

A unique identifier for the post.

user_id

A unique numeric identifier for the user who authored the post.

Note that although Reddit is a site for anonymous discussion, Reddit user
IDs have been replaced with these numeric IDs as an additional layer of
protection.

For easy human readability, a somewhat nonstandard convention has been
adopted for user IDs: all control instances have negative numbers for their
unique user IDs.

timestamp

Time the post was created, encoded as a Unix epoch.

subreddit

The name of the subreddit (discussion forum) where the post appeared

post_title

Title of the post

post_body

Contents of the post.

Post data is provided in
comma-separated values (CSV) files. For example, here is a control post on the engineering subreddit from user ID -107
at Unix epoch timestamp 1361910133, which translates to Tuesday, February 26,
2013 8:22:13 PM GMT”

"19a1p7","-107","1361910133","engineering","Where
to start for rocket design fundamentals ","Hello. I'm a mechanical
engineer with about a year left. I've been focusing primarily on structural
design with an emphasis in composite materials. I've recently become enamored
with with possibility of working in the commercial space industry and can't
stop thinking about how much I need to learn. I'm planning on getting involved
with my university's rocket design team, but I was wondering if reddit had some
other tips and sources to acquire some rocket knowledge. Thanks!"

Labels (at the user level)
are provided in a separate two-column CSV file where the first column is
user_id and the second column is one of {a,b,c,d}.
Please fill out the sample data request form if you would like to see sample data for a small number of users ahead of the training dataset release.

Tasks

Task A: Risk Assessment for SW posters based on their SW postings

Risk assessment is the problem of determining degree of risk for someone when you already have reason to believe they might be at risk. This corresponds to a common real-world use case where the priority for intervention needs to be determined when a friend, family member, or the person themself has identified that there is or might be a problem. Note that this is particularly relevant given really recent work showing that suicide can’t be predicted by asking about suicidal thoughts (McHugh et al. 2019).

Given a user that posted on SW, the primary task is to assess their degree of risk (a, b, c, or d) from their SW postings. “Ground truth” is based on a consensus of human annotators (Shing et al., 2018).

Task B: Risk Assessment for SW posters based on their SW postings and other Reddit postings

This will be the same as Task A, but systems will be permitted to also use the data we have about users from everything else they posted on Reddit

For participants who do both this and Task A, a comparison between the two will help us better understand the value of collecting more comprehensive information in risk assessment scenarios.

Task C: Screening

Screening is the problem of identifying people who might be at risk, in the absence of other information that might already suggest there is a problem. This corresponds to real world use cases where there are populations that could be monitored, e.g. clinicians keeping track of new moms in case of post-partum depression, the VA keeping an eye on returning veterans, or schools looking for early warnings for students (e.g. see https://www.goguardian.com/beacon.html).

The task is to predict suicide risk for a user given their Reddit posts excluding their posts from SuicideWatch, i.e. from an approximation of their general social media activity.

Evaluation metrics are still under discussion, to be finalized by the time the training data is released. Our primary focus will be to treat the four-way classification as an ordinal scale (e.g. Spearman rank-order correlation as a summary statistic) in order to rank systems, but we may also provide precision/recall or sensitivity/specificity for individual categories. In addition, we may provide evaluations mapping the finer-grained ground truth to a binary task, e.g. distinguishing severe (d) from non-severe (a-c), similarly to what was introduced in CLPsych 2017 on ReachOut data.

How to Sign
Up

Optionally, fill out the sample data request form if you would like to see sample data for a small number of users ahead of the training dataset release. This also helps us with advance notice that you might participate.

Use Part 2 to create an application to your organization’s IRB or equivalent. You don’t need approval before you get the training data and start working on your system, just evidence you’ve gotten the process started; however, IRB approval, exemption, or equivalent will be needed before your system output can be evaluated.

By the task signup deadline, submit Part 1 plus a copy of your IRB application (or equivalent) to clpsych-shared-task-organizers@googlegroups.com as a PDF attachment, with subject line “Shared task signup”.