StudentLife Dataset

Introduction

The whole StudentLife dataset is in one big file: full dataset, which contains all the sensor data, EMA data, survey responses and educational data.

For privacy considerations, we removed data that may reveal participants' identities. For example, Bluetooth devices' names may contain participants' real name because people use their names to name their computers. Browser logs are also removed from the dataset. WiFi AP's SSID has beed removed from the dataset because Dartmouth College Network Service does not allow us to disclose any information on campus WiFi AP deployment.

We recommend importing the whole dataset into a centralized datastore (e.g. MongoDB, Apache Cassandra) first. It will make the data processing much easier.

Citation

Please cite the following paper if the dataset is used in a publication:

Data Directory Organization

The dataset directories are organized by data types. StudentLife dataset contains four types of data: sensor data, EMA data, pre and post survey responses and educational data. The top level directory is shown below. In the following subsections, we introduce the structure of each directory, and the data format in next section.

The data files under each data type subdirectory are organized by participants. For example, you can find all physical activity inferences for u01 in sensing/activity/activity_u01.csv. Similarly, you can find u01's conversation inferences in sensing/conversation/conversation_u01.csv.

EMA Data

You can find EMA question definitions in EMA/EMA_definition.json. Participants' responses are stored in EMA/responses. The name of subdirectories under EMA/responses correspond to EMA question's name. For example, EMA/responses/Stress contains all participants' responses to the Stress EMA. Similar to sensor data, each EMA's responses are organized by participants' uid. You can find detailed EMA file format in EMA section

Pre and Post Surveys

All pre and post survey responses are stored in corresponding files under dataset/survey. The directory is organized by survey names. For example, you can find participants' pre and post responses to PHQ-9 depression scale in survey/PHQ-9.csv. All files are in csv format, which is defined in Survey section.

Educational Data

Educational data, which include classes taken during 2013 Spring term, deadlines for each participants, grades and Piazza usage for CS65, is stored under dataset/education. Detailed description is in Educational Data section.

Automatic Sensing

This section introduces the data format of automatic sensor data that resides under dataset/sensing.

Physical Activity Inferences

The first few lines of a participant's physical activity inferences file look like this:

timestamp

activity inference

1364356853

0

1364356856

0

1364356858

0

The first row is the header row, which defines that there are two fields in activity data files: timestamp and activity inference id. The timestamp is the Unix time when the inference was collected. The timezone is Eastern Time Zone.

The activity classifier runs 24/7 with duty cycling. To avoid draining the battery, it makes activity inferences continuously for 1 minutes, then pause for 3 minutes before restart collecting activity inferences again. It generates one activity inference every 2~3 seconds depending on smartphone's accelerometer sampling rate. The meaning of activity inference is described in the following table.

Inference ID

Description

0

Stationary

1

Walking

2

Running

3

Unknown

Audio

The first few lines of a participant's physical audio inferences file look like this:

timestamp

audio inference

1364356875

0

1364356876

0

1364356877

0

The first row is the header row, which defines that there are two fields in audio data files: timestamp and audio inference type id. The timestamp is the Unix time when the inference was collected. The timezone is Eastern Time Zone.

The audio classifier runs 24/7 with duty cycling. It makes audio inferences for 1 minutes, then pause for 3 minutes before restart. If the conversation classifier detects that there is a conversation going on, it will keep running until the conversation is finished. It generates one audio inference every 2~3 seconds. The meaning of audio inference is described in the following table.

Inference ID

Description

0

Silence

1

Voice

2

Noise

3

Unknown

Conversation

The first few lines of a participant's conversation inferences file look like this:

start_timestamp

end_timestamp

1364425656

1364425727

1364427639

1364427780

1364428051

1364428485

There are two fields in conversation data files: conversation start timestamp and conversation end timestamp. For example, the first row in showing above records that the participant was around a conversation from Unix timestamp 1364425656 to Unix time stamp 1364425727. The timezone is Eastern Time Zone.

GPS Location

The first few lines of a participant's GPS location file look like this:

time

provider

network_type

accuracy

latitude

longitude

altitude

bearing

speed

travelstate

1364357009

network

wifi

67.993

43.7066671

-72.2890974

0.0

0.0

0.0

stationary

1364358209

network

wifi

23.0

43.706637

-72.2890664

0.0

0.0

0.0

moving

1364359405

gps

16.0

43.70667831

-72.28901794

136.300003052

96.2

0.25

GPS coordinates were collected every 10 minutes. Important data fields are shown as follows:

Field Name

Description

time

The Unix time of when it was collected (EST)

provider

The source of GPS coordinates: GPS or network

network_type

Which network was used to obtain GPS fix when the provider is network

latitude

Latitude

longitude

Longitude

Bluetooth

The first few lines of a participant's Bluetooth scan log file look like this:

time

MAC

class_id

level

1364359421

00:26:08:C9:80:E2

3670284

-79

1364359421

68:A8:6D:24:D9:8F

3801356

-92

1364360622

68:A8:6D:24:D9:8F

3801356

-94

1364388221

00:26:08:D2:B5:E9

3670284

-80

1364393027

00:26:08:B8:D2:CF

3801356

-86

1364393027

44:2A:60:FB:B7:59

3801356

-93

Bluetooth scans every 10 minutes. We removed device names for privacy concerns. Important data fields are shown as follows:

Field Name

Description

time

The Unix time of when it was collected

MAC

The MAC address of surrounding Bluetooth device

class_id

Describes general characteristics and capabilities of a device, see android.bluetooth.BluetoothClass

level

Signal strength

Note: rows that share same timestamp belong to a single Bluetooth scan.

WiFi

The first few lines of a participant's WiFi AP scan log file look like this:

time

BSSID

freq

level

1364356944

d0:57:4c:57:58:00

2437

-68

1364356944

dc:7b:94:87:29:b0

2462

-87

1364357187

d0:57:4c:57:58:00

2437

-68

1364357187

dc:7b:94:87:29:b0

2462

-87

1364357514

d0:57:4c:57:58:00

2437

-68

1364357514

dc:7b:94:87:46:f2

2412

-89

WiFi scans frequently. We removed SSID for privacy concerns. Important data fields are shown as follows:

Field Name

Description

time

The Unix time of when it was collected

BSSID

AP's MAC address

freq

AP's working channel frequency

level

Signal strength

Note: rows that share same timestamp belong to a single WiFi scan.

WiFi Location

We acquired Dartmouth College's WiFi AP deployment information from Dartmouth Network Services which allows us to calculate a participant's on-campus rough location. However, we are not allowed to release Dartmouth WiFi AP deployment information to the public, so we release the location inference we calculated based on participants' WiFi scan log. You can use location inferred from WiFi scan and GPS Location data to infer the GPS coordinates of each Dartmouth building.

The first few lines of a participant's WiFi location file look like this:

time

location

1364357009

near[north-main; cutter-north; kemeny; ]

1364358209

in[kemeny]

1364359102

in[kemeny]

1364359163

in[kemeny]

1364359223

in[kemeny]

1364359409

in[kemeny]

1364359508

near[kemeny; cutter-north; north-main; ]

1364359793

near[kemeny; cutter-north; north-main; ]

1364360078

near[kemeny; cutter-north; north-main; ]

Each field is defined as follows:

Field Name

Description

time

The Unix time of when it was collected

location

On-campus location inferred from WiFi scans.

There are two kinds of location inferences: in a building (e.g. in[kemeny]) and near some buildings (near[kemeny; cutter-north; north-main;]).

Light

The light data files record when the phone was at a dark environment for a significant long time (>=1 hour). There are two fields in each data file: start timestamp and end timestamp.

The first few lines of a participant's light sensor file look like this:

start

end

1364359112

1364387807

1364397153

1364400889

1364402955

1364418088

1364423980

1364432230

Phone Lock

The phone lock data files record when the phone was locked for a significant long time (>=1 hour). There are two fields in each data file: start timestamp and end timestamp.

The first few lines of a participant's phone lock file look like this:

start

end

1364359161

1364387080

1364395185

1364402754

1364402806

1364409439

1364427062

1364432230

Phone Charge

The phone charge data files record when the phone was plugged in and charging for a significant long time (>=1 hour). There are two fields in each data file: start timestamp and end timestamp.

The first few lines of a participant's phone charge file look like this:

The name field defines the EMA question's name (i.e. Sleep in the above example). The questions field defines the questions that the participants need to answer for this EMA. Each item in questions array has three fields: question_text, question_id and options. question_text is the text of the question. question_id is the id of the question. options defines candidates of the response. For example, if a participant answered 6.5 for the first Sleep EMA question "How many hours did you sleep last night?", you will find hour:8 in their corresponding response record.

EMA Responses

EMA responses are in JSON array format. Each item in the JSON array is one response. As mentioned in EMA Definitions, the keys of each response are EMA question names defined in the EMA definitions. The value is participant's response to the question. It corresponds to the index of the options defined in the EMA definitions.

We can learn from this response that the participants responded at Unix time 1364359545 (EST), and the participants' location GPS coordinates is 43.70705013,-72.28730277 when he/she was answering the EMA question. The participant slept 6 hours according to the hour field. His/her sleep quality was Fairly good and he/she had Three or more times to have trouble staying awake yesterday while in class, eating meals or engaging in social activity according to rate and social respectively.

Seating Position

You can seating position data files under the folder dataset/EMA/response/QR_Code. There are two fields in each seating position data file: timestamp and a QR code corresponding to a seating position.

The mapping between the seating position and the QR code is as follows:

Seating Position Mapping

Survey Responses

Survey responses file contains participants's responses to both pre and post mental health measures. The following shows u01's pre and post responses to the Flourishing Scale. The first column shows which participants answered the survey and the second column indicates if the response is from pre or post measurement. The rest columns correspond to each survey questions.

uid

type

I lead a purposeful and meaningful life

My social relationships are supportive and rewarding

I am engaged and interested in my daily activities

I actively contribute to the happiness and well-being of others

I am competent and capable in the activities that are important to me

I am a good person and live a good life

u01

pre

4

6

6

6

7

6

u01

post

5

5

6

5

7

6

You can find detailed information about the mental health surveys from the following references:

Spitzer R., Kroenke, K., Williams, J. (1999). Validation and utility of a self-report Version of PRIME-MD: the PHQ Primary Care Study. Journal of the American Medical Association, 282, 1737-1744.

Mount, Michael K., and Murray R. Barrick. "The Big Five personality dimensions: Implications for research and practice in human resources management." Research in personnel and human resources management 13.3 (1995): 153-200.

Education

There are four types of educational data: classes which participants took during the 2013 Spring term, number of class deadlines per day, GPA and Piazza usage.

Class

class.csv records classes which participants took during the 2013 Spring term.

You can find the lecture time period and location in class_info.json. All classes are stored in a JSON array. The following shows the location and class periods for COSC 065. The class location corresponds to the WiFi location. The periods defines all class meeting periods in an JSON array. day is the weekday that the lecture takes place where Monday is 1 and Friday is 5.