Data+

Data+

Data+ is a 10-week summer research experience that welcomes Duke undergraduates interested in exploring new data-driven approaches to interdisciplinary challenges. Students join small project teams, collaborating with other teams in a communal environment. They learn how to marshal, analyze, and visualize data, while gaining broad exposure to the modern world of data science.

Data+ 2019

"The Data+ program was filled with intelligent people from all different fields, so it was a great learning experience. Since we worked in teams, it taught me how to work with others in a more efficient, collaborative, and overall better level. Working to meet our clients' needs, I feel as if I gained real-world work experience in a classroom-like atmosphere (project mentor as my teacher, my group as the students). It is a great transition for people who are unsure of what they want to do with their careers or feel under-qualified to pursue a real internship."

—Sonia Xu, Statistical Science

"I feel I've gained an ability to connect my abstract knowledge to an actual problem, and use it to deliver something useful and productive to a client."

– Austin Ferguson, Physics and Mathematics

"Participating in Data+ definitely changed my perception of Data Science research. It was more interdisciplinary than I expected, and the opportunity to work with experts across different fields (Medicine, Civil Engineering, Statistics) was a defining aspect of my Data+ experience."

"Before Data+, data science research sounded like a non-collaborative job involving PhD-level statistical concepts. Data+, however, showed me that there is a place for collaborative workers from all different backgrounds (and of all skill levels) in data science research. Participating in Data+ has enriched my technical skills as a coder; I am now able to navigate software and employ coding languages that I was not at all familiar with before the start of the program. Even more valuable, however, are the "soft" skills I have gained -- specifically, the ability to approach collaboration with an open mind."

"My participation in the Data+ program has shown me how to successfully work with a dynamic team. Each of my team members were fundamentally different in course interests and background, yet we came together to create a polished product in which we each were a point person for a specific portion. I have also gained confidence in my ability to learn new skills, as I basically taught myself (through Google and asking teammates) how to program in R over this summer."
—Devri Adams, Environmental Science

Projects

Social and environmental contexts are increasingly recognized as factors that impact health outcomes of patients. This team will have the opportunity to collaborate directly with clinicians and medical data in a real-world setting. They will examine the association between social determinants with risk prediction for hospital admissions, and to assess whether social determinants bias that risk in a systematic way. Applied methods will include machine learning, risk prediction, and assessment of bias. This Data+ project is sponsored by the Forge, Duke's center for actionable data science.

Producing oil and gas in the North Sea, off the coast of the United Kingdom, requires a lease to extract resources from beneath the ocean floor and companies bid for those rights. This team will consult with professionals at ExxonMobil to understand why these leases are acquired and who benefits. This requires historical data on bid history to investigate what leads to an increase in the number of (a) leases acquired and (b) companies participating in auctions. The goal of this team is to create a well-structured dataset based on company bid history from the U.K. Oil and Gas Authority; data which will come from many different file structures and formats (tabular, pdf, etc.). The team will curate these data to create a single, tabular database of U.K. bid history and work programs.

Producing oil and gas in the Gulf of Mexico requires rights to extract these resources from beneath the ocean floor and companies bid into the market for those rights. The top bids are sometimes significantly larger than the next highest bids, but it’s not always clear why this differential exists and some companies seemingly overbid by large margins. This team will consult with professionals at ExxonMobil to curate and analyze historical bid data from the Bureau of Ocean Energy Management that contains information on company bid history, infrastructure, wells, and seismic survey data as well as data from the companies themselves and geopolitical events. The stretch goal of the team will be to see if they can uncover the rationale behind historic bidding patterns. What do the highest bidders know that other bidders to not (if anything)? What characteristics might incentivize overbidding to minimize the risk of losing the right to produce (i.e. ambiguity aversion)?

In this project, we are interested in creating a cohesive data pipeline for generating, modeling and visualizing basketball data. In particular, we are interested in understanding how to extract data from freely available video, how to model such data to capture player efficiency, strength and leadership, and how to visualize such data outcomes. We will have four separate teams as part of this project working on interrelated but separate goals:

Team 1: Video data extraction

This team will explore different video data extraction techniques with the goal of identifying player locations, ball location and events at any given time during a basketball game. The software developed as part of this project will be able to generate a usable dataset of time-stamped basketball plays that can be used to model the game of basketball.

Teams 2 & 3: Modeling basketball data: offense and defense

The two teams will explore different models for the game of basketball. The first team will concentrate on modeling offensive plays and try to answer questions such as: How does the ball advance? What leads to successful plays? The second team will concentrate on defensive plays: What is an optimal strategy for minimizing opponent scoring opportunities? How should we evaluate defensive plays?

Team 4: Visualizing basketball data

This team will work on dynamic and static visualization of elements of a basketball game. The goal of the visualization is to capture information about how players and the ball move around the court. They will develop tools to represent average trajectories be in these settings that can also capture uncertainty about this information.

A team of students led by researchers in the Energy Access Project will develop means to evaluate non-technical electricity losses (theft) in developing countries through machine learning techniques applied to smart meter electricity consumption data. Students will use data from smart meters installed at transformers and households through a randomized control trial. Students will develop algorithms that can be used to detect anomalies in the electricity consumption data and create a dataset of such indicators. This project will provide researchers with new ways of incorporating electricity consumption data and applications for electricity utilities in developing country settings.

A team of students, in conjunction with Duke’s Office of Information Technology, will use of Duke’s network traffic data to perform IoT device behavioral fingerprinting that can be employed to identify device types. The data will be used to analyze trends and risks, develop security best practices, and build machine learning models that can be used to detect similar device types. Students will work directly with the network data and have access to the analytics tools used in OIT and will have a great opportunity for exploration of the data in consultation with OIT network, security and data analytics professionals.

Interested in understanding the types of attacks targeting Duke and other universities? Led by OIT and the IT Security Office, students will learn to analyze threat intelligence data to identify trends and patterns of attacks. Duke blocks an average of 1.5 billion malicious connection attempts/day and is working with other universities to share the attack data. One untapped area is research into the types of attacks and learning how universities are targeted. Students will collaborate alongside the security and IT professionals in analyzing the data and with the intent to discern patterns.

Saltwater intrusion and sea level rise are issues of serious concern for people throughout the coastal plain. Our Data+ team will collaborate with researchers to create an interactive data visualization platform that compiles remotely sensed estimates of vegetation change throughout the coastal plain and links this data with field salinity estimates. The team will have the opportunity to build educational website content that a) explains how saltwater incursion occurs; b) describes the consequences for coastal forests; c) links this understanding with likely scenarios of coastal climate for the next decade. In each case, we would like to illustrate this content with interactive data graphics.

A team of students led by faculty and researchers from the School of Medicine, the Center for Global Reproductive Health at the Duke Global Health Institute, and the Duke Evidence Lab will collaborate on the user interface for a tool developed to help advocates and policymakers target family planning resources to key populations in low resource populations. Team members will traverse the app development lifecycle while contributing to a tool that can improve global reproductive health.

A team of students led by Drs. Aquino (Engineering) and Routh (Urologic Surgery) will develop objective algorithms in order to guide data interpretation from a urology test, known as urodynamics, which is used in children with spina bifida in order to define a patient’s risk of debilitating bladder and kidney complications. Urodynamics involves dynamic pressure monitoring as the bladder is filled with fluid. This project is part of a 21-institution collaboration coordinated and funded by the U.S. Centers for Disease Control and Prevention (CDC), with the long-term goal of defining optimal management strategies for children with spina bifida. The short-term goal of this Data+ application is to define initial features of urodynamics that can be applied to increasingly complex future algorithms in order to guide clinical interpretations that determine whether, for example, children need reconstructive surgery in order to avoid complications of their disease.

This team will explore how to develop machine learning techniques for analyzing satellite imagery data for identifying energy infrastructure that can be trained once and applied almost anywhere in the world. Led by researchers from the Energy Data Analytics Lab and the Sustainable Energy Transitions Initiative, the team will design two datasets: the first containing satellite imagery from diverse geographies with all energy infrastructure labeled, and the second a synthetic version of the same imagery. These data will enable research into whether synthetic imagery may be used to adapt algorithms to new domains. The better these techniques adapt to new geographies, the more information can be provided to researchers and policymakers to design sustainable energy systems and understand the impact of electrification on the welfare of communities.

A team of students led by Jim Heffernan, Nick Bruns, and partners at UNC and EPA will create interactive data visualizations of water quality data in rivers and lakes of the United States. These tools will aid environmental scientists, managers, policy-makers, and students who want to investigate patterns of water pollution across broad scales of space and time. Students will gain experience with manipulation of large data sets, geospatial analysis, and remote sensing of water quality parameters. Opportunities include developing visualization tools to represent spatial and temporal coverage of water quality parameters, georeferencing field observations and remote sensing satellite overpasses with field observations, and assessing spatial and temporal gaps in observations for a variety of water quality parameters.

Duke must reduce its energy footprint as Duke strives for Carbon Neutrality by 2024. To help this cause, a team of students will review troves of utility usage data and attempt to build an attractive and practical monthly energy use report for every building and school at Duke. This report will not only show historical usage but also develop an energy benchmark for comparison and conservation tips for local administrators to take action. Duke Energy has been using a similar report to encourage conservation at the residential level for years. It is time to bring energy use transparency to the broader Duke community and inspire action.

A team of students partnering with Duke University Libraries will explore the complicated decision space of electronic journal licensing. Electronic resources like journal articles are a major service provided by academic libraries, but the choice of what journal subscriptions to purchase can be costly and time consuming, and journal distribution companies like Elsevier manipulate their journal bundles to maximize their own profits. This team will build a model for journal purchasing by combining several years of journal usage data (including view, downloads, authorship, citations, and impact) with journal cost data. The team will work on software to improve the data cleaning and analysis process and will create visualizations and dashboards to assist the library in its decision-making efforts. Because many libraries have the same concerns about journal bundles and use the same kinds of data to make these decisions, this project may have far-reaching impacts among academic libraries.

Neuroscience evidence (e.g., brain scans, mental-illness diagnosis, etc.) is increasingly being used in criminal cases to explain criminal behavior and lessen responsibility. A team of students led by researchers within the Science, Law, and Policy Lab to explore a national set of criminal cases in which neuroscience evidence is used to see what aspects of the criminal trial (i.e., offense, age of offender, etc.) may predict the outcome of future cases. Additionally, with the use of our comprehensive 10-year judicial opinion data set (2005-2015), the team will collaborate on creating a computer algorithm to assist in locating and coding online judicial opinions to build upon our comprehensive list of opinions. This tool will provide a strong foundation in the work of understanding neuroscience’s role within a criminal court setting.

A team of students will use a variety of data sets and mapping technologies to determine a feasible location for a deep-sea memorial to the transatlantic slave trade. While scholars have studied the overall mortality of the slave trade, little is known about where these deaths occurred. New mapping technologies can begin to supply this data. Led by English professor Charlotte Sussman, in association with the Representing Migrations Humanities Lab, this team will create a new database that combines previously-disparate data and archival sources to discover where on their journeys enslaved persons died, and then to visualize these journeys. This project will employ the resources of digital technologies as well as the humanistic methods of history, literature, philosophy, and other disciplines. The project welcomes students from a broad range of disciplines: computer science; mathematics; English and literature; history; African and African American studies; philosophy; art history; visual and media studies; geography; climatology; and ocean science.

Students will collaborate with staff at DataWorks NC and the Eviction Diversion Program to explore and develop means of using evictions data to drive meaningful policy change that help Durham residents stay in their homes. Students will clean and assess the quality of evictions data, look for seasonal and geographic variation in eviction rates, analyze the relationship between evictions, rents, wages and other economic indicators, develop metrics for the real financial cost of evictions, and build static visualizations or a data dashboard to communicate their results. This project will help housing advocates in Durham assess the impact of their current work, and understand which future interventions will be most impactful.

The American public first encountered the term “genocide” in a Washington Post op-ed published in 1944; since then, the word’s meaning has been circulated, debated, and shaped by numerous forces, especially by words and images in newspapers. With the support of Dr. Priscilla Wald (English), a team of students led by Nora Nunn (English graduate student) and Astrid Giugni (English and ISS) will analyze how U.S. mass media—particularly newspapers—enlist text and imagery such as press photographs to portray genocide, human rights, and crimes against humanity from World War II to the present. From the Holocaust to Cambodia, from Rwanda to Myanmar, such representation has political consequences. If time allows, students will also study the representation of collective violence in Hollywood film, querying the relationship between human rights and genre. The implications of these findings could inform future coverage of human rights-related issues at home and abroad.

A team of students led by history professor Sarah Deutsch will do data mining in newspaper and Congressional databases to investigate the dynamics behind the excess profits tax laws Congress passed between 1918 and 1948 and the concept of price gouging which continues to shape legislation today. As of 2018 numerous states have price gouging laws. Why? How did they define what was excessive? How did this critique of profit-making become mainstream without endangering capitalism? By searching extant newspaper and Congressional databases for the frequency and context of particular words and phrases, the project will begin to uncover the logic and language and the partisanship or lack of it used to critique profits at three moments in U.S. History that resulted in government action to limit profit-making.

A team of students led by researchers from the Michael W. Krzyzewski Human Performance Laboratory (K-Lab) will develop an analytic and report generating web-based application to help the K-Lab reduce musculoskeletal injuries in student-athletes at Duke University. This tool will produce actionable, student-athlete-specific reports that incorporate the analysis of previous injury history and current capabilities (K-Lab assessments) in order to identify injury risk and develop individualized recommendations for injury prevention. Students will develop analytic tools and scoring criteria to assess injury risk through profiling of data based on minimally clinically important differences, injury profiles, peer group analysis, and injury risk scoring strategies based on a comprehensive set of performance metrics. Injury risk identification will be furthered enhanced by clustering data analysis around joint or tissue specific injury risk, previous injury history, and athlete capabilities (strength, flexibility, and postural stability). The final deliverable will enhance injury prevention strategies for student-athletes and other populations by bridging the analytic gap between injury risk screening and actionable injury prevention strategies.

Data-enabled approaches present new opportunities to analyze responses of aquatic ecosystems to stressors and to illustrate scientific findings in new formats that are more widely accessible. Our goal is to create a web-based storytelling platform that illustrates the results of freshwater ecosystem studies conducted at the IISD-Experimental Lakes Area in Canada (https://www.iisd.org/ela/). Students on our team will process historical datasets and develop interactive data visualization tools for public outreach on freshwater ecology and conservation. This project is led by water resources professor Kateri Salk (Nicholas School of the Environment) and staff at the IISD-Experimental Lakes Area.

A team of students led by faculty and students in Duke's River Center will manipulate, model and visualize time series data derived from hundreds of rivers throughout the world. Students will gain experience working with large datasets derived from environmental sensors and will be able to direct the data project based on their learning interests. Opportunities include developing machine learning tools for data processing and pattern recognition, building software and web interfaces to enable cloud computing, and creating interactive graphics aimed at explaining scientific concepts using Big Data. Tools developed through this project will be hosted on the StreamPulse web platform (streampulse.org).

This team will collaborate with Durham’s Crisis Intervention Team, a group of law enforcement, fire, and EMS personnel who are specially trained to interact with citizens in mental health crisis. We will analyze data from the Durham County Jail to track repeat arrests by persons with or without mental illness, along with their use of mental health and other services in the Duke Health System. By the end of the summer, we will report findings and recommendations to the Crisis Intervention Team and Durham’s Stepping Up Initiative.

Have you ever read or watched a movie and realized that you have seen the same story before? How do you know if you are watching an adaptation? A team of students led by UNC-Chapel Hill graduate student Grant Glass, will develop means to track the movement of adaptations within contemporary culture through machine learning techniques. Drawing upon a variety of textual information drawn from historical and digital sources, the project team will have the opportunity to work with many different types of data. Students will identify features of different master narratives, which will be used to demonstrate how certain stories are modified and retold over and over again. By creating this training dataset, the team will use algorithms to identify adaptations in previously unidentified works. This will allow scholars to better understand at scale how certain narratives are adapted into new stories and forms.

A team of students led by researchers in the Global Financial Markets Center at Duke Law will collect and analyze home mortgage market data that was publicly available during the run-up to the Financial Crisis (1997 – 2007), including (i) size of the market, (ii) composition of the market (conforming v. non-conforming), (iii) home ownership rates, (iv) originators (depository institutions v non-depositories), (v) default and foreclosure rates, (vi) assessments of the market by supervisory and regulatory agencies, (vi) press coverage of the mortgage market, and (vii) public statements by governmental leaders about home mortgages. Analyzing and presenting this data will allow the team to understand what information was publicly available to policymakers preceding the Crisis. The data will also be used to inform the oral histories of key policymakers that will be collected during a Bass Connections project that will begin in the fall of 2019.

Modern Energy Group (MEG) finances and operates various distributed energy resources operating in wholesale energy markets, ranging from solar panels to residential smart thermostats. MEG also does financial trading when it identifies arbitrage opportunities in these markets. One of MEG's main operational risks is the very high volatility in wholesale real time (or spot) energy prices. Where stock markets consider a 30% change in price large, energy markets routinely face changes in price on the order of 300%. This high volatility comes from three main "shocks": 1. power demand changes, due to unpredictable weather, industrial patterns, or human consumption; 2. fuel shortages, driven by trade, extraction/exploration, and gathering/transportation economics; 3. electrical transmission outages, driven by operational failure, extreme weather events, and human behavior.

First, this project team will identify what should be considered an "extreme" price shock from 5-10 years of historical data in PJM. Second, the team will work to automatically identify potential causes for the rare events from news articles, public filings, and MEG's own structured data. Third, the team will build reasonable priors for the occurrences of these rare events, and incorporate potential covariance between the events using copulas or similar methods. Finally, the team will create a simple classifier such as logistic regression to predict the likelihood of a price shock on a given day. The model needs to be evaluated with a walk-forward backtest, training on about 3 years of data at a time, and shifting forward the training window in approximately one-month increments, to smooth out potential bias and overfitting in the model.

A team of students led by statistics professor Jie Ding from the University of Minnesota will develop algorithms to recognize human emotions (e.g. calm, happy, angry, etc.) from audio speech data, and to incorporate new emotions into an existent speech. By applying machine learning techniques to various speech datasets, students will identify features of human speech that can represent emotions, to develop software to perform emotion recognition, and to synthesize emotional speech data. Students also have the opportunity to create their own dataset, and apply their developed methods for training and testing. This work will allow further research along the direction of speech emotion analysis, and may result in new designs of human-computer interfaces.