Reflection: 2017 BC Data Science Workshop

Summary

This is a brief summary and some incomplete reflections from my time as the workshop TA for the 2017 BC Data Science Workshop.

About

The BC Data Workshop was hosted August 9 – 25 in downtown Vancouver at UBC Robson Square, co-organized by myself, Brian Wetton and Lee Rippon from the IAM, with support from PIMS and SFU Mathematics.

The workshop featured a wide array of material from experts across the west coast, structured into three sections: a 3-day pre-workshop intensive; a week of introductory material; and a week-long data science project led by industry mentors.

First week

The mornings of the first week were comprised of introductory lectures by Isabell Konrad (UC Berkeley), Yinshan Zhao (BC Health) and Michael Reid (Amazon). Topics covered elements of machine learning and data science; hypothesis testing and experimental design; and modern software tools (SQL, Hadoop, Fink, Kafka, ElasticSearch, etc.). In the afternoons, participants completed mini-projects whose focus was related either to the morning material, or to tools that would be integral to a project in the second week. The mini-projects covered regression, neural networks, matrix completion, data wrangling and exploration, and distributed and parallel computing with Apache Spark and tensorflow on GPU.

Second week

The mornings of the second week were comprised of “advanced topics” — featuring professors from UBC or SFU, speaking on elements of their research which rely on or develop tools from data science. The rest of the time was devoted to group projects. Teams of 5 – 7 worked on projects brought by industry mentors whose solutions would feature elements of data science. This included two kinds of image processing; nonlinear function approximation for video compression bitrate analysis; deep neural networks for genetics research; and analytics from vehicle time series messages.

Opportunities and challenges

My role as TA

It was most rewarding to learn how to think in different modes with
teams of diverse individuals working on very diverse projects. One
evening, I got to help create an interpretable way to balance highly
imbalanced data; the next morning I had to help design a way to train
a deep model using images on a remote
server. After that, I
contributed to a brainstorming session on how to best cluster
data
that was too big for memory. In general, I got to contribute ideas for
project strategies, available software tools, and design/algorithm
troubleshooting.

Designing mini-projects

An unintended consequence of the workshop saw me designing three of the five
mini-projects in the first week.

Designing these consumed a lot of effort and time - too much given the other
organizational duties. Consequently, there are places where the work remains
unpolished. Nevertheless, designing these was a great opportunity to explore how
to communicate new concepts in an immersive, interactive way.

Project duration

We may have imposed too great a constraint on the duration for which the teams
were able to work on their industry
projects. Effectively, they received
their projects Monday afternoon; had advanced lectures through the week; and had
to present on Friday afternoon. It’s clear that longer duration would have
indeed allowed significantly greater progress on the teams’ projects. That being
said, each of the teams made an impressive amount of progress that will be
discussed below.

About the projects

Data-driven modelling of video compression

This project had aggregate camera data and sought to use this data to determine
a function that could accurately predict the bitrate for given camera
settings. The group tried several convex optimization approaches, and settled on
a particular flavour of random forest, boasting accuracy well above what is
considered “industry standard”.

Risk-based platform for accdient prevention

I helped to design a skeleton for this project, wherein the team would take raw
photo data from BC Safety Authority inspections and use it to predict compliant
and non-compliant objects using a combination approach of active learning and
transfer learning. The training would have be performed by downloading batches
of images from an AWS S3 bucket, since there were too many images to fit on the
VM disk we were using. It is likely that such a model will require more complex
structure than a standard image recognition ConvNet, such as topic modelling or
LDA.

The team reduced the complexity of this task by classifying images of flowers
using a transfer learning approach. In this approach, these used bottlenecking
to generate feature vectors by running the images through the bottom several
layers of a VGG16 network. Then, they trained a binary classifier on this
markedly smaller feature space to discriminate between species of flowers. This
particular approach has immediate generalization potential to the more complex
problem of discerning compliant and non-compliant objects from inspection
photos.

Elucidating enhancer-promoter gene expression using ConvNets

This group developed a convolutional neural network whose first layer filters,
after [agnostic] training, were composed of significant and known gene promoter
regions. This convolutional neural network was designed to predict the efficacy
of gene expression for particular enhancer-promoter pairs. This team made
significant progress in the time they had, and made steps toward a
reverse-complement invariant machine learning model.

Data insights from vehicle time series messages

The goal of this project was to learn novel insights from vehicle time series
messages, logged by in-car devices that record the car’s state, position,
etc. This team made significant progress toward the problem of discriminating
multiple drivers of the same vehicle. For this task, the team used Bayesian
methods and k-means.

This project sought to classify land type from photogrammetric drone image
data. The data was in the format of an unstructured sparse point cloud, listing
colour as well as latitude, longitude and elevation. The team took the approach
of looking at local variation in elevation, as well as point colour. The team
used an out-of-memory mini-batch k-means algorithm which clustered features
generated from a nearest neighbours tree. My own approach to this problem can be
seen here.