April 12, 2015

Last weekend I attended the CodeNeuro NYC
conference, which is a conference for neuroscientists, programmers, and
everything in between. It was a unique experience because it had both the
excitement and energy of a tech conference, and the pure, unadulterated zest
for knowledge of a scientific conference.

Day 1: Talks and panel

The first day of the conference started after the workday on Friday at 5pm at
the New Museum in Little Italy, Manhattan, NYC.
This was the first time I’d been to this place and it was really awesome, even
from the outside! Check it out, plus some other photos I took from the conference:

(FYI If you want to look through all the tweets from the entire event, check out this Storify board.)

The conference kicked off with some mingling and beer. I gravitated towards the three I knew people coming in to the conference, Nick
Sofroniew, with whom
I did an undergrad summer program and invited me to talk, Jeremy Freeman, the main organizer of
the whole freakin’ thing, and Ben
Sussman, who co-taught the gitgoing
tutorial with me and is an all-around awesome person. But the space was rad,
we had a balcony and a nice view of the city, and I got to meet a bunch of
people doing super interesting things in neuroscience.

We got started with the talks, which we had three 10 minute (ish) talks, followed by a half hour (ish) break. It was a nice format that we could chat and break in between the talks not only to get a mental break but also discuss the concepts presented by the previous talks in the context of neuroscience. Here’s all the talks:

As you can imagine from the caliber of all the other talks, I was super nervous for mine! because I’m not doing things at exabyte scale, I’m not doing streaming data, and I’m not even doing neuroscience! I’m just analyzing static datasets with the usual machine learning algorithms. But as Jeremy pointed out, we need better tools for analyzing the existing non-big data datasets that we have, plus I got great questions and feedback from the talk which was great.

My top three favorite talks were:

NYT R&D demo This was a fantastic talk about streaming data analytics. He
and his team had created an interactive way to analyze real-time data. They
had a GUI interface where you could select boxes that represented an action on
data (extract, transform, load aka
ETL-like things), and
you could string these boxes together to create an analysis pipeline. It
reminded me a lot of the Galaxy interface for
reproducible bioinformatics pipelines. Check out the tweet below for a
screenshot:

Optogenetics. This was a really interesting talk about optogenetics, which is a way that you can control the brain with light. Researchers genetically insert rhodopsin, which is the light-sensitive protein that makes our eyes work, into neurons, which then makes them light sensitive. As a result, you can then turn specific neurons on and off with light! This talk used both this tool of optogenetics to stimulate/inhibit specific neurons, and calcium imaging to record their responses.

Spark. This was an awesome talk by Paco Nathan
on using big data tools. Paco had worked with both Hadoop and Spark and gave a
great overview of the history of Big Data (which he pinpointed to Q3 1997),
the different papers about it and tools for it, and where Spark fits in that
ecosystem. He gave a few use cases and was overall just a very entertaining
presenter.

Great talk by @placoid at @CodeNeuro just now - apparently rotational bearings on planes create 12 exabytes of data / day… wow #bigdata

After all the talks, there was a panel led by Jeremy with some giants of
neuroscientists, Tony Movshon, Eve Marder and Larry Abbott. A recurring theme
was that the fundamental questions of how the brain works hasn’t changed, but
the tools have drastically changed.

One story that stuck with me was Eve describing that when she was doing
experiments, she used an
oscilloscope to listen (yes, with
her ears! I thought this was amazing!) to the measurements from neurons, and
there was a direct, phsyical, visceral reaction that when you designed an
experiment and heard specific tones, you knew exactly what that meant,
because you were intimately familiar with the exact tones and sounds produced
by the device. But now, she said, the high throughput methods, while capable
of measuring many things at once, are so detached from the physicality of
neurons firing that she feels people aren’t connected to what they’re
measuring anymore. I found that interesting because in my field of RNA
biology, we’ll do these newfangled high-throughput sequencing
experiments to measure RNA expression as a proxy for protein
abundance, and study the alternative
splicing of these RNA
transcripts, but every time, we always have to validate our findings using an
older, low-throughput method like RT-PCR. I don’t know what the equivalent
experiments are in neuroscience, because it seems all very magical to me that
they’re able to manipulate individual neurons in mice, but I also wonder if
it’s known that the neuron in the exact same position in one animal and
another animal, do the exact same thing. So far, biology has relied on laws of
averages, so if you do a high throughput experiment on a bunch of cells and
get an average signal, you should get that same result if you do a low-
throughput version of that experiment. I don’t know what the equivalent of
that is in neuroscience, but it was refreshing to hear that neuro has its own
share of growing pains as its adopted its own high-throughput techniques, and
that biology wasn’t the only one.

It was interesting to hear about the challenges of neuroscience, of how
excited people are about optogenetics, and how many people are apparently
skeptical of functional magentic resonance
imaging
(fMRI) data, and how nobody knows what the cerebral
cortex does!

Day 2: Tutorials + Hackathon

The next morning we were in the New, Inc coworking
space, and started off with some kickass bagels and lox/salmon:

Then, people split up about 50⁄50 for the tutorials and the hackathon.

gitgoing tutorial

Ben Sussman and I taught a tutorial called
gitgoing to quickly teach scientists
the version control system
git and code testing via
py.test and integration systems via Travis-CI. We had about 20-30 attendees. The goal was to get the
scientists acquainted with common tools in open source software so that they
could contribute themselves. It was kind of like a mini Software
Carpentry workshop, but we assumed our target
audience had some coding experience, and we didn’t take the time to explain
computer science fundamentals like variables, loops, flow control, etc.

Our class was structured exactly as laid out in the README.md file. First we
setup their computers so they had git and Python 2.7. This took about an hour
total to get everyone done. Some people finished faster and started moving on to
the git section. Then Ben did an awesome explanation of git, and I learned a
bunch of stuff! I didn’t realize that when you git clone a repo, you’re
getting ENTIRE history of the project, so that makes sense why downloading all
of IPython/Jupyter takes forever. It was
also a really helpful analogy to describe the entire repository as an “ocean of
code,” and that a branch is a single window into that ocean. We also talked
about merge conflicts, and how they can be really easy to create if, say someone
renamed one of the arguments of a function, and someone else added an argument,
and then git doesn’t know what to do anymore. They picked up on testing pretty
quickly, and someone asked “well git thinks it’s okay, but how do you know
the code will run?” Which brought us directly to testing!

Next, I talked about testing and why it’s important. I wrote some simple Python
code with functions like mean_plus_one, std_plus_one, and cv (coefficient
of variance). They were just slight variations on the true
numpy functions so the learners couldn’t just use the
numpy version. We looked at the test file, test_gitgoing.py which used
py.test’s fixtures, which take care of the setUp and tearDown methods that
some other testing frameworks have. We saw a simple example of fixtures, which
creates a 20x10 matrix of normally distributed random numbers. These could have
been set as integers, I just wanted to illustrate how you can create new
fixtures from existing ones.

There is a commented out broken test in the repo which they had to fix, which was a great formative assessment because they had to use their newly formed git and testing knowledge to fix the test, commit the changes, make a branch of their “feature” fixing the test, pushing to the branch, and making a pull request to the master branch. It was really rewarding for the students to see their pull request on Github, and to see their commits on the network of contributors to the gitgoing repo.

We unfortunately didn’t advertise the #gitgoing hashtag before the tutorial so we didn’t get any live-tweets, but Ben and I mitigated this and took a picture afterwards:

Spark tutorial

After the gitgoing tutorial we had a break and mingled. Then, I went to the Spark tutorial, taught by Paco Nathan from DataBricks (slides here). I’ve seen Spark demos before but I haven’t put in the time to play around with the tools, so this was a great way to get exposed!

I was a little late to the tutorial, so I missed the initial setup. I was handed a slip of paper with a url, username and password that was my personal login to the DataBricks cloud. Paco was doing a “preflight check” of explaining different Spark concepts before we dove in. The key things I took away were:

RDD: Resilient Distributed Dataset. This is the core unit of a spark analysis, where you load in data, and use Spark to indicate that you want it to be parallelized.

sc: “Spark context.” This is the object that exists in all the Python library versions with Spark, and is the object that you will be using to create and operate on datasets.

We used an IPython notebook-style REPL interface, which was really nice because we could see the output from our commands right away. We continued with learning how to:

The next part of the tutorial, we learned how to use flatMap, and then how to join several RDDs on their (key, value) pairs. After the break, we had a mini lecture about “Computational thinking” where we had to use what we learned so far to find all the instances of the word “Spark” in two files. If you’re interested, you can see my full notes here.

Hackathon

The hackathon was centered around the neurofinder challenge, which is to try and extract neuronal signals from Calcium imaging data. The goals were go:

Work on evaluation metrics for algorithms and agree on ground truth definition

Work on incorporating existing algorithms into this API

Work on the frontend/backend of a website that would automatically run submitted algorithms on the test data, get the results and upload them to a leaderboard

First, everyone introduced themselves and the group organically split into a few groups: the website, API input/output formatting, implementing algorithms, and designing metrics.

The website team was composed of five people with web development frontend/backend experience. They generated some prototype websites and code, all available on github, and hopefully launching in beta soon!

Another group of about 25 worked on the API and the input/output formats, with their final notes added to this wiki.

Another group of 10-20 split to work on incorporating existing algorithms in to this API. The rest worked on defining evaluation metrics for algorithms and the “ground truth” definitions of “what is a neuron” for these methods. They implemented some evaluation metrics, and developed an initial ground truth definition of manual centers and morphological boundaries. They discussed that this initial “ground truth” risks circularity, but is still a solid start.