At the beginning of April, I attended Big Data TechCon, a meeting aimed at teaching big data tools. I definitely learned a lot, in particular about the pitfalls in big data analytics. It was good to meet people from vastly different industries who were all united by the fact that they had some kind of ‘big data’ in their field. I attended the meeting with a free pass from KDnuggets, which I won by participating in a data science poll organized by Gregory Piatetsky-Shapiro – thanks a lot!

I was recently asked to become a columnist for Dataconomy, a big data website that targets European big data practitioners. I wrote my first Dataconomy contribution about my experiences at Big Data TechCon. If you’re interested, check it out here!

They even gave me a certificate of completion for the meeting!

]]>https://sciencedryad.wordpress.com/2014/04/06/big-data-techcon-in-boston-big-data-tools-and-insights/feed/0mjimuellerBigDataTechConBig Data TechCon CertificateGrad student descenthttps://sciencedryad.wordpress.com/2014/01/25/grad-student-descent/
https://sciencedryad.wordpress.com/2014/01/25/grad-student-descent/#commentsSat, 25 Jan 2014 16:39:09 +0000http://sciencedryad.wordpress.com/?p=99On January 24, I attended a 1-day data science symposium at Harvard University with the fun title ‘Weathering the Data Storm‘. I imagine being in a tiny boat on the endless beautiful sea of data, and then a big data storm comes up! Numbers and pieces of text fly through the air… they hit me hard in the face like hail, pile up in my boat… and I’m in dire need of some clever algorithms to take care of all that data, so that I won’t get hurt, my boat won’t sink!

In line with the fun title, there were lots of fun talks. The funniest quote of the day clearly goes to Ryan Adams from Harvard University, when he introduced a new name for a common machine learning ‘method’: grad student descent. He talked about a ‘meta-problem’ of machine learning: Most machine learning algorithms are sufficiently complex to give great results – if they are run with parameters that are adapted to the problem at hand. For example, to work with a neural network you have to choose the number of layers, the weight regularization, the layer size, which non-linearity, the batch size, the learning rate schedule, the stopping conditions… How do people choose these parameters? Mostly with ad hoc, black magic methods. One method, common in academia, is ‘grad student descent’ (a pun on gradient descent), in which a graduate student fiddles around with the parameters until it works. It’s kind of sad, but it’s so true! Of course, Ryan Adams then went on to discuss better solutions (‘meta-algorithms’ that automatically find the parameters), but it was the ‘grad student descent’ that stuck to everyone’s mind.

Rachel Schutt form News Corps mused on the perennial question ‘What is a data scientist?’ She cited the well-known definition by Josh Wills from Cloudera, which I really like:

Data scientist = “Person who is better at statistics than any software engineer and better at software engineering than any statistician.”

But I hadn’t yet heard the clever rephrasing by Will Cukierski of Kaggle:

Data scientist = “Person who is worse at statistics than any statistician and worse at software engineering than any software engineer.”

Both quotes nail down the interdisciplinary nature of the field of data science (and are really funny). This interdisciplinarity is something that I really like. Whenever I go to data science meetings, I meet people from so many different backgrounds – it is very enriching, and the melting pot of so many different ideas and ways of thinking is enticing. It also matches my own diverse background, with lots of math, physics, statistics, biology, programming thrown together…

It was also great to see some data science tools celebrities. Fernando Perez, who started iPython in 2001, talked about the great features of iPython – for example, I didn’t know that it also supports other languages like R, Julia, or SQL. And Jeff Heer, creator of D3, showed some awesome D3 visualizations, including the most funny alternative-visualizations sequence I have ever seen (the first 15 seconds of this video by Mike Bostock).

Just got my machine learning certificate from coursera – yay! I got 100% on all review questions and programming assignments, but because I started this course late, I received a late penalty for the first few weeks so that I ended up with only 91% counting for the certificate (80% was required to get it). I wish the certificate would list the topics covered in the class. Here they are:

Linear regression (univariate/multivariate)

Logistic regression

Neural networks

Support Vector machines

K-Means Clustering

Dimensionality reduction with PCA

Anomaly detection

Recommender systems

… and a lot of advice on how and when to apply these algorithms, and how to check their performance!

Sands Fish at ‘Big Data & You’ talked about the challenges and new tools for librarians in the age of ‘Big data’

On Jan 14, I attended a Big Data event for librarians, Big Data & You: Preparing Current and Future Information Specialists, organized by NEASIST (New England Chapter of the Association for Information Science & Technology). I hadn’t really thought about it before, but it’s obvious that ‘Big data’ is hitting the field of library science, too. As a researcher, I search for relevant research papers in huge literature databases like ISI Web of Knowledge, PubMed, or the arXiv server almost every day!

]]>https://sciencedryad.wordpress.com/2014/01/16/big-data-for-librarians/feed/0mjimuellerSands Fish at 'Big Data & You'Machine learning with Andrew Ng @ courserahttps://sciencedryad.wordpress.com/2014/01/12/machine-learning-coursera/
https://sciencedryad.wordpress.com/2014/01/12/machine-learning-coursera/#respondSun, 12 Jan 2014 20:59:40 +0000http://sciencedryad.wordpress.com/?p=3I just finished the machine learning class by Andrew Ng at coursera, a provider of Massive Open Online Courses (MOOCs). I finally understand why this class is so famous in the data science community – it’s truly awesome. Andrew Ng is a great teacher who captures attention with his engaging and intuitive explanations. He brings in real-world examples that show why the algorithms are relevant. And he explains what pitfalls to avoid, how to check that your algorithm works reasonably, and which parameters to adjust when it doesn’t. The programming exercises are well set up to allow you to play around with an algorithm. A class that’s both interesting and useful, and fun.

And, of course, machine learning is just fantastic! It’s like magic (science fiction?): You write some simple lines of code, and run it on some training data, and suddenly your program can do useful stuff without you telling it exactly how to do it! And sometimes even without you having a clue how to do it yourself! For example, I trained a neuronal network to recognize hand-written digits. Of course I can do that myself, but I don’t really know how I do it. And I don’t really know how the neuronal network does it, either (although I can get some idea by looking at the output of internal nodes), but it works!

What I didn’t like so much about this class was that it missed out on the theoretical underpinnings of the algorithms. It stopped with an intuitive explanation and some reassurances that it’s okay if we don’t understand the underlying math. I think it’s not okay. You don’t have to prove every mathematical theorem that is needed for the algorithm, but you should have some solid grasp on how and why it works. Otherwise you’ll never be able to understand what’s going on if the algorithm fails. Or, even worse, you might think that your algorithm works just fine, although it really doesn’t! But I guess it also depends on how much math background you have – for people who don’t know much math, explaining these things would take too long, and there’s only finite time in every class!

There was one thing about this class that I totally didn’t expect. It was my first MOOC, and when I started I was concerned that I wouldn’t be able to ask questions to a knowledgeable person, like the professor or a TA. Well, there was an online discussion forum, but everyone there was also a student, so they didn’t have any more clue than I, right? It turned out that the discussion forum was way more helpful than I thought. For almost every question I checked out or asked myself, there was someone who gave a good answer. Admittedly, there were also some answers that didn’t make sense, but luckily, if it’s math or science, it’s mostly possible to figure that out pretty quickly! It seems that peer-to-peer help for science and engineering classes works great!

In this paper we investigate the effect of species interactions when two species have to expand into new territory. Such territorial expansions happen a lot currently, because many species are forced to shift their territories in response to climate change – for example, if it gets too hot for them in their current habitat, they move north. Most species interact with other species, for examples flowering plants and their pollinators (bees and other insects) help each other. Such an interaction which benefits both partners is called a mutualistic interaction. The take-home message of our paper is: If two species have a mutualistic interaction, then it’s difficult for them to shift their habitat. They need a mechanism to coordinate their dispersal, otherwise they’re in trouble! This means that special care should be taken for such mutualistic species, e.g. flowering plants and their pollinators, to protect them from the effects of climate change.

We didn’t investigate these effects with plants and pollinators, though, but with budding yeast. That’s just more convenient: Instead of doing experiments with plants over several years on many acres of land, we did experiments for several months on cm-sized Petri dishes. I analyzed more than 1,000 microscopy images of yeast colonies! For this, I developed an image analysis pipeline in MATLAB to extract features that characterize the interaction and the dispersal. I then performed statistical analyses of these features that allowed me to estimate quantitatively, for example, when a mutualistic interaction is so important that territorial expansion becomes difficult. I could do this with MATLAB, too – yay for MATLAB’s versatility!

]]>https://sciencedryad.wordpress.com/2014/01/09/paper-published-in-pnas/feed/0mjimuellerYeast phase diagramYou don’t know what you don’t know about the database!https://sciencedryad.wordpress.com/2014/01/08/you-dont-know-what-you-dont-know-about-the-database/
https://sciencedryad.wordpress.com/2014/01/08/you-dont-know-what-you-dont-know-about-the-database/#respondWed, 08 Jan 2014 19:54:20 +0000http://sciencedryad.wordpress.com/?p=15On Jan 7, I attended the Critical Data conference at MIT (an event coupled to the Critical data hackathon the weekend before). It was all about big data in healthcare, with speakers from both the medical and the data science communities, and from both academia and industry. Everyone agreed that there is great potential in the enormous amounts of data than can – and are – collected to improve the current healthcare system. Medical data can be used to help doctors make decisions in ways that are not only more efficient but also better founded in facts. It can be used to improve the life of patients. Non-medical data (like data on hospital admissions and prescriptions) could help to make the healthcare system more efficient.

The question is, how? The challenges are great: The data is messy, and entered by many different people in different hospitals. This not only means that there are different types of data from different sources (a common Big Data problem). It also means that, because procedures performed by different doctors or in different hospitals can be very different, it can be hard to join, or even to interpret, the data from different sources. As Omar Badawi, a panelist who works at Philips Healthcare, said: ‘You don’t know what you don’t know about the database!’

But nevertheless, healthcare big data holds big promises. Many speakers lamented the untrustworthiness of the current medical studies: Most published results are not reproducible! And this is true even though they use ‘clean’ data from randomized clinical trials. But one big problem is that these studies are very expensive, and therefore small. This increases the likelihood of false positives and effects that look larger than they actually are when repeated in bigger studies. In addition, the high cost severely limits the amount of studies that can be done. In contrast, healthcare ‘big data’ is collected from patients ‘in the wild’ (a neat description by Josh Gray from athenahealth), so it’s comparatively cheap, and there’s lots of it. Many speakers showed examples where this data could be used to (retrospectively) predict whether patients would die within the next 24h, whether a patient would be re-admitted to the ICU etc. These first successes show that big data will be able to keep its promise to improve healthcare. However, these results have yet to impact actual medical care. To convince doctors and health care providers to access, trust, and implement then is a challenge that’s another typical big data problem: You don’t just have to make sense of the data, you have to transform it into actionable recommendations, and then convince people to act!

Team Oxygenators at the Critical Data Hackathon at MIT: Luciana, Nupur, Yousuf, and me. Photo by Jennifer Joe from medtechboston; see her report of the hackathon here.

The weekend of Jan 3-5, I participated in the Critical Data Hackathon at MIT – a weekend that brought together clinicians and data scientists to make use of an awesome medical database called MIMIC (Multiparameter Intelligent Monitoring in Intensive Care). It contains 200GB clinical data from 40,000 ICU stays, with matched physiological signals for 7,000 patients – a lot of data to play with!

I joined with team ‘Oxygenators’, who set out to mine the MIMIC database for quantitative predictors of the oxygenation state of patients with different types of diseases. Our clinical experts Luciana and Nupur had lots of ideas which lab tests and chart events to look at, and the data analysts Yousuf and I set out to pull this data from the database using SAP Hana (SAP was one of the sponsors of the hackathon). This turned out to be quite a challenge! We had to find the relevant data within 45 tables in a database that neither of us had seen before. For example, an important quantity for oxygenation is the FiO2, the fraction of inspired oxygen. We found out quickly that this value was listed in the chartevents table – but not just once: there were entries like FiO2 (Analyzed), FiO2 Set, FiO2/O2 Delivered, FIO2, FIO2 Alarm [Lo/Hi], FIO2 [Meas]. Which one is relevant? Only a medical doctor can know! (It turns out we needed FiO2 Set.)

The real fun began once we had pulled data from the database and started actually exploring it (using MATLAB). We looked for correlated variables that would change for patients with a specific disease, for example patients with pulmonary embolism. Unfortunately, the hackathon ended too soon – before we could play with the data long enough to actually make quantitative predictions for our model.

In total, the hackathon was a lot of fun. Not just playing with the database and the data, but also meeting the other participants. It was a pleasure to get to know to these smart, motivated people who were all enthusiastic about improving healthcare with the help of data!

]]>https://sciencedryad.wordpress.com/2014/01/06/critical-data-hackathon-at-mit/feed/1mjimuellerTeam-OxygenatorsThe algorithm and the knifehttps://sciencedryad.wordpress.com/2013/12/30/the-algorithm-and-the-knife/
https://sciencedryad.wordpress.com/2013/12/30/the-algorithm-and-the-knife/#respondMon, 30 Dec 2013 03:24:38 +0000http://sciencedryad.wordpress.com/?p=11I have always found that good science is also aesthetically appealing. The same is true for algorithms, as expressed beautifully in CLRS:

“A good algorithm is like a sharp knife – it does exactly what it is supposed to do with a minimum amount of applied effort. Using the wrong algorithm to solve a problem is like trying to cut a steak with a screwdriver: you may eventually get a digestible result, but you will expend considerably more effort than necessary, and the result is unlikely to be aesthetically pleasing.”