The Washington Post recently published an article on rising rates of colon cancer in young adults. I'm sure the article was quite interesting, but I was too distracted by the figure below to finish reading. I had to read the caption several times to work out what the graph is showing. I believe it's depicting an increase in the proportion of incident rectal cancer cases that are diagnosed in people 55 and younger. Remarkably little information is conveyed with this. How many people are diagnosed with cancer each year? How has the overall rate changed over time? Why are there only two data points - is this surveillance system trustworthy?

​What would make it better? I would prefer to see an incidence rate per age group over time, like the graph below from this paper. It gives a better sense of the annual incidence, and the distribution of cases by age group.

​If the authors are determined to show the proportion of new cases in each age group, a bar graph would be more effective, similar to this CDC figure on chronic disease prevalence.

This post is written by Eric Bakota, epidemiologist at the Houston Department of Health. You might remember Eric from my interview with him earlier this year. He's leading an interest group for epidemiologists who use R. The tl;dr is if you use R and want to join, sign up here!

The International Society of Disease Surveillance (ISDS) is committed to improving population health by advancing the science and practice of disease surveillance. One way it does this is through the R group for Biosurveillance (RGBS). This group's goal is to engage new and inexperienced users to gain skills required for data management, analysis, and visualization through the R language.

R is a free, open source, and popular statistical programming language. Because the language is powerful, yet free I suspect more and more health departments, epidemiologists, and data scientists in general will migrate to R (or Python). A recent analysis conducted by Bob Muenchen from r4stats.com indicates that R is now used for scholarly articles more frequently than all statistical languages except SPSS. ​​

by Robert A. Muenchen at http://r4stats.com/articles/popularity/

To support epidemiologists exploring R as a tool to improve public health, RGBS holds monthly webinars. These webinars facilitate the sharing of programming techniques (and code!) among current R users. The generally focus on beginners, but some advanced topics are presented as a way to give viewers an idea of what is possible with R in the hands of an expert. Webinar topics range from introducing a single package (‘surveillance’, ‘dplyr’, ‘RShiny’) to current research that used R for the analysis. Most webinars have an open-forum concept; attendees are encouraged to create a dialogue with the presenter. In addition to the monthly webinars, we have a google discussion forum. The forum allows for new R users to plug into an existing network of peers who can provide assistance or feedback on any problem. The forum also has links to each of the previous webinar materials, including a recording, slides, and code. ​

Finally, ISDS is committing two of the pre-conference workshops of the annual conference towards improving R capacity for public health practitioners. The Introduction to R for Biosurveillance will provide hands-on training for epidemiology, Biosurveillance, and high-quality data visualizations. The course will be led by Dr. Jared Niemi at Iowa State University. Intermediate R: Practical Tools for the R User, which I will lead, will introduce R users to several packages from the “Hadleyverse”, including dplyr, ggplot2, and tidyr. Each participant will be put into a group, given a dataset and will be able to work on a project as a team.

If you are interested in joining the R group, you can sign up here. If you have any additional questions, or if you are interested in presenting at a webinar please don’t hesitate to contact me at Eric.Bakota@gmail.com

Today's interview is with MIT grad student Maia Majumder, a long time Twitter friend and collaborator. Be sure to check out her blog.

Tell us a little about your background. Why is the work you do important to you?I'm currently an Engineering Systems SM/PhD student at MIT and a computational epidemiology research fellow at HealthMap. Before coming to MIT, I earned a Bachelors of Science in Engineering Science and a Masters of Public Health in Epidemiology and Biostatistics at Tufts University. My research interests involve probabilistic modeling, Bayesian statistics, and “systems epidemiology” in the context of emerging infectious diseases.

I also enjoy exploring novel techniques for data procurement, writing about data for the general public, and creating meaningful data visualizations. By working at the intersection of two separate but surprisingly similar disciplines (engineering and public health), I hope to (in whatever way I can!) improve the way we respond to outbreaks as they happen.

﻿Your background is in engineering. How did you become interested in public health? How does public health differ from engineering?When I started as an undergraduate at Tufts in 2008, our freshman year required reading was Dr. Robert D. Morris' The Blue Death: Disease, Disaster, and the Water We Drink. Because of this book, I learned how Dr. John Snow - one of the fathers of of modern day epidemiology - and his study of the 1854 cholera outbreak in London changed the way we understood disease transmission. Given that cholera is endemic in Bangladesh (my family's home country), I became fascinated with public health - and thankfully, there were several classes in epidemiology offered through the Civil & Environmental Engineering department. Something that I learned very quickly at Tufts Engineering - which also reflects what Dr. Morris emphasized repeatedly in the Blue Death - is that engineers are critical to the public health of any given society. This ultimately translated into a university-funded, 4-year-long research project in Bangladesh. Between 2009 to 2013 (when I received my MPH from Tufts Medical), I spent several months each year at the ICDDR,B (International Center for Diarrheal Disease Research, Bangladesh) working on cholera in the field... And because of these experiences, I perceive the cultures of engineering and public health to be fundamentally interwoven.

Outreach has been a big part of your work so far. Why that is important to you?In my opinion, science communication is an integral part of public health practice. This is especially true for novel pathogens research. Unknowns are scary, and emerging infectious diseases are chock-full of unknowns! As is the case with other sexy topics, EIDs are frequently the subject of sensationalized (and often scientifically inaccurate) news media reports. By making myself and my work accessible to the general public (via Twitter, email, and my blog), I hope to mitigate some of the fear and confusion that tends to ensue during an EID outbreak. I've found that most of the people who choose to engage with me (and other scientists!) on these topics respond very positively to having their questions and concerns addressed one-on-one... And that's precisely what makes outreach worthwhile!

Why do you think computational methodologies are a good way to approach epidemiology? Why did you choose that over more "traditional" public health?In my view, digital disease surveillance and traditional disease surveillance should be practiced hand-in-hand. The former can provide critical, near real-time insights in under-resourced regions where the latter simply isn't an option. That said, traditional disease surveillance generally provides far deeper understanding with regard to a given disease of interest; however it requires infrastructure that isn't always available for data collection to occur. Given that novel pathogens often emerge from under-resourced regions, digital disease surveillance is extremely useful in my particular line of work - especially in the context of rapid response in the early days of an outbreak.

What do you see as the future of epidemiology? What are your hopes for public health in the next 10 years?As I mentioned above, my hope is that digital and traditional disease surveillance methods will eventually be utilized simultaneously during an ongoing EID outbreak, with an increased emphasis on public participation. In reality, this is happening already... But I believe a formalized protocol that maximizes on the advantages of both kinds of surveillance would likely aid in reducing costs and improving efficiency.

Today's interview is with Dr. Bryan Lewis, who was my PhD advisor at Virginia Tech! Bryan is the reason I learned to code (in Python!), which ultimately convinced me to switch from liberal arts to computational epidemiology. I thought if he convinced me, he should probably be able to convince you.

1. How did you come to be a computational epidemiologist? It's not a very common field!Because its awesome! Fighting infectious disease with computers, what could be better? Seriously though, I was fortunate to have the opportunity to do research while I was getting my MPH with some very gifted infectious disease modelers (Joe Eisenberg and Travis Porco). I took a couple of classes and worked on a couple papers with them, and enjoyed it greatly. Travis was a great mentor and spent a lot of time helping me work on my master’s thesis, which was a small modeling project. After working on that and spending a couple years with Joe doing research with them I was pretty hooked on the field, and increasingly became convinced that it was a growth area that I could best make a contribution in.

2. Before coming to Virginia Tech you worked as a TB epidemiologist at the California Department of Health. How did your background in computer science fit in with that role?It was a pretty good fit, since traditional PH departments need good epidemiologists, which requires good data analysis. As a person who could actually write some code and think more “computationally” I was able to help on projects where the more “classically” trained epis couldn’t analyze the data in a particular way. I recall “realizing” this at some point while I was still just an intern, when I got a task to analyze just a couple questions of a survey from a senior epidemiologist in the TB epi department. Once I understood what she wanted with that batch of questions I asked about the rest of the survey and she leveled with me, “I don’t know how convert those data into the format I need for the analysis I want to do. The rest is easy.” To me analyzing and interpreting the whole survey was the daunting part, converting the data with a small program was the easy part. So it was a good fit, I learned a lot, and think that as a “coding-oriented” person I was able to help a lot. There were multiple projects where that was my role, and the more senior classically trained epis did the interpretation and “standard” set of analyses.

3. What competencies do you encourage in your students? What challenges do they see them face when acquiring those skills?Data analysis is crucial in so many disciplines, its hard to go wrong. So if you are talking about classes, I’d steer folks towards stats, programming of some kind (partial to Python), and a quantitative science class or two to in a field of interest to apply these skills to. For many students it seems there is an obstacle to tackling some of these classes because the material can be dry when not in the context of a juicy problem. I think programs that can incorporate a “find a problem based project” phase or a general “project class” that can act as the test bed for applying some of techniques can be quite instructive. After you’ve been confronted with a problem you’ve wanted to solve and then faced a limitation due to not having a technique in your toolbox, your hunger for more techniques will grow. Then when faced with the dry presentation of a technique a deeper understanding can occur. This problem based learning is, in my opinion, the best way for students to learn and at their heart what most PhD programs try to foster.

4. What do you see as the future of epidemiology? How does coding fit into that future?I think coding and more computational techniques are going to play a huge role in the future of epidemiology. At its core epidemiology is about collecting and analyzing data about disease, computational techniques can make this both more efficient but also more complete. Historically the data available for analysis has been constrained by the labor of collecting it. As we enter a world where more and more non-traditional data are available, coding will be essential for gathering, organizing, and analyzing it. I absolutely think there will a lot that “digital epidemiology” can contribute to the field, and coding skills along with a keen understanding of epidemiology and appropriate techniques for analysis will be essential.

5. These days you do a lot modeling. What is the role of modeling in public health, and what role do you hope it plays in the future?Again, I am very heartened by the trends I’ve seen in the last 15 years I’ve been “in” the field. I certainly don’t think modeling will supplant traditional epidemiological analyses. However, modeling methods are becoming more common place and are more embraced by the public health community every year. The variety of studies I see now where some form of mechanistic (rather than statistical) modeling has been applied has grown significantly. I think of it as an increasingly useful tool. Much like surveys and logistic regression now, I think in the future mechanistic models will add significantly to our understanding of diseases, and offer insight to ameliorating their effects.

Tell us a little about the work you do, and why it is important to you.

I’m a second-year MPH student studying hospital and molecular epidemiology. My main research focus is antibiotic resistant hospital acquired infections, although I’ve also done some work with vaccine surveillance and lead emissions from residential demolitions.

My work is important to me for two reasons. Firstly, I’m intrigued by the narrative we have around antibiotic use. We talk about “conserving antibiotics” and financial incentives to encourage new drug development, but we rarely discuss actual eradication of antibiotic resistant organisms. I was lucky enough to find a lab that’s doing prospective surveillance of antibiotic resistance –trying to predict what’s going to evolve and under what conditions – and I think that’s one of the first steps in asking the questions even more preventative than finding new drugs: how do we stop needing new drugs?

Secondly, my academic program combines molecular biology with epidemiological studies. In many instances, this lets us do more with data: we can connect molecular markers to clinical relevance, and look at epidemiological data with a much finer lens. Molecular tools are rapidly becoming invaluable for both infectious and chronic disease.

What programming languages or tools do you use, and how do they fit into your work now?

My primary tool is SAS, which I learned as part of my graduate school coursework. I’ve started using MATLAB – also taught through coursework – for mathematical modeling. This year I took a Software Carpentry course, which aims to teach coding for the express use of researchers. That introduced me to R, as well as tools like GitHub.

As I’ve progressed through my education, I’ve come to see, more and more, the value in knowing an actual programming language. There are a number of tools I’d like to build, but I’m limited by only knowing statistical packages. Many in my cohort have expressed similar frustration: some of the work we want to do would have to be outsourced to a computer scientist. But I don’t see this divide is necessary: there are languages that would allow us to both build tools and conduct statistical analysis. If we have to learn a statistical package anyways, why not spend that time learning a tool that gives us even greater flexibility?

Luckily, there are so many classes and resources online that it’s easy to begin to teach yourself to code. The greatest motivation for being able to learn to code is having a goal or project you want to build yourself. I have the motivation to teach myself independently, and the tools to do so exist, so I’m optimistic! And as the next generation of epidemiologists see the value in these tools and find ways to learn them, I think it will work itself naturally into the curriculum of epidemiology.

I noticed you have a background in the humanities. What path did you take to get where you are now, and what role does that training play in your current work?

As an undergraduate, I dual majored in biology and philosophy; although I spent the summer working in laboratories, during the academic year I spent the vast majority of my free time working as the arts editor for our college newspaper.

These experiences turned out to be invaluable for studying epidemiology. Philosophy is a study of analytical thinking. What is the minimum standard of evidence that we need to make this assertion? Can we really say that A causes B? How I use inductive versus deductive thinking? What piece of information would prove or disprove my argument? Philosophy trained to think me like an epidemiologist, even if I didn’t yet have an epidemiologist’s vocabulary.

Journalism was just as helpful. When you’re conducting an interview, there’s no guarantee that your questions will elicit the information you want, or even that you want the right information! There were so many times I started writing a story about one subject that morphed into something completely different when I started interviewing. We see this challenge in epidemiology all the time: what if there are factors at work that we don’t even know to measure? How do we design surveys and gather information in such a way that gives us the data we’re actually seeking? Epidemiology and journalism are both storytelling: using data from real-world observations to craft a narrative that explains a cause.

Transitioning to epidemiology wasn’t that hard, as I’d always been involved in science. But I’ve always been glad some of my background was in the humanities as well.

I love what you've said in the past about epidemiology being a set of tools and not a set of facts. What can epidemiologists or students can do to prepare for that trajectory?

I came to graduate school having memorized a lot of information about diseases and historical outbreaks – that is, having learned a set of facts. Upon beginning my degree, I rapidly realized that epidemiology is about equipping us with a set of tools that we can use to go about acquiring those facts. I wasn’t the only one under this illusion: after the first semester, many of my peers switched from the epidemiology track to health behavior/health education.

I love epidemiology, but that’s because I now have the tools to answer my own questions about disease, rather than being limited by other people’s discoveries. However, if I’d had a better understanding of epidemiology from the beginning, I would have prepared differently. I would have taken a lot more coursework in statistics and computer science; I would have focused on developing skills in data management and analytical thinking (so I’d do the philosophy degree again!)

With epidemiology, it doesn’t matter what you’re interested in: you can use tools to study chronic disease, infectious disease, environmental contaminants, the impact of preventative health measures, anything. If you’re trying to prepare for a track in epidemiology, read plenty of books on public health– the type of content that will remind you what questions you may be interested in – but focus your classwork on developing your analytical thinking and quantitative skills.

What do you see as the future of epidemiology? What are your hopes for public health in the next 10 years? How does coding fit into that future?

I see the field becoming even more interdisciplinary. With the massive amounts of data becoming available, it’s going to be even easier to look at an outbreak in the context of so many things: economic factors, environmental factors, anthropological factors. We can think about disease in the context of greater networks. That probably wouldn’t be possible if we as epidemiologists weren’t getting better and better tools, but I think that’s where coding comes in: we can build the types of things we need to think about and interact with data in different ways.

This openness of data also means more people will be able to contribute to epidemiology: it’s not just going to be people with advanced degrees debating science in the slow journal publication cycle. Now, as tools and information are widely shared, anyone with initiative will be able to contribute to epidemiological questions.

In the next ten years, I hope epidemiology becomes more predictive. Right now, despite falling under the field of public health, things often still feel very responsive: we responded to the recent Ebola outbreak, we’re responding to Zika, we’re responding to the lead in the water in Flint, MI. But maybe we can start building tools to get past that problem. What if someone had decided to build a platform that integrated civil engineering data with health department data and overlaid it all on an interactive map? Maybe we could have anticipated a problem before it began. I hope that’s what coding can start helping our field to do.

I've come across two articles recently that remind me of why public health is not just a profession, but a vocation. Both articles feature people with unusual dedication and commitment to their personal mission. And although neither story is related to computer science, they are great reminders of why public health professionals show up day after day, and what we can achieve in doing so.

Vice wrote recent on the impending eradication of guinea worm, a painful disease that used to infect millions of people. Former President Jimmy Carter and the Carter Center have devoted the last 30 years to the effort, and the work has paid off. In 2015 there were only 22 new cases of guinea worm. President Carter, now 91, considers this his "most satisfying achievement."

Virginia Tech professor Marc Edwards took commitment to another level by uncovering the Flint water scandal (together with Dr. Mona Hanna-Attisha) despite substantial opposition. Dr. Edwards even spent his own money to do so. He has harsh words for the academic and government cultures that do not live up to their mandates to serve the public.

Zika stolen all my free time which means I haven't had time to blog, but such is the life of an outbreak scientist. Here's another repost from my personal blog, originally published in 2013. Data as a civic right is a passion of mine, and I'll happily repeat this message on loop until things change.

What if I told you a brand new public library is coming to your town. It's going to be really well stocked with great books - but you can't open them, you can only look at the covers. That is the current state of our public data right now. Here's why you should care.

I recently argued that we all need to level up the ways we share data, particularly public agencies. PDFs seem to be the default for how public data are made available, which just doesn't cut it. The source of my frustration is not that workarounds don't exist. They do, or else nothing would ever get done. It's that nobody should have to be a 'halfway decent programmer' in order to liberate data from the dreaded PDF. What good is that for the vast majority of people who can't program? This is what I call the 'let them eat cake' perspective, and it's dreadfully out of style.

The broader point is that the data to which I am referring - those administered by taxpayer-funded institutions and mandated by law to be released - are not a privilege. Data about the health, wealth and safety of our communities are a civic right.They exist precisely because taxpayers paid for them to exist, for the exact purpose of using them to learn how to better our lives. That makes public data a public good, not a state secret.

Did you know that 13 million people in the United States lack access to a safe municipal water source? Me neither, until I downloaded and analyzed the data. (For reference, there are 1.2 million people living with HIV in the US.) Knowing that kind of information empowers people to elect legislators, produce and reproduce science, plan communities, and set goals. Releasing data for public consumption is better for government, better for science, and better for democracy.

That's why we need to demand that the data that describe our lives be made accessible. Not "available" - accessible. Widely disseminated, and easy to find, download, and analyze. PDFs are necessary but not sufficient for this task. I see no reason why data can't be released in multiple formats; it's not like it all needs to be re-entered by hand each time.

Luckily things are changing. The Obama Administration, theNational Science Foundation and the European Union are all working towards opening their data. We're not there yet though. We as citizens need to continue to demand our data be made available to us. Similarly, we as scientists and public health professionals need to our part to release timely, usable, and well documented data.

I'm reproducing and updating here the most popular post I've ever had on my personal blog. Although I originally wrote the piece in 2014, nearly three years later we find ourselves in much the same situation. We can do better.

There's a serious problem with the current state of shared data - it is almost completely unusable! Here are some ideas for sharing more effectively.

I often have a question I'd like to answer for which I know data are available. Most recently I wanted to look up the incidence (number of new cases) of various infectious diseases over the last decade. This should be easy - CDC publishes the Morbidity and Mortality Weekly Report of just that. Well, the data are indeed available - put only in PDF. [Update: This is no longer true. CDC now has an entire data portal (hallelujah). Project Tycho has taken on the task of digitizing all of the historical MMWRs. Despite this incredible progress, most public health data is still published in PDF so the problem endures.]

Why even bother with computers? They might as well mail around a printout. If I wanted to actually analyze it, I would first need to enter a decade's worth of data by hand. Ain't nobody got time for that.

I don't mean to pick on CDC. County Health Rankings is an awesome website that aggregates and releases for download public health data from a variety of sources. I'm grateful for that, but the Excel files they release each have multiple sheets, nested headers, merged cells, and extra columns with confidence intervals. It's pretty much impossible to analyze that data in a program other than Excel. To do so, I first have to manually select and reformat the data I want, rename the variables, and then copy/paste it into a new file - which rather defeats the purpose.

There are about eight million other examples that I had to restrain myself from enumerating. The point is that sub-optimal sharing practices make it difficult for researchers (of both the professional and citizen variety) to actually use shared data. The research either a) won't get done because it's too much of a hassle, b) will have errors from manual data entry, c) will take way longer than it should. Possibly all of the above. With that in mind, I came up with some tips to level-up your data sharing.

Learn how to step up your sharing game:

Include as much detail and resolution as possible. County-level data is better than state-level data which is better than national-level data. Bonus points if you break it down further by age, sex, etc. I understand that this can't always be done for privacy reasons, but it is immensely useful when it is possible.

Use a flexible file format. My preference is .csv, because it can be read by almost any program. I'll tolerate .xls and .xlsx. But please, please, please do not use pdf.

If you do use a spreadsheet format, do not use multiple sheets, nested headers, merged cells, strategic cell borders, etc. Make it as plain as possible. Don't worry that you'll end up with too many files if you don't use sheets. Release them in a zipped folder instead.

Use short variable names with no whitespace. Underscores are usually a safe bet, so instead of "Number of new tuberculosis cases" use "incident_tb". If you have a corresponding column, e.g. confidence intervals in the screenshot shown above, make the variable name relevant. Use "UCI_incident_tb" instead of relying on the column's proximity to "incident_tb" to indicate a pairing. Include a README that explains the variable names if you're worried they aren't descriptive enough.

Actually, include a README no matter what. It can include variable names, units of measurement, notes on data collection/reporting/suppression, or anything else that is relevant.

Tell me whom to cite! I'm so pleased to be able to use your data, and I'd love to give you the credit you deserve. Put your citation on your website, in your README, and everywhere else I might look for it so that I can use it appropriately. Or post it to figshare where it will automatically be assigned a doi. It's an easy way to make you data citeable, shareable, and version controlled.

Bookmark these guidelines. Next time you reach for the 'export to PDF' button, or begin to use the change-cell-border feature on Excel, pull this out and remind yourself, 'this is not machine-readable. Nobody will use my data if I release it like this.' Then rejoice that you are awesome for sharing your data, and for doing so in a way that is actually useful. And for that, I thank you.

I'm excited to kick off a new series of interviews with epidemiologists in public health practice, academia, and industry. The inaugural interview is with Eric Bakota, an epi with the Houston Health Department. I met Eric at an event called EpiHack Analytics, which pretty much sums up our shared interest in bringing data science to epidemiology.

Tell us a little about your background and job.​I am currently an epidemiologist for the Houston Health Department. I work within the informatics division, where I spend a lot of my time translating our epidemiological data into information by way of R-generated graphics.

How did you learn R, and what role does it play in your work now?​I learned R while earning my MS in evolutionary biology. My advisor had no clue how to program in R, he used JMP and SPSS, but he knew that R was quickly becoming the premier statistical programming language. He also appreciated that it was free!

Stratified density distributions of campylobacteriosis cases revealed fewer cases in children than expected.

I am currently focused on three projects: a epi-surveillance tool in R Shiny to allow for data exploration of communicable disease cases; a QI/QA report using R Markdown to inform managers of the surveillance team's efforts; and the 3rd edition of Houston's Epi-in-Review, which is a compendium of analyses for each reportable condition. The graphs for the book were made using the "ggplot2" package. I love that these projects draw from the same database yet serve very different functions by informing different audiences (epidemiologists, managers, and the public at large, respectively).

What challenges and opportunities have you faced working as a tech person in a more traditional public health department?There have been far more opportunities than challenges. I think that the future of epidemiology will skew heavily towards coding (R, SQL, Python, JavaScript, etc). Having the ability to create graphs that go beyond Excel-styled charts has impressed my bosses. The biggest challenge is balancing the new tech-tools with very real and legitimate concerns about data security. I am hopeful that tech companies continue to improve data security and offer products & services that are, at a minimum, HIPAA compliant.

Do you have any advice or words of encouragement for people looking to get started with coding?I think the best way to learn to code is to find an interesting problem that you're motivated to solve -something you're willing to work on from home.

What do you see as the future of epidemiology?Coding for analysis and coding for automation are, I believe, the future of epidemiology. On some fronts it is already happening. The RCKMS project led by CSTE & CDC is a great example: the projects end goal is to have reportable condition case reports be automatically generated using electronic medical records and electronic lab reports. This will allow surveillance investigation epidemiologists to focus on analysis of data instead of acquisition of data.

Future epidemiologists will need to be less focused on understanding the natural history of disease, its transmission, etc and more focused on understanding the science of data, which includes regression techniques, machine learning, and the grammar of graphics. ​