Data Are Like Ikea Furniture: Best Shipped Flat

April 10, 2020Loren Perelman

This post was originally published on the Riffyn Discover Blog on June 28, 2018

The way we organize our data is critical to our ability to perform various analyses after we collect it. We are all familiar with products like Excel or Google Sheets, which are the most popular modes for capturing and sharing data (certainly in undergraduate or graduate studies, where formal systems almost never exist). They are powerful, and they enable a lot of flexibility in how we present our data. One potential challenge is that we tend to organize our data in a style that makes it “pretty” to the eye – this is generally how data is presented in tabular form for scientific publishing. In the publishing or presentation contexts, it makes sense to present the data this way, as we are trying to summarize complicated data in a human-readable format. In the world of analysis, however, this approach is not ideal, as it does not “feed” analytical software the right format of data.

Solid estimates suggest that data science is 80% clean up and 20% actual analysis. Spending hours cleaning “pretty” data and making it usable for analysis is...well, less than pleasant. Except for a few outliers out there, I suspect most of us want to spend our time analyzing data and making keen insights that enable even better experimentation and process. The good news is that this is possible with a little bit of planning and organization!

Data Collection Begins With A Plan

Let’s first think about what a generic data collection plan might look like, in a graphical format. Figure 1, below, illustrates what we’re capturing in a process. There are inputs/factors*, which are both the material inputs (e.g., a reagent, a chemical, a filtration apparatus, a fermentation vessel, etc.), as well as process settings (e.g., temperature, pressure, duration, etc.). The outputs/responses of the process are both material outputs (perhaps a new chemical, a harvest broth from fermentation, a purified powder, etc.) and the associated data, such as the chromatographic information, the quality data associated with the products, and so on.

Figure 1. Example of a data plan for a given process. We push inputs/factors into a process, which results in outputs and responses - we want to capture all relevant data associated with this process!

Take, for example, the process of measuring the concentration of a given molecule by HPLC. We might have inputs/factors like sample type (and the sample itself), diluent, mobile phase, the HPLC system we’re using, the operator who prepared the sample, and so on, as well as column temperature, flow rate, injection volume, etc. And our outputs/responses will be the physical eluent, a chromatogram or chromatograms (signal vs. time), peak area(s) for our molecule(s) of interest, concentration for those molecules, etc.

The Type of Table You're Probably Used To

In Table 1, I’ve put together a simple example of a common output from the sort of study detailed above in a format that might be acceptable for publishing a report. Let’s talk about what’s wrong with the table from a data analysis perspective:

There are merged cells, which means that we can’t associate each measurement to the input or factor that led to it - most programs will only recognize this as the first column. So, the operator “Alice” will only be associated with the mean value of the measurement for fermentation broth derived from HPLC 1.

There are both row-based and column-based titles. Analytical software requires each bit of data to be in its own column so that data is easily parsed.

The data has been abstracted into calculated values - the fact that there are means and standard deviations indicates that there were multiple replicates. Each replicate should be in its own row so that we are able to detect outliers when they occur (or at least generate suspicion that we may be observing one).

Table 1. Example of a typical data table from an analytical measurement system analysis type of study. While this table looks good, it is not very usable for subsequent analysis, nor does it tell us very much about the nature of the data behind each mean, standard deviation (S.D.), and %RSD (the S.D. divided by the mean times 100%, or a relative measure of the standard deviation in the context of the mean).

How Should I Structure My Table?

Data tables should contain each variable in its own column and a row for each individual run/measurement/etc. This means there will be quite a bit of redundancy in certain columns, to be certain, but each row will be completely unique (at the very least, there will always be a unique run/replicate id number that prevents even identical results from resulting in an identical row). Table 2 provides an example of the data provided in Table 1 being rebuilt in an analytically-friendly format.

Table 2. Example of flat analytical data table. You will note that each instance of a measurement occurs in its own row, there are no merged cells, and each variable has a column associated with it. Note that the table has been truncated for readability.

By arranging your data in this format, you still enable production of a summarized table, such as the one detailed in Table 1; however, you also enable other types of analysis - you can now group and summarize by any variable utilized in a given study. We know more about each measurement and can scrutinize it independent of a grouping. We can perform a variance analysis to determine which variable(s) has the most impact on the noise of our assay or process. Planning ahead and pre-structuring your data tables for analysis will provide for all use cases while keeping things organized and efficient. This means using analytics and visualization software like JMP, R, Spotfire, Tableau, and Python without any cumbersome prework.

Riffyn SDE helps you automatically shape your data into this format ready for analysis. It provides cutting-edge approaches to data organization using patented technology. If you’d like to learn more, send a note to hello@riffyn.com .

Notes

*It should be noted that for mathematical modeling, the x variables are all considered factors, and the y variables are all responses. We note that in “real” life, there are physical materials involved, hence inputs and outputs (which ultimately become factors and responses in the modeling process).

Loren Perelman

Loren Perelman is the Vice President of Science at Riffyn where he leads a team of 20 scientists who deploy Riffyn solutions in customer sites. Prior to this he was Director of Manufacturing, Quality, and Analytics for Bolt Threads, overseeing a team in the development and routine utilization of analytical methods supporting protein-based materials. Loren has also worked as a Senior Scientist at Amyris and a Group Leader, Analytical Services – Fibrosis Team, at Nitto Denko Technical Corporation. He received his Ph.D. from UCSD and his B.S. from Harvey Mudd College.