I work on numerous data projects to undertake statistical analyses and the amount of time taken to reformat the data into a suitable and usable format is so painful that I have decided to write this page about how to do it correctly. Excel is the most common method to collect data and here are some tips:

How to name a file

Give your excel file a short name and put a date in reverse order (e.g. draindata 141003.xlsx). In this way you file will be ordered naturally in windows chronologically (most recent file last).

How to set up your spreadsheet

Conventionally patients details are entered in rows (i.e. each patient is a new row). Enter all data into one sheet, do not separate pre-operative data into sheet 1 and post-operative data into sheet 2, instead put all data on one sheet and create a variable separating the timing (e.g. “post”=”0” or “1”).

How to name your columns

Give each column a short, unique name in lower-case and numbers without any spaces (e.g. prehb, posthb, cr1,cr2). This is because the statistical software that I use (R 3.0) is case and space sensitive (i.e. “pre-op haemoglobin” is not the same as “Pre op Haemoglobin”), and nothing is more tedious than typing out a long name.

Do not generate group names for columns

Sometimes when researchers collect data, they have a heading at the top for pre-operative values of Hb,WCC,Plt and then a group heading for post-operative Hb,WCC,Plt, but when it comes to analyses, we can only have one header row, and then we have duplicates for Hb,Hb,WCC,WCC,Plt,Plt. You may notice the the rule of not using upper and lower case is also broken in this case.

Each column must have a unique name

Sometimes researchers use different colours in an excel sheet to signify different values (e.g. background in red for pre-treatment values, background in green for post-treatment values), and otherwise variables are given the same names. Related to the point above each variable (column) should have a unique name (e.g. hb1,hb2).

How to record categorical outcomes

Firstly, use numbers. For example “male” 0 or 1 (conventionally 1 would mean “male”). Notice I did not choose to label the column “gender” in which case another user of the dataset will not know which gender “0” or “1” refers to. When there are multiple categories, label them as numbers “1”,”2″,”3″ (e.g. “colour” as the variable name 1=red, 2=blue etc…).

Ensure dates are formatted correctly

Ensure that you format a date field as a “date” in excel, this will help identify common errors where date is entered incorrectly (e.g. 9/11/13 and 9.11/13).

Be careful entering data

When seen by a computer, “No”, “no” and “no.” are not the same (it will be treated as a factor with three levels / outcomes).

Do not change the column names unless absolutely necessary

When I undertake an analysis, I keep a script file with the names of the variables in it. If you preserve the names, the same analysis can be done in seconds. If you change the names, I have to re-write every part of the script that references the previous variable name and manually change it to your new name.

"Obeying these simple rules will save your statistician hours and hours of data formatting and results a happier interaction!"