mperative vs procedural analysis packages.

Edited: Dec 30, 2009, 8:54am

oops should be "Imperative vs Procedural"...

this is something of a rant - worrying about the lack of data trail and problems w/ data mgt w/ the trend to "imperative/command driven" analysis packages (Stata/R) in the field i work in, epidemiology. Curious about other's thoughts. Taken from a note written to w young, very bright recent phd in epi whom i've worked with over the last few years at the Nat. Inst. for Envi. Health Sciences (NIH). My background is PhD, Geography; MLS; and MsPH. Been working for 25 yrs as epidemiology data analyst w/ a small group of reproductive epidemiologists.................

some time or another, figuring you've been exposed to Stata and are very handy w/ SAS i'd like to chat w/ you - and others of the younger epidemiologists about strong/weak points of the programs. I'm NOT worried about the stats analysis bits - probably R is stronger than either and all three are more than sufficient.

What worries me are: 1. the command line mode of interactive analysis that Stata encourages (i know it doesn't require this approach - but the whole .do command/programming mode that allows Stata code to be saved and reused appears very much to be an after thought (certainly this impression is enhanced by reading the Stata manual!); 2. the relative difficultly (evidently) of writing Stata code that is "structured" - that is readable by someone other than the coder (or, indeed, by the original coder). The examples of Stata coding that people have had to try to work their way through over @ Westat typically just wend. wind and wrap all over the page/screen. Basic aids to understanding, ie indentation, are generally ignored. The "recode" and variable definition statements in Stata are VERY hard to read, largely, i suspect, because from its beginning Stata "assumed" a good, clean dataset that its powerful statistical commands could then manipulate.

The whole process of tracking changes to data and datasets, as well as the problems attendant on doing the various sorts of merges, updating, data and dataset manipulation that, as far as i'm concerned, is REALLY why at the data processing end, we use SAS (SPSS - what the Norwegians use along w/ R - HAS improved it's data handling/mgt capabilities a lot, too.)

I had a long discussion w/ a friend who's pretty high up in a major survey/research outfit's stats/epidemiology section and responsible for the data integrity/quality of the work coming from a large number of analysts. I know she's very perturbed because the financial officers @ her shop want to dump SAS for Stata and as she's used both, a lot, she's worried about being able to ensure the quality of the data that goes in for analysis. My thought was that, in a pinch, an organization that was required to do both data mgt and analysis could just license Base SAS and (preferably) SAS/Stat and then use the powerful Stata (or even cheaper...ie free) R statistical procedure -- GIVEN a clean analysis dataset that needs minimal manipulation.

The whole "command line" approach to data mgt is unnerving - it's conceptually the same as trying to do data mgt in excel. People try to do it all the time and when it bites them in the butt, the real danger is that, unless you know the data intimately, the analyst is left w/out a clue that a problem even exists. I'm also aware that programmer/analysts relying on SAS can make similar sorts of mistakes - but i think it's easier for someone else to FIND those mistakes, since SAS is all about leaving data trails. Working on Norway Clefts w/ Ruby (a Stata user), she had a problem at a point at which her results were very different from one of the biostats postdocs who was using SAS. I replicated the statistical analysis that Ruby was doing in SAS and she and I got identical results - BUT it was also pretty easy for me to figure out that the biostats guy was not handling SAS missing values correctly - w/out needing to see his code..i simply made a couple of assumptions about what he might have been doing and, indeed, got the results of HIS logistic models. I'm sure he understood statistically what he was wanting to do far more deeply than I ever could - but that doesn't make any difference if the data is being handled incorrectly before the analysis.

And Stata's failure to include a "statement delimiter" (eg the semi-colon in SAS; the period "." ending SPSS statements; even an relatively "terse" language like C, C++ which was designed to be "minimalist" in its structure uses ";" statement terminators freaks me out. Stata seems to be conceptually based on something like APL - the original "do everything in one line programming language which made it incredibly powerful and incredibly hard to decipher.(~R∊R∘.×R)/R←1↓⍳R --- err, as i'll believe what the wikipedia author says, this expression finds all prime numbers from 1 to R. Whow!~ but, just like Stata, APL is an "imperative" - command driven language and lacks statement terminators.

Anyway..sometime I'd llike to pick yr brains on this.as you're part of the proverbial younger generation of epi analysts.

My faults lie on the other end of the spectrum...i'm afraid to get rid of any code, and as i tend to go about dataset and data manipulation very incrementally, data step by data step as it were, i have far too many pieces of SAS code that i'm loath to trash - esp. those in which I've made mistakes. Which is why, among other things, as i wander through the fields of EPS and highlight/summarize the key datasets and programs, i'm not getting rid of any code or datasets(yet, at least) but I am trying to set up a means for someone else to discern what the important datasets really are and how they got there..(probably about 10% of the total, at most!).-- bob mcconnaughey

hmm.. reading about APL i am totally UNsurprised to see that it was a major influence on spreadsheet development.-------------thoughts?