Big data (useR! 2011)

J. Demmler – Challenges of working with a large database of routinely
collected health data

The SAIL data bank holds over 1.9 billion (anonymous) entries. To use the data for research, they need to ensure that proper data security is observed. For example, secure data transport. All analysis is done with a secure environment. Files are moved into the environment via an FTP client

Why R? No advanced SQL options available, so using DB2 allows loops. Also R is great for data pre-cleaning and is suitable for the heavy analysis. To connect to the SAIL database, they need to use the RODBC package. SQL queries are run from within R, however SQL scripts are kept in separate files since they are “reviewed”.

Lots of errors in data, e.g. units.

John Bryant – Demographic: classes and methods for data about populations

Existing data structures for population type data:

array: messy code;

data frames: not that natural for this type of code;

demography package: not really extensible.

Target audience for this new package: applied statisticians, social scientists. Not programmers. Core to this package is the Demographic class: S4 object, specialized array with associated meta data.