Resources

Links to and information about data science references, training and news

I have a ‘go-to’ list of quality and up-to-date sources of information that I continuously draw upon either as a reference, for training, detecting patterns and trends, or comparison and evaluation.

I have mapped the list to the data science process, and the skills and knowledge supporting it. I hope that it will save you time and help focus on relevant and key aspects using quality, vetted information.

The following graph provides an outline of the data science process and the skills and knowledge required to practice it.

The data science process and tools, skills & knowledge that support it

Summary of the data-science process

Extract, Transform & Load

The first, and probably the most labour intensive part of the data science process is to prepare and structure datasets to facilitate analysis, specifically importing, tidying and cleaning data. Hadley Wickham wrote about it in The Journal of Statistical Software, vol. 59, 2014.

Tidy Data

It attempts to deal with ‘messy data’, including the following issues:

Column headers are values, not variable names.

Multiple variables are stored in one column.

Variables are stored in both rows and columns.

Multiple types of observational units are stored in the same table.

A single observational unit is stored in multiple tables.

Null Values/ Missing Data

Most models are typically unable to support data with missing values. The data science process will, therefore, include steps to detect and populate missing data.

It can occur anywhere between the importing and transforming stages. Models can be used to guess values for more complex treatments, whereas a simpler approach could use aggregation during transformation.

PCA

A variable is a quantity, quality, or property that you can measure. Height, weight, sex, etc.↩

A value is the state of a variable when you measure it. The value of a variable may change from measurement to measurement. 152 cm, 80 kg, female, etc.↩

An observation, or data point, is a set of measurements made under similar conditions (you usually make all of the measurements in an observation at the same time and on the same object). An observation will contain several values, each associated with a different variable. Each person.↩