Post navigation

This is an excellent question and one I often ask myself on reading the new data-science articles. I can only assume that these articles are discussing organisations with problems for which know exactly what data they do/don’t need to collect and have already accumulated large repositories highly-relevant problem-specific data.

My experience in scientific data-mining is very different however. I’ve often received novel master-questions against databases that were never originally designed for such general query. This has led to data re-construction and contextual re-combination of parts of disparate databases to fit the question context and present summaries of that back to the querying scientists.

Quite often they realise they either can’t ask the question the way they have (data cannot support it) or they are asking entirely the wrong question (having seen new aspects of their data) and come back with an entirely different line of unrelated questions.

This is highly iterative and much effort is required to develop a collective vision of the deficiencies and effects of legacy data generation processes in relation to current problem context and how lines of analysis (and ultimately better future data generation for it) can best be achieved.