Listening to Data

Menu

Archaic Data Science: lesson of the Ulfberht

Data science is a very popular term used to describe relatively recent developments involving data. The popular notion of data science centers on the notion of access to very large amounts of data from multiple sources, usually in the context of large unstructured data stores sometimes generically named as NOSQL.

This modern notion of data science distinguishes itself from older (and very mature) disciplines of statistics and relational databases. People with strong backgrounds in statistics or relational databases have a claim to being data scientists, but they rarely if ever used that term before. Thus, when they attempt to claim the role of data science, their modern counterparts are inclined to call them fake data scientists. Modern data science is about raw unstructured data.

Although modern data science used statistical tools and structured algorithms, their approach is different from older approaches in that the ideal is to apply statistical models or structured extractions always from the raw data. In contrast, the older “fake” data science perform their craft on statistical or relational models that are derived from raw data: once the model captures the information in the data, there is no need for access to raw data except for archival purposes.

The term “fake” data science has been in the context that says that a person with prior skills and experience with statistical or relational modeling don’t have the right skills to perform modern data science. I suggest “archaic” may be a better term than “fake”. My point is that the older disciplines necessarily involved deep appreciation of data in order to properly understand the relevance and credibility of the data for a particular project. This appreciation is not a trivial skill.

What I find alarming is that the more modern approach dismisses the wisdom gained from generations of diligent investigation of data. Today data science is applying tools to bigger data sets. These tools are the same as used before: statistical models or relational query techniques. However, today they are just tools to apply to the larger raw data sets. The challenge of the modern data scientist is to optimize these tools to tackle the larger data sets.

As suggested by the term NOSQL, if an older tool can not scale then we must dispense the tool entirely and find a different one. Relational databases using high degrees of normalization do not scale to the sizes needed. Thus, relational databases need to be abandoned in favor of non-relational approaches. The focus is on scaling up to larger quantities of data.

Modern tools for handling extremely large data sets have impressive range of capabilities. These tools retrieve query results and used these to process statistical algorithms. From the perspective of the end user such as an analyst, the modern data science reproduce older capabilities but with much larger data sets.

It appears that new data science approaches completely replace old approaches for working with data. The old approaches are archaic: old and irrelevant.

I am writing this post because I believe the older approaches are still relevant and essential. A different way to distinguish the two interpretations of data science is by analogy. Modern data science is a specialty of computer science: it grew out of computer science, a science that focuses on implementation. Legacy (archaic) data science is a form of historical science like history, archaeology, or audits. The two disciplines complement each other, but they remain distinct.

As a form of historical science, archaic data science treats data like evidence. Like evidence, data is subject to scrutiny for relevance, accuracy, reliability, and meaning. Any results from the analysis are prone to challenges by counter claims. There is a need to be very selective about what data is available for analysis.

In these archaic data sciences, there was a common approach to both statistical and relational handling of data. This approach involves multiple steps of preparing data with intermediate forms of data between the raw data and the final publishable analysis. Typically there are multiple steps of refining data similar to refining raw materials from the earth. Each step has a purpose to add value in small incremental steps. For example, there is an analogy to material mining where the first step is to separate the ore, and a second step is to remove the impurities in the ore. Data practice follows the same successive refinement for essentially the same reasons.

What we need from data is its having the right properties for the task we are asking from it. When we use data to make important decisions, the data for that decision has to robust enough to stand up to any challenge.

Recently, I watched a video (by NOVA) about the modern recreation of the likely process of producing the superior Viking swords with the name Ulfbert. Within this video, there is a detailed description of each of the steps that were performed to go from raw ore to a fine sword. As mentioned in the video, the final sword with its unique superior characteristics depended on all of the previous steps being performed exactly right: if any step failed, then there was no option but to go back to the beginning. This is an interesting analogy because the sword was decisive technology for the Vikings. There was a huge difference between the preparation of the Ulfbert and the more quicker path from ore used by its rivals.

Another interesting observation from this video is that the technology for preparing the steel for this sword was lost for nearly 1000 years. At one time, there were craftsman who knew how to build such a superior product but eventually they become ignored and lost. The video didn’t say but I suspect one reason for its decline was that it was easy to overwhelm the expensive superior blades with much larger volume of cheap blades. Eventually the dominance of the Vikings declined and their superior technology became archaic and ignored.

The video reminds me of my own experience of preparing data through nearly a dozen steps each involving intermediate results that used the predecessor step for its source of raw data. Each step had its own specific purpose either to remove flaws or to enrich the information content with more data or processing.

Eventually, the data available to analysts was far removed from the raw data that we started with. Unlike the sword manufacture scenario, we saved the intermediate results so that we can trace back from the final prepared data to the ultimate raw data for confirmation purposes. However, the careful preparation of the final data form facilitated rapid development of new reports or new queries with a high degree of confidence that the results could support important decision making. By working with this highly refined data, the analyst could create new queries and still have the confidence that comes from using data that has already earned its trust from earlier analyses.

I was directly confronted with the complaint that this approach was archaic. In particular, there was a demonstration project that the new data science techniques could reproduce in a single quick step several of the reports available from the old process. This was a powerful demonstration because of the potential cost savings it promised. Even the old process had to invest in retaining access to the raw data. The advantage was that the new process dispensed with all the additional investment required for the multi-step processes with all of their persistent intermediate results.

The demonstration proved that multiple step process was not essential to prepare the precise and elaborate reports. Even though there is still a productivity advantage with readily accessible SQL skills. It is claimed that the people who would write SQL and they could easily be retrained or replaced by people who could write the queries in the new tools. The old approach was old. It was archaic.

I agreed with the benefits as presented. In fact, I was excited by the prospect of eliminating the intermediate steps because they were a big burden in terms of hardware, labor to maintain the hardware (such as patching the OS), labor to monitor and manage each of the intermediate steps behind the scenes. However, I attempted to defend the old process.

I believe in the practice of rhetoric that suggests every proposal deserves a strongly argued objection. Even if the objection is unlike to succeed, as it the case with my project, the argument can bring up considerations that otherwise may be overlooked.

In this case, I argued the case for the value added in the intermediate steps.

First, I pointed out that they are recreating existing reports. The initial creation of these reports benefited from easy access to prepared data that already had earned confidence from previous uses. The demonstration did not demonstrate creating innovative reports that provided new value. In contrast, the older approach demonstrated sustained productivity for innovation over a long period of time. Having data already refined and ready for use provides a great advantage in terms of conceiving the new report and of easily implementing the reports. The same innovation is possible with the more direct single step approach, but it would be instructive to compare the two approaches for some innovative report instead of reproducing an existing report.

I mentioned this first because it is an easily understood proposition. I proceeded to demonstrate creating innovative reports using the old multi-step approach but with a technology refresh to use the latest multi-dimensional database technologies. In this case, there was no parallel attempt to show a comparable effort with the new single-step approach because no one was available who understood it well enough. However, they were convinced that someone equally skilled in the new single-step technology could be equally productive. In fact, they suggested that I could do it myself, which misses the point of this exercise. I was presenting the case that more people can more quickly produce innovative reporting the older approach than with the newer approach.

I had a second argument that I think is much stronger. This argument is that the multiple steps provide protection against the countless ways that data can disappoint us. Preparing data on a daily basis throughout the calendar exposes the processes to many kinds of problems that can occur. Often these problems are completely unanticipated. By the time we discovered the new problem we need new ways to work-around the problematic data until the corrective action could be completed.

For an example, we had been working with a certain type of data for several years when suddenly some of the records duplicated the same observation. The records looked unique, but the information was clearly a duplicate of another record. This issue was rare so that most of the records did not duplicate other record data. Also, only in certain conditions would the problem be large enough to affect the final results. Having a multi-step processing approach provided a great opportunity to create a fix by introducing a cross-check with other refined data. I agree that a similar solution could occur in a single-step approach because both approaches involve working with the same raw data. My argument that the problem was easier to identify, to isolate, and to fix with a multiple-stage approach compared to confronting the problem directly with the raw data.

Over the decade of daily reporting that I worked on that project, I encountered similar problems a couple times each month. Each time, I was able to find effective solutions to exploit the advantages of having prepared intermediate results to simplify the exploration of the data and to implement a solution. However, most of the time I did this work internally with the team devoted to the project. Part of the good reviews for our performance is that we made this kind of work look easy. No one outside of the project was alert to the actual mechanics of working through a problem like this. They were only aware that the problem occurred and we implemented an effective fix within an acceptable amount of time. My argument that the multiple-step approach made this productivity possible is lost because there was no one around to listen to these details. It is hard to argue this case until the audience can be aware of the possibility that data will disappoint us.

I very much question whether the single step approach can be effective in detecting and isolating problems in data itself. In the single-step approach, every query is directly on the raw data that is huge. The queries themselves gain their speed by being selective: they extract a tiny subset of data at a time.

It is much more expensive to pull a query for every piece of data that occurred over a period. This is the type of query that is needed to investigate and isolate the unexpected problems such as the one above. We know something is wrong, but we don’t yet know the cause.

In fact, a major part of the objectionable investment of the multi-step processing is the resources devoted for doing exactly this problem of pulling every single record in order to refine it into a summary result. Without any selective filter, any query of all available data will take a lot of resources. The benefit of the multiple-step process is that it ends up with a more easily queried data set that summarizes the totality of all of the raw data. We can explore this summary data more readily to find problems and devise solutions. The solutions often are easy to implement because they can work with this summary data instead of requiring specialized performance optimization to work directly on the raw data.

This argument is hard to make because it resides solely in the experience of the older data science of recognizing the need to refine data into a form that is easier to use. This skill is quickly becoming archaic. Eventually the old approach will disappear like the Ulfberht and we will not remember what remarkable properties that old approach provided.