Data Prefetching and Caching

OK, so in our last two discussions, we looked at the memory bottleneck and how even in high performance environments, there are still going to be situations in which streaming the data to where it needs to be will introduce latencies that throttle processing speed. And I also noted that we needed some additional strategies to address that potential bottleneck, so here goes:

At the hardware level, an engineered method for reducing data access latency is a device called a cache, which is a smaller memory module that provides much more rapid access to data. By prefetching data into the cache, streaming accesses go much quicker, since the data has already been delivered close to where it needs to be. Data in the cache is basically a mirror of what is on disk, and the values in the cache need to be updated when the corresponding values on disk are updated.

At the software level, we can get the same effect by doing the same thing: making mirror images of data values close by to where they need to be. When the underlying data source changes, the copy needs to be updated also. This caching concept is relatively high level, but there are different infrastructure implementations that rely on this approach. One is called data replication – the approach has been around for a long time, and was used in earlier times to make copies of files in remote locations. For example, if transaction processing took place in New York and that data needed to be accessible in Hong Kong, a replica was created in Hong Kong. Instead of repeating every transaction in both places, the changes in the original were tracked and propagated to the replica via a process called change data capture.

The same approach is now incorporated into what is called data virtualization, which combines a metadata service layer on top of a software cache to provide seamless access to distributed data sets stored in heterogeneous environments, file systems, or databases, with reduced latency when the values are placed in the software cache. While there may be different approaches to maintaining this software cache, data replication stands out for two reasons. First, the data replication algorithms have been tested and in production for many years, and second, there has been a lot of investment in making sure the replication algorithms provide high performance, and so in my next post we’ll return to the discussion of data latency and big data and the value of data replication. I will discuss this topic more on May 23 for the Information-Management.com EspressoShot webinar, Treating Big Data Performance Woes with the Data Replication Cure.