News and information about supporting research, scholarly and artistic work

Main menu

Post navigation

Big Data

I’ve recently returned from a conference on Big Data and data-mining. These topics are becoming more and more important in the world of research computing, so I would like to present a series of short articles to cover the issues surrounding modern data management.

It seems everyone is talking about Big Data these days. The increasing availability of storage and computing power has driven datasets to grow quickly, well beyond the ability of traditional tools to handle. How can these datasets be analyzed quickly and efficiently?

Traditionally, most datasets have been small enough to analyze on a single computer, often a desktop or small server. As data has grown, though, this simplicity begins to break down. System memory starts to become insufficient, disk space short, and algorithms need to be modified to handle the huge amount of data. Old tools like programs for desktop visualization or analysis, and even relational databases start to fall short.

As these issues have become more apparent, tools have been developed to manage the influx of data. It is now possible, with sufficient computing power, to quickly deal with huge quantities of data. Many of these tools have come from large Internet companies like Google and Amazon that were among the first to encounter the issues around Big Data in the course of their normal operation.

One example that will be covered in a future post is the Map/Reduce strategy. This technique for handling large amounts of data does so by breaking up a problem into small pieces that then get processed using a large number of machines. This approach was pioneered at Google, and has been implemented in the open source Hadoop package. If you’ve ever wondered how Google’s search can find the thing you’re looking for in all the data on the Internet, it’s by using Map/Reduce.

There are many approaches to handling this data, depending on how it will be used. I’ll be covering some of these approaches and the related tools in later posts.

If you have any questions about data management, no matter how big your data is, contact me at research_computing@usask.ca and I’d be happy to help you.