Keeping up with big data query technologies?

February 08, 2013

Keeping up with big data query technologies is a real challenge. Luckily for AgilOne customers, we do the work for them. But if you are sifting through all the different offerings, I have some advice from our evaluations. At AgilOne we use a number of databases and data storage technologies. We provide our customers with features that require a variety of techniques spanning from state-of-the-art machine learning to speed-of-thought query services. Clearly there is no one technology that will satisfy all our needs. A general high-level categorization labels the various offerings as supporting batch, interactive, or streaming analytics. For the purpose of this blog post we focus on technologies for interactive big data queries.

Picking the right technology and approach to these problems is of strategic importance to us; hence we constantly look at how to evolve our infrastructure. It is nice for our customers, as they do not need to keep up with all the developments in this fluid space. Recently we looked at new ways of implementing our query services. We use it to allow customer to conducted domain oriented business intelligence queries at speed-of-though.

How you compare and contrast technologies depends on the problem at hand. In our case we looked at it from the perspective of speed-of-thought queries in SQL or a SQL “like” language, as well as the possibility of providing a pivoting UI on top of the query engine. We did not intend to satisfy other needs such as implementing machine learning or data cleansing with this solutions.

The number of available technology offerings is growing day by day. Some solutions focus on providing very scalable analytics engines for very large datasets. Some focus on providing very fast query engines using columnar storage techniques. Other technologies try to provide both a NoSQL and RDBMS platform and solve the "thin pipe" problem. With the growing popularity of Hadoop, shuffling data between NoSQL stores (HDFS) and relational stores has become a bottleneck. The "thin pipe" refers to technologies like Scoop that provide a data transfer mechanism, but with relatively low transfer speeds. Other solutions provide massively scalable mechanisms for creating aggregated data allowing for quick queries, but requiring background aggregation processes. In addition traditional relational database vendors are combining columnar store or NOSQL implementations with traditional implementations of relations models so that enterprises can take advantage of both using one infrastructure component.

In our case we had additional consideration; being a cloud service we need to consider how well the solutions work for a multi-tenant solution storing lots of data. This is in fact a very significant consideration as it has both technical and economic implications.

As a first step to help guide what to look for in these technologies, I propose you consider the following dimensions:

1) How much data will you store? Is 100TB enough? Will you need several 100TB's or do you need something that scales beyond that? For many enterprises 100TB is enough.

2) If you need to scale beyond 100TB's are you prepared to invest money and resources into developing sharding over multiple data stores? The alternative is to require that the solution automatically takes care of the scaling-out for you.

3) What kind of query language do you need? Is it enough with a subset of SQL for simpler select queries or do you need the full SQL? Do you use tools that generate MDX queries?

4) Do you want a commercial product or are you comfortable using an open source solution. If you are comfortable with an open source solution do you think you need commercial support?

5) Is this strategic for you? and if so are you prepared to be an early adopter of a new technology that may give you an edge?

The diagram below informally illustrates some of the many aspects you may consider. Scale-up implies that the solutions architecture is not centered at a scale-out model. Super scale-out indicates the solutions main design center is around scaling to very very large data sets. The size of the circle indicates the price point, with more expensive offerings being larger circles than others. The color indicates maturity: green indicating very mature, yellow early in life cycle, and red very early.

Things evolve quickly in the space of analytics databases right now. Solutions evolve and new solutions become available, and the strategies of emerging technology companies in this space are still evolving. You need to decide if this technology is strategic for your company and make your decision accordingly, and know what problem you are solving.