Terabytes Scalability

My team is currently evaluating Qlikview Server 64bit.

One concern we have is Qlikview's scalability in the Terabytes range.

Perhaps Qlikview is infinitely/linearly scalable, but our definition of scaleability is bound by hardware cost considerations. If 1 TB of transactional data over a handful of tables will roughly require 100G of RAM (or more), that info is not very useful for us.

We rather learn that, for example, a 32G server will handle 1TB of data with degradations characterized by response time, number of concurrent sessions, or other behaviors. That will help us strategize our budget for the HW based on our data growth against what we can trade-off.

Can someone shed some light in this area? There seem to be little info in the manuals that speak to resource management of Qlikview.

Terabytes Scalability

Well, I'm not sure how helpful my own experience will be, as we have orders of magnitude less data than you, but I'll give it a shot.

One constraint is that you REALLY want ALL of your data to be in RAM at the same time. If QlikView starts swapping to disk, your performance will take such a huge hit as to make it unusable. So then what it probably comes down to is how well you can compress from your source data down into the data models and fields specific to each application, and from there how good QlikView is at compressing what is left. And that's going to be EXTREMELY dependent on your data and your data model.

Let me give what I might consider worst case and best case scenarios based on my own applications.

Worst Case: In perhaps my biggest and most complicated application, I'm reading from a set of custom tables designed to use as little memory as possible and process as quickly as possible in the DBMS itself. Uncompressed, we're probably reading in 2.5 GB of raw data. When loaded in, heavily massaged, compressed by QlikView, and finally presented to the user, it takes about 1 GB of RAM. However, I believe the load process itself peaks at over 2 GB of RAM. Call it 2.5 GB for safety, and we're at that definition of a 32 GB server only handling 32 GB of source data. Also, if you might have 100 users referencing the same file at once, all with different selections, and that 1 GB of RAM suddenly becomes 100 GB of RAM, and your Server needs significantly MORE memory than your source data required to provide reasonable performance. We might say then that your 32 GB server can only handle maybe 800 MB of source data.

Best Case: In another application, we're reading in data from multi-purpose tables shared by multiple systems where no effort has been made to optimize space required. We probably only read half of the rows, a tenth of the columns, and then compress even further when building our own internal data structure in QlikView. For that application, you could say that we're reading in about 32 GB of uncompressed data, but the final application takes only 60 MB of RAM. I happen to know it take significantly more than that during the load, but there would probably be optimizations you could do if RAM was critical. In our case, it wasn't an issue at all, and we burned as much RAM during the load as we wanted while optimizing for load SPEED. So let's assume you could code the load much more efficiently, perhaps using only those same 60 MB of RAM. And let's say you only have one user at a time, or just one user period, such as if you are just throwing up the charts on monitors, and not allowing any interaction. In that case, your 32 GB server would in theory handle over 16 TB of data.

With a spread that covers many orders of magnitude, I frankly have no way to answer your question. It will depend entirely on your data, data structures, how you set things up in QlikView, and so on. My rule of thumb might be, "have enough server RAM to handle your source data", which is to say have a server with terabytes of RAM. That should handle most scenarios, but will of course be vast overkill for others.

You may be able to get a much better answer from someone who's actually had to deal with terabytes of data, of course, but I hate seeing questions left completely unanswered.

"A second (optional) layer includes merging the raw QVDs from the 1st layer to create intermediate QVD structures ready to be used by end user QVW applications."

I believe I understand it, as we discussed (and ultimately rejected) doing something similar in our early days of adopting QlikView. We discussed having a first layer of QVDs that was almost just a database dump, and then a second layer of reporting-format QVDs that used the first layer as input.

The intent is to avoid loading the same data from your DBMS more than once, as that could be considered the least efficient part of this process. Also, since your business systems may well be running from those same databases, you want to tax the business systems as little as possible. So the "dump once, THEN figure out what to do with it" approach makes some sense.

But on deeper examination, it simply didn't make sense in OUR environment. Our databases were already indexed in such a way that all of the "second layer" QVDs could be build directly from the database with very little wasted time. Technically a straight one-time dump would be even MORE efficient, but the difference for us simply wasn't worth the extra complication.

This is one of those things that each shop would need to look at and decide for themselves. But it's also something that isn't a HUGE deal if you change your mind later. For us, we'd just write some additional straight database dumps, and convert all our current QVD-creator applications to using those first-layer QVDs. We have enough of them after years of development that it would be a fair amount of work, but I believe it would be very straightforward, not something frighteningly complex. It would probably be even easier to go in the opposite direction, slowly converting your second-layer QVD-creators from reading the first layer to reading the databases directly. Eventually you drop the first layer.

Terabytes Scalability

Well as always "it depends". I'll just add a bit to John's comments and suggest a tool.

I want to echo John's comment about paging. I don't have any formal studies, but my experience has been that QV apps -- both UI and LOAD -- do not tolerate paging well. You want to have enough RAM to avoid paging.

There are techniques that you can apply to minimize the RAM footprint of an application. The "Qlikview Optimizer", available on some of the QVS installs as well as in the "Share Qlikviews" section of this site, does a pretty accurate job of estimating RAM requirements of a document, especially as it relates to concurrent users.

I'll also suggest a couple of blog posts (yes, mine) that deal with RAM tuning:

Terabytes Scalability

John/Rob,

Thanks.

Yes, I understand the data model and optimization will vary the outcome a lot.

I actually have some test data ready to try it out, except red-tape is barring me from concluding that hands-on study (can't find a suitable server to import the Oracle data dump from our transactional DB at the moment).

Good news is the local rep finally produced additional tech notes on scalability, performance tuning and HW selection. Will look into that and may report back if I find something interesting.

Terabytes Scalability

John/Rob,

Thanks.

Yes, I understand the data model and optimization will vary the outcome a lot.

I actually have some test data ready to try it out, except red-tape is barring me from concluding that hands-on study (can't find a suitable server to import the Oracle data dump from our transactional DB at the moment).

Good news is the local rep finally produced additional tech notes on scalability, performance tuning and HW selection. Will look into that and may report back if I find something interesting.