November 11, 2013

FastStart is a tiny bundle consisting of an empty QlikView application with a couple of useful subroutines packed with empty configuration .xls file and portable QViewer for inspecting resident tables. I made it for myself for the cases when I need to quickly create a new app from scratch but you're welcome to download and use it. It was made suitable for rapid debugging and dealing with large datasets.

Added Partial Load tab -- calls LoadVars so that variables can be updated without full reload (by simply pressing Ctrl + Shift + R), also can hold additional script commands for execution during Partial Reload which is convenient for modifying data model;

Default tab Main renamed to Variables -- intended for setting up variables and runtime parameters;

Added Main tab -- blank tab, here we start manipulating with data.

Two additional folders

Config -- contains config.xlsx

Tools -- contains QViewer.exe

I intentionally tried to keep it minimalist and not overload it with pre-built stuff like UI layout framework, calendar scripts, predefined color variables, etc. Although, external configuration file for storing variables and subroutine for inspecting resident tables -- these two features I use in almost every application.

August 19, 2013

Recently, I had to develop an application with significant amount of data -- starting with about 100mln. of records with future increase up to 400mln. It was an interesting experience and below are my tips and findings that might be of use for those who have to deal with large amounts of data in QlikView.

Tip #0 -- go to Document Properties and turn off compression. It will make saving and loading application much faster. Yes, it takes more disk space now but you're not going to send such large application by mail anyway.

Mitigate long reload times
Once you've got tens of millions of records you probably don't want to perform full reload after every small change in the script. Some people use built-in debugger for limited runs, although you can have better control over limited reloads using a variable and FIRST prefix for LOAD statements that operate with large tables. Example:

LET vFirst = 1000;//00000000;FIRST $(vFirst) LOAD ...

In this case you can do quick short runs with a limited set of data either to check correctness of script syntax after small changes or verify transformation logic. Remove semi-colon and double slash for a full load. Make sure that you have enough zeroes.

You would probably want to reduce even full reload times as much as possible. In some cases, if data model allows logical partitioning of data it's better to switch to partial reload as the main reload mode for updating the application with new information. In this case data model isn't dropped entirely, so you can append new chunks of data to it instead of building everything from scratch. Full reload can be left for emergency cases, when entire data model has to be rebuilt.

This technique is especially useful when you work with monthly (weekly, daily) snapshots and you need to join them (you might want to do this for denormalization as a way of performance improvement discussed below). In this case instead of joining two full tables you can join monthly snapshots first, and then concatenate results month by month, using partial reloads (see picture below).

To make it more manageable, it's better to store monthly snapshots in individual QVD files -- one QVD per month.

May 12, 2013

I'm working on a concept of a new kind of ETL tool and I'm looking for one or two teammates (maybe future co-founders) to join our efforts and make it happen.

Background
The concept is about new data transformation tool intended for use by non-technical people. We observe the rise of Business Discovery tools like QlikView, Tableau, etc. that further simplify data analysis and visualization for people without technical background and I think there is an area where similar change is still waiting to happen -- business data manipulation and transformation. This task is traditionally "outsourced" to technical engineers who build complex ETL jobs by the means of SQL scripting or ETL-tools. However, in my opinion, this way of doing things doesn't match reality anymore:

First, there is a growing need for non-technical (a.k.a. "business-") users to manipulate data themselves. Nowadays they are increasingly more inclined to browse data and perform ad hoc analysis rather than deal with static reports. However, not being able to prepare or modify data set for analysis brings a lot of inconvenience and limitation for them. Finally, we have situation when those who prepare data -- don't analyze it, and those who analyze it -- can't prepare it. I believe this is a major factor that inhibits effectiveness of Business Intelligence in general.

Second, uber-popular Excel, which partially solves the problem of self-service data manipulation, is becoming more and more an obstacle rather than a solution because of two reasons:

a) Typical data volumes went beyond Excel capabilities and continue growing exponentially. Nowadays average analyzed data set is about 10-30mln of records and of 200-500GB size. That's too much for desktop Excel.

b) Spreadsheets as technology doesn't play well with relational data sources (e.g. adding a new value to a dimension can break all references in a spreadsheet; writing back from Excel to a database is a hassle). At the same time relational databases is the main source of data for BI tools and will remain it for decades.

Concept
In my opinion, any attempts to make traditional ETL/ELT more user friendly or stuff Excel with even more features will be ineffective. I believe data transformation needs to be re-thought and re-designed from scratch in order to make it appropriate for use by non-technical people.

Here is how I envision it:

It's visual. Who said you can't join two tables by dragging and dropping one on another? Left/right join? Not a problem. This can be done without knowing SQL at all. Another example -- grab a field header in a table, shake it with mouse to obtain a list of unique values of this field (which is actually a new table). Table concatenation, filtering, grouping -- all this can be done in a visual way.

It's instant. What you see is what you get and get instantly. Result of manipulations and transformations is displayed immediately.

It's table-based. Cell is just a particular case of a table. Vector either. Why don't treat them as tables? We don't really need a spreadsheet full of empty cells. It should be free-form layout consisting of connected tables (connection = transformation). Click on empty space to create a new cell and start editing it.

It's functional/declarative. New columns and tables is result of functions applied to tables, fields or specific cell ranges. Transformation is actually a sequence of functions where resulting table of one function serves as input table for another. Business users understand concept of Excel-like functions, but don't understand SQL or imperative programming (and will never do).

It's semantic. Name fields meaningfully. Fields with the same name expected to contain the same kind of information (e.g. Product Description). Refer by field names in functions and expressions (e.g. =Profit/Revenue). Make own re-usable functions.

It's smart. Join by fields with the same names. Omit table name when referencing a field with unique name, etc.

Since business users very rarely operate with more than 50mlns of records the tool is not expected to have ultra-high performance or perform sophisticated transformations. Instead, main accent to be made on polished usability and simplicity of UI which would allow business users perform data manipulations themselves, without knowing SQL or asking data engineers for help. I believe this is achievable.

From technical standpoint -- this is in-memory data processing engine which is built either as WinForms .NET/C# application or web-application on any other stack (I've experience with .NET/C#, RoR, Javascript and now playing with Go). At some point JIT compilation of expressions will be required, so the stack should allow doing this.

Project

At this point the project is pure R&D which goal is to find out optimal product and UI concept, and build simple working prototype. There is no ready specification -- we will have to take a pen and a paper and think out, discuss and try various approaches until we find something we are strongly convinced about. There is a non-zero chance that this will never happen.

The result is working prototype that allows doing filtering, joins and concatenations and basic arithmetical operations on fields. Then there can be two options how it can evolve further -- either to build a commercial product and bootstrap as a side-project, or look for funding and turn into a company and then start building a commercial product.

Who I'm looking for

I'm looking for software development veterans who love Business Intelligence, understand its purpose and challenges, and are capable of resolving complex technical tasks which might involve extensive parallel calculations, concurrent execution and JIT compiling. Since my programming style is rather pedantic, DRY and, to some extent, perfectionist I'm looking for people with the same altitude and similar coding style. It should be mentioned that my main programming experience was 15 years ago and now I'm reinstating my coding skills with QViewer and some web-projects. I expect you to be more experienced programmer than me, however that doesn't mean I'm expecting someone to do all the coding -- we do programming together and with equal efforts.

If this project looks interesting to you and you think you have appropriate experience of software development -- give me a shout and tell a few words about yourself. My email is in the upper right block of this blog. If you know someone who might be willing to take part in this project -- please don't hesitate to share with him/her the link to this blog post.

UPDATE (Feb 2015)
The concept described above has been made real as EasyMorph. See http://easymorph.com for more details.

April 8, 2013

Direct Discovery is one of the recent and long awaited features of QlikView which became available starting from version 11.2. Direct Discovery allows mixing data from external data sources and from QlikView's in-memory data engine using the same QlikView scripting language and expressions. Aim of Direct Discovery is to go beyond RAM limitations and enable real-time querying against large volumes of data outside of QlikView, while presenting the same associative experience as for regular in-memory data. Here are two good documents that describe it in more details: QlikView Direct Discovery Datasheet and QlikView Direct Discovery FAQ.

From development standpoint, Direct Discovery can be employed by two ways: using DIRECT SELECT statement in loading script and SQL( ) function in expressions.

March 17, 2013

Here is a brief overview of features recently added to EasyQlik QViewer -- my favorite QVD viewer :) If you're subscribed to QViewer newsletter then you've heard about some of them except the most recent.

Better Lineage

Since QlikView 11 all QVD files generated by version 11 apps contain SQL statements from loading script. This is quite handy when you need to recall what fields from database were used as source for a QVD. It doesn't contain full snippets of script so it's not possible to see entire transformation logic, however this is handy when you only need to remind yourself what it was. Also QVDs contain connection strings, so that you can double-check if source database and schema are correct. Previously, Metadata window in QViewer showed only 1 line per Discriminator/Statement and it wasn't convenient for viewing SQL statements. Therefore starting from version 1.4.4 there is new additional multi-line window for viewing Discriminator/Statement records from QVD headers. See screenshot below (clickable).

January 19, 2013

There is a number of ways to establish bi-directional integration between QlikView and external systems. Some of them employ web-services. In this case a web-service is a helper application which sits between QlikView and external application (or is part of that external application) and communicates using web-protocols. Web-services can be written in any popular programming language (PHP, Java, C#, Python, Ruby, Go etc.) and hosted using popular web servers like Apache, IIS or nginx or any other of your choice.

About Dmitry Gudkov

I used to be a BI consultant for 12+ years. In 2013 I switched to developing BI/ETL software and created EasyMorph — a new kind of data transformation and analysis tool. I'm also the author of QViewer — a popular standalone QVD file viewer.