Synthetic keys have a bad reputation. The consensus seems to be that they cause performance and memory problems, and should usually or even always be removed.

I believe that the consensus is wrong.

My understanding of a synthetic key is that it’s a pretty basic data structure. If you load tables like this:

TableA: FieldA, FieldB, FieldC

TableB: FieldA, FieldB, FieldD

What you’ll actually get is this:

TableA: SyntheticKey, FieldC

TableB: SyntheticKey, FieldD

SyntheticKeyTable: SyntheticKey, FieldA, FieldB

Where neither the synthetic key nor the synthetic key table can be directly referenced, only the original fields.

Well, that doesn’t sound like a bad thing. If I had two tables that I legitimately wanted to connect by two fields, that’s pretty much what I would do manually. As far as I can tell, QlikView is simply saving me the trouble.

Two tables can be connected by one field. That shows up as a connection. Two tables can be connected by two or more fields. That shows up as a synthetic key.

Now, maybe you should NOT connect two particular tables by one field. If so, that’s a data model problem that should be fixed. And while you could say that THAT connection is a problem, you should never say that CONNECTIONS are a problem.

Similarly, maybe you should NOT connect two particular tables by two or more fields. If so, that’s a data model problem that should be fixed. And while you could say that THAT synthetic key is a problem, perhaps you should never say that SYNTHETIC KEYS are a problem.

Synthetic keys should be no more a problem than automatic connections between tables are a problem. Either can cause problems when they result from a bad data model, when there are too many or the wrong connections. But neither should cause problems when they result from a good data model. You should not remove all synthetic keys any more than you should remove all connections between tables. If it is appropriate to connect two tables on two or more fields, I believe it is appropriate to use a synthetic key.

What does the reference manual have to say on the subject?

"When two or more input tables have two or more fields in common, this implies a composite key relationship. QlikView handles this through synthetic keys. These keys are anonymous fields that represent all occurring combinations of the composite key. When the number of composite keys increases, depending on data amounts, table structure and other factors, QlikView may or may not handle them gracefully. QlikView may end up using excessive amount of time and/or memory. Unfortunately the actual limitations are virtually impossible to predict, which leaves only trial and error as a practical method to determine them.

Therefore we recommend an overall analysis of the intended table structure by the application designer. Typical tricks include:

· Making sure only the necessary fields connect. If you for example use a date as a key, make sure you do not load e.g. year, month or day_of_month from more than one input table."

Yikes! Dire warnings like “may not handle them gracefully” and “may end up using excessive amount of time and/or memory” and “impossible to predict”. No WONDER everyone tries to remove them!

But I suspect that the reference manual is just poorly written. I don’t think these warnings are about having synthetic keys; I think they’re about having a LOT of synthetic keys. Like many of you, I’ve gotten that nasty virtual memory error at the end of a data load as QlikView builds large numbers of synthetic keys. But the only time I’ve ever seen this happen is when I’ve introduced a serious data model problem. I’ve never seen a good data model that resulted in a lot of synthetic keys. Doesn’t mean they don’t exist, of course, but I’ve never seen one.

I’d also like to focus on this particular part, “Typical tricks include: Forming your own non-composite keys”. While I agree that this is a typical trick, I also believe it is useless at best, and typically A BAD IDEA. And THAT is what I’m particularly interested in discussing.

My belief is that there is no or almost no GOOD data model where this trick will actually improve performance and memory usage. I’m suggesting that if you have a synthetic key, and you do a direct one to one replacement with your own composite key table, you will not improve performance or memory usage. In fact, I believe performance and memory usage will typically get marginally worse.

In the thread, a synthetic key was blamed for some performance and memory problems, and it was stated that when the synthetic key was removed, these problems were solved. I explained that the problem was actually a data modeling problem, where the new version had actually corrected the data model itself in addition to removing the synthetic key. I then demonstrated that if the synthetic key was reintroduced to the corrected data model, script performance was significantly improved, while application performance and memory usage were marginally improved.

load time file KB RAM KB CalcTime

synthetic key removed 14:01 49,507 77,248 46,000 ms

synthetic key left in place 3:27 49,401 77,160 44,797 ms

What I would love to see are COUNTEREXAMPLES to what I’m suggesting. I’m happy to learn something here. I’ve seen plenty of vague suggestions that the synthetic keys have hurt performance and wasted memory, but actual examples seem to be lacking. The few I ran across all looked like they were caused by data model problems rather than by the synthetic keys themselves. Maybe I just have a bad memory, and failed to find good examples when I searched. But I just don’t remember seeing them.

So who has a script that produces a good data model with a composite key table, where removing the composite key table and allowing QlikView to build a synthetic key instead decreases performance or increases memory usage? Who has an actual problem CAUSED by synthetic keys, rather than by an underlying data model problem?

I just ran a test on a couple of files, and ended up with nearly-identical results for both the synthetic key and the manually-created key. The files I tested on included both one with a synthetic key (which I removed for testing) and one with manually-created keys.

Personally, I have no problem whatsoever with synthetic keys - in my own work, I don't remove them unless I feel there's a good reason to. I think a lot of their bad press stems the fact that synthetic keys are often created (often excessively) when you make a mistake in the data model. This makes people shy away from allowing them to exist. This can get particularly bad in some situations - for example, when you run 'Load *, 1 as myField resident myTable;', you'll get a whole lot of ugly synthetic keys as QV tries to hash out every possible combination of fields.

As a rule of thumb, I teach other developers to create manual keys when a synthetic key would have four or more columns in it. I've never tested if this actually improves performance, but anything over three columns (four synthetic keys, iirc - AB, AC, BC, ABC) tends to make the data model look messy, so I try to avoid those.

I would certainly be interested in seeing results - primarily for application performance, which I think is far more interesting than load times and file size - that would give me a better idea of when (if ever) to eliminate synthetic keys.

You know, I don't think I've ever had a synthetic key on more than two fields, and I haven't tested it either. Just fiddling with it, three matching fields did what I expected - created a single synthetic key, not a synthetic key for each possible combination of the three fields. Same for ten matching fields. I'm going to set up a test with a large number of matching fields and see if it performs like I would expect (about the same but marginally better with the synthetic key).

I agree that script speed tends to be much less important than chart speed, though I try to optimize my script speed too. Many applications are more useful the fresher the data is, and if all your scripts run twice as fast, you may be able to load them twice as often. But no one script is that critical, and I suspect that most synthetic keys can be replaced for much less script overhead than the one case I tested.

I agree that much of the bad press originates from data model mistakes. You make a mistake, it generates one or more synthetic keys, takes forever to load, memory usage shoots through the roof, your charts stop working, and you have a complete mess. You fix the mistake, the synthetic key goes away, and so do all the problems. It's easy to conclude (incorrectly, I believe) that the synthetic key was the cause. But the synthetic key in this case was probably just another symptom of the underlying data model mistake. And sometimes, the mere process of removing the synthetic key can fix your data model mistake. If your experience is then that removing synthetic keys fixes your performance problems, it's easy to believe the synthetic keys are to blame. But I think this is usually or always mere coincidence - while removing the synthetic key, you ALSO fixed your data model.

John Witherspoon wrote:But the synthetic key in this case was probably just another symptom of the underlying data model mistake. And sometimes, the mere process of removing the synthetic key can fix your data model mistake.

Agree. I understand that when a syn key appears that something is going wrong with the model as I didn't make the syn key (nor anything realated) on purpose. I take them more as a warning rather than a problem.

Thanks for sharing your experience on qlikview. But I personally believe that people may not understand if they read this, they will only understood only with experience when they work around the synthetic keys. What I can see is that you have amazing knowledge in qlikview but that is not just because of reading the books it is just because of your experience.

I faced serious problem because of synthetic keys. So I always prefer to avoid the synthetic in my data model.

I also feel that qlikview should be in a position to give the scores on efficiency of the script that we are using in the load script then only we will be in a position to understand whether these synthetic keys is really a problem or not. If qlikview can't say about this then there is not other way to find the same other than your experience.

Rikab Kothari wrote:...I personally believe that people may not understand if they read this, they will only understood only with experience when they work around the synthetic keys.... I faced serious problem because of synthetic keys. So I always prefer to avoid the synthetic in my data model.... What do you say?

What I suspect is that you have not actually faced serious problems because of synthetic keys. Instead, you have probably faced serious problems because of poor data modeling. Synthetic keys often show up when you model your data poorly. But the synthetic keys do not CAUSE your poor data model, any more than coughing makes you sick. And removing the synthetic keys may not fix your data model, any more than taking cough syrup will make you healthy. I believe you are confusing correlation with causation, or confusing cause and effect.

To fix your data model, you have to understand what is wrong with your data model. Synthetic keys, when you didn't expect them, are certainly a clue that may help you to understand what is wrong. But they are not the actual problem.

When less experienced people encounter synthetic keys, they should NOT ignore them. I wasn't trying to suggest that. I would suggest that they should try to UNDERSTAND them. Why does your data model end up with a synthetic key? Why DO you have the same fields on more than one table? Does it make sense to do that? Does the synthetic key make sense? Does your data model make sense? If so, I'm not aware of any reason to remove the synthetic key. If not, then fix the data model itself. Don't think of the synthetic key as the problem to be solved, or you may simply remove the synthetic key in a way that doesn't fix your data model, such as doing an exact replacement with a concatenated key. That probably wouldn't help.

Rikab Kothari wrote:I also feel that qlikview should be in a position to give the scores on efficiency of the script that we are using in the load script then only we will be in a position to understand whether these synthetic keys is really a problem or not. If qlikview can't say about this then there is not other way to find the same other than your experience.

QlikView DOES give you a score - the run time. If you change the script, and it runs twice as fast, your script is now twice as efficient. I think that's about the only score QlikView COULD give you. It sounds like you want QlikView to read your script, somehow consider how else you might have written it more efficiently, and then score your code based on a comparison to what it thinks would be the most efficient way to do it. If that program existed, nobody would need you as a programmer. QlikView would write itself. But perhaps I'm misunderstanding your request.

And understanding whether synthetic keys are a problem or not isn't a matter of understanding your script anyway. It's a matter of understanding your data model. MORE than that, it's a matter of understanding what your data model SHOULD BE. Just as with evaluating script, how is a computer program supposed to understand what your data model should be? All it can possibly understand is what it is. And if what it is has a synthetic key in it, well, then that's just how it should be from the computer's standpoint. Only YOU can decide if your data model is correct or not.

So I'll agree with your conclusion, "then there is no other way to find [whether these synthetic keys is really a problem or not] other than your experience." Correct. Only experience with data modeling will tell you if your data model is correct. No computer can tell you. Only experience can tell you.

As John pointed out, the creating of simple synthetic keys with a synthetic key table is very much the traditional response to the problem of multiple keys between tables.

But Qlikview is not a traditional database (and Mr Codd would have certainly objected to hearing it referred to in the same sentence). The way we should think and design with Qlikview is very different and it takes a mind shift to, for example, not to get hung up about duplication of values.

The truth is we have to take a leap of faith with Qliktech and specifically the documentation. We can try to perform 'tests' that produce often inconclusive results and may be invalid under different circumstances.

My pet gripe about the product is Qliktechs absolute reluctance to discuss anything related to performance and optimisation and here is another example where they are the only ones who can resolve it.

So, I would say if you are happy with the time it takes for the script to run and the perfomance of the document in use then let simple synthetic keys be, move on and create something new that adds real value to your business.

Have you found QlikTech to be reluctant to discuss performance and optimization?

Other than simple bug reports, I've only really dealt with QlikTech on two technical issues, an in both cases they were very forthcoming.

The first was the issue of mathematical imprecision, where they confirmed for me that they were using a binary floating point to store numbers, which is why some of the numeric functions have "unpredictable" and incorrect results. (Edit: I should probably add "occasionally" to that. Most of the time, numbers behave as you would expect in QlikView. I do accounting reports in QlikView, and have not implemented known workarounds. To me, it isn't a big enough problem to worry about. See http://community.qlik.com/wikis/qlikview-wiki/explanation-of-mathematical-errors-in-qlikview.aspx for information and workarounds.)

The second was the issue of set analysis performance vs. other approaches, which resulted in an open (if very high level) discussion of how set analysis actually behaves internally, and what we should be able to conclude about performance from that behavior. That also pointed out that set analysis in an early release of version 9 was not performing as expected, leading them to do some additional fixes and optimization.

This case seems similar in a way. I THINK I understand how synthetic keys behave and how they are stored. Well, I think I'm pretty solid on one level, and a little sketchy on another. On one level, where we can actually talk about tables and the like, I think a synthetic key is just another table, like the one I showed near the top of the original post. This is strongly suggested by the "Internal Table View" in the table viewer. For example, for a 10-field synthetic key, I see this:

I believe the only significant differences between this and a real table is that it is built automatically, and that you cannot refer to the table by name or to the synthetic key by name. The table name and synthetic key name are purely internal values, used only by the product itself.

The level where I'm a little or even a lot sketchy is that I'm just generally sketchy on QlikView's internal storage format for tables. I don't think it's storing them in anything approaching the relational model that it displays when you look at the "Internal Table View". It appears to directly store lists of (or at least a lists of pointers to) the distinct values of each field, the evidence for which is how much faster LOAD fieldvalue('Field',iterno()) can generate a list of values than LOAD DISTINCT Field on large tables.

QlikView data is compressed via what QlikTech calls a "symbol table," but I generally call "dictionary" or "token" compression.

QlikView typically gets at its data via scans. There is very little in the way of precomputed aggregates, indexes, and the like. Of course, if the selection happens to be in line with the order in which the records are sorted, you can get great selectivity in a scan.

One advantage of doing token compression is that all the fields in a column wind up being the same length. Thus, QlikView holds its data in nice arrays, so the addresses of individual rows can often be easily calculated.

To get its UI flexibility, QlikView implicitly assumes a star/snowflake schema. That is, there should be no more and no less than one possible join path between any pair of tables. In some cases, this means one will want to rename fields as part of QlikView load scripts."

While I'm not sure how much that tells us about synthetic keys, I see nothing there that would lead me to believe that synthetic keys and the synthetic key table are stored any differently than a regular key and a regular table. There are obviously minor differences, such as perhaps the synthetic key being flagged as such. But it certainly seems like it is mostly the same, both from thinking theoretically about how they might store the data, and from checking the actual performance differences (very minor once built) between a concatenated key table and a synthetic key table. I'm guessing that they aren't just nearly identical in the "Internal Table View", but that they are also nearly identical as internally stored and compressed by QlikView.

That IS something I'd love to have confirmed by a techie from QlikTech. If so, that would be fairly solid confirmation that MERELY replacing a synthetic key with a concatenated key is useless, because it would be stored almost exactly the same internally. You would have accomplished nothing, essentially.

Is there any techie from QlikTech reading this and able to confirm how synthetic keys and synthetic key tables are stored internally?

Just a quick note regarding my earlier post - I double checked, and the complexity in synthetic keys was related to the fact that key columns appeared in multiple tables, each time within a different key context. I'm sure if I had gone through and only eliminated the columns that resulted in data model issues / circular references, I'd have been left with a reasonable number of synthetic keys - but of course, when you're eliminating them anyway, it's often easier to just eliminate the whole bunch instead of trying to figure out which ones you can keep.

Looking forward to seeing what QV's technical folks have to say about this.. :)

Thanks John - as a beginner it's important to hear these kinds of discussions. During my training I was definitely taught that synthetic keys should never exist and should indeed be replaced by user created keys. That never made huge sense to me.

There is one area where synthetic keys do make life more difficult - when using the table viewer, you can only see the data that makes up the synthetic key in the synthetic key table itself, other tables that include SynX data do not display the synthetic key value, just the other fields in the table.

This can make it a lot more difficult to determine what is going on in the associations between the data when a report does something unexpected.

Colin_Albert wrote:There is one area where synthetic keys do make life more difficult - when using the table viewer, you can only see the data that makes up the synthetic key in the synthetic key table itself, other tables that include SynX data do not display the synthetic key value, just the other fields in the table.This can make it a lot more difficult to determine what is going on in the associations between the data when a report does something unexpected.

The view IS different, but it's the synthetic key version that shows all the fields in each table, not the concatenated key version. Or have I misunderstood what you're saying?

With synthetic key:

With concatenated key:

Both views make equal sense to me, though I do prefer the less cluttered second view. It sounds like you prefer seeing all the fields. I think this is going to be a personal preference thing, not a case where one way is better than the other. After all, if QlikTech felt that the second view was more informative, they could easily display synthetic keys this way. They don't, so it appears that QlikTech feels it is more informative to display all of the fields in all three "tables".

with all due respect, this idea sounds like an interesting hypothesis, but since it's against anything we've known and were taught before, I would suggest being careful about propagating and recommending it, until we can hear an opinion from R&D about it. There is a lot of interesting stuff happening behind the scenes in QlikView, and all we can do is speculate and theorize.

Working with large data volumes, I've seen synthetic keys crashing servers, getting QlikView "hung", raising RAM requirements beyond anything imaginable - all nine yards of fun. With all this experience, I would never recommend to anyone to "love the synthetic keys".

On a similar note, I've seen an "analysis" performed in a lab with thousands of 1 and 0 values, allegedly proving that IF formulas are performing as well or better than calculations with flags or Set Analysis conditions. Later on, I've met several beginners confused by that analysis. The fact remains the same - IF formulas are killing performance on large data sets, and synthetic keys can't be good on large data sets.

It's good to challenge basics and "rock the boat" once in a while, but we need to be careful when promoting a "maverick" hypothesis into a status of an expert recommendation.

This is, of course, just my personal opinion. It would be extremely interesting to hear from R&D on this issue.

Oleg Troyansky wrote:we need to be careful when promoting a "maverick" hypothesis into a status of an expert recommendation.

It's true. And I almost certainly came off too strong in my initial post.

For now at least, I'm mostly asking a question and making a hypothesis. If I'm making any "expert recommendation" at present, it would be to CONSIDER leaving synthetic keys in as ONE of your options that you test. I feel like I've seen sufficent evidence to at least consider that option, even if I've not seen sufficient evidence to support a much broader hypothetis like "synthetic keys are never the cause of a problem, even if they are coincident with a problem".

Thinking further, and I suppose this just echoes much of what I've said above, I have two main reasons for believing what I believe.

For a synthetic key to cause problems where a manually-concatenated key does not, they would have to be stored significantly-differently internally. And I can think of no reason for such a difference. If they ARE stored differently, and this difference is causing performance and memory problems, WHY continue storing them differently? If we as users can trivially fix a "synthetic key problem" by replacing it with the equivalent manually-concatenated key, why wouldn't QlikView do the same thing internally?

Admittedly overly-simplistic testing of overly-simplistic data models shows extremely similar performance and memory usage between synthetic keys and manually-concatenated keys. I have not yet seen a test showing problems with a synthetic key that did not have the same problems with the equivalent manually-concatenated key. I have not seen any real world problems in my own experience where synthetic keys caused any problems, only where they were coincident with data model problems that were the true cause of the problems.

I could certainly be wrong. Just because I can't think of a reason for a difference doesn't mean there isn't a perfectly good reason. The testing I've done has been very simplistic. There is strong anecdotal evidence suggesting that synthetic keys ARE a problem. Even the reference manual urges caution. But if I am wrong, it would be fascinating to me to know WHY I'm wrong.

You say "I've seen synthetic keys crashing servers, getting QlikView "hung", raising RAM requirements beyond anything imaginable". Can you provide a simple example of this? Or perhaps the synthetic key problem only arises when the examples are NOT simple?

Definitely waiting to hear from R&D. I'm hoping we get a much more thorough answer than "yes" or "no", or I'm afraid it won't be particularly informative.

I don't pretend to know all the answers, and I don't have time to perform a full-blown performance test. I'm mostly using my prior experience seeing QlikView using substantially more resources when it's forced to build a synthetic key, compared to us building composite keys as part of the script.

The speculation that I can offer about the possible cause for the difference is how I intuitively feel about it:

When we build the composite key, we give QlikView very specific instructions about HOW to build it - what tables, fields and values to use, etc... In contrast, when we leave multiple keys (sometimes in more than one pair of tables) for QlikView to "figure it out", QlikView has to determine - what synthetic keys need to be built? What tables are affected? What are all the possible permutations of the involved fields and their values? QlikView has to build the "game plan" before actually building the key, and I suspect, that algorithm is not a simple one. I can totally believe that, in attempt to cover all possible situations, the algorithm has to go through a lot of unnecessary iterations before settling on a certain solution. On the other hand, none of that is needed when we build the keys on our own - we tell QlikView what to load, and QlikView simply loads a few fields from a few tables. I can easily understand why it would be faster...

In a completely metaphoric way, the difference is similar to the difference between performing "parallel parking" and writing a program that performs "parallel parking". Everyone can parallel-park, but very few people can embark on covering all possible situations in a computer program that could do it for you.

Obviously, all those speculations could be full of ... nothing, but the empiric evidence of many years of delivering QlikView applications tells me that we are much better off without synthetic keys than with them...

Oleg Troyansky wrote: The speculation that I can offer about the possible cause for the difference is how I intuitively feel about it:When we build the composite key, we give QlikView very specific instructions about HOW to build it - what tables, fields and values to use, etc... In contrast, when we leave multiple keys (sometimes in more than one pair of tables) for QlikView to "figure it out", QlikView has to determine - what synthetic keys need to be built? What tables are affected? What are all the possible permutations of the involved fields and their values? QlikView has to build the "game plan" before actually building the key, and I suspect, that algorithm is not a simple one. I can totally believe that, in attempt to cover all possible situations, the algorithm has to go through a lot of unnecessary iterations before settling on a certain solution. On the other hand, none of that is needed when we build the keys on our own - we tell QlikView what to load, and QlikView simply loads a few fields from a few tables. I can easily understand why it would be faster...

OK, what you say makes a lot of sense, and I haven't been thinking of it that way. It's also very consistent with my interpretation of the warnings in the reference manual:

"When the number of composite keys increases, depending on data amounts, table structure and other factors, QlikView may or may not handle them gracefully. QlikView may end up using excessive amount of time and/or memory. Unfortunately the actual limitations are virtually impossible to predict, which leaves only trial and error as a practical method to determine them."

And my interpretation:

"I don't think these warnings are about having synthetic keys; I think they're about having a LOT of synthetic keys."

I haven't encountered this situation in practice, except when I've had data model mistakes. But there are certainly good data models where you want multiple tables connected by multiple fields. If a lot of these fields are the same, perhaps it is ambiguous how the tables should be connected. QlikView does its best to try to figure out what you wanted, but it could get it wrong, and the synthetic keys that result could be all wrong and cause a serious problem. That seems VERY plausible from a theoretical standpoint, and again, is very consistent with the reference manual. In such a situation, it makes perfect sense to build your own concatenated keys, just like the reference manual recommended, so now the recommendation is making much more sense to me. Sometimes it is easier to just tell a computer what to do than to hope it figures it out.

So I'll now admit that the general form of my hypothesis, "synthetic keys aren't a problem if you have a good data model" isn't really true. Even if you only have two tables in your data model that share two fields, in a sense, you data model isn't good, because you haven't indicated how those two tables are supposed to connect. Now in a simple case like this, it's easy for QlikView to figure out, and I'll still hypothesize that the resulting synthetic key won't cause any problems. But in a fundamental sense, QlikView is completing your data model for you, so one of the basic assumptions in my hypothesis, "you have a good data model", is simply incorrect. In a fairly fundamental sense, if you haven't specified how the tables are to be connected, you do NOT have a good data model yet.

So I now see two risks associated with synthetic keys that weren't clear to me before. First, if the data model is ambiguous enough, there could be performance and memory problems as QlikView tries to guess what you probably wanted. Second, it might guess incorrectly, and the resulting data model could be simply wrong, and/or cause performance and memory problems.

I still suspect that in most practical cases, we're doing something simple like connecting two tables by two fields. In such a case, there is no ambiguity. QlikView should be able to get it correct without wasting time or memory, and the resulting data structure should be at least as efficient as if you'd modeled the connection manually with a concatenated key. I can't prove those "should"s, though. They're just educated guesses combined with overly-simplistic testing.

But regardless, any strong form of my hypothesis is likely simply incorrect. Synthetic keys probably CAN cause a problem when the data model is ambiguous. And to remove such ambiguity, you'll need to model the connections between tables yourself, manually. And that right there is the traditional advice. It at least makes more sense to me now, even if I still suspect you'd be better off leaving things alone in most cases.

Assuming we don't hear back from R&D, and when I can get some time, I should try to figure out how to make an ambiguous data model to try to create an actual synthetic key problem. People do it all the time on accident, including me, so in a sense it should be easy. The hard part, I think, is to do it with what is otherwise a correct data model, and where only the connections are ambiguous.

John Witherspoon wrote: I'm going to set up a test with a large number of matching fields and see if it performs like I would expect (about the same but marginally better with the synthetic key).

OK, I made a trivial 10-field synthetic key. I created two million rows on two tables linked by this synthetic key. I put an X value in one table and a Y value in the other. I then did a scatter chart of sum(X) and sum(Y) by one of the fields in one of the tables. Then I created a concatenated key instead of the synthetic key for comparison.

I'm not sure we can draw any conclusion about memory use and calculation time. I'd need a lot more data points given the closeness and the variability. But the extremely preliminary conclusion would be that the synthetic key version is better for memory use and calculation time, but insignificantly so. The main difference is load time, and while load time isn't nearly so important, why waste it for no improvement?

I'll go ahead and post my test file, though with only 2000 instead of 2000000 rows, in case anyone wants to repeat the testing or tell me how to concatenate my key more efficiently.

Edit: I should mention that this is a very poor data model, and is only being used to test synthetic key performance. If this were a real application with real fields, you should join everything into a single table. The synthetic key would go away, and the application would likely peform somewhat better. However, the synthetic key wasn't the problem (as the concatenated key performance demonstrates); the data model was the problem.

Edit2: I've gone ahead and fixed the data model for comparison purposes, simply loading everything into a single table. It is loading much, much faster, taking much less file size, less RAM, and calculating over twice as fast. What I'm hoping to show here is the difference between simply removing the synthetic key and actually fixing the data model. Either way, the synthetic key goes away. But if all you do is remove it, things get worse. Things only get better when you fix the underlying data model problem. In this example, anyway. And again, this isn't to say that synthetic keys MEAN you have a data model problem. Good data models can contain synthetic keys. This one, however, should not.

Edit3: I was asked to post the script for the "Data Model Fixed" version. It's pretty trivial:

LOAD ceil(rand()*10) as A,ceil(rand()*10) as B,ceil(rand()*10) as C,ceil(rand()*10) as D,ceil(rand()*10) as E,ceil(rand()*10) as F,ceil(rand()*10) as G,ceil(rand()*10) as H,ceil(rand()*10) as I,ceil(rand()*10) as J,ceil(rand()*50000) as XID,ceil(rand()*1000) as X,ceil(rand()*1000) as YAUTOGENERATE 2000000;

...you'll get a whole lot of ugly synthetic keys as QV tries to hash out every possible combination of fields...

Earlier in the thread, I couldn't think how to set up the situation properly, even though I'd certainly seen it happen before. The first attached file demonstrates it pretty clearly. We have five tables that share some key fields, but they each share different key fields. The result in this case is 40 separate synthetic keys.

Now, this is a bad data model. In this case, the good data model would just have a single table. There's nothing here to change my advice to not replace synthetic keys with composite keys in a good data model. But it does demonstrate the synthetic key explosion we sometimes encounter.

I believe most of the synthetic keys here are useless. I believe only five of them end up mattering, the five that link our five tables back to the key table that holds the actual values of the key fields. We can replace just those five keys with composite keys as a way of improving the data model. That's shown in the second attached file. But the data model still isn't right, so again, this isn't a case where we have a good data model and we're replacing synthetic keys 1:1 with composite keys.

Since Anthony seems to have confirmed that synthetic keys behave like composite keys internally, I think the more open question now has to do with script performance.

So far, the examples I've created or seen have loaded synthetic keys faster than creating the equivalent composite keys. However, it's possible that with a some specific data model, it would take QlikView longer to figure out how to construct the synthetic keys than it would take us to just construct composite keys manually. After all, as Oleg said, "When we build the composite key, we give QlikView very specific instructions about HOW to build it..." In a sense, that should make QlikView's job easier, not harder.

The attached two files were my attempt to load a several composite keys as efficiently as possible, thinking that perhaps the synthetic key equivalent would be slower. But the synthetic key load was still significantly faster - for a million rows, 66 seconds vs. 206 seconds.

Maybe someone knows a more efficient way to load composite keys? Or some situation that will make it much harder for QlikView to figure out and build the right synthetic keys?

I need to look into the results more closely, but we set up a simulation of a semi-realistic case:

invoice header and invoice line items associated by 4-part key

data is stored in QVDs

QVDs are loaded incrementally, with only a small fraction loaded per day

With those assumptions in place, I tested two alternatives. First, just load in the data and let the synthetic key form. Second, add a hash128() key to both tables during the incremental load (negligible overhead), and skip the 4 key fields when pulling the invoice line items from the QVD. (Yes, technically using a hash128() on separate incrementally-loaded files can result in data problems, but the chance is as close to 0 as makes no difference, so in practice, it should be fine.) The results appear to be a counterexample to what I've been stating in this thread.

Load time is definitely faster and chart time is likely faster with the composite key. The RAM and file size differences are overstated since in the test case, my row size is small enough that adding a 16-byte incompressible key makes a large difference. In a typical application, row sizes would be much larger, and the composite key a much smaller percentage of the data.

Now, this composite key is NOT a one for one replacement for the synthetic key, which is what I've mostly been talking about in this thread. It is not structurally identical. But it is FUNCTIONALLY identical, and a practical solution for real world cases. I think that's good enough to call it a counterexample to my general assertion.

And that makes things more interesting. Sometimes a synthetic key is better. Sometimes a composite key is better. It depends on the situation and on what you're trying to optimize (RAM vs. load time, for instance).

My perspective on the whole thing stems from the fact that Qlikview rationalizes synthetic keys this way in the first place. The simplest question here is, if it's so terrible, why does QV do it at all?

The fact that QV rationalizes correctly modeled data in a way that's consistent with good design says to me that this is a totally legit method of creating these data models.

I think the trepidation with synthetic keys stems from two places:

1. Bad data modeling design2. General lack of understanding of what a synthetic key is

As you say, bad data design results in 'bad' synthetic keys, results in bad performance, and this usually stems from lack of experience in data design. From there, people assume that synthetic keys themselves are bad, when they're not.

If you have a data model that uses synthetic keys, and that's what you intended, then by all means, use a synthetic key. I can't imagine why you would go through the trouble of rationalizing the key manually, and creating your own composite key, when QV is prepared to do the heavy lifting for you.

Had I encountered a situation where I expected to see a synthetic key in my data model, and I did see such a synthetic key, I'd leave it right where it was. I do, however, get slightly more concerned when I've had a total blonde moment, and created multiple compound synthetic keys, and my model goes all nutty.

As you pointed out in your QV documentation, it MAY do this, and it MAY do that. It's like that warning label on your coffee that has to tell you that it's hot… because some genius put it between their legs once, and got 3rd degree crotch burns. Yup, that coffee MAY give you 3rd degree crotch burns. It may not, however, if you use it properly.

I suspect that a little bad press have done synthetic keys in for the general user community.

Thank you all for a very interesting thread. I thought I would quickly weigh in with my perspective... as much as that can be considered the "official" QlikView perspective. In short, John has it right. Synthetic keys are neither bad nor good, they just are QlikView's attempt to resolve a data model linkage. If that linkage makes sense, from a data model perspective, then a synthetic key is neither better nor worse than resolving the linkage manually in the load script.

So, why the bad reputation? Largely this is because synthetic keys are automatic. So, the most typical case is a new user who creates links between tables that were not expected and this results in large, and meaningless, synthetic keys. This case is not likely to perform well, and worse yet, is likely to have unexpected/meaningless results in the UI. Also, synthetic keys may create unexpected linkages with unexpected meanings. When you explicitly link tables together, you have to think through the meaning of linking tables/field. When QlikView does it automatically, it may have a meaning you didn't intend or expect.

My practical suggestion would be to use synthetic keys as a warning. If you see these pop-up, especially if you didn't expect them, you should investigate what tables you've loaded, what fields are in those tables, and what data model you *intended*. If QlikView automagically did the right think, then you're fine. If something unintended happened, then you may need to rename some columns to resolve an unexpected link.

If you are a beginner, the recomandation stays true - avoid synthetic keys. As a rule it's a result of poor design.If you're an expert, it's a different story. You know the risks and understand what you're doing.(Compare to car driving, you may by chance run on two left wheels, but better leave it for the experts :-))

thank you very much for your explanation, it's wonderful to shed some light on this issue. One clarifying question for you:

Let's suppose that we have a situation with an "expected" syntetic key (or a number of synthetic keys) in a large data set. Would you expect any substantial difference in the performance between building the composite keys manually and building them automatically? Can QlikView run out of resources trying build all the synthetic keys automatically, while it could have been successfully performed when the keys are explicitly specified?

The reason for my question - in my practice, I've seen many instances when we had to "kill" the load processes taking too much time and resources because of an overlooked synthetic key. Admittedly, most of the times those were not intentional - in most cases, it happened when we forgot to drop a table or rename a field. Intuitively, I always thought that it's more "cost-efficient" to build the composite keys explicitly than to let QlikView figure it out automatically.

Oleg Troyansky wrote:Anthony, thank you very much for your explanation, it's wonderful to shed some light on this issue. One clarifying question for you:Let's suppose that we have a situation with an "expected" syntetic key (or a number of synthetic keys) in a large data set. Would you expect any substantial difference in the performance between building the composite keys manually and building them automatically? Can QlikView run out of resources trying build all the synthetic keys automatically, while it could have been successfully performed when the keys are explicitly specified?The reason for my question - in my practice, I've seen many instances when we had to "kill" the load processes taking too much time and resources because of an overlooked synthetic key. Admittedly, most of the times those were not intentional - in most cases, it happened when we forgot to drop a table or rename a field. Intuitively, I always thought that it's more "cost-efficient" to build the composite keys explicitly than to let QlikView figure it out automatically.What would be your recommendation here?<div></div>

Oleg-- In your question you say, "I've had to kill loads... admittedly when the keys were not intentional". This is exactly my point... if you want a join between two+ tables on two+ fields, and that is "expected", then QV should handle it "as well as" if you resolved the key manually. If you end up with an unexpected join... then all bets are off.

Why? Because when you're joining two+ tables on two+ fields that weren't expected to join, then there are likely to be MANY unique combinations of the two+ fields. In other words, it is quite likely that you will build a massive synthetic key table... in fact far more massive than ever makes sense. That is why QV starts consuming memory, etc... it's trying to figure out how to union all these unique values between these fields.

Anthony Deighton wrote:if you want a join between two+ tables on two+ fields, and that is "expected", then QV should handle it "as well as" if you resolved the key manually.

So to confirm, you're saying that as long as the synthetic keys that QlikView builds are correct, then QlikView will handle those keys as well as if we resolved the same keys manually in the load?

And by "handle it 'as well as'" do you mean in the final data model, so this is mostly talking about memory use and chart performance?

What about script performance and memory usage? I would guess that as long as the synthetic keys are intended, straightforward and correct, then it should be no slower, and may sometimes be significantly faster to allow QlikView to figure them out automatically. For example, in my test of a 10-field key, the load took 3:36 when I built the concatenated key manually, and only 1:21 when QlikView built it for me. There was a similar difference in another example I tested in another thread, but I'm too lazy to track that one down. (Edit: 14:01 vs. 3:27)

That said, what about the possible exception that Oleg and I discussed earlier? When you have a lot of legitimate multi-field keys that you're letting QlikView build for you, can it become difficult (but not impossible) for QlikView to figure out the right key structure? So in some cases, might the script execute faster instead of slower if you built the same keys manually, simply because of the difficulty of building them automatically? Or might we expect the creation of synthetic keys to ALWAYS be as fast or faster, no matter what the key structure, than building the exact same key structure manually?

Since the move to the new forum ruined the formatting of my original post and made it unreadable, and since it won't let me edit it, and since I keep needing to refer to it, I've fixed the formatting below. I need to summarize the information from this thread in a document at some point, but until then, this will have to do.

--------------------------------------------------------------------

Synthetic keys have a bad reputation. The consensus seems to be that they cause performance and memory problems, and should usually or even always be removed.

I believe that the consensus is wrong.

My understanding of a synthetic key is that it’s a pretty basic data structure. If you load tables like this:

TableA: FieldA, FieldB, FieldC

TableB: FieldA, FieldB, FieldD

What you’ll actually get is this:

TableA: SyntheticKey, FieldC

TableB: SyntheticKey, FieldD

SyntheticKeyTable: SyntheticKey, FieldA, FieldB

Where neither the synthetic key nor the synthetic key table can be directly referenced, only the original fields.

Well, that doesn’t sound like a bad thing. If I had two tables that I legitimately wanted to connect by two fields, that’s pretty much what I would do manually. As far as I can tell, QlikView is simply saving me the trouble.

Two tables can be connected by one field. That shows up as a connection. Two tables can be connected by two or more fields. That shows up as a synthetic key.

Now, maybe you should NOT connect two particular tables by one field. If so, that’s a data model problem that should be fixed. And while you could say that THAT connection is a problem, you should never say that CONNECTIONS are a problem.

Similarly, maybe you should NOT connect two particular tables by two or more fields. If so, that’s a data model problem that should be fixed. And while you could say that THAT synthetic key is a problem, perhaps you should never say that SYNTHETIC KEYS are a problem.

Synthetic keys should be no more a problem than automatic connections between tables are a problem. Either can cause problems when they result from a bad data model, when there are too many or the wrong connections. But neither should cause problems when they result from a good data model. You should not remove all synthetic keys any more than you should remove all connections between tables. If it is appropriate to connect two tables on two or more fields, I believe it is appropriate to use a synthetic key.

What does the reference manual have to say on the subject?

"When two or more input tables have two or more fields in common, this implies a composite key relationship. QlikView handles this through synthetic keys. These keys are anonymous fields that represent all occurring combinations of the composite key. When the number of composite keys increases, depending on data amounts, table structure and other factors, QlikView may or may not handle them gracefully. QlikView may end up using excessive amount of time and/or memory. Unfortunately the actual limitations are virtually impossible to predict, which leaves only trial and error as a practical method to determine them.

Therefore we recommend an overall analysis of the intended table structure by the application designer. Typical tricks include:

Making sure only the necessary fields connect. If you for example use a date as a key, make sure you do not load e.g. year, month or day_of_month from more than one input table."

Yikes! Dire warnings like “may not handle them gracefully” and “may end up using excessive amount of time and/or memory” and “impossible to predict”. No WONDER everyone tries to remove them! But I suspect that the reference manual is just poorly written. I don’t think these warnings are about having synthetic keys; I think they’re about having a LOT of synthetic keys. Like many of you, I’ve gotten that nasty virtual memory error at the end of a data load as QlikView builds large numbers of synthetic keys. But the only time I’ve ever seen this happen is when I’ve introduced a serious data model problem. I’ve never seen a good data model that resulted in a lot of synthetic keys. Doesn’t mean they don’t exist, of course, but I’ve never seen one.

I’d also like to focus on this particular part, “Typical tricks include: Forming your own non-composite keys”. While I agree that this is a typical trick, I also believe it is useless at best, and typically A BAD IDEA. And THAT is what I’m particularly interested in discussing.

My belief is that there is no or almost no GOOD data model where this trick will actually improve performance and memory usage. I’m suggesting that if you have a synthetic key, and you do a direct one to one replacement with your own composite key table, you will not improve performance or memory usage. In fact, I believe performance and memory usage will typically get marginally worse.

In the thread, a synthetic key was blamed for some performance and memory problems, and it was stated that when the synthetic key was removed, these problems were solved. I explained that the problem was actually a data modeling problem, where the new version had actually corrected the data model itself in addition to removing the synthetic key. I then demonstrated that if the synthetic key was reintroduced to the corrected data model, script performance was significantly improved, while application performance and memory usage were marginally improved.

What I would love to see are COUNTEREXAMPLES to what I’m suggesting. I’m happy to learn something here. I’ve seen plenty of vague suggestions that the synthetic keys have hurt performance and wasted memory, but actual examples seem to be lacking. The few I ran across all looked like they were caused by data model problems rather than by the synthetic keys themselves. Maybe I just have a bad memory, and failed to find good examples when I searched. But I just don’t remember seeing them.

So who has a script that produces a good data model with a composite key table, where removing the composite key table and allowing QlikView to build a synthetic key instead decreases performance or increases memory usage? Who has an actual problem CAUSED by synthetic keys, rather than by an underlying data model problem?

You sound have explored the breadth of synthetic keys! I think you would be the best person to approach regarding the problem I am facing with Synthetic keys.

I am trying to binary load a QVW document into my Dashboard, and the source document which I am loading contains 26 synthetic keys, and the Dashboard doesn't pull any other information than the binary load. Still my dashboard doesn't give the same results as the source document.

As I noticed the synthetic table which has 38Million records at source document shows 284Million records after binary loaded into Dashboard. I am not sure if the problem is a culmination of binary load and over usage of Synthetic keys together, but could this be possible that synthetic tables exhibiting a different behavior to cause inconsistency of results?

If you're loading in a data model with 26 synthetic keys, I can all but guarantee that the data is modeled incorrectly, that the data model itself has bugs. Of course I can't say that conclusively from afar, without knowing your data and how it's been modeled, but that's what I would bet on.

As I understand binary loads, you should have EXACTLY the same data and data structures in memory after you do one. If you're seeing a different number of records, I have to suspect you're either counting wrong, or are doing more than just a binary load in the script. But I've done little more than a few routine maintenance tasks in QlikView over the past couple of years, so perhaps something has changed. Bugs in QlikView itself are also a possibility.

I've used synthetic keys in several scenarios and from my experience across multiple documents, I have never seen a performance impact outside of reload time. I have several situations where we have legitimate need to have synthetic keys in place because the index data may or may not exist in several tables and each has slightly varying context. Short of loading a temporarily concatenated set of all data points and reducing to a distinct set, my only opition is to load and let the synthetic keys take care of that for me. We also have some calculation automation that utilizes the joined field data from those tables at varying points in the script so it is helpful to not replace them with a key.

The reality is that most relational models have multiple loops and have those for a reason. It is difficult to reduce them to a completely non looping set without utilizing synthetic keys, especially in situations where the data is coming from 4 or 5 sources that were not designed to work in a coherent data model.

In all honesty, my biggest frustration with them has been their use as a scapegoat in the support environment. We have been experiencing a bug for quite some time that causes large reports to disconnect an ajax session. The report runs fully but is not handed back to the user. When we contacted support, the response was that synthetic keys were causing the issue. I went through a two month process of removing the synthetic keys (not that it took that long but we've been incredibly busy and changes require thorough testing) only to find out that after removal, we had the same exact issue. In addition to that, we have had server performace issues (not reload related) blamed on the synthetic keys as well that were not resolved as a result of the removal of the keys.

In total, we have spent a great deal of time removing them only to find out it had no affect on our problems. I honestly don't know why it would create a problem since we are simply doing what the synthetic key is doing to resolve the issue anyway.

Excellent thread on synthetic/composite/autonumber keys, thank you all, I am learning a lot here. I read the whole thread from the start but I haven´t seen yet a word from QlikTech R&D. It would be nice to see what their input is here.

John, if I understand correctly, you mention several times that, normally, performance issues are caused by poor data models and " Only YOU can decide if your data model is correct or not "..., "Only experience with data modeling will tell you if your data model is correct. No computer can tell you. Only experience can tell you ". From a beginner position ( this is my case ) , do you have any best practices / recomendations ( or reference books ) to gain that knowledge on Data Modelling, to be able to see if your Data Model is not correct/optimum before starting to load everything in QlikView ( to avoid problems later on ) ?

From a beginner position ( this is my case ) , do you have any best practices / recomendations ( or reference books ) to gain that knowledge on Data Modelling, to be able to see if your Data Model is not correct/optimum before starting to load everything in QlikView ( to avoid problems later on )?

While not strictly necessary, the data model that most seem to gravitate towards for QlikView is either star schema or snowflake schema. You would have one central fact table surrounded by dimension tables. With snowflake schema, these dimension tables may themselves have dimension tables (in star schema, these tables would be denormalized onto the main dimension tables).

I don't think these data models prohibit synthetic (multi-field) keys, but I wouldn't think that synthetic keys would be very typical with these structures either. More typically, the central fact table would have an ID, and this ID would map you to your dimension table. But if multi-field keys make sense, they make sense. You CAN convert them to a single ID, and store the other fields on the dimension table, but this often involves additional script complexity and load time for little benefit (though it may be good experience for a new developer, but that's a bit of a separate argument).

Both star and snowflake schema tend to involve some denormalization, more so for star schema. In my background, denormalization is a dirty word, but it's not a problem in QlikView. We're not talking about operational data here. We're talking about reporting data, data warehouse data. There are no updates back to the operational databases. Everything generally flows in one direction. QlikView's internal compression algorithms keep you from paying a memory or performance penalty, and the denormalized data often performs better.

In the interest of full disclosure, I personally make no effort to transform my data into a star or snowflake schema. You just need to understand what tables make sense, how they relate to each other, when to denormalize and when to keep a separate table and so on. A data model designed that way will often resemble a star or snowflake, but sometimes it won't. So sometimes mine do, and sometimes mine don't. But it's one of those cases where if you lack the data modeling experience to guide you, then very explicitly aiming for star schema may make sense and be good practice.

My data models also grow organically rather than by top down plan, though I may refactor if I become unhappy with them. "OK, I'm reporting on order items, so that's my base data. Looks like I need a little bit of product data. Let me join that. Oh, I need a bunch of product data. let me make that its own table. I need some descriptions for these codes, so I'll join those in. Looks like I need a calendar. Hmmm, they want to know what ordered material is now available in our warehouse for pick up, so I'm adding a warehoused items table." I threw that example together randomly, but what I have now resembles a star schema. The order items are the central fact table. The calendar is a dimension connected by date. The product data is a dimension connected by product ID. The descriptions have been joined to the main data so don't turn it into a snowflake. Warehoused material isn't a dimension, so maybe this isn't technically a star schema, but it IS connected to our main fact table by just order item ID, so it's at least structurally the same.

One thing I did when I first started QlikView was to normalize the data even more than it was normalized in the business systems. I was so excited by how well QlikView could understand all the connections that I assumed it was the best thing to do, and that finally I could correct all those horrible denormalizations that still existed in our main systems! This heavy normalization didn't really cause any problems, but I don't think was optimal. While QlikView handles highly-normalized data models just fine, these days I tend to fairly significantly denormalize the data. The consensus is that this improves performance, though I've paid little attention. Somehow, the denormalizations just "make sense" now. I know that's hardly a strong justification, but honestly, a lot of what I do is by feel, not by explicit rationale. Some things just feel right. Others don't. I hope that it's 30 years of programming experience talking to me subliminally rather than random neuron firings, but who knows. And certainly "do what feels right" isn't a very useful instruction to someone trying to figure out how to model their data.

As a very rough rule of thumb on when to denormalize and when not to, if you only have one or two or a few fields on a dimension table, go ahead and join it onto your main table. If you have a bunch of fields, keep it as its own object. Also, regardless of number of fields, I almost always keep my calendar table separate.

Thanks a lot for all the details John. Having checked the wikipedia articles on data model schemas, and other interesting whitepaper I found from a Data Modeling tool vendor ( I attach it here for other beginners like me ), where it explains Normalization of Data Models, I believe in my case I don´t even have a Star or Snowflake schema. Probably my model is much simpler, with the only complexity ( if it is really a complex thing ) of having several common fields across all 3 tables ( see tables in attached excel sheet ).

These 3 tables are loaded from 3 qvd files ( I concatenate periodic extractions from 3 SAP transactions into excel sheets, to be able to do time-series analysis ), so I believe I will have to try different scenarios with them and see which is the best option in terms of script load speed and validation of expected data associations. I am thinking of :

Option 1 : Leave Qlikview generate the needed synthetic keys and see how it works

Option 2 : Try to reduce the number of common fields. As I don´t have unique simple primary keys in any of the 3 tables , I think I need to generate primary (composite) keys ( and convert them to integers primary keys using Autonumber function, as leaving string generated primary keys could cause the out of virtual memory error when loading the script ). And I need to do this when loading the data from the QVD files ( and not when generating the QVD files ).

Regarding the Autonumber function, I have seen another thread ( see attached ), where you were debating with other guys about Autonumber() and Hash*() functions. Do you know what is the different between the various Autonumber "flavours" ( Autonumber(), Autonumberhash128(), Autonumberhash256() ...) which one would be adequate for my tests ?

Well, for your data, I expect that just letting synthetic keys form will NOT work very well. There are a number of fields in common, and they're inconsistently in common, so you'll probably get at least a handful of interacting synthetic keys. I wouldn't expect replacing them with composite keys to work any better. It wouldn't surprise me if this is a case where concatenation is the best answer - just concatenate (not join) all of your tables into one table, and let the fields be null where they don't exist for that particular data. It's really hard to know as an outsider to this data whether or not that would do what you need, though.

If you're creating composite keys, you might do this while generating the QVD files, or while loading from the QVD files. Both have advantages and disadvantages, and some of these advantages and disadvantages are tied to how you create the composite key, and how you're generating the QVDs (in one application or multiple applications).

If all of your autonumbering is occurring in the same load in the same application, autonumber() is the function to use. It produces integer key values, taking the least space and having presumably the highest performance in charts.

If your autonumbering is occurring on the same keys in different loads, different applications, autonumber() should not be used. It will generate different numbers for the same keys, and the same numbers for different keys. AutonumberhashXXX() technically has this problem as well. The probability of mismatched keys is very small but non-zero. Use at your own risk. (Edit: Hash 128 vs. 256 refers to the number of bits in the generated hash value, in the resulting key field. So you're generating a 16 or 32 byte key. The longer the key, the lower the chance for key problems. But in some cases, the resulting key may be longer than the data you were hashing, in which case you're just wasting time.)

The safest but slowest function is to use no function at all - simply concatenate your field values with a separator of some sort. These large keys would be expected to slow down chart performance a little, and of course take more space.

And then of course there are synthetic keys. With synthetic keys, all autonumbering is internal to the product and occurs in the same load. It is therefore safe, and should be at least as fast as an explicit autonumber(). But didn't we just see an example where synthetic keys loaded much more slowly than autonumbered keys? Yes. The reason is that "all autonumbering... occurs in the same load". In the example, we used autonumberhash128() in an incremental load, saving us almost all of the time required to assign the key values. The synthetic key approach, on the other hand, had to generate keys for the entire data set every time, and I believe this is what slowed it so much in comparison.

Field A has 2 values. Field B has two values. You claim the synthetic key table will have four rows. It doesn't. It has two rows.

I believe you're confusing synthetic keys with Cartesian joins. That isn't how synthetic keys work. A synthetic key is not giving you all POSSIBLE combinations of field values. It is giving you all EXISTING combinations of field values. You can only have 1,728,000 records in the synthetic key if you have 1,728,000+ records in your data set. You can only have 36,000,000,000,000,000 records in your synthetic key if you have 36,000,000,000,000,000+ records in your data set.

Edit: Attached is your worst case example - a 5 field synthetic key with 800 possible values for each field. The load works in a flash, the file is 156 KB, and there is no problem rendering a tablebox that depends on taking advantage of the synthetic key. I believe it is safe to conclude that the synthetic key table does not, in fact, have 36,000,000,000,000,000 records.

John, I can appreciate the thought and effort you put into understanding synthetic keys.

My take on synthetic keys lies along these lines:

Can it work? Sure. Just the same, we don't have to create preceding loads. We don't have to explicitly list all fields in select statements or preceding loads. We don't have to explicilty name the table we are creating, joining or concatenating to. Does that mean we shouldn't? Definitely not.

Being explicit in what we do is a much cleaner and consistent approach to development. It makes it easier on ourselves when we have to revist the app down the road. It especially makes it easier on anyone else who has to support your app.

Not having a standardized way of handling keys opens the door for many problems and rework when changes are required.

My take on it is exactly the same, but I couldn't possibly word it as clearly as you did - it's all about "being explicit in what we do".

John - I agree with your distinction between synthetic keys and Cartesian Joins, however, I think the sentiment that Klaus was trying to express is that generating a single key (synthetic key in this case) built for all possible permutations of the component keys, will create the biggest field in your database. For example:

1,000 Customers x 1,000 Products x 1,000 Dates - that's a composite key with up to 1B distinct values (of course, if every customer buys every product every day...) In any case, the synthetic key will generate a list of distinct values that's substantially longer than any of the initial 3 keys. And that, as you know, is what determines the memory consumption in QlikView - not so much the number of rows in the table.

So, I get your argument that Synthetic key can be performing slightly better than an identical Composite Key, generated manually. In this case, I'll repeat after Mike - I'd rather be explicit in what I do. However, I'd like to continue the exercise and take it to the next level. Let's design the optimal data model that perhaps won't need the Composite Key at all, and that's the ultimate solution to the problem of Synthetic Keys.

One technical correction - the variable Time that you calculate in your script, does not include the time needed for calculating the synthetic keys. The linking is done after the script is finished. The only safe way to know that is to generate a log file and to see the time needed for generating the synthetic key.

In the case of your script, applied to 2Mil. rows, the load itself took 32 sec., and generating the Synthetic key took additional 13 sec. - more than a third of the script load time!

I agree that the synthetic or composite key can take a large portion of the memory if it has a large number of distinct values. That's a disadvantage to both of those approaches.

As far as being explicit, I've changed my mind on that issue. I used to believe that if two tables were associated by two fields, then obviously I intended a synthetic key. I saw this as no different than associating two tables by a single field. But given that 90% of synthetic keys are probably created in error, what is my assumption going to be when looking at someone else's program? Unless the programmer's comments explain that he or she had intended it to join that way, that's one of the first things I'll think could have been a mistake. So yes, composite keys are self-documenting, and this is an advantage over synthetic keys.

I further agree with you that probably the BEST solution in many cases is to have neither a synthetic key nor a composite key. This thread has been primarily focused on the comparison of the two, based on what I think is a common suggestion that you replace synthetic keys with composite keys. But a better solution still can be to modify the data model to perform without either. And that is certainly relevant in a general discussion about whether or not synthetic keys should be removed. Even if they are arguably a better choice than composite keys in some particular application, they may not be the best choice. And if another choice is best, it should generally be used.

Yes, you're right that how I was calculating the load time for synthetic keys was explicitly ignoring the load time for synthetic keys. I hadn't realized that at the time, though I should have. Synthetic keys are built after the script finishes executing, so no duration set in the script can include them. This then misrepresented them in the comparisons. I know I've done it correctly more recently, as I remember reading log files for the synthetic key creation times, but I can't remember if that was in more recent threads or just something I was doing myself when testing my own applications. In any case, you're correct - my load times were wrong in the comparisons.

And on top of this, I'll confess that I went back and checked all the places I've used synthetic keys. "All" turned out to only be a few applications with one each, and in ONE of those applications, the synthetic key was a design flaw. My data should not have been connected by the two fields involved. I redesigned the data model to fix the error, and of course the synthetic key went away. Perhaps if I'd had to build the connection manually, the additional thought required would have been enough to make me realize my error, and I'd have avoided the mistake. If with seven years of QlikView experience (and much more data modeling experience) I'm still screwing up synthetic keys, should I really be recommending them to less-experienced developers?

You and others have done much to convince me, Oleg, since this was originally written. There are definite advantages to being explicit. There are often definite advantages to simply restructuring the data model, such as by joining or concatenating tables. Even composite keys can be significantly faster in common real world scenarios, such as when combined with incremental QVD loads. I've written no actual applications where the synthetic key gave me any definite advantage other than slightly faster initial coding time, which is very low on the scale of what is truly important.

I wouldn't say that I've come completely around, but my current opinion is much different than it once was. So thank you (and others) for continuing to debate me on the subject. I appreciate your patience and experience.

...we don't have to create preceding loads. We don't have to explicitly list all fields in select statements or preceding loads. We don't have to explicilty name the table we are creating, joining or concatenating to. Does that mean we shouldn't? Definitely not.

Being explicit in what we do is a much cleaner and consistent approach to development. It makes it easier on ourselves when we have to revist the app down the road. It especially makes it easier on anyone else who has to support your app.

Well, there's often reason to do this sort of thing beyond merely making it easier on ourselves and others when maintaining the application. Say I LOAD * from a QVD (or database). My application is now exposed to outside corruption. All anyone has to do to break my data model is add a field to that QVD that happens to have the same name as a field in another QVD that I'm using. No thanks. I'd rather my applications be safe. And on top of that, LOAD * or SELECT * wastes time and memory on fields I don't need.

But yes, these are good examples of why it isn't always a good idea to just let QlikView do something for us. It is sometimes important to do it ourselves, manually, explicitly, even when that means more code.

Hi John, Needed your help badly. Could you please help in below scenario:

I have to concatenate the QVDs from different folders. But the problem is that, when the QVDs from 2nd folder are coming, they are replacing the QVDs generated from 1st folder. My aim is to concatenate the QVDs from both the folders. Please suggest. Do I need to mention CONCATENATE somewhere?

Further, the data model contains a composite key (the FromDate and ToDate fields) which will manifest itself as a QlikView synthetic key. But have no fear. This synthetic key should be there; not only is it correct, but it is also optimal given the data model. You do not need to remove it.

Further, the data model contains a composite key (the FromDate and ToDate fields) which will manifest itself as a QlikView synthetic key. But have no fear. This synthetic key should be there; not only is it correct, but it is also optimal given the data model. You do not need to remove it.

Agreed, but there may still be legitimate reasons for removing the synthetic key, even in the given example. Let's say that Events is is a huge table that we incrementally load every hour into a QVD, and that our application loads from there. Testing might show that every hour, it takes half a minute to do the intervalmatch, and another half a minute for QlikView to build the synthetic key.

As an alternative, add an IntervalDates field to the Intervals QVD, defined like FromDate*100000 + ToDate or something along those lines. When incrementally loading events, intervalmatch to ONLY the new rows (this assumes our historical intervals don't change), and add the IntervalDates field to the Events QVD. Now our intervalmatch happens in a flash since it's on so few records, and no time is taken building a synthetic key since we now match directly on a single numeric field. With what we said about performance, that should save us a minute per hour, which adds up.

Of course testing might also show that the intervalmatch and the synthetic key are not a performance bottleneck at all, and that we're complicating the process for no particular reason, including putting some interval information on an Event table where it doesn't really belong in a data sense. That's one reason why testing is important. Know what you're doing, and know why you're doing it.

For one document I used to combine two keys into a combined key, and created a link table. After reading this topic, I realized I could remove this operation from my script and just let QlikView do all the dirty work by creating a synthetic key.

The old link table and the new synthetic table contain exactly the same number of rows (+-300k) and the resulting data shown on screen is identical. Document size is only slightly reduced (<1%), but the most notable difference is the load speed, which is half of what it used to be. I can't measure performance but it doesn't "feel" any different.

Seems my data model is correct so the synthetic key does not cause any difficulties for me.

Very interesting. I am also bit confused with the whole literature of that subject. Henric speaks about a consensus: and everybody repeats what the consensus says. I tried to understand that specific point but I fear that I failed. In 10 or 20 years, I will be an expert understanding that curious point

We learn that we should avoid these keys.

1) because of their poor performance. You stated and Henric stated also that the performance was not so poor.

And the literature encourages us to replace them with composite keys, autonumber and so forth : I do not see really the point of all that stuff!

2) because of a poor data model. But most of the time, a data model will be considered as poor if one can see a single synthetic key. And the model is changed to avoid these keys, even if the new model is far more difficult to understand, handle to the previous one.