Pages

Wednesday, December 9, 2009

A colleague of mine sent me this link, and wanted me to have a look at point 5:

Choose sys_guid() instead of sequences for populating meaningless PK-columns.Traditional sequences are used for populating an ID-column.Sequences have several disadvantages:• Sequences have to be created, maintained and deployed.• Sequences need PL/SQL-code to be fetched (the ‘select into’ construction is obsolete in 11g, but defaulting a column with sequence.nextval still does not work). • When you introduce triggers to fetch the .nextval value, you introduce significant undesired performance penalties• When moving your data from data from one environment to another (e.g. copy PRODUCTION data to ACCEPTANCE), you have to synchronize the .LASTVAL of all sequences.• Fetching sequences can be a performance-critical in big environment (bulk fetch in Data warehouse for instance) which has to be fixed by increasing the CACHE propertyUse guid's to overcome the disadvantages of sequences:create table employees( guid raw(32) default sys_guid() not null primary key);

I knew about the SYS_GUID() function, but didn't realize you can use it as a default value for a primary key column. That sure is convenient. Where other RDBMS all have some sort of autonumber/identity column to automatically fill an ID column, Oracle uses sequences, which is a little more cumbersome to implement. However, I found the story not really balanced. Before addressing each point and mentioning some of the disadvantages of SYS_GUID, let's see how it works. Using sequences, a simple setup can be like this:

rwijk@ORA11GR1> create sequence seq cache 1000 2 /

Sequence created.

Here I create the sequence with "cache 1000", because I want a session to fetch 1000 numbers at once, instead of one each time. This reduces the overhead for Oracle to maintain the SYS.SEQ$ table. Gaps in the numbers might be large this way, but this shouldn't bother you. If it does bother you, you likely shouldn't be using a sequence, as they are never gap-free. A table is created without any reference to the sequence:

To fill the id column with the sequence value, I prefer to just use the nextval function directly in my api. Another option is to use database triggers, but I'm on a mission to avoid those whenever possible. The next procedure represents such an api:

One object less (the sequence), and you don't need to bother about the ID column in the insert statement. The mentioned disadvantages of sequences per point:

Sequences have to be created, maintained and deployedWhile obviously true, I don't think it is a major point. A sequence never comes alone, so we are already creating, maintaining and deploying database objects. But it adds up a little indeed.

Sequences need PL/SQL-code to be fetched (the ‘select into’ construction is obsolete in 11g, but defaulting a column with sequence.nextval still does not work)You don't necessarily need PL/SQL to use a sequence, as shown above. So this is untrue. Although it is true that you cannot default a column with sequence.nextval.

When moving your data from data from one environment to another (e.g. copy PRODUCTION data to ACCEPTANCE), you have to synchronize the .LASTVAL of all sequencesThis is true. Very minor disadvantage though, in my opinion. If we refresh the acceptance database, we don't only refresh the data, but also the objects itself.

Fetching sequences can be a performance-critical in big environment (bulk fetch in Data warehouse for instance) which has to be fixed by increasing the CACHE property.The solution to the possible problem is already presented here.

No advantages of sequences over SYS_GUID are mentioned, so here are some of mine:

A RAW(16) column takes up more space than a NUMBER column.The SYS_GUID is a RAW(16) taking up 16 bytes, whereas as an ID populated by a sequences typically is a NUMBER(6), NUMBER(9) or NUMBER(12). A reasonable average is NUMBER(9) which takes up only 4 bytes. When you have foreign key columns in your table, this effect is multiplied.

Ad-hoc querying has become more cumbersome.Which one would you rather type. This:

You can no longer see the order of inserts by the ID column.When using the cache property and multiple sessions, this is neither necessarily true for sequence populated ID columns, but there is a strong correlation between the number and the time it was populated. This "information" is not there with SYS_GUID() populated columns.

Performance seems to be slightly less when using SYS_GUID()The results of my test varied a lot. But generally it favoured the sequence approach. Here is the output of one of the more representative tests:

15 comments:

Indexes ! Sequence generated values will always increment and be added at the high end of the index.SYS_GUID values will be all over the place. On the plus side, that may reduce contention for hot blocks , but on the minus side your index inserts will be all over the place. All the arguments for/against reverse key indexes would apply.

I've had a few people trying to convince me to use SYS_GUIDs, I'm just not buying it yet. I'm not convinced as to any real or perceived advantage.

"You can no longer see the order of inserts by the ID column."You probably know this, but I thought it worth mentioning, on a RAC system you can't count on the sequence numbers to be in order. That quickly stopped me from using the ID for any type of ordering.

Like Jeffrey mentionend, if guessing ID's breaks your system, you should have a look for another job. Basically i had to change for several years from Oracle to SQL Server, where this "crazy idea" of GUIDs is omnipresent. in short:Tons of waste data to represent easy, small and basically useless information. as a PK is just an arbitrary value, you do not have to order it or whatever and indeed, readable ID-Values make life of a developer really much easier and less error prawn (similar looking is not similar value !).In real large and high concurrency systems GUID's are the real performance killers, as KEYS are usually Indexed, just the UK-Check on inserts raise for example physically/logically reads dramatically, especially if tables are not partitioned (talkin about billion rows tables here..). Imagine a table with 5 billion rows .. just calc size of the index ..and if you have more than one table..as usual a model..you'll need tons of RAM just for indexes ...and now have 200.000 users doing concurrent inserts (by luck no update/delete).. i had this situation. only possibility to safe system before death was to refactor guids to int .. and all was fine. .. so like always, for small systems .. it probably doesnt matter, but if you plan to go far/big, NEVER use it !