Sunday, February 19, 2006

Watch out for sequential Oracle GUIDs!

I ran into a very interesting situation regarding Oracle GUIDs or Globally Unique Indentifiers, the other day. I will tell you about my experience and then after that, you can read all about Oracle GUIDs in an article I originally had published in Oracle Professional. Enjoy!

A GUID is a long sequence of characters that are supposed to be globally unique -- that is, the likelihood of a particular sequence of characters appearing (returned by the GUID generating function) more than once on any computer running on our globe is miniscule. Thus, you can use GUIDs when you need unique values that span, say, database instances or multiple networks.

My latest obsession is a product named Qute - the Quick Unit Test Engine. You can find out more about it at www.unit-test.com. We use GUIDs for primary keys; this makes our work substantially easier since the product and underlying data are installed on a user's computer, and we do not have to worry about possible conflicts with sequence-generated primary key values.

The Delphi frontend in Qute also takes those GUIDs and hashes their values to integers, which are then used to index arrays of data and speed up access to browser data. All fine so far.

Unfortunately, a number of users reported an error involving a "Hash Mismatch" on starting up of Qute. After lots of patient cooperation from those users, we were amazed to find that the Oracle SYS_GUID function (the GUID generator offered by Oracle) was returning sequentially-incremented values, rather than pseudo-random combinations of characters.

We then had a bug in our Delphi code such that the hashing algorithm was set to "quick and not very secure" -- meaning that strings which were not very different were hashing to the same number, which then triggered our error. We adjusted that algorithm, making it more robust, and the problem seems to have disappeared. But I thought I should let the world know about our experience.

To see if you have this issue in your installation of Oracle, run the following script:

Going Global with GUIDs

What is a GUID?

You've probably heard about and even seen some funky-looking string that is referred to as a "GUID". This article explains what a GUID is, how they can be very useful, and how you can work with them inside Oracle.

Let's start with a formal definition of a GUID, taken from Wikipedia (http://en.wikipedia.org/wiki/GUID):

"A Globally Unique Identifier or GUID is a pseudo-random number used in software applications. Each generated GUID is "mathematically guaranteed" to be unique. This is based on the simple principal that the total number of unique keys (264 or ) is so large that the possibility of the same number being generated twice is virtually zero. The GUID is an implementation by Microsoft of a standard called Universally Unique Identifier or UUID, specified by the Open Software Foundation (OSF)."

Ah! So a GUID is actually the Microsoft version of the UUID, about which Wikipedia tells us:

"A Universally Unique Identifier is an identifier standard used in software construction, standardized by the Open Software Foundation (OSF) as part of the Distributed Computing Environment (DCE). The intent of UUIDs is to enable distributed systems to uniquely identify information without significant central coordination. Thus, anyone can create a UUID and use it to identify something with reasonable confidence that that identifier will never be unintentionally used by anyone for anything else. Information labelled with UUIDs can therefore be later combined into a single database without need to resolve name conflicts. The most widespread use of this standard is in Microsoft'sGlobally Unique Identifiers (GUIDs) which implement this standard.

"A UUID is essentially a 16-byte number and in its canonical form a UUID may look like this:

550E8400-E29B-11D4-A716-446655440000"

So, in point of fact and strictly speaking, a GUID is a special (Microsoft-specific) case of a UUID. Perhaps because "GUID" is easier and more fun to pronounce than "UUID," it is the most common of the two terms, even when used in a non-Microsoft context (as we find in Oracle and as I will be doing in this article). I will continue, therefore, to refer to this number as a GUID.

One comment regarding the "mathematical guarantee" mentioned above: as far as I understand, no one has come up with an algorithm that really does quarantee a unique GUID. It is possible, though very, very improbable, that you could generate a GUID that conflicts with an existing value.

GUIDs are GOOD for...

The core value of a GUID is expressed in a single line of the UUID definition above, namely: "The intent of UUIDs is to enable distributed systems to uniquely identify information without significant central coordination." This statement should get Oracle database programmers thinking about two things:

Distributed databases: they are certainly a kind of "distributed system." So GUIDs should help us identify unique objects not just in a single instance but across different database instances.

Primary keys and unique indexes: we use these constraints to "uniquely identify" rows of information within a given relational table. Perhaps a GUID could be used for these situations?

If I am writing code for an application that resides in a single database instance, then GUIDs are probably not very compelling. Suppose I face a different scenario: my application runs on many different instances. Each instance has its own users inserting their own data. On a regular basis, however, data from one instance must be merged into another instance.

In this scenario, reliance on sequence-generated primary keys gives me serious heartburn. Since users are inserting rows independently in multiple instances, sequence values are being used up at different rates. When I move data from one instance to another, I have to assume and program for primary key conflicts of various kinds.

This is exactly the situation I encountered as I built Qnxo (www.qnxo.com). This product provides access to a large and ever growing/changing set of PL/SQL scripts (reusable code and generation templates). Any changes to my central PL/SQL repository must be distributed to users. In my first phase of design and construction of Qnxo, I relied on sequence-generated primary keys. As soon as I realized the error of my ways, I converted as much as possible over to GUIDs.

Let's take a closer look at some of the lessons I learned about applying GUIDs in the PL/SQL environment.

Oracle and GUIDs

To help us work with GUIDs, Oracle provides a SQL function named SYS_GUID, documented as follows:

"SYS_GUID generates and returns a globally unique identifier (RAW value) made up of 16 bytes. On most platforms the generated identifier consists of a host identifier, a process or thread identifier of the process or thread invoking the function, and a non-repeating value (sequence of bytes) for that process or thread."

The difference between a RAW and a VARCHAR2 is that RAW string is not explicitly converted to different character sets when moved between systems by Oracle. This point is critical for GUIDs, since we wouldn't want our GUIDs to be changed as they move from one system to another. Conversely, it is also a non-issue for GUIDS, because the characters used in GUIDs are drawn from a restricted subset of ASCII characters, which are static across all character sets.

This function is also exposed in PL/SQL through the STANDARD package as follows:

I can therefore immediately obtain a GUID from within my PL/SQL program by writing code like this:

DECLARE

l_guid_rawRAW (16);

l_guid_vc2VARCHAR2 (32);

BEGIN

DBMS_OUTPUT.put_line (SYS_GUID);

l_guid_raw := SYS_GUID;

l_guid_vc2 := SYS_GUID;

DBMS_OUTPUT.put_line (l_guid_raw);

DBMS_OUTPUT.put_line (LENGTH (l_guid_raw));

DBMS_OUTPUT.put_line (l_guid_vc2);

DBMS_OUTPUT.put_line (LENGTH (l_guid_vc2));

END;

/

And here is the output from running this code:

E7D3F9F0A0B24390AA504E79EC132677

6FB0B56BA65C4B1580827575F0DDB602

32

693295E9405D4C7C8B3453E481EF4DA5

32

Formatting for GUIDs

As Wikipedia specified earlier in this article, ""a UUID is essentially a 16-byte number and in its canonical form a UUID may look like this: 550E8400-E29B-11D4-A716-446655440000."

In addition, in many situations, the common presentation format for a GUID is:

{550E8400-E29B-11D4-A716-446655440000}

So as you decide to incorporate GUIDs in your code and underlying database, you should decide how you want to store the GUIDs. You could stick with the unvarnished RAW values, or you could store a formatted version, if that is how the GUIDs are going to be used in your application.

We decided that in the Qnxo, we decided that the GUIDs would be stored with the full formatting, including dashes and the open and close brackets.

Using GUIDs in Oracle tables

Let's look at how we can use GUIDs inside a relational table.

Below I create a table using a GUID as the primary key for a table.

CREATE TABLE guid_table (

pky RAW(16) PRIMARY KEY

, NAME VARCHAR2(100));

When I insert a row into this table, I can use the SYS_GUID function to generate my unique GUID. I try to do it this way:

INSERT INTO guid_table

VALUES (SYS_GUID, 'Steven');

but I get the following error:

ORA-00984: column not allowed here

As with .nextval, I am unable to call this function directly in SQL. So if I want to insert rows I need to write and run code like this:

DECLARE

l_pkyguid_table.pky%TYPE;

BEGIN

l_pky := SYS_GUID;

INSERT INTO guid_table

VALUES (l_pky, 'Steven');

l_pky := SYS_GUID;

INSERT INTO guid_table

VALUES (l_pky, 'Sandra');

END;

/

Now I can select the data from this table:

SQL> SELECT * FROM guid_table;

PKYNAME

-------------------------------- ---------------

DF832F19657645A2866E7AD15E9F59AF Steven

5FBAB1A92157434EBDBDB5880556D151 Sandra

That's certainly easy enough! Of course, since we want to make certain that a valid GUID is placed is in this column, it would be best to encode that logic into a BEFORE INSERT trigger on this table, as in:

CREATE OR REPLACE TRIGGER guid_table_bi

BEFORE INSERT

ON guid_table

FOR EACH ROW

BEGIN

IF :NEW.pky IS NULL

THEN

:NEW.pky := SYS_GUID;

END IF;

END;

/

And I can now insert into this table as simply as this:

INSERT INTO guid_table

(NAME

)

VALUES ('Joe'

);

In the case of Qnxo, since we decided to store the formatted version of the GUID, we created tables with primary keys defined as follows:

CREATE TABLE sg_favorite (

universal_id VARCHAR2(38) PRIMARY KEY

...

);

Encapsulating Oracle's GUID function

If you decide that you want to work with Oracle GUIDs in the RAW format and never need to deal with the formatted version, then you are pretty much ready to go. If, on the other hand, you want to able to work with GUIDs in a "standard" format, you may want to take advantage of the guid_pkg I wrote to assist in Qnxo development.

The elements in the package are:

guid_t

A SUBTYPE that encapsulates the RAW(16) declaration of a GUID and gives it a name. With this subtype in place, you can declare GUIDs in your code as follows:

DECLAREmy_guid guid_pkg.guid_t;

formatted_guid_t

Another SUBTYPE, this time designed to allow you to easily and readably declare variables that hold a formatted GUID, used as follows:

DECLAREmy_guid_with_brackets guid_pkg.formatted_guid_t;

c_mask

A mask that specifies the valid format for a formatted GUID. It also serves as a wildcarded string that I can use within a LIKE statement to determine if a string matches the standard GUID format.

is_formatted_guid

A Boolean function that returns TRUE if the string passed to it has a format consistent with the standard GUID format.

formatted_guid (2)

Two overloadings of functions that return a formatted GUID. In the first overloading, you pass a string, which could be a RAW GUID value or a VARCHAR2 string. It will return a string containing the GUID in the standard format (as specified by c_mask).

The implementation of this package is shown in Listing 2. It relies on the SYS_GUID function to generate a new GUID value, but otherwise is devoted to applying the format mask and converting un-formatted GUIDs to strings that contain the correctly-placed hyphens.

Listing 2. The body of the guid_pkg package

CREATE OR REPLACE PACKAGE BODY guid_pkg

IS

FUNCTION is_formatted_guid (string_in IN VARCHAR2)

RETURN BOOLEAN

IS

BEGIN

RETURN string_in LIKE c_mask;

END is_formatted_guid;

FUNCTION formatted_guid (guid_in IN VARCHAR2)

RETURN formatted_guid_t

IS

BEGIN

-- If not already in the 8-4-4-4-rest format, then make it so.

IF is_formatted_guid (guid_in)

THEN

RETURN guid_in;

-- Is it only missing those squiggly brackets?

ELSIF is_formatted_guid ('{' || guid_in || '}')

THEN

RETURN formatted_guid ('{' || guid_in || '}');

ELSE

RETURN'{'

|| SUBSTR (guid_in, 1, 8)

|| '-'

|| SUBSTR (guid_in, 9, 4)

|| '-'

|| SUBSTR (guid_in, 13, 4)

|| '-'

|| SUBSTR (guid_in, 17, 4)

|| '-'

|| SUBSTR (guid_in, 21)

|| '}';

END IF;

END formatted_guid;

FUNCTION formatted_guid

RETURN formatted_guid_t

IS

BEGIN

RETURN formatted_guid (SYS_GUID);

END formatted_guid;

END guid_pkg;

/

GUIDs: a nice addition to the toolbox

I'd heard of GUIDs, but never had reason to work with them before developing Qnxo. I started with standard, sequence-generated primary keys in my script table, but found that this approach was sorely lacking when it came time to distribute that content (reusable scripts and templates) to my user base.

By relying on GUIDs, I was able to greatly simplify the logic needed to upgrade distributed content from a central source. Not only did I avoid the problem of conflicting sequence-generated primary keys, but I was also able to keep intact foreign key references, since they were based on unchanging GUIDs.

15 comments:

Hi,in our .NET driven project we made the same experiences with GUID's; On Windows based databases the output was very randomly. On the production database (AIX) only a fraction of the GUID changed (the time fraction?);There are tow poor things how oracle handles GUID's:* No supported build in type (they are designed as RAW(16)) - would like to have a built in like Timestamp* Hence no support for Query rewrite : ( see my article : http://orcasoracle.bloghi.com/2005/12/09/guids-or-query-rewrite-why-my-materialized-view-is-not-used.html )But on (.NET) developer site they do not want to be without their GUID's;RegardsKarl

There is one major flaw in your formatting algorithm for Guids. It only affects you if your data needs to be cross-platform with SQL Server. To see what I mean, take any guid value and do this on SQL Server:

select convert(binary(16),'your guid here')

Notice how the byte ordering of the binary(16) is quite different from the apparent ordering of the bytes in the GUID string value.

Now, I right a .Net application which fetches the first row of sometable from both platforms.

The System.Guid returned from SQL Server will not be equal to the System.Guid returned from Oracle.

Why? Because on SQL Server, the base datatype of a Guid is actually a binary(16) (equivilent to Raw(16) on Oracle). But when the Guid value is marshalled into the binary, the bytes are re-ordered like this:

432165769ABCDEF

Yuck.

So, if you're going to be a multi-platform product, either:

a) use the Binary(16) datatype on SQL Server to hold your guids orb) write an icky byte-swapping routine on Oracle and make your application slow as molasses.

"See the difference? The algorithm that Oracle is using on these operating systems is clearly faulty; yes, the values are unique but they are far from even pseudo-random."

I wouldn't normally bother commenting on a two-year-old posting, but this is getting referenced and it is wrong. If you look at the spec for UUIDs -- RFC4122 -- you will see that there are more than one legitimate version of the UUID. Oracle, on Unix, is seems to be implementing Version 1 UUIDs. On Windows, it is probably calling Win32's UuidCreate(), which generates a Version 4 UUID (earlier versions of Windows also generated Version 1 UUIDs, but some people didn't like having their MAC address included in their UUIDs, so Microsoft switched to the slightly less reliable but more private Version 4 UUID).

UUIDs are required to be unique -- wildly different is not part of the specification.

A little point of view from a project I used to work on; a person should not only consider the issue of distributed databases where you may have multiple instances of the same application/tables merging into other tables, but also the issue of distributed data itself. Businesses are more and more using data mining and other business intelligence techniques to combine data from widely heterogeneous systems and simple sequential numbering schemes just won't work well in these combinations.

Several examples:

a) More and more, enterprise databases are cross referencing data from multiple applications and problem domains. The sales database references the inventory database, both of which may reference the customer database and/or the employee database. All of these databases may be part of different applications and indeed may use different DBMSs because the apps may have been provided by different vendors. Indeed, one app and/or database may not even expose the database - CRUD operations may take place via a SOAP or other web service interface.

b) MetaData - which is domain I used to work in - where the data types are defined so that you have a central repository to manage what the different data in different databases and apps map to in other databases and apps. Or you need a repository to store semantic information.

In my experience, we did not use the GUID functionality in various DBMSs for several reasons:

1) Not all DBMSs have this functionality and we needed to support at least three different DBMSs.

2) For various reasons we needed to generate the GUID before we interfaced with the database - we didn't want the GUID automatically generated and then have to retrieve it so we could use it elsewhere. This was especially true for bulk import operations. Then there is the issue of how the different DBMSs that support GUIDs implement them and how to use the results of one in another.

3) We had to interface with other apps which may have their own generated GUIDs.

That said, being a total beginner with regards to database design and performance, I had given no thought to the impact a GUID/UUID would have on the indexes (page locality). Doing a search with Google does turn up quite a number of performance problems when random GUIDs are used in databases as primary keys, so maybe this is why Oracle used sequential GUIDs.

We are producing enterprise framework and three months ago rewrite all database routines to work with GUIDs instead of auto-generated integers and other technologies. Before that we wasted two weeks, gathering all information available about guids as primary keys in databases.It's quite a question, what is better, but GUID has several advantages, some can be found in previous posts and some others are (useful for several users working with db simultaneously):

1) Now primary keys can be generated on client-side without connection to database. a. First it gives possibility to generate data offline (without connection to DB to get next autoincrement ID). b. Several rows can be created with referenced columns:

As you can generate both department and worker IDs in program, you can create for example, 2 departments and 7 workers with all fields fulfilled and just call 9 INSERT statements (or one packet query). In traditional (autoincremented) case first you need insert departments, get their ID's, fill Department_ID field for workers and insert them. In our case we add up to 100 rows from 8 tables connected by 18 referenced fields in one transaction without any crazy-mind logic.

2) Row ID is unique through database, not table. So one Guid is enough to define row (for example, we store ID's of recently deleted rows).

About problem of sequential and random Guids. Never, please never use random Guids as primary key for pretty large database. SELECT/UPDATE/DELETE will be OK, but INSERT will make DBMS to rebuild index greatly, so several inserts will slow down DBMS. Really, generating sequential Guids it's not a problem and solves this problem with INSERT.But take care (as said in Chris F. comment), about bytes-order. If interested, I can share class (C#) that generates sequential ID's for different DBMS (MS SQL, MySQL, Vista DB, easily Oracle can be added) that generate pretty-random but strictly sequential IDs even from parallel computers.