If this is your first visit, be sure to
check out the FAQ by clicking the
link above. You may have to register
before you can post: click the register link above to proceed. To start viewing messages,
select the forum that you want to visit from the selection below.

String v/s Integer as primary key

Design/Modeling Question: How bad/different can 2-3 tables join be with a string field (individual) as primary key versus having additional integer field (individual) as primary key. This table can have upto 100000 records.

Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.

I'm sorry but your paper assumes you have a donkey modeling your data. Who's going to use a varchar2(50) "department description" column as a PK?

Well, it was a department 'name', and not a description, and I've seen it more times than I can count.

In general, when I've had that question, it is because someone already has a string in mind that will uniquely identify each row, and that string column generally falls into one of two categories:
- The existing name of something (always larger than a numeric ID)
- A 'meaningful' code for something, such as a shipping code that, when broken into component parts, specifies the ship-from location, the ship-to location, and the date, or something equally ridiculous. (Again, always larger than a numeric ID)

I have *yet* to have someone say: "Hey, can I use a 2-byte string instead of a 2-byte number to be the PK on this table".

Because at that point, what is the purpose of using a string anyway? If you truly have a choice, why wouldn't you use the sequence-generated numeric PK? Why would you try to use some other non-automatically-populated string value?

The next question is: Where does that string value come from anyway? The user?

If that is the case, then we have another reason not to use that string field as a PK: It is user-entered and users can make mistakes.

I have attached another snippet from my book that describes the issue of meaningful data and other issues with the definition of Primary Keys.

And finally, NUMBER(5) will take less storage space than VARCHAR2(5).

First, remember that different character sets store string data differently. Sometimes it's 1-byte per character and sometimes it is 2. But even at 1 byte per character, you're losing out. 1 byte of numeric data can store a value of up to 255. To store the same thing in a string variable would take 3 bytes. The NUMBER datatype can store a number with up to 38 precision in only 21-22 bytes. So unless I'm missing something, a numeric column will almost *always* be tighter than an equivalent string column.

The only way I can see a string winning is if the string is used more like a number with greater than base 10, like base 16 (or up to base 255 if we could use every possible character). So, say we used 'ZZZ0' to represent 35*36^4+35*36^3+35*36^2+0, which is 60464880. For a numeric column, we would need at least 30 bits to store the same number, as 2^30-1 = 1073741823. And since the closest number of bytes is 4 (32 bits), it would still take the same amount of space (4 bytes).

As always, I'm more than willing to be proven wrong, but it would seem to me that there is almost no good reason for using a string PK. As a matter of fact, the only time I do is when there is a 'universally'- accepted set of codes for something, such as a States list or something similar.

1) There's not much situations when a small string can identify something. I agree whith that.

2) User entering PK's for tables is not cool. It's a bad user interface. And can lead to numerous troubles. I agree.

3) There is diferences between storing numbers and varchars. Number are always small. I agree. I never developed a system in witch was necessary to store hexadecimal numbers.

But I don't agree that is a RULE saying "Numbers are better".

You listed one example of that: State lists.

There are numerous business specific data that are not in the silly cases you listed, are small and meaningfull. They are, usually, lookup tables, so user doesn't usually insert data. They can have varchar2 columns and no problems with space.

As I can't find some documentation about it, I made my example:

CREATE TABLE test_space( col_1 NUMBER(10));

BEGIN
FOR i IN 1..1000000 LOOP
INSERT INTO test_space(col_1) VALUES (i);
END LOOP;
COMMIT;
END;
/

/*
Takes this space:
DADOS2 10088
*/

DROP TABLE test_space;

CREATE TABLE test_space( col_1 VARCHAR2(10));

BEGIN
FOR i IN 1..1000000 LOOP
INSERT INTO test_space(col_1) VALUES (To_char(i));
END LOOP;
COMMIT;
END;
/

/*
Takes this space:
DADOS2 13208
*/

So, to a million record table (ten times bigger than 100000 mentioned in daljitsb's question), we have about 3M size difference. It's something arround 30% bigger. A lot, sure, but if I use a numeric pk it needs to be referenced in, at least, 3,33 other tables to justify the extra space on index. As there is no 3,33 tables, we need to join this table to 4 other tables all of then with a million records to justify a numeric pk just for space saving.

So, there are 'cases' and 'cases'. They need to be evaluated and specifyed to know witch is better.

And, correct me if I'm wrong, assuming that space used is equal for memory saving, there is no difference to a oracle server to join two tables using varchar or number columns...

Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.