Coding Style

If you’ve known me for a long time, and let’s face it, if you’re visiting this blog, you probably do, you know that I’m particularly proud of the work I was able to do at Network Solutions as the Director of Database Engineering and Development.

So in yesterday’s post I mentioned that I wasn’t happy with my solution. Among several reasons were the limit on the first 100 numbers, and the lack of using what I think to be a more elegant solution with recursive sub-queries.

Also, I received a comment about problems with the MarkDown plugin — which was also affecting my code posting, so I disabled MarkDown and can now handle code posting much, much better.

I couldn’t let it rest, so I dug into the recursive sub-queries with a vengeance and wrote up something more elegant.

During the course of recent twitter / RSS reading, I came across an interesting post related to the old FizzBuzz problem. The post intrigued me for 2 reasons: it talks about solving a basic problem in Scala, and it emphasizes the use of functional programming. And just for grins and chuckles it calls out OOP based on an old paper from 1984 by John Hughes on Why Functional Programming Matters.

I’m a big fan of functional programming, even if my language of choice, SQL, is a poor example of it. Although, I tend to think functional when I write SQL, I know that the language is limited in that respect.

I’ve always been fascinated by development databases — more so sometimes than huge, heavily utilized production ones. Mainly because I’ve seen how the beginnings of a performance problem, or the start of an elegant solution takes shape within a development database. It’s one of the reasons why I love high levels of visibility through full DDL-auditing within development. I love to SEE what database developers are thinking, and how they are implementing their ideas using specific shapes of data structures.

One of the concepts I’d love to see is a “river of news” panel within development tools to see what is going on within a development database. Some of the good distributed source code control systems do this now.

Pity us poor US DBAs — safely secure using our ancient, many-times-upgraded Oracle 6 databases with their US7ASCII character sets.

We knew that ASCII only covered 0-127, but who could blame us when we started putting WE8MSWIN1252 "international" characters into those fields — the database let us, and it felt kind of sexy putting in cool European characters with umlauts and accents on them.

Besides, all of our business was with other American companies, and if someone had some "funny" characters in their name, then they just had to change them!

Of course, all of this is said with tongue firmly planted in cheek. Nowadays you’d better be able to handle Unicode in your database if you want to have a prayer of not being labeled as something older than teleprinters and typewriters.

I first encountered this situation when working with a US7ASCII database where we started using XMLTYPE columns — little did I know that XMLTYPE columns actually validated the character set of the XML document coming in — one of our fields was the country name.

Everything was fine until February 13th, 2004 — the day ISO added an entry for the Aland Islands… (which has an A with a diacritical ring above it).

We started seeing errors inserting our XML documents — all due to strict validation of the character set. Did we change character sets? No — we stopped using the XMLTYPE columns

Fast forward a few years and now I’m lucky enough to work with proper databases created with the AL32UTF8 character set — so now I can store my friend Mogens Noorgard name correctly (or I would if I could spell it…)

However, little did I realize that I needed to declare my columns differently…

You see, back in the day, VARCHAR2(10) meant that I wanted to store up to 10 characters in the column gosh darn it — I didn’t worry about bytes vs. characters — same thing right?

Er, no.

So in a brand new database with an AL32UTF8 character set, why was I getting column length errors trying to insert the string H,U,”Y with an umlaut” into a VARCHAR2(3) field?

Heck, isn’t “Y with an umlaut” just another character? It’s just WESMSWIN1252 character 255, right?

Don’t tell me it’s a character set issue — I’ve been trying to avoid opening up that NLS manual for years…

Ok, ok — open the manual and start reading about Unicode — specifically UTF-8. Uh-oh, I read the words "variable-length encoding" and the light starts to dawn…

Turns out that “Y with an umlaut” is 1 byte in WESMSWIN1252 (specifically 0xFF), but it’s 2 bytes in UTF-8 (0xC3BF).

But didn’t I declare the column to be 3 characters in length? So why does it care about the underlying encoding?

Enter NLSLENGTHSEMANTICS and the fact that the default is set to BYTE.

NLSLENGTHSEMANTICS

By default, the character data types CHAR and VARCHAR2 are specified in bytes, not characters. Hence, the specification CHAR(20) in a table definition allows 20 bytes for storing character data.

This works well if the database character set uses a single-byte character encoding scheme because the number of characters is the same as the number of bytes. If the database character set uses a multibyte character encoding scheme, then the number of bytes no longer equals the number of characters because a character can consist of one or more bytes. Thus, column widths must be chosen with care to allow for the maximum possible number of bytes for a given number of characters. You can overcome this problem by switching to character semantics when defining the column size.

NLSLENGTHSEMANTICS enables you to create CHAR, VARCHAR2, and LONG columns using either byte or character length semantics. NCHAR, NVARCHAR2, CLOB, and NCLOB columns are always character-based. Existing columns are not affected.

You may be required to use byte semantics in order to maintain compatibility with existing applications.

NLSLENGTHSEMANTICS does not apply to tables in SYS and SYSTEM. The data dictionary always uses byte semantics.

Note that if the NLSLENGTHSEMANTICS environment variable is not set on the client, then the client session defaults to the value for NLSLENGTHSEMANTICS on the database server. This enables all client sessions on the network to have the same NLSLENGTH_SEMANTICS behavior. Setting the environment variable on an individual client enables the server initialization parameter to be overridden for that client.

Sigh…

Can anyone tell me why the default would be BYTE? Why would I want to declare character fields with BYTE lengths? Thank goodness it’s not in bits…

Anyway, we adjusted our standard to make sure that DDL always specifies BYTE or CHAR in the declaration now:

In recent engagement, I come across a "smelly" construct (database smells) that looks like this:

Select max(errorid)+1 into newerrorid from errorlog;

"Why aren’t they using a sequence?", I wondered.

The reason, of course, is that the PL/SQL developers need to request the creation of each and every object from the production support DBAs, and since such requests require review by the central data architects for correctness before being approved for creation in development, the process can take 4-5 days. As a result, they took this "shortcut". (Reason #392 of why I don’t think production support DBAs should have any place in the development process, but that’s another story).

The good news is that they recognized this was bad after I pointed it out, and they went ahead and requested the sequence.

One week later, we get the sequence, correct the code and promote it to the integration environment.

Where we promptly get uniqueness violations when attempting to insert rows into the table because the sequence number was less than the max(errorid) already in the table.

"No problem!", I said – I didn’t want to re-create the sequence with a larger "start with" (due to the turnaround time), so I take a lazy shortcut:

Declare I number; J number; begin select erroridseq.nextval into I from dual; select max(errorid) into J from errorlog; while I <= J loop select errorid_seq.nextval into I from dual; end loop; end; /

Yes – I know this is all kinds of horrible, but I was in a hurry and didn’t think.

And the worst part is that it didn’t even work.

They still got uniqueness violations and came to me later saying that there were problems with the sequence – that when they selected maxval from the sequence in TOAD they got one value (1000), and when they selected maxval from the sequence via SQL Developer, they got another value (300).

What did I forget / do wrong? What should I have done?

I eventually figured it out and "fixed" it.

There’s a coda to this – after I smacked the palm of my hand to my forehead and then explained the problem to the PL/SQL developers I thought they understood it. But later in the day they came to me and said they were having the same problem with a different sequence (getting different – and much smaller – values when selecting maxval from different tools)…

Do you hate arbitrary requirements? You know the ones — like: the customer account number must be a 10-digit number without any leading zeros and no more than 3 repeated digits? Don’t you always try to argue the user back into letting you use a simple sequence generator — maybe giving in on the leading zero requirement, but arguing against trying to make sure there aren’t 3 of the same digits in a row?

Maybe if you thought there was a good reason, or authority or research on why that requirement was a good idea, then you’d see it as an interesting challenge rather than a burden? Maybe if you read a post by Seth Godin about it?

Wonder if the user thinks that some of the database limitations are arbitrary?

I’m somewhat surprised to see a lack of Oracle blogging reaction to the recent post on The Daily WTF which goes into great detail on a case of SQL injection. Maybe we’ve either become tired of it or we assume that “my systems don’t do that!”.

So, how do you audit or track if your system is being hit by injection? How would you detect it? Assume you’re “just a DBA” — and no one tells you about applications being deployed that talk to the database. Is there a way you could tell just by looking from within the database? What kind of assumptions would you make?

There has been an interesting and somewhat heated discussion going on about a recent blog post by Dominic Brooks and referenced by Doug Burns about the relative value of data vs. applications. Actually, most of the heat seems to be directed at a comment made by Tim Gorman on several mailing lists in which he states that:

Data, not programs, is the only thing that matters — applications are transient and have no value except to acquire, manipulate, and display data. Data is the only thing with value.

I’m not actually going to add any fuel to that fire, only offer up some observations. I think I agree with many who are stating that data that lies about, unexploited by any application, is a pretty useless waste of storage. That the true value of data comes from an ability to use it through an application which allows one to analyze, manipulate and visualize information synthesized from the data soup. One reason I’m excited about the new company I’m with is its focus on helping people increasetheir ability to exploit their data.