Pages

3 Apr 2008

SQL is in Fact a Programming Language

While I was updating my blog recently, I read the top part of it which said "My adventures in database developing". I realized that I have really neglected database developing with all my research into other programming languages.So I was trying to play around with SQL statements to brush up on my skills.

Luckily (or unluckily), someone at my work, a java developer, needed to do a data export and thought that I can do it in 1 SQL line. He said.... and I quote "You only need 1 SQL line. Whats the problem?".Whenever I hear the "whats the problem" line, there usually is one.So, with him having great confidence in my SQL-ing ability or a complete misunderstanding of database developing, I was forced to do the data export.

I realized early on that it would take a long time. Just to start off, I needed like 20+ columns, but each column was a row somewhere in 4 different tables.

To make a long story short, I did complete the task and to simply pat myself on the back, I will paste it here. Just to show that, yes, you can program in SQL. (but next time, please give me rights on the server to do it with several views).

The SQL statement is about 2-3 pages long, by the way andyou don't really need to read all of the statement

left join (CAMPAIGN_ATTRIBUTE message2 INNER JOIN SUBSCRIPTION message on message2.CAMATTR_SUB_ID=message.SUB_ID and message2.CAMATTR_ATTR_ID=50002) on message.SUB_CON_ID=CONTACT.CON_ID and SUBSCRIPTION.SUB_ID in (select distinct(CAMATTR_SUB_ID) from CAMPAIGN_ATTRIBUTE group by CAMATTR_SUB_ID)

8 comments:

LOL! Yes, SQL is a programming language. And like most programming languages, there is a graceful way to do design and architecture and then there is the way that I call "blecherous" (it makes you say "blechh!").

As soon as I saw the fourth field in your select-list I knew what was up:

lastname.CONATTR_VALUE

I feel sure this is an instance of a common SQL design blunder called Entity-Attribute-Value. It's not your fault they designed the database this way, but now you can see how difficult it makes your job using that design.

People who design their database with Entity-Attribute-Value practically don't need a SQL database at all -- they need an XML file.

I'm currently writing a book called "SQL Antipatterns" and EAV is high on the list.

I'm also presenting "SQL Antipatterns" as a tutorial at the MySQL Conference on April 14.http://en.oreilly.com/mysql2008/public/schedule/detail/1639

More details here:http://karwin.blogspot.com/2008/03/speaking-at-mysql-conference.html

This design should be shunned because of how cumbersome and opaque it is to code against it. All the ID's in the join clause to pick the right attributes are a tell-tale.

But still...what if you have an application like MS Project: it lets you add custom properties to tasks, resources, etc. And, I've done a few projects where we really needed that to calculate custom performance indicators for project progress. Under the hood, MS Project has a entity value module in the database schema to implement this, and indeed - you get these typical queries, which sucks.

But what about the alternatives? Surely, MS Project could just add a new column for a custom property, but the problem is that there are different types of projects and we'd end up with many columns that are basically useless to most types of projects. So that does not seem a good solution. Alternatively, a separate table could be created to hold the custom properties for a single project or group of projects of a certain type.

The disadvantage of modifying the schema to accomodate custom properties would mean that all users that save the project in the db would need to have the privileges to ALTER or CREATE tables. After that, access to those new objects must also be granted to all other users that might need to read those projects.

Not at all impossible, but still, application users with the rights to create database objects....

Another option would be to indeed store all those custom things in some kind of semi-structured format such as XML. But would that really make it better? I seriously doubt it - esp. once we need to combine date stored in custom properties with other data in the db, I think I'd stil rather settle for an entity-attribute approach - rather that than springling XQuery / XPath constructs in my queries to get it done - I mean, it seems a cure that's worse than the disease.

So - what are your thoughts? How to deal with partially extensible datamodels/ structures in relational databases? Should we just give up, or put up with a number of these annoying edges? I really don't know...

My short answer is: if you can't describe your data, then SQL isn't the right tool to query it.

Sometimes you need semi-structured data, or you need the capability of adding attributes without changing the application code or the database metadata.

In those cases, you are asking for a greater amount of flexibility than SQL was designed to provide. Every programming task makes certain assumptions to achieve simplicity, and SQL makes the assumption that a given entity has a stable set of attributes.

EAV breaks this assumption -- it's designed to accommodate dynamic attributes. But it should be no surprise that the resulting SQL isn't very graceful.

There are emerging standards that treat data and metadata as virtually the same thing, and allow both to be dynamic and queriable. RDF/XML is one example, and there's even a standard query language for it -- SPARQL.

I'd say we're 10+ years years away from having de facto SPARQL interfaces as common as SQL interfaces, and RDF data stores that are as efficient as RDBMS products are today.

In the meantime, EAV can be used for semi-structured data. Though we must recognize it comes at a high cost, and perhaps in the majority of cases where it is used, it'd be simpler to add attributes with ALTER TABLE ADD COLUMN.

Other options include a hybrid between conventional and EAV tables. Store common attributes in the conventional way as columns, but allow extended attributes to be stored in an EAV table referencing the conventional table. At least this allows some queries to avoid the EAV table, if the query only needs to reference the base set of attributes.

Another option is to use "subtables" that extend a parent table, just like an object-oriented derived class adds properties and methods to those inherited from its parent class. It takes a join to reconstitute the full row, but it's better than EAV which requires a join per attribute.

All the solutions naturally add complexity, and therefore you should evaluate them on a case-by-case basis.

@bill karwinI saw that you will be speaking in MySQL conference and I have to say that your topic seemed the most interesting to me :)So please link the audio/video/slide files somewhere so I can see afterwards.

Regarding the design of this database, I just call it "highly normalized". The people who made the database had to factor in many different circumstances, locations in which it could be deployed and languages that it could be used with. So they opted to do it this way.

My main problem is that if you want to produce reports of some kind, its very very slow.

I worked somwhere else where they had very much the same problem and it was my task to make it go faster.

So I suggest just having an extra table (or atleast view) that duplicates the data from many smaller tables and integrates them.

On topic, to do this in java would probably taken 10x more lines to develop.

"My short answer is: if you can't describe your data, then SQL isn't the right tool to query it."

Well, in this case, the data can be perfectly described, just not at the same time as the database schema was designed.

Whether it is "highly normalized" or "not normalized at all" is a matter of perspective.

From the point of view of the developers of the original application, that could not see whatever custom attributes the users would be adding, this EAV pattern is a normalized design to solve the problem of custom attributes. The could also have created a bunch of "spare" or "extra" columns in the existing tables which would depending on what they would be used for hardly be better normalized.

On another level, it is more arbitrary than it may seem whether something should be considered to be unnormalized or 1st normal form.

For example, take phone numbers. In some cases, we don't mind the phone number having a structure of it's own and we choose to regard it as an atomic value. In other cases, we want to separately store and search parts like country and area codes. Similar cases are URLs and email addresses.

For multivalued attributes, we can witness the same. Sometimes it might be necessary to store someones initials. Although we know that each letter stands for a proper name, we will usually choose to treat it as an atomic value.

In the end, the requirements of the application will govern what will be considered an atomic value. There is no objective generic criterion that can tell us that.