Thursday, October 06, 2005

The Query Language

Now we stop talking about security temporarily – we will talk about the security aspects of each point in the application in detail- and move to another point which is the Query language we are about to use.

When I first thought about my database application, I didn't imagine that I might think of using any language other than SQL. SQL is an ANSI \ ISO standard language, used by most databases, and every body knows it.But I discovered that a lot of talking is there on the web about the SQL language and how it doesn't completely conform to the relational data model which was derived from the relational algebra and so on.

In 1970, E. F. Codd from IBM introduced his paper A Relational Model of Data for Large Shared Data Banks. In this paper, he put a data representation system derived from the relational algebra which is called the relational data model. The relational data model was not the first data model ever. But mainly, the relational data model was put to solve problems in the previous data models such as the hierarchical data model.

The fundamental assumption of the relational model is that all data are represented as mathematical relations, i.e., a subset of the Cartesian product of n sets. In the mathematical model, reasoning about such data is done in two-valued predicate logic (that is, without NULLs), meaning there are two possible evaluations for each proposition: either true or false. Data are operated upon by means of a relational calculus and algebra.

The relational data model permits the designer to create a consistent logical model of information, to be refined through database normalization. The access plans and other implementation and operation details are handled by the DBMS engine, and should not be reflected in the logical model. This contrasts with common practice for SQL DBMSs in which performance tuning often requires changes to the logical model.

The basic relational building block is the domain, or data type. A tuple is an ordered multiset of attributes, which are ordered pairs of domain and value. A relvar (relation variable) is a set of ordered pairs of domain and name, which serves as the header for a relation. A relation is a set of tuples. Although these relational concepts are mathematically defined, they map loosely to traditional database concepts. A table is an accepted visual representation of a relation; a tuple is similar to the concept of row.

The basic principle of the relational model is the Information Principle: all information is represented by data values in relations. Thus, the relvars are not related to each other at design time: rather, designers use the same domain in several relvars, and if one attribute is dependent on another, this dependency is enforced through referential integrity.

The problem with SQL is that it is not a good representation of the relational model. SQL contains a lot of what is seen by some people as problems. Some of these are:

SQL doesn't allow the developer to define his own data types. He cannot create objects with relations between these objects, although the relational data model contains the creation of objects and building relations between them.

SQL allows for writing redundant queries. Codd warned from this when SQL was first made, and said that this will give different performance based on how the query is written. It appears now that he was true.In an amazing experiment, two queries that return the same result were executed on a non optimized DBMS. The first query executed in 2 seconds and the other was executed in 2500 seconds!Today, most popular DBMS uses algorithm to remove the redundancy from SQL queries before executing them. For example, SQL server uses what is called "Query optimizer" to remove such redundancy.But this is a pitfall that the standards introduce to the industry, and the developers of DBMS have to avoid it. The performance of your DBMS depends on how much optimization and redundancy removal you can do to the queries- this is not the only point on which the performance depends, but it is one of the points.

One of the symptoms of the inefficiency of SQL is the changes to the data models that are taking place at a high rate. Some data bases come with what they call "Object relational data model" others use XML- such as SQL server 2005- although the XML is the hierarchical data model, which was for its problems discarded when the relational data model was put!

But these are not the only opinions in the subject. Some people see that there is no need for a complete implementation for the relational model, and SQL is good for the industry. Even some mathematicians see that the relational data model is wrong and not well derived from relational algebra.

Most of the papers talking about the flaws in SQL are put by four people: Fabian Pascal, E.F. Codd, Hugh Darwen and C.J. Date. The problem is that most of these papers are not free, so I could not read most of them.

So I couldn't exactly figure out specifically what are the problems in SQL. What is the software industry loosing with using SQL as it is now? I cannot get the whole idea without reading those papers, which I cannot do because they are not free.

I have two options; the first is to use SQL as my query language. The second is to put my own query language.I cannot write a new query language with out a detailed understanding of the problems of SQL, and unfortunately I don't have this deep understanding.The second is to know how to put a language. How to choose commands and syntax rules, which is something that I have a very small knowledge about.

Putting a new query language can easily be a new subproject inside my database project, which I am working on alone.So the decision I made is to use SQL as it is.

The next time – God willing – we would be talking about using multithreading in my database application.