Most of today’s programs handle data in one way or another and often this data is stored in some kind of relational database. To read and modify this data, a program must bridge the gap between its representation and the one used by the dialect of SQL that the database provides. This bridge typically comprises a database API that sends queries as text strings, often accompanied by some kind of table-to-object mapper that has to coerce data and relationships in both directions, usually with elaborate layers of abstraction in an effort to keep the two sides loosely coupled.

“Yet by obscuring the true data source these solutions end up throwing away the most compelling feature of relational databases; the ability for the data to be queried.”

“It was Codd’s very great insight that a database could be thought of as a set of relations, that a relation in turn could be thought of as a set of propositions (assumed by convention to be true), and hence that all of the apparatus of formal logic could be directly applied to the problem of database access and related problems.”

This approach not only adds complexity and increases the need for data transformations but, most importantly, it destroys the significant advantages provided by the relational model of data. The relational model is built upon predicate logic which brings the power of formal reasoning to data: it is the only sound foundation available.

A number of approaches and frameworks have been proposed to span the gap between the two systems; most never question why there are two systems in the first place.

Microsoft’s forthcoming LINQ to SQL (formerly DLinq) is a major attempt to bring SQL closer into the program than before, but will still keep the database sub-language and all that it entails.

“It is no wonder that applications expected to bridge this gap are difficult to build and maintain. It would certainly simplify the equation to get rid of one side or the other. Yet relational databases provide critical infrastructure for long-term storage and query processing, and modern programming languages are indispensable for agile development and rich computation.”

The solution to the problem is not to get rid of one side or the other, nor to have one side overlap the other, but to merge the two sides into one: supersede SQL (the COBOL of database languages) with a true relational programming language, one that is computationally complete, and then the gap disappears. Our solution uses one of the most effective, expressive and readable languages available, Python, and extends it with relations and a sound relational algebra.

Since its inception in 1969 by E. F. Codd, the relational model has been the foundation for nearly all databases. It replaced earlier network and hierarchical ad-hoc approaches to data storage by being as simple as it needed to be, but no simpler. It was so powerful it allowed users to ask for what they wanted to find, rather than specify how they might find it.

Over the decades, SQL has become the de-facto language for relational databases, but SQL misses many of the benefits of relational technology. In recent years, partly due to SQL’s weaknesses and partly due to minimalistic and stagnant implementations, the database has become merely a storage engine fronted by layers of drivers, mappers, hierarchical markups and frameworks which make flexible querying both complex and distant from the application code.

Having implemented a comprehensive, standards-compliant SQL server, ThinkSQL, we did some further research into the history of SQL’s dominance in the marketplace and its quirky syntax. We found a far superior alternative in the form of D[1], a generic name for any relational language that conforms to The Third Manifesto. We’ve implemented such a language, Dee, as an extension to Python.

The current version of Dee is an initial release to gain feedback regarding the approach. We chose Python because its interpreted style, dynamic typing and built-in sets and dictionaries make it ideal for interacting with data; plus any language that allows you to do the following sorts of things has got to be good:

See Why Use Python? for more information on the advantages of the language. A guide to the Python language can be found in An Introduction to Python. We do assume you are familiar with Python in what follows.

To start using Dee from within the Python interpreter or from a Python program, first import the module. (For demonstrating we import everything but it’s recommended that you only import the features you need.)

A Relation comprises a heading and a body. The heading is a set of attribute name/type pairs. The body is a set of tuples. Each tuple in the body comprises a value for every attribute in the heading. To specify a relation literal, pass the heading as a list of attribute names followed by the body as a list of tuple literals, e.g.:

there is no duplication in the heading attribute names (they are a set)

nor is there any duplication in the tuples in the body (they are a set)

Also, we will try to use the term relation variable when we mean a variable that refers to a Relation, and just relation (or relation value) to mean the value of the relation. This is an important distinction. The value of a relation never changes, just like the value 5 never changes.

To assign a relation value to a relation variable, use the standard Python syntax, e.g.

(Note that Python allows an additional comma after the last item in a list, which can simplify copy/paste operations. Also a Python tuple with a single value must have a comma after the value to distinguish it from a value in parentheses, e.g. (7,) rather than (7))

Given a relation such as the one denoted by IS_CALLED above, we should take the meaning of it to be as follows:

The heading supplies the parameters for the predicate, e.g. StudentId and Name are the parameters for the IS_CALLED predicate.

The tuple Tuple(StudentId='S3',Name='Cindy') is an instantiation of that predicate. It is a proposition where the argument values ‘S3’ and ‘Cindy’ are substituted for the parameters. This states that student S3 is called Cindy.

Instead of defining the value of a relation variable once when it is assigned, we can refer to a function to provide the relation. The function can then return different values at different times. One important kind of relation variable that refers to a function for its data is a virtual (or derived) relation variable. A virtual relation variable refers to a function that returns a relational expression. All other relational variables are base relation variables. To specify a virtual relation variable we first need to define a function to provide the data by returning a relational expression. For example (ignore the relational expression syntax for now, we’ll cover the details of that later):

An attribute value can itself be a relation. Such attributes are known as relation-valued attributes or RVAs. There are a number of relational operators (actually macros) that use such nested relations. For example, GROUP, which takes a relation and a set of attribute names together with a new attribute name and returns a relation with the set of attributes as a nested relation, 1 per unique value of the non-grouped attributes:

A Relation (function-based or not) can also take an extra parameter in its constructor to specify a set of constraints. This takes the form of a Python dictionary where each key gives the constraint name and each value is a pair of constraint-function, parameters. For example, to specify that the “StudentId” attribute is a candidate key for the above relation we could say:

Here, Key is a pre-defined constraint type (actually a function wrapper that creates a function) that takes a list of attributes to enforce the constraint. A constraint function can return True or False and is called whenever the relation is assigned a new value. If no candidate key is specified for a relation, one is assumed comprising all the attributes in the relation (this is displayed in representations as {'PK':(Key,None)}). As another example:

In a number of places we need to pass expressions, e.g. restrictions (where clauses). Python has a built-in way of defining such expressions with anonymous functions using the lambda keyword. So an example restriction for the above IS_CALLED relation could be:

In this example, the lambda expression is passed to the relation’s where function and the expression introduces a range variable, t, which will stand for each Tuple in the relation. The expression itself, the part after the colon, tests whether the Name attribute of each tuple is equal to ‘Boris’: if it is then the tuple is included in the result. Any Python expression can be passed this way. So here, complex boolean expressions including boolean operators and function calls can be built, e.g.

Lambda expressions can also be used as general constraints. On relations, another pre-defined constraint is Constraint. This takes a function that must evaluate to True for the constraint to hold, e.g.:

Here, the ‘MarkRange’ Constraint uses the ALL relational operator (discussed below) to ensure that all Marks in this relation are between 0 and 100. Note the Constraint works at the relation level and its range variable is r in the example. Useful operators at this level are ALL, ANY, IS_EMPTY, and the relational comparison operators discussed below, because they all take relations and return a boolean result.

This returns a list of tuples from the relation. Since relations are sets they can have no order, so to iterate through all the tuples in a relation you must use this method to first extract a list of tuples from the relation.

This is also the way to access the tuples in a pre-defined order. The toTupleList method can take an extra parameter to define a sort order. The sort parameter is a pair (ascending,attribute-list) where ascending is a boolean flag to indicate whether to sort in ascending order or not, and the attribute-list specifies the attributes to sort on.

We use a small core of relational operators to deliver a large number of operations. For example, we use & (relational AND) to provide natural join, intersection and Cartesian product, and we use it as the basis for implementing restriction and extension. A number of other operators are defined as macros on top of the core ones, e.g. GROUP, and this number can easily be increased. The ideas behind this approach can be found in Appendix A of the Third Manifesto.

One of the powerful uses of & is the natural join. This joins relations together on their commonly named attributes. To make the most of this, without having to rename attributes before each join, use the same name for the same attributes across relations, e.g. if a key on one relation is named “product_code” then use that same name in all other relations in case they need to be joined. Naming it “code” on the product relation and “product_code” on other relations would require the rename operator to be used before doing a natural join (not to mention making the two attributes appear to be different things).

The relational operators are defined as Python functions taking, and usually returning, relations. Many of the common ones are also defined as methods and operators on the Relation class.

This is so called because a relation can be thought of as representing a point in n-dimensional space (where n is the number of attributes) and just selecting a few of them is akin to projecting that point onto the chosen axes.
Note once again that since a relation body is a set of tuples, there are no duplicate tuples.

This is crucial to our implementation since attributes with the same name are considered to represent the same thing. The mapping of old to new attribute name(s) is given as a Python dictionary (or indeed a Tuple would also do).

If you think about it, these are all the same thing - it just depends on whether the relations have some, none, or all of their attributes in common. We implement them all using the AND relational operator using the Python&.
Note that since a relation heading is a set of attributes, there are no duplicate attributes.

These are the same thing - it just depends on whether the relations have any of their attributes in common. We implement them all using the OR relational operator using the Python |.
For pragmatic reasons, we only implement the Union operator, i.e. where the relations have all of their attributes in common. The more general Or would need to handle an infinite (impossible) number of alternatives.

This is used to add new attributes to relations. First, the list of the names of the extra attributes is passed, followed by a lambda expression returning a dictionary containing the attribute values for each tuple. The values can refer to the range variable introduced by the lambda to access tuple values. It is implemented using the EXTEND relational operator which in turn is implemented using the AND relational operator (can you see how?).

These operators take relations and return scalar values according to some lambda expression (except in the case of COUNT which simply counts the number of tuples). If the relation has a single attribute then the expression defaults to it.

This takes two relations, the heading of the second must be a projection of the first, plus a tuple returning one or more scalar summary expressions. If we take the EXAM_MARK relation variable, which looks like:

This joins two relations and then projects back onto the first one. It is very useful for finding just those tuples in one relation that have a counterpart in another, hence the alias MATCHING. For example, “find all students who are enrolled on at least one course”:

This operator is especially attractive if we think of a relation as a function: ‘calling’ the function with another relation as a parameter can be achieved by performing a natural join. For example, suppose we have a function-based relation, plus, with a heading of [‘x’, ‘y’, ‘z’]. Suppose the tuples of this relation are all the possible ones where x + y = z. One implementation of such a function-based relation could be:

>>> defplusfn(trx):... """Plus (could just as well be called minus)"""... ifset(['x','y']).issubset(trx.attributes())andnotset(['z']).issubset(trx.attributes()):... return[Tuple(x=trx.x,y=trx.y,z=trx.x+trx.y)]... ifset(['x','z']).issubset(trx.attributes())andnotset(['y']).issubset(trx.attributes()):... return[Tuple(x=trx.x,y=trx.z-trx.x,z=trx.z)]... ifset(['y','z']).issubset(trx.attributes())andnotset(['x']).issubset(trx.attributes()):... return[Tuple(x=trx.z-trx.y,y=trx.y,z=trx.z)]...... ifset(['x','y','z']).issubset(trx.attributes()):... iftrx.x+trx.y==trx.z:... return[Tuple()]#DEE, i.e. True #todo: should really return trx(x,y,z) - no matter?... else:... return[]#DUM, i.e. False...... raiseRelationException("Plus: Infinite rows")#no x,y or z... #Note: we could go further and return tuples given just one attribute... # or indeed we could start yielding infinite combinations if no attributes are passed>>> plus=Relation(["x","y","z"],plusfn)

And it could be used to test whether an expression is valid or not: a returned tuple means True (DEE), no tuple means False (DUM).

And if the function-based relation had more than one result for a given tuple, e.g. square root would return -2 and +2 for 4, then it could simply return multiple results.

This whole approach could be made syntactically simpler using composition. Using a function-based relation as a function (or indeed a mathematical operator) could be done with COMPOSE(Relation(["X","Y"],[(3,4)]),plus).

This is now looking more like a call to a plus function with the COMPOSE making sure that only the new information, the result, is returned. Of course it’s really still performing a natural join on two relations. We could go further and extract the result as a number.

This operator is used to traverse tree-like structures, such as a bill-of-materials. It takes as input a binary relation comprising parent and child pairs and returns a relation that has all hierarchies fully expanded. For example:

To modify a relation variable, like any other variable, you assign a value to it. Since the values of relations are often large, we introduce a shorthand, insert, for adding tuples to an existing relation variable. This shorthand really just performs a UNION (|) between the original value and the given relation (so the two relations must have the same heading).

We also introduce a shorthand, delete, for removing tuples from an existing relation variable. This shorthand really just performs a MINUS (-) between the original value and the given relation (so the two relations must have the same heading).

Another shorthand, update, is for updating selected tuples in an existing relation variable. This shorthand really just performs a delete followed by an insert, with some special copying between the deleted tuples and the inserted tuples. For example, if we want to update all exam marks below 70 to 0:

The first boolean expression is used to determine which tuples to update - this is the same format used in Restriction (the where method). Then comes the list of attributes to be updated, followed by another lambda expression that returns a dictionary containing the new values for each attribute - this is the same format used in Extension (the extend method). The values here can refer to the pre-update tuple values by prefixing their names with ‘_OLD_’, for example to remove 10 from each Mark over 9:

One of the features of Dee is that adding new relational operators is easy and encouraged. Simply define a standard Python function taking and returning relations. For example, the transitive closure operator TCLOSE, used to explode a tree structure and discussed earlier, is implemented like this:

Notice that this takes a single relation, asserts that it has two attributes and then recursively defines itself to return a relation. The simplicity of the definition owes much to the power of the basic relational operators, the macros defined on them, and the Python language itself.

We currently define a base Database class (in DeeDatabase.py) which provides a namespace for grouping relation variables (base and virtual) together. It also acts as a unit of persistence for both defining, storing and loading relation variables and for transactions. Each new Database should descend from this base class. The initial base relation variable definitions and data should be added and any virtual relation variable definitions and supporting functions.

Each database has its own catalog: a set of relations that describe the database (including itself):

relations

attributes

constraints

constraint_attributes

When a database is imported, the relation variables it holds are loaded from storage. When the program that imported the database is finished, the database (i.e. all its relation variables) are stored to the storage area (e.g. disk). The database is also stored upon a transaction commit - see below.

The database construction method (__init__ in Python) is called when a database is first created or loaded from storage. It is used to define initial values for relation variables, e.g.:

Any modifications to the existing relation variables, or new ones added to the namespace, or old ones deleted from the namespace, will be persisted when the program finishes (or after a transaction commit).

We currently define a base Cluster class (in DeeCluster.py) which provides a namespace for grouping databases together. Each new Cluster should descend from this base class and import and set its database definitions.

Each cluster has its own catalog: a set of relations that describe the cluster:

databases

The cluster construction method (__init__ in Python) is called when a cluster is first created or loaded from storage. It is used to define initial values for database variables, e.g.:

As well as using the Python interpreter to interact with relations and databases, a stand-alone web-server that gives access to a cluster of sample databases from a browser is available. To start the server and a client run:

python DeeWebDemo.py

The default browser should launch and a text box will be presented to allow Dee expressions to be entered. The expression can be evaluated by pressing the ‘Evaluate’ button. The default database can be changed by selecting from the drop-down list and pressing the ‘Change database’ button.

This should be optimised for insert/update operations where often just the new tuples need to be checked. Also more work could be done to simplify and improve the declaration of some constraints, e.g. ForeignKey, and the representation and storage of ad-hoc lambda functions.

Currently if a constraint fails during the insertion phase of an update, the deletions are not rolled back (unless this is manually controlled via a database transaction).

We would also like to use a multiple assignment mechanism, similar to Python‘s, to achieve parallel assignment to avoid any need for deferred constraints.

We need to implement routines to ensure relation constraints, such as candidate keys, are inferred in derived relations. We do some of this already, e.g. for the rename operator, but much more is required.

Adding and removing constraints from existing relations is not yet supported. Also database-level constraints cannot be created yet. Also actions such as deleting a relation variable that is referenced should be handled better.

With memory sizes measured in Gigabytes, the immediate intention is to keep databases in memory with a protected power supply to provide short-term durability. For longer term durability, the database can be saved to disk. This is currently done by storing the database relation variables in a single Python pickle file. If the debug_dump constant is True in DeeDatabase then a script is also written containing the latest relation definitions and data. This storage mechanism could be made more efficient (by storing incremental changes rather than the whole database) and more resilient (e.g. storing online to Amazon’s S3).

The current implementation is intended to be shared by a single application, e.g. a web-server, feeding multiple clients. Work is needed to ensure that multiple threads can share the data and serialise transactions.

By working with in-memory data and using fully hashed relations we can perform very fast joins. We have some optimisation code for some relational operators, but this could be improved e.g. by introducing a ternary join operator AND(r,s,t).

As another example, the where method effectively scans the relation, filtering by the given expression. For equality checks on large relations it would be much faster to use & with the appropriate relation: then the hash join will be used instead of a scan. This optimisation can be made manually, e.g. replace r.where(lambdat:t.Name=='Findit') with r&GENERATE({'Name':'Findit'}), but it ought to be done by the system.

Perhaps we should introduce more Relation methods (such as intersect, union, join, times) to map onto the existing operators. These would simply be aliases but may make things appear more familiar. On the other hand, once it’s realised that these are all just AND and OR, it might be best to keep the number of different names for the same thing to a minimum.