Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

In my
last post on object-relational algebra, I introduced the idea that functional notation allowed relational algebra to play well with other mathematical disciplines, and therefore solved at least in theory, the limits of relational algebra's expressiveness. In this post, I'll discuss the initial properties of functions as I see them in this approach, and the next, and final, posting in this series will cover the special series join function. The series join function itself provides a way to address transitive connections between tuples in a relvar.

Definition and Types of Functions

The general approach of this object-relational algebra involves applying algebra to manipulate operations on functions of sets of tuples. A set of tuples is a relvar, or relational variable. The set of tuples in the relation is the domain of the function, and as with all functions, it represents exactly one result per domain element, so for every tuple, a single value will result. That value of course can be a set (including a null set), a tuple, or even a relvar. However, it is also necessary to note that a function must return the same basic type of result for every tuple. We can't return a simple value for one, a relation for another, and a set of values for a third.

Each element in a function can be seen as being
relational in the sense of within the expressive boundaries of relational algebra, or
non-relational in the sense of being outside those boundaries. Functions themselves can then be divided into three categories:

Relational functions contain only relational elements and can be expressed solely in relational algebra.

Non-relational functions contain only non-relational elements and cannot be further simplified in relational algebra in any circumstance, and

Semi-relational functions contain both relational and non-relational elements, and these can be partly further simplified in relational operations.

All types of functions have some properties in common. If a function is a function of a relation, then it is necessarily true that it is also a function of every candidate key in that relation, and that for any well-formed data that could be added, it will continue to return a single value of consistent type.

A second level of categorization can be had regarding whether the domain of the relation is fully mathematically self-contained or not.

For example, suppose we have a relvar of employee data, E, and E contains an attribute dob, representing the employee's date of birth. Therefore, for every tuple in E, there is a dob element. We can then have two functions:

age(E) represents the current age of the employee at the time it is evaluated.

age2000(E) represents the age of the employee on New Year's Eve, 2000.

age2000(E) has a fully self-contained domain. The values depend on the values stored in the relvar E, and nowhere else. age(E) however does not have a fully self-contained domain. For any given relational operation, age(E) will behave like a function and we can use it as such, but the results will change over time. Oracle calls these determinate and indeterminate functions respectively. PostgreSQL divides these up into additional categories for planning purposes --- in addition to immutable functions whose output never changes for a given input, you have stable and volatile functions, the latter are not really a function per se of the input.

Properties of Relational Functions

I didn't really start getting really excited about this until I started working with relational functions. Once I started though, there was no going back. Relational functions can be expressed in relational algebra and therefore can roughly map to subqueries in SQL. Of course this is not exact, but there may be cases where if is helpful to look at from this perspective. This is particularly important when looking at optimizing simple functions written in SQL when called in where clauses.

Relational functions can be expanded algebraically in other relational operations. For example:

Suppose we have a relvar L which represents landmarks, and has an attribute country_id which is a foreign key to country, and suppose further we have a relvar C which represents countries and has an attribute called name which represents the country's name. We can then define the following functions:

let country(L) = σ
id=L.country_id(C)

let frenchl(L) = σ
(country(L)).name='France'(L)

country(L) can be in-lined into frenchl(L), and this can be transformed into a subselect, and eventually (in a set- rather than bag- based approach at least) a left join.

The set of single-value returning relational functions is a proper superset to the number of functional dependencies in the database reachable through appropriate join dependencies. Without grouping and aggregation, these sets are identical. However, with grouping and aggregation, relational functions express a larger array of possible values.

Properties of Non-Relational Functions

Non-relational functions cannot be expressed in relational algebra and therefore must be solved outside the stream of relational operations themselves. Non-relational functions must be treated as to their value types. A set-returning function can be treated as a set, a single-value returning function can be treated like a relvar's attibute, and a relvar-returning function can be treated like a relvar itself.

This field moves relational algebra from the area of solving relational queries into the area of symbolic simplification.