Sunday, June 25, 2017

Normalization is a process of identifying the optimal grouping (relations at the end) for attributes that satisfies data requirements in an organization. It is a database design technique we use after completing ER modeling. This process identifies relationships between attributes (called functional dependencies) and applies series of tests described as normal forms.

There are many articles written on this and you can find examples for almost all normal forms. However many articles explains the theory with the same scenario, hence thought to make a post with different set of examples that I use for my lectures.

As explained above, the purpose of normalization is to identify the best grouping for attributes that ultimately forms relations. Some of the characteristics of relations formed are;

Support of data requirements with minimal number of attributes

Relation holds attributes with a close logical relationship (functional dependency)

Functional Dependency

Let's try to understand functional dependency first. This speaks about the relationship between attributes in a relation. For an example, if EmployeeCode and FirstName are attributes of Employee relation, we can say that FirstName is functionally dependent on EmployeeCode. This means, each EmployeeCode is associated with exactly one value of FirstName. We denote this like;

EmployeeCode -> FirstName

Remember, in above relationship, we call the EmployeeCode as determinant. Basically the left-hand side of the arrow is considered as the determinant. The relationship between left to right is always one to one (1:1).

If the right-hand attribute is fully dependent on left-hand side, we call this dependency as full functional dependency. If the left-hand side is a composite one (two or more attributes) and right-hand side can be determined by part of left-hand side, then the dependency is considered as partial dependency (you will see an example of it later).

Why need to identify functional dependencies? One of the reasons for that is, identifying the best candidate for the primary key. Once functional dependencies are identified, we can analyze all and select the most suitable determinantas the primary key.

First Normal Form (1NF)

The definition of this goes as A relation in which the intersection of each row and column contains one and only one value. Let's try to understand this with an example. The following table shows an identified relation for Student Registration for courses. As you see, a tuple represents a registration that is done for a date.

In order to make sure that the relation is normalized for 1NF, we need to make sure that;

You can see that Course attribute has multiple value that violates the 1NF. There are multiple ways for addressing this but if I need to handle it without decomposing the relation, I can organize my tuples as below.

Since the relation has no multiple values in intersections and no repeatable groups, it is now a 1NF relation.

Second Normal Form (2NF)

The definition of second normal form is A relation that is in First Normal form and every non-primary-key attribute is fully dependent on the primary key. What is says is, there should not be partial dependency between primary key and non-primary key.

Let's try to set the primary key for above table. For that, let's list out some functional dependencies;

StudentCode, Course -> Name, Town, Province, Course, DateRegistered

StudentCode -> Name

Town -> Province

Considering above identified functional dependencies, I can easily pick the first one, that is StudentCode, Course as my primary key because the combination of them can be used for identifying the tuple easily.

Okay, now the primary key is StudentCode+Course. However, we know that StudentCode -> Name relationship is still exist. This means that Name can be determined by part of the primary key, that is partial dependency. This is the violation of second normal form.

We can decompose the relation now into two for making sure that relations do not violating the 2NF.

Note that you will not see violation of 2NF if the primary key is based on just one attribute.

Third Normal Form (3NF)

This normal form speaks about transitive dependency. The definition goes as A relation that is in First and Second Normal form and in which no non-primary-key attribute is transitively dependent on the primary key.

This says that we should remove transitive dependency if they are exist. What is transitive dependency? It is a condition such as in Student relation, StudentCode determines the Town (StudentCode -> Town - There is only one two associated with a given StudentCode) and Town determines the Province (Town -> Province), therefore StudentCode determines Province (Note that, as per this relation StudentCode detemines Province but the issue is it can be determined by Town too). This is transitive dependency. In other words, if you see that Attribute A determines B (A -> B) and B determines C (B -> C), then A determines C (A -> C).

For removing transitive dependency, we need to decompose the relation.

Boyce-Codd Normal Form (BCNF / 3.5NF)

This is an extension of 3NF and it is sometime treated as 3.5NF. This makes the 3NF more stronger by making sure that every non-primary-key determinant is a candidate key with identified functional dependencies. The definition goes as A relation is in BCNF, if and only if, every determinant is a candidate key.

What does it exactly means? You have already seen that we can identify many functional dependencies in a relation and we pick one for defining the primary key. The determinants of other identified functional dependencies can be candidate keys for the primary key or they might not be qualified for the primary key. If you see that all determinants are qualified, means you can mark them as the primary key if need, then your relation (table) is in BCNF.

Take this example.

Assume that business rules related to this relation are as follows;

Course has one or more subjects.

Course is managed by one or more lecturers.

Subject is taught by one or more lecturers.

Lecturer teaches only one subject.

If you consider the primary key of this table is Course + Subject, then no violation of 1NF, 2NF and 3NF. Let's list out all possible functional dependencies.

Course, Subject -> Lecturer

Course, Lecturer -> Subject

Lecturer -> Subject

Now, based on the identified functional dependencies, see whether you can make determinants as candidate keys. If you take the first one, we can clearly say that Course + Subject is a candidate key. Second one that is Course + Lecturer is also a candidate key as we can identify tuples uniquely using it. However the determinant of the third one cannot be used as a candidate key because it has duplicates. You cannot make Lecturer as a primary key. Now you have a determinant that cannot be set as a primary key, hence it violates BCNF.

In order to make the table BCNF table, need to decompose as below.

Forth Normal Form (4NF)

This normal form handles multi-valued dependencies caused by 1NF. When we see repeated groups or multiple values in an intersection, we add additional tuples removing multiple values. That is what we do with 1NF. When there are two multi-value attributes in a relation, then each value in one of the attributes has to be repeated with every value of the other attribute. This situation is referred as a multi-valued dependency. See below relation;

If we apply 1NF to this relation;

The definition of the multi-valued dependency goes as Represent a dependency between attributes in a relation, such that for each value of A there is a set of values for B and set of values for C. However the set of values for B and C are independent of each other. This dependency denotes as A ->> B.

The forth normal form is describes as A relation that is in Boyce-Codd normal form and does not contain nontrivial multi-valued dependencies. This talks about one type of multi-valued dependency that is nontrivial. Trivial relationship means; if B is subset of A or A U B = R. Else it is Nontrivial. As you see, CustomerContact contains nontrivial dependencies, hence need to decompose the table as below.

Fifth Normal Form (5NF)

In order to normalize relations, we decompose the relations into multiple relations. Although multiple divided relations optimize transactions and avoid anomalies, it adds a cost for data retrieval as relations have to be rejoined. The biggest risk with rejoining is, producing inaccurate outputs in certain conditions.

When we decompose a relation into two relations, the resulting relations have the property called lossless-join that makes sure rejoining two relations produce the original relation. The definition of lossless-join is, A property of decomposition, which ensures that no spurious tuples are generated when relations are reunited through a natural join operation.

Now let's try to understand Fifth Normal Form. When decomposing a relation into multiple relations for minimizing redundency, it might introduce join dependency, that might create spurious tuples when they are reunited. The definition of Join Dependency goes as for a relation R with subsets of the attributes of R denoted as A, B, ..., Z a relation R satisfies a join dependency if and only if every legal value of R is equal to the join its projections on A, B, ..., Z. Considering this, definition of Fifth normal form goes as A relation that has no join dependency.

Since this is very rare to see in database design, let's try to understand with an example. See the following table that contains how Lecturers teaches Subjects related to Courses.

Assume that the tuples are formed based on the following scenario;

Whenever Lecturer L1 teaches Subject S1,

and Course C1 has Subject S1,

and Lecturer L1 teaches at least one subject in Course C1,

then Lecturer L1 will be teaching Subject S1 in Course C1.

Note that we have this scenario for explaining the 5NF, otherwise you will not see it properly.

Now if I try to decompose this relation into two relations for minimizing redundant data, I will be having these two tables (Sequences are added for understanding joins only);

Now, if I need to rejoin these with Natural Join (Read about Join at: ), this will be the result.

See the highlighted one. It is the result of Join Dependency. It is a spurious tuple which is not valid. In order to avoid it, in order to make the relations as 5NF relation, let's introduce another relation like below;

Now, if we rejoin all three, we will not see any issue.

Therefore, in order to minimize redundant data and make sure no join dependency, make sure relations are formed as 5NF relations.

No comments:

Search This Blog

About Me

Dinesh Priyankara (MSc IT) is an MVP – Data Platform (Microsoft Most Valuable Professional) in Sri Lanka with 16 years’ experience in various aspects of database technologies including business intelligence. He has done many local and foreign business intelligence implementations and has worked as a subject matter expert on various database and business intelligence related projects. He is the Founder/Principal Architect of dinesQL (Pvt) Ltd and he consults, teaches and runs training programs on data related solutions and subjects.