Database test driven development for everyone

News, Databases, Technology, Insights ...

By: Alex PodlesnyInspired by an interview question.
Or what to do when you get iced
from all 360 degrees.

Part One - Having Fun

How easy it is to move a condition from WHERE clause to a JOIN clause? The answer is - very easy. Some of us might say that is is esthetically better, some of us would say that semantically join is for joining while WHERE clause invented to introduce predicates. While others even argue that there exist a relational algebra rule that should allows interchangeability of the predicates.

All of those points are right in their own way. Lets try to move predicates around and see what surprising features we can uncover.

As a starting point we will have two tables the Product and the OrderDetails and one important stored procedure GetOrderDetails that returns order details data enriched with a relevant product information.

First four disappointing samples use the inner join.
By default JOIN stays for an INNER JOIN. The "inner" qualifier usually get omitted by developers, this is normal practice but the a better practice is to spell-out the complete join description so that your fellow developer, who is working at 2:00AM in the morning will not have to remember all cool intricacies of T-SQL. Here they are, all four stored procedures:

all four procedures return the same expected result, it is even getting boring:

Lets sharpen our refactoring skills and use the left outer join.

The "outer" means (as we all know) that all rows from the left table not meeting the join condition are included in the result, set is usually. Use of the "outer" qualifier is optional so that LEFT OUTER JOIN = LEFT JOIN, not going to mention about best practices this time...

It is not a surprise any more, it is an introduced-unexpected-feature, now we are getting all the details for all orders, while product name is populated only for the Order #4.

This is very expensive feature in terms of performance impact and in terns of data exposure, new code will pull all records from OrderDetails table. Here is the result:

Part Two - Deal with it

As you can see, moving a given predicate from WHERE clause to a JOIN clause can dramatically alter expected result. Visually such a change is very hard to catch, especially when reviewing complex queries. Imagine if code will be also include cumbersome comments that migh lead reviewer to believe that behavior was actually intended. An attempt to run such query against unreliable data in staging environment may not expose an error or expose it inconsistently, reducing testing efficiency and allowing such an ambiguous code to be released to production.

Here is how we can protect the user, help testing team, remove a burden from a reviewer and improve confidence in a code.

when running our code against last version of the procedure that user LEFT OUTER JOIN and predicate in the join clause we get following error: #ActualResult and #ExpectedResult recordsets should have same number of rows. something is wrong with GetOrderDetails procedure.

you would know why Expected and Actual result is different after examining the table output listed below

Now lets use one of the earlier created versions of the GetOrderDetails procedure

If you landed on this page from a search engine, we would recommend starting with the first chapter

Let's define what is that we are trying to integrate continuously. At the first glance, it might seem obvious that we are talking about just the one thing - the database, and that is our product of integration.

Do we really talk about one thing?

When we say - "database" - here is what people might think:

DBA would think of physical servers with database software instances (like Oracle or SQL Server) installed on a few servers, a database created with files located on physical and logical drives, and populated with data;

Developer who works with database centered application might think of a few tables that exist somewhere in a database this application have access to;

User would think simply of data. Data loaded from raw files, collected from sensors, manually entered, bought or acquired in any other way. User needs data.

The magical "database" term removes obstacles and allows all stakeholders in their different roles work together, and at the same time it carries very high level of ambiguity.

Let's try to remove ambiguity, set expectations and to focus on actual CI implementation for a particular database product.

Please welcome the “DNA and The Creature” example:

DNA

The Creature

On one side there are developers and designers who create databases, write database code, create packages and procedures

The Terminology Wall

On the other side, there are Users who use data in a database and DBAs who support that database

Let say, for example, we have designed a database that describe a creature, the "The Creature" database. Two tables is all we need:

CREATE DATABASE TheCreature

GO

CREATE TABLE CreatureAttributes(

AttributeID int,

Name nvarchar(50),

Value nvarchar(50)

)

GO

CREATE TABLE SignificantEvents(

EventID int,

EventDate datetime,

EventDescription nchar(10)

)

GO

Done - we got a product. So… Coding is done; testing is done, we got approval from the client; we saved our code in source control system as one small file called A_Creature_DNA.sql

Then, this product (the SQL file) was distributed to many divisions of the company. Many departments installed this product internally on their production servers and started using their own “The Creature” databases.

Let say one office accumulated following data:

CreatureAttributes table

AttributeID

Name

Value

1

Class

Mammal

2

Family

Rabbit

3

Name

Bunny

4

Diet

Herbivore

5

Lifespan Limit

12 years

6

Birthday

2012-01-01

SignificantEvents table

EventID

EventDate

EventDescription

1

2014-01-01 4:00 PM

New creature is born of class Mammal

2

2014-01-01 4:01 PM

New creature chooses to be a Rabbit

3

2014-01-03 1:00 AM

Creature tried to eat grass - and it likes it

4

2014-01-03 2:00 PM

We named it Bunny

5

2015-02-03 7:00 PM

Bunny ran away from Fox. Very scary...

6

2016-04-04 8:00 AM

Bunny met Dolly, they are friends now

7

2019-05-08 9:00 PM

Bunny got to a hospital

8

2019-05-09 5:00 AM

Bunny feels better and going home

9

2020-08-23 9:00 AM

Bunny goes to school

There also might be other department, who grew their rabbits or lions, or horses, or other unknown chimeras.

From a developer Joe standpoint, who invented “The Creature” database, the final database product is the SQL code for those two tables.

From user Jill standpoint, who uses this product daily, the “The Creature” database is in the actual database installed on the server and populated with real data.

Both Joe and Jill have different needs from the database. Both use a “database” term in their way. In the end, they are working with two different database products - the database code and the database that this code has created.

See Also:

I want to begin a discussion about Database Continuous Integration (CI) and slowly unwind this concept within the database framework.

You are welcome to start from any chapter, but same words might mean different things to people sitting in the same meeting room and working together as the same team. Just try to ask an Oracle and SQL Server DBA the simple question of "what is the database?", and you would understand.

In this chapter: defining a database lifecycle, a flow of code and environmental differences.

Power of SDLC

In small startups and big corporations, the software development process follows internal lifecycle adopted by the company. It can be one or two step process, it can be manual, it might be evolving. The benefits of SDLC are hard to underestimate since they provide standard guidelines, minimize risk, improve quality, provide consistency and play a significant role in safeguarding actual business and its products.

Flow of code

Everything starts in the development camp. Here is where changes are made, queries are written, tables created, and databases designed.

All changes, no matter how big or small, eventually flow to a testing, staging, or other non-production environments. Over there those changes are reviewed by someone other than the original inventor, tested, checked for compliance, verified, documented, accepted and scheduled for a production release.

Scheduled releases are double checked, triple checked and eventually deployed to a production environment.

In production, the newly introduced changes are confirmed, checked, rolled back when necessary, documented and left to be forgotten.

As database projects mature, development priorities slowly overtaken by an invisible support and operation priorities; changes to the database design might still happen, but they are happening on very limited scale. Many new changes tend to follow SDLC process, but that does not happen all the time.

The life of a production database

Life turns into a real adventure in a production environment. Here, a new code released and updates installed. Hardware changed. Performance tuned. Backups configured. Recovery exercised to verify a backup process. Availability insured. New data loaded. Hardware-software-network issue resolved; configuration changed; security and permissions updated.

After database becomes live, the flow of requests never stop.

For many organizations, the changes to production databases might follow different SDLC (formal or informal) and never trickle back to the development environment. After a while, number of differences grow between development and production environments.

Development environments

Oh yes, those development environments…

The Development - is an environment where changes are happening. Regardless of the importance this environment has very little love. It is not favorite environment for developers, neither it is the preferred environment for testers nor it is an ideal environment for administrators. So why do we have this absence of love?

Here is why:

This environment refreshed with production data only once in a while;

It is most likely underpowered and useless for query tuning;

Developers can delete each other’s code and data at any time, well, mostly unintentionally;

It is unreliable - too many changes in data and logic, too many contentions;

It has bugs and unfinished features at all the time;

Testing activities leave too much rubbish behind.

Competition for database resources is one of the primary causes that hold development teams back. Competing needs affect the quality, stretch timeline, pose multiple limitations and consume a mountain of team’s time to sort out all unintentional impact.

The Testing - environment gets a little more love. It updated more often; it is where developers would like to go to make sure that testers do not find their bugs. Continuous impact from testing efforts anticipated, but it is mitigated by frequent refreshes.

The Pre-Production - or a User Acceptance Environment (because that is where users can try all new changes) is also an environment where developers and testers want to be. Pre-production environment is a favorite of DBAs because it is as close to production as it can get. From hardware standpoint, it is a sandbox for performance tuning and a final resort before delivering new changes to a live production database.

If you landed on this page from search engine, we would recommend starting with the first chapter

Here is how CI practice can be implemented as backbone of the production process:

Maintain source code repository:

Follow the same principles outlined in Chapter 4, plus:

Create separate Version Control System (VCS) folder (or branch) to store all objects scripted from production system; this folder becomes a "Base Source Code" for you database.

Create separate VCS folder (or branch) for each release that is currently in the queue to go to production, all development changes represent a "Delta Source Code". The Delta is where all developers are contributing their code until feature is ready to go to the next step in the adopted SDLC process.

Commit changes often, never longer than a day:

Follow the same principles outlined in Chapter 4, plus:

Set up automated process that script all production database objects and core metadata in a production database, and then checks all those scripts to VCS repository on a daily basis.

Yes, every night automated process should collect and check-in production code into VCS. Over time, it becomes a history of all production changes that can be used to monitor the state of the production system.

Every commit to the master branch should be built:

Follow the same steps outlined in Chapter 4, plus:

All changes from the production environment should be incorporated into the build to help control assumptions.

In addition to a brand-new-fresh-and-clean database artifact that needed for development, a production oriented CI process should produce a Delta Release artifact, which later is used to update production system automatically.

Build Automation:

Build automation should become the backbone of production delivery process with multi-layer, multi-environment delivery system.

Make sure that artifacts are available as installation packages

Follow the same principles outlined in Chapter 4

Installation package should be easily consumable by existing and home grown solutions. Keep your options open, tools are changing very rapidly, make sure you have a way to adopt a new approach

Build process should be fast:

Follow the same principles outlined in Chapter 4, plus:

Production CI process can quickly become multi dependent and layered, look for ways to improve delivery time for each part of the process and for the process as whole;

Manage the risk of "never building final product", minimize the impact from future releases that currently are actively developed. Isolate, separate, branch, minimize and simplify each release. Try to achieve a "one deployment a day" goal, when you are there improve too many per day, if you are not there re-think re-engineer, change…

Automated Unit Testing:

Follow the same principles outlined in Chapter 4, plus:

Unit Test should become paramount requirements. Any changes in code, no matter how small should have a unit test for that change.

the primary deployable artifacts in the production CI is the Delta Source Code that contains scripts to load well-tested code or well known by now data.

Tune the process to the point of continuous deployment to production

At first, define what is the "Auto-Deployable Release" that team can accept.

Create guidelines to identify Manual Release. Manual release - is the release that should not be deployed automatically. Make guidelines available for every team member.

Try to get code reviews, documentation, sign-offs, approvals and other paperwork while new product is moving through development pipeline. By the time, it reaches the UAT environment you would know if the release is auto- or manually-deployable.

Do not push the release to production if anything fails in a build, even if it is one-insignificant-tiny unit test.

Use the same tools to deploy to production and to pre-production environments. Deployment tools should use the same technology, the same workflow, and the same process to deliver changes. You want to catch failures before they take down production environment;

Use the same tools to deploy automated and manually released with the main difference being a step that requires human interaction - push a button to deploy Manual Release to prod.

Be realistic - Not everything could (nor should) be automated - just tune the process to the point of continuous deployment to production of all the changes that can be automated.

See Also:

In this chapter: from development to production and from production to development workflow.

If you landed on this page from search engine, we would recommend starting with the first chapter

Back in Chapter One, where we have discussed SDLC, we have introduced the realities of the production world.

Many businesses built around one database or, perhaps, multiple databases, they grow together, expand and evolve through well know development cycles:

Develop new feature

Move it to production

Use a new feature until it becomes old

Come up with new ideas

Back to Step 1

The big difference between CI for a product development and CI needed to support corporate production environment is a product of integration: "The Creature DNA" v.s. "The Creature" (check Chapter Two for more information). While "Creature DNA" can be sitting on a shelf and can be assembled and reassembled any time. The actual "The Creature" database could not be turned off, reassembled or changed, especially in the high-availability projects. Ironically over lifetime, very close to real life scenario, the Creature can get new properties that original DNA did not even had. Maintenance, performance tuning, constant flow of new data and hotfixes, hardware-software-network issue resolution, upgrades, configuration changes can quickly introduce mountains of differences in how both behave.

If technologists want to deliver new functionality to running, potentially high-availability database, they need to look into the idea of incremental upgrades. Upgrades that slowly transform existing system into an adaptive and always evolving production environment with new functionality and without impact.

When idea of continuous updates and continuous delivery of changes accepted, the next step is to set up a backflow of changes, unknown to developers, from the production environment back to the development.

In such a transformed place:

the product of CI for a production environment can be a simple set of scripts that introduce new functionality - an upgrade

the goal of CI stay the same - is to make sure that by the time those scripts are ready to be deployed to production, there are:

no known integration issues

a minimal as possible risk of impact

Production lifecycle should include steps to push all production changes continually back to development environments and close the potential code difference gap between development and productions environments:

Dev --> Prod

Prod --> Dev

develop new feature

move it to production

use new feature until it becomes old

come up with new ideas

back to Step 1

recognize a change in production

script that change

deliver this change back to source code repository in development environment

back to Step 1

With dev2prod and prod2dev vision let’s move to the next chapter and examine production CI

If you landed on this page from search engine, we would recommend starting with the first chapter

Almost at every step of a CI process there would be some sort of a Database Artifacts generated; they can be different and similar, and all fulfill their individual purpose. Here a few core samples:

SQL Scripts - a group of database SQL scripts. Can be separated in following categories:

1. The complete list of scripted-out database objects and metadata. If run in an empty database server, it creates brand new database will all of its objects and initial set of data;
2. A package that add new functionality or changes existing functionality to an existing database;
3. A package that inserts new or manipulate existing data within existing database.

An installation, in this case, can be a simple execution of the scripts on target database server. While script execution might not be a viable approach for databases with many objects and metadata, it can be quite efficient to deliver hotfixes and incremental updates, especially to production environments.

Back Ups Or DB Files - different flavors of backup processes available for various database engines. Some of them work for a particular use case better than other. They usually divide into following categories:

Cold backup - is done when database server is down, and actual database files can be copied to another location (preferred approach). In Oracle, it might require server shutdown while in the SQL Server it can be achieved by detaching the database in order to gain access to actual database files;

Hot backup - performed while database is in use.

An installation, in case of backups, can be completed by the set of scripts that perform automated restore on a target database server. While installation would be an excellent solution for application development and automation of unit tests, it has limited use for deploying changes to running production and running test environments.

Virtual Servers - a set of fresh virtual clones can be used as a starting point for a CI process to install fresh and clean databases.

An installation, in this case, consist of an on-demand virtual machine provisioning by a consuming system.

Solution and a type of artifact do not need to be rigid, a mix of the strategies can be chosen to satisfy development and integration tasks that appear at a time.

In this section: adopting Continuous Integration as core of the development process

If you landed on this page from search engine, we would recommend starting with the first chapter

From an application standpoint, database is essentially a resource. As long as the application has access to a given database it'll work. Looking back into second chapter of our discussion, essentially this resource is a some “Creature” created base "Creature DNA".

So, if we can create brand-new-fresh-and-clean database, backup this database and keep files as artifacts, then this files can be used by other processes to restore and reuse that brand-new-fresh-and-clean database in the target environment.

If this is achievable, then data-centric application would have the luxury to use clean non-polluted database with the latest tested code.

Here is how to implement CI practice as a backbone of the development process:

Maintain source code repository:

As the first step, installing in-house (or using a hosted solution) SVN, Git, Mercurial, TFS or any other system would do.

At the next step, all database code should be scripted. Make scripts that create tables, views, triggers, stored procedures, functions, synonyms, schemas, roles, permissions, data types, constraints, and every other thing that an application might need. Together with scripts that create database objects, you need scripts to populate application-specific metadata and a few seed records.
Each of those scripts should be checked-in into code repository.
The goal here is to make the clean database populated with default set of metadata;

Commit changes often, never longer than a day:

All developers should commit code changes to source code repository as soon as possible, providing comments and explanations why changes made, why new objects added or why do they removed.
Changes to default metadata should be checked-in to a documented as well.

There is no big task that could not be split into committable small sub-parts, developers should not hold features and code hostage;

Every commit to the master branch should be built:

Out of many branches that developers use in the source code repository a master branch should always be build when new changes are committed;

But what does it mean to build? how can it be built?
These are good questions, and we cover all terms in more detail in our future post. For now let assume that we prepare a package that have all the scripts necessary to create brand-new-fresh-and-clean database;
To get that done create set of build scripts that automatically download all committed changes from source code repository, run those database scripts to produce brand-new-fresh-and-clean database, and finally backup this database;

Build Automation:

Many build automation tools exist on the market today, like TeamCity, Jenkins, Hudson, and others, some of them are free. Those tools make the life of automating a build processes very easy and enjoyable.

Make sure that built artifacts are available as installation packages

Save time for your fellow team members, instead of writing miles of deployment documentation, create installation packages ready for automated deployments and available to team members.

What sort of installation package to create? Are there many different kinds?

These are superb questions, and we cover them in more details in our future posts, for now let assume that we prepare an archive with following files:
1) a package that have all the scripts necessary to create a brand-new-fresh-and-clean database
2) script that call this package to create brand-new-fresh-and-clean database on the target server

Installation package should be easily consumable by existing commercial, open source or simple home grown tools

Build process should be fast:

Build server should continuously monitor source code repository for new changes and start the build process as soon as possible.
A few minutes to build a product is excellent, half an hour is questionable, an hour is unacceptable.

System might be configured to check for changes, compile the solution, run all unit tests, prepare and publish artifacts - all in One Giant Process.

Alternatively, it may be configured with many small independent sub-processes that create their artifacts.

Automated Unit Testing:

After build is done; product should be checked for conformity to expectations and requirements by running a set of automated unit tests.
Unit tests should be created by developers, since they are those who build a feature, and they are first who verifying that new functionality. However, that might not always be the case; integrators and testers should automate their testing as well.

Adopt Test Driven Development practices.

Unit Test should be stored in the source code repository and be available to the build servers alongside database scripts used to create database artifacts.

Available Build Results:

Everyone in a team should know that build failed. The earlier team knows, the faster they can fix it.
Preventing downstream environment from getting corrupted code, and uncovering integration issues is at the core of a continuous integration practice

Any existing on the marked CI and deployment servers would make build results available to users. They can be configured to share the status in many different ways: from email to text message, from the web page to flashing red light in the hallway.

Everyone should know when build failed, what code brought it down, when was it fixed and how;

Some might argue that following two steps are not relevant when product is just a brand-new-fresh-and-clean database intended to be used by developers.

For now here they are, the extra steps:

Keep the set of pre-production environments. Configure automated deployments to them

If you landed on this page from search engine, we would recommend starting with the first chapter

Continuous Integration - What is it? How can we get there? Can we buy it off the shelf?

Continuous integration (CI) is the practice of making sure that changes are integrated continuously into your product so that many issues, integration problems, and impact can be discovered and avoided earlier in the SDLC process.

Here are main principles of continuous integration process:

Maintain source code repository;

Commit changes often, never longer than a day;

Every commits to the master branch should be built;

Employ Build Automation;

Make sure that artifacts are available as installation packages;

Automated Unit Testing;

Build process should be fast;

Keep the set of pre-production environments. Configure automated deployments to them;

Make Build Results Available;

Tune the process to the point of continuous deployment to production.

If you choose a single tool, you may find over time that some of the steps are hard to achieve, or that they become an impediment to your business, slowing it down, impacting cost. If it is where you are: look for other tools or use multiple tools or different solutions. Be open.

To get continuous integration right, you need to persist and be a leader! You will have a long path to team building and nurturing a CI culture. You will spend much time selling the idea to your peers, superiors, and the team. Know, that not every team member adopts a plan. Take it one step at time.

Getting CI working for your company can be a fun quest, it should not be bloody conquest. Don’t get upset, give everybody their time, persist, and become an example of excellence.

CI sound cool, but it might not be good for your project - to use it or not - is a decision that you need to make.

See Also:

For instance you want to enforce naming convention where objects should not start with "spData" prefix. Here is the simple unit test that is searching through object names and definitions in target database and tells if there is any naming convention issue: