This really is a problem, and it gets more serious with larger tables.

I’m going to explain what the problem is from an analyst’s point of view.

What LIKE does for us

Like in the SQL WHERE clause

LIKE identifies rows which match the criteria. It can be used with character columns and combined with other conditions.

The first time I encountered this statement I was amazed that it worked at all! This simply demonstrates how SQL protects us from what is really happening underneath the surface.

Like with a trailing wildcard using an index

Appearances can be deceptive. LIKE only appears to match the search argument against the contents of the table column. It has to use an index in order to perform properly. The entries in the index are sorted. SQL uses the sort sequence with “trailing” wildcards, to get to the first match quickly, and then ignore any entries past the last match.

The problem with leading wildcards

Valid LIKE criteria – But two may cause problems

SQL cannot use an index to matching entries when there is a leading wildcard. As a consequence it will probably scan the whole table. This will work but it may take longer than you want.

The problem gets worse for tables with more rows because with a leading wildcard LIKE has to test every row.

There will be times when you need to use a leading wildcard it in a query. Go ahead and use it, just expect poor performance. If this query which will be used repeatedly, or in a program, then discuss what you are trying to do with the people who look after your database.

Summary

Phil Factor describes leading wildcards with LIKE as an SQL Smell. This may be unavoidable in some ad-hoc queries, but you should treat it with caution if you intend to reuse the query or use it with large tables.

Where next?

The next, and final, blog post is on “Floating Point Numbers”. Do you use floating point numbers in your database? Phil Factor thinks this is an SQL Smell too. Find out why in the next article.

Phil Factor describes this as an SQL Smell, and it is. Finding it should immediately make you suspicious. The problem is not the statement, it’s where you find it!

Interactive and Batch SQL

SQL can be used: Interactively, in Batch files and in Programs

You can use SQL interactively, in “batch”, and inside programs. One of the good things about SQL is that it looks pretty much the same wherever you find it.

“SELECT *” is intended to be used interactively. That’s how I use it, and I expect Phil Factor does the same. Typing the statement in the figure at a command line, or inside a development environment like SSMS is completely appropriate.

Some people create queries interactively using “SELECT *” as a starting point. That’s legitimate too. It’s a matter of personal style.

Don’t use this form of SELECT in a program or when you expect to reuse it. If you save the file, you shouldn’t be using “SELECT *”.

Why Using “SELECT *” is a problem

SELECT * will continue to work even if columns are removed and added!

Sometimes we want things to break! We want something to fail before something worse happens.

You can change the design of tables in a database. One way is using the ALTER statement. Columns can be added and removed.

“SELECT *” will continue to return a result even when the tables it is using have changed significantly. This is a problem because we don’t know if it is still doing what we originally intended!

Legitimate uses of SELECT *

Legitimate, safe uses of SELECT *

There are a few ways you can use an asterisk in a SELECT statement without taking a risk. That is when you are checking if something exists, or counting the number of rows. In both cases the columns of the tables are irrelevant.

Summary

Phil Factor identifies “SELECT *” as an SQL Smell. It can be used interactively, but almost anywhere else it has the potential to cause problems.

Where next?

Do you use “LIKE” in searches? There times when Phil Factor thinks this is an SQL Smell too. Find out why in the next article.

Phil Factor identifies several SQL Smells associated with the use of DateTime, Date and Time data-types. Using the wrong types will waste space, harm performance and create “odd” behaviour.

If you are clear about what you are recording, you will avoid these issues. I prefer to say what I want, and let an expert choose the best date-types. In other words, I prefer to separate the analyst and designer roles. That way I avoid suggesting the wrong types.

DateTime, Date or Time – Which do you need?

DateTime columns which contain only Date or Time contribute two of SQL Smells. This wastes space, both in storage and in memory (which will degrade performance).

There is something worse here. Using DateTime in this way suggests we are not clear about what we want. This lack of clarity encourages the designer to “hedge their bet” by using DateTime.

How precise do you need to be?

Business Systems need to record dates and times, but they don’t need great precision. For many business transactions, the nearest second or even minute is adequate. In some cases recording extra precision can be misleading.

Storing Durations

For a “Period” you need to know 3 things: Start, End and Duration. You only need to store two out of the 3. The third value can always be calculated if you have the other two.

Finding if an Event is inside a Period is easier using Start and End

Storing the “start” and “end” is more flexible. It is straightforward to work out if an event took place within a given period.

Finding if Periods Overlap is easy using Start and End times

Telling if periods overlap is easy too.

If you decide to store a duration you must specify the units you intend and the precision you need. (As Phil puts it “milliseconds? Quarters? Weeks?”). Do not be tempted to store a duration in a “Time” data-type! “Time” is intended for “Time-of-day” not duration.

Dates and Times: Choosing the right data-type – Some simple questions

Choosing appropriate data-types for dates and times is not difficult, if you go about it the right way.

Divide the job into two steps: “The Requirement” and the “The Most suitable technical data-type”. Do the two steps separately, taking into account any local standards and conventions.

The Requirement

Is this an “Event” (a “Point in Time”) a “Period” or a “Duration” (both have beginning and an end)?

Event: What is the best way to represent this data? Should it be a “Date” or a “Time”? Does it really need “Date-time”?

Duration: What are the units and scale of this duration?

How precise do you need the value to be? You may surprise yourself.

How long do you expect this system (and the data in it) to last?

The Most Suitable Technical Data-type

Technical Properties of Date and Time data-types in Microsoft SQL Server

Use the table (based on Microsoft’s Documentation) to choose the best data-type for your needs. This table is for SQL Server. Other database managers will have similar tables.

Microsoft recommends using DATETIME2 in preference to DATETIME for new work (it takes up less space and is more accurate). Providing the maximum date is acceptable, I would consider SMALLDATETIME for business transactions (but you do risk creating a “Year 2000 problem” if the data turns out to have a long life).

If your system will span several time-zones, then you should definitely consider the benefits of using the DATETIMEOFFSET data-type.

Summary

The DateTime, Date and Time data-types can all cause SQL Smells when they are used inappropriately. Problems can be avoided by following some simple guidelines.

Where next?

Phil Factor doesn’t like “SELECT *”. Find out why in the next article.

Here’s another of Phil Factor’s SQL Smells. “Using the same column name in different tables but with different data-types”.

At first glance this seems harmless enough, but it can cause all sorts of problems. All of them are avoidable. If you are an analyst, make sure you are not encouraging this to happen. If you creating the Physical Model or DDL for a database, “Just don’t do it!”

Two “rights” can make a “wrong”!

The problem here is not “using the same column name in different tables”. That is perfectly ok. Similarly, using “different data-types for different columns” cannot be wrong. That’s exactly what you should expect to do.

The problem is doing both at the same time. The issues are: practical and technical.

The Practical Problem of the same column name with different data-types

Any human user is likely to think that the same name refers to the same type of thing. They won’t check that the definitions of both “names”. No amount of “procedures” or “standards” will make them do anything different.

Sooner or later this will cause one of the technical problems.

The Technical Problems from the same column name with different data-types

Technical problems will occur when a value moves from one column to the other, or when comparing the two columns. Data may be truncated and those data transformations cost effort.

These problems may not manifest themselves immediately. The consequences will be data-dependent bugs and poor performance.

The Solution to this Issue

This smell and the associated problems can be avoided by following some simple rules:

If two columns refer to the same thing (like a foreign key and a primary key), make sure they are the same data type.

If two columns refer to different things, then give them distinct names. (Do not resort to prefixing every column name with the table name. That’s horrible)

Having columns with have different names and the same data-type is perfectly OK.

Summary

“Using the same column name in different tables with different data-types” in an SQL database is simply “an accident waiting to happen.” It is easily avoided. Don’t do it and don’t do anything to encourage it.

NULLs are a perennial problem. Nobody likes them. They confuse developers and users and many analysts do not really understand them.

The concept of NULL allows us to say that there are things that we do not know.

In his article on SQL Smells, Phil Factor associates several smells with NULLs. In this post I’ll explain how to avoid using NULLs and how to use them properly when they are necessary.

Why do we need NULLs at all? What is the benefit and what is the cost?

NULL has a simple meaning with wide-ranging and surprising consequences. NULL means the value is unknown. And this in turn means that the result of any calculation or concatenation which uses this value must also be unknown!

NULL is a feature of SQL. The benefit of allowing NULL or “three valued logic” (TRUE/FALSE/UNKNOWN) is that it allows a database to record that there are things we do not know. The cost of having them is that any calculation or concatenation which uses this value must also be unknown! This confuses many people.

Reasons for needing NULL

There are many reasons why we might not have data to put into a column. Thinking about why we are considering defining a column as NULLable will encourage us to consider alternatives.

The structural NULL – Permanent sub-types

Customer with Sub-types which may cause NULLs

Sometimes we want to combine two entities into a single “super-type” table. There are attributes of Person will never be used for Business (and vice-versa). These missing values will need NULLs.

The structural NULL – Lifecycle subtypes

Order with Lifecycle sub-types which may cause NULLs

Something similar can happen if we combine all the steps of a entities lifecycle into a single table. The attributes of the later stages will always be empty (NULL) until that stage is reached. These later steps often contain dates or times.

In both cases involving sub-types it may be possible to splitting the sub-types into separate tables. Consider whether it is worth the effort and make sure you avoid the pitfalls of sub-types in SQL.

Data that will never be there

Attributes and Values for an “Address”

Entities like “Address” are frequently modelled with attributes like “AddressLine_”. In many cases there will never be values for the later lines. They will not be mandatory in the user interface, but do they need to be NULL? Consider whether allowing them to default to “spaces” or an empty string, would be better and whether it would have any bad effects.

Things which should never allow NULL

There are some things which should hardly ever allow NULL. This includes all keys and identifiers. Avoid allowing short titles or descriptions to be NULL (For long descriptions allowing NULLs is understandable).

Unavoidable NULLs

Attributes and Values for a “Person” Entity – Sometimes NULL is hard to avoid

There are some attributes where allowing NULL is hard to avoid. Life insurance and pensions companies may need a “date of death” for their customers! Having a column with allows NULL is often the easiest way of handling this.

You should resist the temptation to use “magic dates” or inappropriate data-types in order to avoid allowing NULL. The consequences are far worse than the problem.

Summary

NULLs are a problem, nobody likes them but they are necessary. Many problems with NULLs can be avoided by two rules:

Remember that NULL means “unknown value” and this has consequences.

Ask “_why_ don’t we have this data?”

In many cases NULLs can be avoided by data modelling – that means the analyst has to do work in the Conceptual or Logical Model.

Where next?

The next article is about another smell: having the same name for different things!