Category: September

Here’s another of Phil Factor’s SQL Smells. “Using the same column name in different tables but with different data-types”.

At first glance this seems harmless enough, but it can cause all sorts of problems. All of them are avoidable. If you are an analyst, make sure you are not encouraging this to happen. If you creating the Physical Model or DDL for a database, “Just don’t do it!”

Two “rights” can make a “wrong”!

The problem here is not “using the same column name in different tables”. That is perfectly ok. Similarly, using “different data-types for different columns” cannot be wrong. That’s exactly what you should expect to do.

The problem is doing both at the same time. The issues are: practical and technical.

The Practical Problem of the same column name with different data-types

Any human user is likely to think that the same name refers to the same type of thing. They won’t check that the definitions of both “names”. No amount of “procedures” or “standards” will make them do anything different.

Sooner or later this will cause one of the technical problems.

The Technical Problems from the same column name with different data-types

Technical problems will occur when a value moves from one column to the other, or when comparing the two columns. Data may be truncated and those data transformations cost effort.

These problems may not manifest themselves immediately. The consequences will be data-dependent bugs and poor performance.

The Solution to this Issue

This smell and the associated problems can be avoided by following some simple rules:

If two columns refer to the same thing (like a foreign key and a primary key), make sure they are the same data type.

If two columns refer to different things, then give them distinct names. (Do not resort to prefixing every column name with the table name. That’s horrible)

Having columns with have different names and the same data-type is perfectly OK.

Summary

“Using the same column name in different tables with different data-types” in an SQL database is simply “an accident waiting to happen.” It is easily avoided. Don’t do it and don’t do anything to encourage it.

My previous post got me thinking. One thing leads to another as they say. The whole of my process really requires that you have a Data Model (aka Entity Relationship Diagram/Model and several other names). But what do you do if you don’t have one, and can’t easily create one?

What’s the problem?

Suppose you have a database which has been defined without foreign key constraints. The data modelling tools use these constraints to identify the relationships they need to draw. The result is that any modelling tool is likely to produce a data model which looks like the figure below. This is not very useful!

Faced with this, some people will despair or run away! This is not necessary. Unless the database has been constructed in a deliberately obscure and perverse way (much rarer than some developers would have you believe) then it is usually possible to make sense of what is there. Remember, you have the database itself, and that is very well documented! Steve McConnellwould point out that the code is the one thing that you always have (and in the case of a database, it always matches what is actually there!).

To do what I propose you will need to use the “system tables” which document the design of the database. You will need to know (or learn) how to write some queries, or find an assistant who understands SQL and Relational Databases. I’ve used MS SQL Server in my examples, but the actual names vary between different database managers. For example: I seem to remember that in IBM DB/2 that’s Sysibm.systables. You will have to use the names appropriate for you.

The method

“Method” makes this sound more scientific than it is, but it still works!

Preparation: Collect any documentation

“Brainstorm:” Try to guess the tables/entities you will find in the database.

List the actual Tables

Group the Tables: based on name

For each “chunk”, do the following:

Identify “keys” for tables: Look for Unique indexes.

Identify candidate relationships: Based on attribute names, and non-unique indexes.

Draw your relationships.

“Push out” any tables that don’t fit.

Move on to the next group.

When you’ve done all the groups, look for relationships from a table in one group to a table in another.

Now try and bring in tables that were “pushed out”, or are in the “Miscellaneous” bucket.

Repeat until you have accounted for all the tables.

At this point you are probably ready to apply the techniques I described in my previous post (if you haven’t been using them already). You might also consider entering what you have produced into your favourite database modelling tool.

The method stages in more detail.

Preparation:

Collect whatever documentation you have for the system as a whole: Use Cases, Menu structures, anything! The important thing is not detail, but to get an overview of what the system is supposed to do.

“Brainstorm:”

Based on the material above, try to guess the tables/entities you will find in the database. Concentrate on the “Things” and “Transactions” categories described in my previous post.

Don’t spend ages doing this. Just long enough so you have an expectation of what you are looking for.

Remember that people may use different names for the same thing e.g. ORDER may be PURCHASE_ORDER, SALES_ORDER or SALE.

List the tables:

select name, object_id, type, type_desc, create_date, modify_date

from sys.tables

(try sys.Views as well)

Group the tables

Group the tables based on name: ORDER, ORDER_ITEM, SALES_ORDER, PURCHASE_ORDER and ORDER_ITEM would all go together.

Break the whole database into a number of “chunks”. Aim for each chunk to have say 10 members, but do what seems natural, rather than forcing a particular number. Expect to have a number of tables left over at the end. Put them in a “Miscellaneous” bucket.

Identify the candidate keys, and foreign keys from the indexes

select tab.name, idx.index_id, idx.name , idx.is_unique

from

sys.indexes as idx

join sys.tables as tab on tab.object_id = idx.object_id

where

tab.name like ‘%site%’;

select tab.name, col.column_id, col.name

from

sys.columns as col

Join sys.tables as tab on col.object_id = tab.object_id

where

tab.name like ‘%PhoneBook%’

order by 1, 2 ;

From attribute names and indexes, identify relationships

Sometimes the index names are a give-way (FK_…)

Sometimes you have to look for similar column names (ORDER_LINE.ORDER_ID à ORDER.ID)

I gave an answer in the forum, but here is an expansion, and some ponderings.

First of all, let’s set out the terms of reference. The question asks about a “large unfamiliar” database. I think we can assume that “unfamiliar” is “one that we haven’t encountered before”, but what is “LARGE”? To me “large” could be:

Lots of tables

Many terror-bytes 😉

Lots of transactions

Lots of users

There may be other interpretations

I’m going to go with “Lots of tables” with the definition or “lots of” being:

“more than I can conveniently hold in my head at one time”

I’ve also assumed that we are working with a “transactional database” rather than a “data warehouse”.

Preparation

Gilian, the questioner was given some good suggestions, which I summarised as “Collecting Information” or perhaps “Preparation”:

Understand objectives of “The Business”

Understand the objectives of “This Project” (Digging into the Database)

Collect relevant Organisation charts and find out who is responsible for doing what

Collect relevant Process Models for the business processes which use the database

Of these, the one which is specific to working with a Database is the ERD. Having a diagram is an enormous help in visualising how the bits of the database interact.

Chunking

For me, the next step is to divide the model into “chunks” containing groups of entities (or tables). This allows you to:

Focus – on one chunk

Prioritise – one chunk is more important, interesting or will be done before, another

Estimate – chunks are different sizes

Delegate – you do that chunk, I’ll do this one

And generally “Manage” the work; do whatever are the project objectives.

I would use several techniques to divide the database or model up into chunks. These techniques work equally well with logical and physical data models. It can be quite a lot of work if you have a large model. None of the techniques are particularly complicated, but they are a little tricky to explain in words.

Here is a list of techniques:

Layering

Group around Focal Entities

Process Impact Groups

Realigning

Organise the Data Model

I cannot over-emphasis how important it is to have a well-laid out diagram. Some tools do it well, some do it less well. My preference is to have “independent things” at the top.

I’ve invented a business.

We take ORDERs from CUSTOMERs.

Each ORDER consists of one or more ORDER_LINES and each line is for a PRODUCT.

We Deliver what the customer wants as DELIVERY CONSIGNMENTS.

Each CONSIGNMENT contains one or more Batches of product (I’ve haven’t got a snappy name for that).

We know where to take the consignment by magic, because we don’t have an Address for the Customer!

We reconcile quantities delivered against quantities ordered, because we sometimes have to split an order across several deliveries.

That’s it!

Layering

“Layering” involves classifying the entities or groups of entities as being about:

Classifications

Things

Transactions

Reconciliations

Things

Let’s start with “Things”. Things are can be concrete or they can be abstract. We usually record a “Thing” because it is useful in doing our business. Examples of Things are:

People

Organisations

Products

Places

Organisation Units (within our organisation, or somebody elses)

Classifications

Every business has endless ways of classifying “Things” or organising them into hierarchies. I just think of them as fancy attributes of the “Things” unless I’m studying them in their own right.

Note: “Transactions” can have classifications too (in fact almost anything can and does), I’ve just omitted them from the diagram!

Note: The same structure of “Classification” can apply to more than one thing. This makes sense if, for example, the classification is a hierarchy of “geographic area”. Put it in an arbitrary place, note that it belongs in other places as well, and move on!

Transactions

Transactions are what the business is really interested in. They are often the focus of Business Processes.

Order

Delivery

Booking

Where there are parts of Transactions (eg Order_Line) keep the child with the parent.

Reconciliations

Reconciliations” (between Transactions) occur when something is “checked against something else”. In this case we are recording that “6 widgits have been ordered” and that “3 (or 6) have been delivered”.

If you use these “layers”, arranged as in the diagram, you will very likely find that the “One-to-manys” point from the top (one) down (many) the page.

Groups around Focal Entities

To do this, pick an entity which is a “Thing” or a “Transaction” then bring together the entities which describe it, or give more detail about it. Draw a line round it, give it a name, even if only in your head!

“Customer and associated classifications” and

“Order and Order_line” are candidate groups.

Process Impact Groups

To create a “Process Impact Group”

Select a business process

Draw lines around the entities which it: creates, updates and refers to as part of doing its work.

You should get a sort of contour map on the data model.

In my example the processes are:

Place Order

Assemble Delivery Consignment

Confirm Delivery (has taken place)

It is normal for there to be similarities between “Process Impact Groups” and “Focal Entity Groups”. In fact, it would be unusual if there were not similarities!

Realigning

Try moving parts under headers (so, Order_line under Order) and reconciliations under the transaction which causes them. In the diagram, I’ve moved “Delivered Order Line” under “Delivery”, because it’s created by “Delivery related processes” rather than when the Order is created.

Finally, “Chunking”

Based on the insights you have gained from the above, draw a boundary around your “chunks”.

The various techniques are mutually supportive, not mutually exclusive. The chunks are of arbitrary size. If it is useful, you can:

combine neighbouring chunks together or

you can use the techniques (especially “Focal entities” and “Process Entity Groups”) to break them down until you reach a single table/entity.

Tools

My preferred tools for doing this are: a quiet conference room, late at night; the largest whiteboard I can find; lots of sticky Post-its or file cards (several colours); a pack of whiteboard pens; black coffee on tap and the prospect of beer when I’ve finished. For a large database (several hundred tables) it can take several days!

There are times when I want to do something, and it gives rise to questions. Often I have to put the questions to one side in order to get on with the thing which is my immediate priority. When that happens, I put the questions to one side with the intention of coming back to them in the future. This is one of the occasions when I have had the opportunity to go back and answer some of the questions.

The questions in this case were:

How does Server-side security work on MS SQL Server? And

How can I control the access different users have to my database?

I’m not (have never been, and do not really intend to become) a DBA (Database Administrator). I understand a bit about “privileges” but in the past I’ve always had someone else “doing it for me”, and in any case , I’ve worked on other databases such as DB/2 or Oracle.

After a little bit of research and a bit of experimentation (fiddling around), I found the answer to my questions. Although it was hardly earth-shattering, the understanding is satisfying.

The key is to understand that in MS SQL Server the “Server Instance” and the “Database” are separate entities. It is possible to have several databases inside the same Server instance. In fact I do it all the time when I’m experimenting. This means that there are two separate things to define:

A LOGIN, which gives access to the Server Instance, and with Server-side authentication, provides the security, and

A USER, which belongs to the Database and is granted the Database privileges (including CONNECT).

Figure 1 Server-side security for MS SQL Server

Summary of the stages:

Ensure that SQL Server is set up to allow Server Authentication

Create the LOGIN (in the SQL Server Instance)

Create a USER corresponding with the LOGIN (in the Database)

Grant the USER CONNECT privilege (happens by default, but can be revoked)

Grant the USER the appropriate privileges on the Database

If you’re interested in taking this a bit further, I’ve summed all this up in a video on YouTube.

I wouldn’t have described myself as an “absolute beginner”, but I found plenty to enjoy in the course and came away having learned quite a bit about what is going on inside Oracle, and I assume most other database managers.

Circumstances influence what we do in life and so far I have had much more exposure to DB/2 and MS SQL Server than to Oracle. That hasn’t been a decision on my part, simply the choices that had been made for the projects I was involved in.

In a similar way, I’ve spent much more time “dealing with users” as a Business Analyst, than I have working out how to manage the space requirements and performance of a database. It does me good to learn just a little about the things a DBA has to consider. I don’t have to let those considerations govern what I consider the requirements to be, but at least I can understand where other people are coming from.

Taking the course led me to what you might consider “meta” thinking: thinking about not the content of the course, but the way it was presented and the platform Udemy on which it was presented.

I find Udemy interesting. It seems to work well. It certainly worked for me.

Udemy seem to be aiming to be a “neutral marketplace”. The courses belong to the course instructors. Of course Udemy have standards for courses, but beyond the usual “fit to print” conditions, they are mostly technical standards (quality of video and sound) rather than subject matter related. In a similar spirit, Udemy promote the platform, but the promotion I have seen seems to be fairly neutral with regard to individual courses. On the other hand, instructors or course owners are completely free to advertise their wares elsewhere and direct potential customers into Udemy. It’s a simple model which I think I will investigate further.