Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Can everyone hear me? My name is Brian Fenton and I’ll be talking about Table Design. I’ll leave some time for questions at the end, but if you have a question during the talk or think I’m going too quickly, feel free to raise your hand, start shouting obscenities, throw a shoe at me, whatever. This is an end table that can be converted into a club and shield, to illustrate that a well-designed table can be excellent protection.

I started out with SQL Server 2000, then later moved on to Oracle and finally MySQL. I’ve never used Postgres personally,but I hear nothing but good things about it. I’m going to try to keep this talk DB agnostic as much as possible

During this talk I’m going to cover some basic tenants of table design, requirements gathering, picking good naming conventions, and how to normalize your data to avoid data integrity issues. Later on I’ve got some example data that we’ll design some tables to store, and I’ll finish up with some design anti-patterns and some potential solutions. Many of the examples here came from my past experiences and are therefore going to be US-centric, but I’ll try to call out those issues when they come up.Keep in mind that this is intended as a beginner level talk, so if you’ve been doing this for awhile you might find this a little basic, but I’m happy to answer questions. My hope is that I’ll at least introduce some areas for further study that you may not have considered. Also, the answer to pretty much any database design question is “It Depends”, so be aware that these are more guidelines than hard and fast rules. I still see them broken all the time, sometimes with good reason, but you should still know why the rule is there before you deviate from it.

Say you’ve been asked to design a system. Odds are extremely low that this will be built in a vacuum; most likely you’re going to be replicating an existing process. For starters, you’re going to want to find out what data the system is supposed to capture. Now this can be a pretty lengthy process and you’ll want to make sure you get it right to the best of your ability before it goes into production. If this is a system built on paper then a lot of your job is done for you, because you have paper forms that you can use as a guideline. However, this isn’t going to be enough…

Make sure you actually shadow users doing this process to find out if there’s any extra data that they enter from another place that isn’t part of the form, or if there’s fields they ignore that haven’t been taken off of the form since it was last printed, that sort of thing. Also ask what pieces of data are required, what’s optional, and what formats the data should support. Present your results back to the users to make sure you didn’t miss anything, and try get as much understanding as you can about what each piece of data is used for in the overall process. You may uncover new pieces of data or new relationships that way.

Every table needs one. For your database to be useful, you need a way to uniquely identify all the rows in your tables. These values have to be unique per table, and they shouldn’t change over time. Auto-incrementing integers are your friend here. If you use oracle, you’re stuck with sequences and triggers, and I’m sorry. One thing that they actually don’t have to be is consecutive. It’s perfectly fine to have gaps in your IDs. I’ve seen a lot of beginning database people ask what the best way is to compact their IDs and recover the missing values after they delete records, and that’s just not necessary.

Any value or combination of values in your table that can be used to uniquely identify a record is called a natural, or a candidate key, and if you want to use them that way, they can be made into your table’s primary key, Anything that’s sort of arbitrarily assigned to your table, like an auto-incrementing ID, is called a surrogate key. However, if you want to use something other than an integer ID, think really long and hard about it. Lots of people think that they have fields that would work great as a primary key for a table, but they’re usually wrong, and they often don’t find out until after their system has been in production for awhile and it’s difficult to change. The most common choice here is social security number, for those in the US. What’s less commonly known is that not everyone has one, they’re not guaranteed to be unique, and they can change over time. All three of these things make it a bad choice for a primary key. The best example I can think of for using a non-integer key is US Postal abbreviations for State, but data sets like that are few and far between. Stick with auto-increment. Finally, if you do go with text strings as keys, try to keep them small. The larger the size of your keys in bytes, the slower your system will become when processing them, but there are many instances involving things like replication where using a GUID for a key has advantages.

I love me some foreign keys. Any time the primary key of one table is included in a different table, that’s called a foreign key. That’s how we define all the wonderful relationships that make your database into a relational database. If you have two tables, A and B, if you include the ID from table A as a column in table B, that’s a foreign key. They also allow you to define foreign key constraints, so that the value of the foreign key in table B has to exist in table A, or it can be null if you allow that. This prevents the database from accepting any random garbage people might throw at it, for at least that one column. That’s why it’s often a good idea to create a whole table just to store valid values of a certain type, like product codes or states or whatever. Even if you think the data is never going to change, that’s totally fine. A lookup table like that that’s only used for validation will probably only cost you a few kilobytes of space, but for that you get a strong guarantee of validity.

Relationships are defined by the presence of foreign keys, and ideally, foreign key constraints. If a table has a foreign key from another table, then that first table has a relationship to the second. There are three main types of relationships that SQL supports. The first one is a one to one relationship, where one record in the first table relates to zero or one record in the second table. An example of this would be something like a an address to a state. Each address can only be a part of a single state. In many ORMs you have the concept of “has one” or “belongs to”, that’s a one to one relationship, at least on the part of the record that belongs to another model. The next type of relationship is a one to many relationship. An easy example here is something like bike parts. A bicycle is composed of many parts, but each part belongs to only one bike. In ORM terms this would be a Has Many relationship. The final type of relationship is a Many to Many, where there’s more than one potential combinations on either side. These actually require an intermediate table to represent the combinations. Here an example might be cocktail recipes. A recipe could contain any number of ingredients, and a specific ingredient may be part of any number of cocktails. Keep in mind that relationships are not guaranteed to be bidirectional, so in the bike part example, from the perspective of an individual bike part, it’s a one to one relationship. Relationships don’t have to be reciprocal either. In the address example, the states table has no reason to have a relationship to the address.

Constraints are used to restrict what data can get into your system. They’re most commonly used to ensure that column values are unique or in the case of a foreign key constraint, ensure that a value in the foreign key column exists in the foreign table. They can also be used to verify, say, that a field can only be a “Y” or an “N” character, or that a string field matches a regular expression. Some constraints, like unique and foreign key, require the column to be indexed in order to be used.

Define columns as strictly as you can get away with. The default of varchar(255) is almost never the best choice for a data type. The database is your last line of defense for your application, and if it rejects bad data because it’s the wrong type, doesn’t match a format mask, is too large, etc… then you have one fewer piece of bad data in your system that your application code will have to handle. I’m not saying that you shouldn’t also have validation on the application side, you absolutely should, but on the off chance that something sneaks through, there’s no reason the database has to be permissive about handling it. In an API you generally want to be permissive in what you accept and strict in what you return. Databases are the opposite if that. You want to be extremely strict in what you accept, and whatever sort of things users can write a query for, they can get. Also, don’t just limit yourselves to characters and ints, you also have set/enum types to restrict your input to a known set of values. If that set of values is large enough, add a table storing all those possible values and use foreign key constraints to enforce validation. The more strict you can be about your inputs, the less time you’ll have to spend cleaning data later.

Cleaning data sucks. You can spend hours or days of your life just editing fields that an older system allowed but your new system doesn’t. This also applies to splitting apart fields that used to have multiple values in them into their own unique columns. You also can’t just blindly rely on pattern matching either, because you’ll still have to go through and verify all the changes. You might also end up with records where the old system had certain fields as optional and in the new one they’re required, so now you have to either find out what that missing information was, fill in some sort of placeholder value, or relax that requirement, just because the system you inherited was too lax.

Database normalization is your friend. It’s based mostly on the work of a man named E.F. Codd, who came up with a set of laws about what defines a properly created relational database. If your system adheres to the rules, it’s considered to be in normal form, and each rule includes those below it. So if your database is in second normal form, it also satisfies the first rule. Third normal form is usually as far as people go, but there are a number of higher levels than that. The point of normalizing your data is that it prevents things that make it possible to violate your data integrity in some way, by having duplicate rows, or two rows representing the same thing but containing different data about it. It also tries to help avoid orphaned records, where you have a parent record with multiple child records related to it, and you delete the parent but leave the child records intact. I’m not going to call out each specific rule as I cover them, but if you follow the guidelines I present you’ll end up at third normal form or higher.

You may hear people talking about how you should denormalize your database for performance reasons, but nobody can give you the right answer for your specific set of circumstances without seeing your data and your actual performance in a live environment. Again, the answer here is “it depends”. This particular part of database optimization follows many of the same rules about optimizing your code. Avoid premature optimization, and it’s much easier to denormalize normalized data than it is to take so-called “performant” design and give it better referential integrity.

This one is pretty easy. Don’t store the same piece of data multiple places in your system. The whole reason you have a RELATIONAL database is so you can have a single source of truth for each piece of data in your system, and then store references to it everywhere else. Most people think they do this, but in practice a lot of them don’t go far enough. For instance, does your system store addresses in their own table or do you have employee addresses, customer addresses, billing addresses, etc… spread all throughout your other tables?This also shows a theme to this talk… Just like in object-oriented, nearly everyone has tables that are too big. There’s almost always an option to split tables into smaller chunks. Now most people don’t want to write queries to get data out of a system like that, which is why many of them exist in the first place, but to solve that you can use views to present the version that your users want to see, while sleeping easy knowing that your database design is rock solid.

Any time you have a single field holding multiple pieces of data, you’re asking for trouble. Data that’s stored like that is harder to query, harder to update, and it has to be parsed when it goes in and out of the table. Any time you see data with commas in it that isn’t a text description, or things that are separated by pipes, anything like that, it usually means somebody is trying to violate that rule. Also watch out for columns named something like “Extra data”. One common exception to this is a US phone number. They are actually three fields (trunk, line, and extension) but since people rarely care to search by area code anymore it’s usually denormalized into one.

Calculated fields introduce a dependency in your data that you’ll have to maintain. Try not to store averages, sums, monthly totals, anything like that. In doing so you risk the source data changing and the calculated data becoming out of date. An exception here is if you’re designing a read-only datamart, or having a data warehouse store aggregations for reporting. In that case you’re really using your database like a caching system and then a lot of the standard rules go out the window. A great example of a calculated column is age. You don’t want to have to update this every year, so it’s better to store birth date instead. You can take this a bit far though. In the US at least, knowing the postal code allows you to look up the city and state, but nearly everyone designing databases will store all 3 independently because people are used to seeing them together and they don’t want to take the extra effort to look them up for each interaction. Also, a usability study was done on web forms that showed that when asked to only enter their zip code without their city and state, users actually filled out the form slower even though they had to put in less information because of the cognitive dissonance.

Dates should be stored in UTC. Internally they are stored as UTC timestamps anyway, so it’s better to leave localization up to the client. I’ve gotten into hot water before by using local time and then having a piece of content go live an hour early or late when DST kicks in. It also makes math with dates really simple. Store as much precision as you need, just don’t store them as strings or you’ll spend all your time typing out TO_DATE/FROM_DATE in your queries.

Ok, I’m going to give everyone a little bit of time to think, then I want you all to close your eyes and I’ll do a show of hands.Ok, raise your hand if you think the answer is 2.8. Raise your hands if you think the answer is 4.It’s 4. NULLs are excluded from aggregation functions.

NULL is not the same thing as empty. I actually find it’s easier to understand NULL if you think of it more like “infinity” rather than something that’s not there.Anything plus NULL equals NULL.Because of these factors NULL can have kind of a destabilizing effect on your system. Not a big one, but if you truly don’t need them or can work around it, you’ll often end up with a system that behaves more predictably over time. It could be that there’s a sensible default that works better.That being said there are good uses for them too. Middle initials are a good use of NULL. People may not have one, in which case you’d use an empty string, or it’s likely that they do have one and they just never gave it to your system. Since it’s character data, it’s not likely to be used in aggregate functions

Have one, and stick to it.You don’t want to have all of your tables follow a pattern and then have to remember that two years ago you decided to use forums_categories instead of forum_categories as a table name and never changed it. In the next few slides I’m going to briefly outline a convention I use and that I’ve seen used a number of places.

Tables should be named after plural nouns. Ideally if there’s a separate noun representing a group of something you want to use that. The goal here is to encourage thinking in sets since that’s how SQL operates. The more you can get used to set-based logic, and get away from the “for each” type of processing mode you get into with PHP, the less you’ll be tempted to write a CURSOR statement. A poorly written CURSOR statement is death to performance. That being said, most people will stick with “employees” and that’s fine too.

Linking tables used to represent a many-to-many relationship should be named with the first table, an underscore, and then the second table. The first table should be singular and the second should be plural. Simple enough.Also, linking tables are an instance where you don’t necessarily need a separate auto-incrementing integer to use as a primary key ID field. You can define the primary key as the combination of, say, customer_id and address_id, and as long as there aren’t a lot of other pieces of data in that table such that you may need to reference it regularly, then you don’t need the extra column and index.

Column names shouldn’t include their parent table name. A customers table should not have a “customer_name” field, just “name” is fine. Don’t Repeat Yourself. You should also not rely on mixed case field names, because not all database collations support mixed case names. If you’re relying on mixed case to make your column names readable you might be disappointed when they come back in all caps. Just separate them with underscores. Primary keys should be called ID and foreign keys should be singular table name underscore ID. Another thing to avoid is putting the data type in the column name. We don’t need to use Hungarian notation with our code anymore, we should use it in our databases either. If those types change over time, it’s generally much less work to change the type of a column than it is the name. If you’re using MVC architecture, a column name change can result in changes through all three layers of the code base, plus the database schema itself. In addition, avoid reserved words in your table names. You never know what types of systems you may have to interface with, so avoid the whole list of reserved words regardless of what database system they come from. You don’t want to always have to wrap all your field names in backticks or whatever because you really wanted to name your description column “desc” and now SQL thinks you want an ORDER BY. Finally, lean toward internationalized versions of field names when available. Don’t use zip code when you can use postal code, swap out city for locality, state for region, that sort of thing.

A common example used to illustrate the idea of database normalization is a listing of books and their authors. I picked a few real software books and made up prices and ISBNs, so let’s see how we’d organize them.For starters, we have a column with multiple pieces of data in it. The commas are a dead giveaway. So we could add a few more columns to this table, maybe author1, author2, author3, author4? Anybody happy with that solution? OK, so like I said near the beginning, add more tables

We’ll have an authors table, and a books table, but that still only gives us at most a one to many relationship with authors and books. We can either add the more book entries, one per author, or add ISBNs to the authors table. I can tell you right now that all of these people have written more than one book, so what next? Add another table.

So now we can represent all the possible relationships between these authors and these books. This is a pretty good table structure for this set of data. Everybody satisfied?The only thing I’d potentially consider here is possibly adding an auto-incrementing ID to the books table and using that to key off instead of ISBN. ISBNs only apply to published books, and not all publishers add them. It’s possible to get published by a small, private publisher without one, and while you can add an ISBN later, you may end up in a situation where you need to track unpublished or small run books that don’t have ISBNs. ISBN is still a safer bet than social security number, but again it’s another example where the real world may not give you as much standardized rigor as you’d like it to.

This is some fake data for a university registration system. I had to build a system like this a few years ago, and when I started, this data was all being stored in an Access database on a shared drive. Only a few people could access it at a time, and all the applications were mailed in on paper. This system wasn’t very usable or sustainable, and it was very vulnerable to data corruption and locking issues, especially if there was any sort of problem with the network or a power outage. So the first thing we need to do is choose a primary key. I know there were three different Chris Smiths at my high school while I was there, and two of them were in the same grade, so we know name is out. Add an auto-incrementing integer ID to solve that. I also added a states table just for powering dropdowns and foreign key validation. Then the term column looked like it was storing two pieces of information in the same column, but instead of splitting it up into term and year, we actually broke that out into a separate table as well and tied it into the main enrollment system to get term codes. Finally, that Date column was used to represent the orientation date that the student wanted to attend, so we broke that out into its own table as well and used it to store capacity and a number of other things about each date. That also made reading the data simpler because at a glance we couldn’t tell whether or not the dates were stored using US or European conventions regarding the order of month and day.

Can this person be register for your system? And don’t just say “no, because he’s dead”, that’s cheating. This is just using European languages too. What about Chinese? Arabic? What if people have more than three distinct sections in their names?You need to know your potential audience here and future proof as much as you can. Ultimately this is more of a product direction kind of decision, but it’s important that whoever makes that decision understands the issues. The W3C (not W3Schools) has an excellent article on this topic that I’ve linked in a later slide that I recommend everyone read.

UTF-8 is a really complicated topic that people can and do give entire presentations on. There are a many nuances and gotchas involved, but the point here is, use UTF-8 encoding for everything unless you have a really good reason not to. The extra storage space it costs is not likely to cause problems unless you have an extremely high load, in which case you can make arguments about index efficiency. Just be aware that if you’re defining column sizes in bytes, you may be able to store fewer characters in them than you think because UTF-8 can use up to four bytes per character instead of one or two like most European-based encodings. Also, the “UTF8” character sets on both mysql and oracle do not include full UTF-8 support, so if you need to store things in the high end of the character set like the snowman or other symbols you need to use a different character set like utf8mb4.

Indexing is another very deep topic that people have written entire presentations and books about, even made whole careers out of. It’s as much an art as it is a science. I’m only going to give it one slide, and point you to the presentation I link at the end of this talk by Bill Karwin. He gives a great talk on indexes and I recommend you see him speak about them if you can. So here’s my brief overview. An index is a list of values from a table, sorted in order, with pointers back to the original records that those values came from. They’re used to try to speed up queries. Defining a primary key automatically creates an index on the fields used in the key, and it also adds a uniqueness constraint. If you run a query like SELECT * FROM customers where id=7, then the query optimizer can do what’s called an INDEX SEEK and use the index on ID to just pull up that single record instead of loading all the records. The worst scenario is called a TABLE SCAN, where the database loads up all the records in the table and checks each of them one by one. On a large table that can be a real performance hit, especially if that query is done frequently. Indexes aren’t just free though. They take up their own space, and they require db resources to maintain when records are added, and they become less efficient over time as records are removed. The very coarse, blunt approach to indexes is to index all foreign keys, and any fields that you regularly use in WHERE clauses on frequent queries. Sort your indexes in the same order as you want your results. If you index over multiple columns, order the columns in the index in the same order as they’re used in your WHERE clauses. You can only really create an indexing strategy based on what queries are actually being used, and best indexing choices may change over time as your usage patterns change.

Going to go through a list of the 3 most common table design anti-patterns that I’ve seen and/or done myself at various points in my career, aside from violations of the rules before about column design. Spoiler alert, the solution to all of these is “split them into multiple tables”

I see this come up pretty frequently, usually because people didn’t think about how their system might grow or change over time. Maybe they started out with teachers who only taught one or two courses, and started having to support three, four, five, or more courses. This also breaks down completely whenever more than one teacher has to teach the same course.

So here’s our solution: breaking the setup apart into three different tables, one to represent teacher data, one for course data, and one representing the many to many relationship between the two. This is also an example of the “no ID” linking table I mentioned before. You can set the primary key of the teacher_courses table to be a combination of teacher_id and course_id. That combination should always be unique and it saves you from having to add/index another column that will likely never be used.

Ok, a bit of background. There are companies that make multi-tools, toilets, and baby food all named Gerber. They are different companies, but for the lulz let’s pretend they’re all the same one. Obviously this product catalog has a number of potential issues with it, and we really, really don’t want them to get mixed up or we might end up with some very expensive recalls. Furthermore, if you remember from the earlier slides, NULL means unknown. Most of you probably wouldn’t lose any sleep not knowing what flavor your toilet is, but not knowing the number of blades in your baby food could pose some real problems. You also don’t want to be the person that has to explain to management how someone was able to buy a gallon of baby food for $1.50. One of the biggest problems with this design is the fact that SQL is poor at conditional validation. We can enforce type code with an enum or with its own table, but we have no easy, efficient way to say that if the type is food, then flavor is required, and if not then it has to be “N/A” or something. One of Codd’s laws says that all of the fields in a table have to be about the the record identified by the primary key. Obviously that’s being broken here.

The first potential solution is the most obvious, in that we split those three very separate items into their own tables and then just duplicate the columns that they have in common. This is a perfectly workable solution and nobody is really going to yell at you for it. However, if we have a lot of duplicated data, it might be better to consider solution 2.

Here, we move the similar fields into a parent table with sub tables representing the different subtypes. Those sub tables then include parent ID as a column. Since those parent IDs are going to be unique per product, you don’t need separate IDs for each individual subtype of product.

This is the most complicated anti pattern we’ll deal with today. It has the largest number of potential solutions and none of them are without their disadvantages. There’s at least one entire book written about this problem, so the real moral of the story is that if you do have to have something like comments on your site, it’s worth looking into outsourcing that management to a third party service or using some sort of non-relational storage engine to handle it. This is not code you want to write yourself if you can avoid it.

The first solution, and the one most people go with, is called an Adjacency List. It involves storing a pointer to the parent item in each record. Only works well with a single pointer to parent record, then goes off the rails. It’s capable of handling things like adding new items and finding nearby items in the hierarchy, but the queries it takes to build the entire tree with this kind of structure just keep getting more and more complex the deeper the nesting gets… you end up with an awful lot of LEFT JOINs

This solution is called Nested Sets, and it’s based around the idea of using pairs of numbers to encode the item’s position in the hierarchy. so that the the number in the left column is greater than the left numbers used by all its children, and the number in the right column is less than the numbers used in the right column of each of its parents. Think of it kind of like counting down one side of the tree and back up the other side. The trip down is for your left column and the trip back up is for your right column. Keep in mind these numbers are not foreign keys, and are just used to store position.

For this solution you can use things like a BETWEEN query to find all parent or child comments of any comment in the system. To find parents of a comment, search for comments that have a left value between the original comments’ left and right values, and to get child tickets, use the same search for the right column instead. This gets pretty complicated to maintain though. Every time you add a new comment to the system, you have to recalculate the left and right value columns for every other comment in the table. Another difficulty is finding the specific next child or next parent of any individual ticket. I don’t recommend this one.

This solution is called Path Enumeration, and in it each record stores its entire tree position in a delimited field. This makes finding a specific item’s position in the tree simple. You can also find child elements or parent elements using a LIKE query on the path column. However, you might recognize this as violating the one data element per column rule, and it requires some extra parsing to both generate and consume the path column. Adding a new comment is easier than with nested sets because you can just append the new comment’s ID to the paths of the affected comments.

The last solution is called a Closure Table, and it involves adding a new table that pretty much stores every possible path relationship in the system. It sounds like that’s a lot of extra records, but in practice it isn’t as many as you’d think. These are exactly the type of records that are stored and indexed efficiently. This also makes querying very simple because you can JOIN on parent = 2, child = 4, whatever you need. Adding a new comment is a bit more work but it’s also extremely straightforward since you just add the new record to the comments table, a new record to the tree table, and then a new record for each ancestor.This is my preferred solution since all the operations are straightforward, it supports multiple levels of nesting, and it enforces referential integrity. But if anything, it’s just a reminder to avoid this problem if you can, and steal someone else’s solution if you can’t.

So to wrap up, gather the data you’ll need to store, pick proper primary keys, enforce validity via strict types, relationships, and constraints. Avoid duplicating data throughout your system. Don’t store calculated data. Store your dates as UTC, text as UTF-8, stick to a naming convention, use NULLs intelligently, avoid anti-patterns, Add indexes conservatively and re-evaluate based on your live queries and real data. And finally, everyone’s situation is different. There’s no magic silver bullet that will solve all your problems, but when it doubt, start with a tight, standard design, make it friendly with views, and only deviate after you have metrics proving you need to.

This is mostly for looking up the slides later online, but if anyone wants these links I’m happy to share them with you directly

Please take a few moments to rate my talk and leave feedback. This is my first full-length talk, and I intend to keep giving these things, so you’re only helping out your fellow developers and future selves by giving me ways I can improve. Thank you.

44.
Wrap upGather requirements Text as UTF-8Choose good keys Use a naming conventionDefine relationships Be smart about NULLsBe strict Avoid anti-patternsNo duplicate/calculated data More tables is usually goodNo multi-valued columns Index where you needDates as UTC “It Depends”

45.
ResourcesDatabase Design for Mere Mortals by Michael J. HernandezHow to support full Unicode in MySQL databasesPersonal names around the worldFalsehoods Programmers Believe About NamesFalsehoods programmers believe about timeHow to Design Indexes, Really