Picking datatype for STATUS fields

Quite commonly in the applications you would need to use some kind of “status” field – status of order – “new”, “confirmed”, “in production”, “shipped” status of job, message etc. People use variety of ways to handle them often without giving enough thought to the choice which can cause problems later.

Perhaps worst, though quite common thing is to define such field as VARCHAR(255) . Even though the stored value is often short the full specified length can be used for internal processing, such as when creating temporary table or sorting.
Before we go to the list of variants which can be used lets talk about two important types of these status fields. First is BOOLEAN or YES/NO type. You would frequently see it as columns VISIBLE, DELETED , CAN_ACCESS etc. In case things are as easy as YES/NO using TINYINT UNSIGNED NOT NULL is quite a good idea, though using ENUM is also fine choice as you’re not planning to change allowed values for such field anyway.

For status fields with more than two values there are commonly following choices:

VARCHAR(N) – This is the most straightforward approach though it is also most inefficient one from performance standpoint. However how much of performance are you going to lose? This depends on number of rows as well as their average length. If you have 1000 of rows you would not save much of space by using TINYINT instead of VARCHAR. Same if you have 1000 bytes per row using 10 byte VARCHAR(10) will not waste too much space. The other problem you may run into is slower JOIN performance especially for MyISAM Tables. Though it is solvable as well.

The benefit of using this solution is obviously self documenting. Values such as “New”, “Shipped”, “Confirmed” are much more descriptive than 1,2,3 and do not need to be separately maintained. However this benefit also comes comes as a drawbacks as it is very easy to store not intended statuses in the system and it is harder to spot them. For example you may have somewhere in the code use “Shiped” value which can cause hard to catch errors. Also if you’re using actual values you may have temptation to display them directly in your application interface which can cause extra problems if you need to support multiple languages. In any case if you’re picking this approach you should make sure the values you use are short. Pick 8, may be 12 characters and use single byte character set to keep the size as short as possible.

ENUM – This is what VARCHAR(N) columns get converted during the process of application optimization. It is very nice as it often allows to save on the storage and get extra performance without application modification. ENUM takes just 1 byte for small sets so it is rather efficient. The problem with ENUM is adding new values using traditional way (ALTER TABLE) causes table rebuild which is not acceptable for many environments. It is possible to hack this around though I would not like to do this as a standard production practice. It is also worth to note ENUM comes with couple of gotchas – first if you store value which is not allowed by ENUM list it will be always converted to ” – empty string, the additional ENUM columns can have. The other gotcha is sort order which is done for ENUM in the order in which values are specified in the ENUM declaration – not according to their string values. So make sure to add values specify strings for ENUM in sorted way to get same sorting as before.

TINYINT – Or you can just use TINYINT which has very small footprint, fast for queries and has no problem with adding values to it, however it has also drawback of integer values not being self documenting and can’t be used in interfaces. TINYINT is a good choice if you can maintain a discipline and have the table which has integer to string mapping
or at least have constants with descriptive names defined, so you do not use numbers in the application directly. I prefer to have the table though because correctly architecting your queries you can use such table with very low overhead while it can become very useful if you plan to extend statuses with further meta data, for example descriptive help message, information when this status is added or flag to mark this is deprecated status value and can’t be used.

In general there are reasons to use all 3 approaches in the applications, just do not do it by randomly picked advice but think what kind of application do you have and what are the most important goals for you. Hope this article will help you to do the right pick.

Related

Peter managed the High Performance Group within MySQL until 2006, when he founded Percona. Peter has a Master's Degree in Computer Science and is an expert in database kernels, computer hardware, and application scaling.

Another relatively commonly used option is CHAR(1), then have character status codes, e.g. ‘N’ for “New” or ‘D’ for deleted. Not as self-documenting as an ENUM, but easier for humans to read and still maintains quite a small field in the table.

Interesting notes- I’ve been using VARCHAR for my status columns as it’s the default behaviour of my particular status plugin (acts_as_state_machine for Ruby on Rails’s ActiveRecord model). I’ve been considering moving this to make use of TINYINT instead of VARCHAR- are there any metrics on the performance differences, other than the obvious size advantages?

Indeed you can use CHAR(1) – This is for example the choice for Yes/No values in MySQL Grant tables. It is a bit better documenting than tinyint assuming you can have statuses to start with different letters. In fact you can also use CHAR(2) and use couple of letters for bit identification. This is why I did not really separated it from VATCHAR variant.

It is easy if you only use single programming language and if you always access the data from API. As the time goes you may have to run some batch jobs manually or have portions of applications developed in other language which makes it easier to make a mistake.

Yeah indeed you can. I wrote at it a while back here http://www.mysqlperformanceblog.com/2008/04/23/efficient-boolean-value-storage-for-innodb-tables/
Though I do not see it quite commonly used because it is minefield as you have to handle NULL very specially. JOIN using = for example would not work. COL=NULL will not fetch the rows either so it makes automatic queries generation more complicated. Though it you have to store 100 of boolean values I’d consider this trick.

Peter: You should also note that ALTER TABLE to add new entries to the end of an ENUM/SET definition in 5.1 is an online operation and requires no table rebuild. So this major drawback to ENUM is corrected.

Greate topic, Peter. It’s something that affects many of us, and is more design-oriented than anything else. Something to get right from the start. I would add that using int types works well because it’s portable–more so than enums, anyway.

As a follow-up, How do you feel about indexes on status columns? Typically, there are very few values, and we’re trained to only
index columns with readonable cardinality. I see a lot of cases though where a typical status flag has very skewed distribution. E.g.
open/shipped/delivered might all have relatively few records compared to closed, where orders stay forever after being completed ?
At some point, it’s more useful to index that column than it is painful to maintain the index.

Indeed this is the case when advice “do not index columns with low cardinality” can be wrong. It is very possible to have column with two values with 99%/1% distribution or something similar.
In such case assuming you typically do lookups for that 1% value index can be quite helpful.

Though too bad with MySQL you will also have to keep 99% of that same value junk in the index. If MySQL would support partial indexes we could build indexes to only include the selective status values which it makes sense to do lookup on.

The workaround which works for some cases is using separate table to maintain such values (with proper indexes) so they can be looked up without requiring to have index built on all data.

“if you store value which is not allowed by ENUM list it will be always converted to â€ – empty string”

This is the case if your sql_mode does not include STRICT_TRANS_TABLES or STRICT_ALL_TABLES (which it doesn’t by default). If it does include (or is equal to) these modes, value truncation due to inserting or updating a non-defined value will lead to a runtime error:

ERROR 1265 (01000): Data truncated for column ‘%c’ at row %r

(Do you have any benchmark data on sql modes? would be interesting to know whether there is a performance penalty involved in one of these modes, or the more encompassing TRADITIONAL mode)

Indeed this applies to standard behavior. If you get to Strict mode it is going to be changed… but a lot of other stuff is going to change as well. In most cases I’ve seen enabling strict mode also needed a lot of minor fixes to the application to make it work 🙂

“It is also worth to note ENUM comes with couple of gotchas – first if you store value which is not allowed by ENUM list it will be always converted to â€ – empty string, the additional ENUM columns can have.”

That is an implementation flaw in my opinion. It should be similar to a FK violation and raise an exception.

Just to take an even different look. On a current project using SQL server, the application developer I am designing the schema with prefers to use lookup tables for all of this.

A lookup table behaves more or less like ENUM with a few key differences:
– Its ease to add/remove entries (even reordering is possible without having to alter the db schema
– Obviously you end up having to do joins. However the lookup tables should obviously be super small and easily fit into memory

In many cases you would see “const” join type for such tables which is very fast (because it is pre-read only once) however if it is join for each row it can get expensive even if data is in memory – join is still much slower than fetching data from the different column in same table.

Awesome article, makes me want to go and check some of my older db designs to see how I was doing them.

For some time I’ve been doing this: The status field is a TINYINT or INT(2). But to attack the problem of self documentation, I just create a table with fields, and DBAs and DB users don’t have to ask the dev team for the meaning of those numeric values. Heck, they can even make joins with this table to get an understandable query result.

Hi Peter, thanks for your post. Right now I am choosing a type for a status like field. So far it will have only two values, but since this field represents type of the objects, it is very likely to change in the future I am considering UNSIGNED TINY INT vs. ENUM.

At this moment other table with similar columns use VARCHAR(N) approach, so using ENUM looks more appealing to me for two reasons:
1) It self documents what status means what
2) More consistent with the rest of the code.
3) Same string values can be used in APIs without internal mapping.

One problem that I am afraid of is that rebuild the entire table may be required after adding a new allowed value. Though we use MySQL 5.1 and adding ENUM value at the end should end up within merely modifying a metadata, there is one caveat that can happen: according to manual:
“Changing the definition of an ENUM or SET column by adding new enumeration or set members to the end of the list of valid member values, as long as the storage size of the data type does not change. For example, adding a member to a SET column that has 8 members changes the required storage per value from 1 byte to 2 bytes; this will require a table copy.”
(http://dev.mysql.com/doc/refman/5.1/en/alter-table.html)

How will I know when is the moment when new range of the ENUM field hits a limit and needs to be resized? I also checked out the chapter “Choosing Optimal Data Types” in your book, but unfortunately did not find answer to my question. Do you know if there is any way to find out the current length of the ENUM column to understand how much it can grow without rebuilding?