To use MySQL effectively, it’s essential to understand how MySQL handles data. This chapter describes the types of data values that MySQL can handle, and discusses the issues involved in working with those types.

This chapter is from the book

This chapter is from the book

Virtually everything you do in MySQL involves data in some way or another because, by definition, the purpose of a database management system is to manage data. Even a statement as simple as SELECT 1 involves evaluation of an expression to produce an integer data value.

Every data value in MySQL has a type. For example, 37.4 is a number and 'abc' is a string. Sometimes data types are explicit, such as when you issue a CREATE TABLE statement that specifies the type for each column you define as part of the table:

Other times data types are implicit, such as when you refer to literal values in an expression, pass values to a function, or use the value returned from a function. The following INSERT statement does all of those things:

The statement performs the following operations, all of which involve data types:

It assigns the integer value 14 to the integer column int_col.

It passes the string values 'a' and 'b' to the CONCAT() string-concatenation function. CONCAT() returns the string value 'ab', which is assigned to the string column str_col.

It assigns the integer value 20090115 to the date column date_col. The assignment involves a type mismatch, but the integer value can reasonably be interpreted as a date value, so MySQL performs an automatic type conversion that converts the integer 20090115 to the date '2009-01-15'.

To use MySQL effectively, it’s essential to understand how MySQL handles data. This chapter describes the types of data values that MySQL can handle, and discusses the issues involved in working with those types:

The general categories of data values that MySQL can represent, including the NULL value.

The specific data types MySQL provides for table columns, and the properties that characterize each data type. Some of MySQL’s data types are fairly generic, such as the BLOB string type. Others behave in special ways that you should understand to avoid being surprised. These include the TIMESTAMP data type and integer types that have the AUTO_INCREMENT attribute.

How the server’s SQL mode affects treatment of bad data values, and the use of “strict” mode to reject bad values.

How to generate and work with sequences.

MySQL’s rules for expression evaluation. You can use a wide range of operators and functions in expressions to retrieve, display, and manipulate data. Expression evaluation includes rules governing type conversion that come into play when a value of one type is used in a context requiring a value of another type. It’s important to understand when type conversion happens and how it works; some conversions don’t make sense and result in meaningless values. Assigning the string '13' to an integer column results in the value 13. However, assigning the string 'abc' to that column results in the value 0 (or an error in strict SQL mode) because 'abc' doesn’t look like a number. Worse, if you perform a comparison without knowing the conversion rules, you can do considerable damage, such as updating or deleting every row in a table when you intend to affect only a few specific rows.

How to choose data types appropriately for your table columns. It’s important to know how to pick the best type for your purposes when you create a table, and when to choose one type over another when several related types might be applicable to the kind of values you want to store.

Two appendixes provide additional information that supplements the discussion in this chapter about MySQL’s data types, operators, and functions. These are Appendix B, “Data Type Reference,” and Appendix C, “Operator and Function Reference.”

The examples shown throughout this chapter use the CREATE TABLE and ALTER TABLE statements extensively to create and alter tables. These statements should be reasonably familiar to you because we have used them in Chapter 1, “Getting Started with MySQL,” and Chapter 2, “Using SQL to Manage Data.” See also Appendix E, “SQL Syntax Reference.”

MySQL supports several storage engines, which differ in their properties. In some cases, a column with a given data type behaves differently for different storage engines, so the way you intend to use a column might determine or influence which storage engine to choose when you create a table. This chapter refers to storage engines on occasion, but a more detailed description of the available engines and their characteristics can be found in Chapter 2.

Data handling depends in some cases on how default values are defined and on the current SQL mode. For general background on setting the SQL mode, see Section 2.1, “The Server SQL Mode.” In the current chapter, Section 3.2.3, “Specifying Column Default Values,” covers default value handing, and Section 3.3, “How MySQL Handles Invalid Data Values,” covers strict mode and the rules for treatment of bad data.

3.1 Data Value Categories

MySQL knows about several general categories in which data values can be represented. These include numbers, string values, temporal values such as dates and times, spatial values, and the NULL value.

3.1.1 Numeric Values

Numbers are values such as 48, 193.62, or -2.378E12. MySQL understands numbers specified as integers (which have no fractional part), fixed-point or floating-point values (which may have a fractional part), and bit-field values.

3.1.1.1 Exact-Value and Approximate-Value Numbers

Exact-value numbers are used exactly as specified when possible. Exact values include integers (0, 14, -382) and numbers that have a decimal point (0.0, 38.5, -18.247).

Integers can be specified in decimal or hexadecimal format. In decimal format, an integer consists of a sequence of digits with no decimal point. Hexadecimal values are treated as strings by default, but in numeric contexts a hexadecimal constant is treated as a 64-bit integer. For example, 0x10 is 16 decimal. Section 3.1.2, “String Values,” later in this chapter, describes hexadecimal value syntax.

An exact-value number with a fractional part consists of a sequence of digits, a decimal point, and another sequence of digits. The sequence of digits before or after the decimal point may be empty, but not both.

Approximate values are represented as floating-point numbers in scientific notation with a mantissa and exponent. This is indicated by immediately following an integer or number with a fractional part by ‘e’ or ‘E’, an optional sign character (‘+’ or ‘-’), and an integer exponent. The mantissa and exponent may be signed in any combination: 1.58E5, -1.58E5, 1.58E-5, -1.58E-5.

Hexadecimal numbers cannot be used in scientific notation; the ‘e’ that begins the exponent part is also a legal hex digit and thus would be ambiguous.

Any number can be preceded by a plus or minus sign character (‘+’ or ‘-’) to indicate a positive or negative value.

Calculations with exact values are exact, with no loss of accuracy within the limits of the precision possible for such values. For example, you cannot insert 1.23456 as is into a column that allows only two digits after the decimal point. Calculations with approximate values are approximate and subject to rounding error.

MySQL evaluates an expression using exact or approximate math according to the following rules:

If any approximate value is present in the expression, it is evaluated as a floating-point (approximate) expression.

For expressions containing only exact values that are all integers, evaluation uses BIGINT (64-bit) precision.

For expressions containing only exact values but where one or more values have a fractional part, DECIMAL arithmetic is used with 65 digits of precision.

If any string must be converted to a number to evaluate an expression, it is converted to a double-precision floating-point value. Consequently, the expression is approximate by the preceding rules.

3.1.1.2 Bit-Field Values

Bit-field values can be written as b'val' or 0bval, where val consists of one or more binary digits (0 or 1). For example, b'1001' and 0b1001 represent 9 decimal. These bit-value notations coincide with the introduction of the BIT data type in MySQL 5.0.3, but bit-field values can be used more generally in other contexts.

A BIT value in a result set displays as a binary string, which may not print well. To convert it to an integer, add zero or use CAST():

3.1.2 String Values

Strings are values such as 'Madison, Wisconsin', 'patient shows improvement', or even '12345' (which looks like a number, but isn’t). Usually, you can use either single or double quotes to surround a string value, but there are two reasons to prefer single quotes:

The SQL standard specifies single quotes, so statements that use single-quoted strings are more portable to other database engines.

If the ANSI_QUOTES SQL mode is enabled, MySQL treats the double quote as an identifier-quoting character, not as a string-quoting character. This means that a double-quoted value must refer to something like a database or table name. Consider the following statement:

SELECT "last_name" from president;

With ANSI_QUOTES disabled, the statement selects the literal string "last_name" once for each row in the president table. With ANSI_QUOTES enabled, the statement selects the values of the last_name column from the table.

For the examples following that use the double quote as a string quoting character, assume that ANSI_QUOTES mode is not enabled.

MySQL recognizes several escape sequences within strings that indicate special characters, as shown in Table 3.1. Each sequence begins with a backslash character (‘\’) to signify a temporary escape from the usual rules for character interpretation. Note that a NUL byte is not the same as the SQL NULL value; NUL is a zero-valued byte, whereas NULL in SQL signifies the absence of a value.

Table 3.1. String Escape Sequences

Sequence

Meaning

\0

NUL (zero-valued byte)

\'

Single quote

\"

Double quote

\b

Backspace

\n

Newline (linefeed)

\r

Carriage return

\t

Tab

\\

Single backslash

\Z

Control-Z (Windows EOF character)

The escape sequences shown in the table are case sensitive, and any character not listed in the table is interpreted as itself if preceded by a backslash. For example, \t is a tab, but \T is an ordinary ‘T’ character.

Table 3.1 shows that you can escape single or double quotes using backslash sequences, but you actually have several options for including quote characters within string values:

Double the quote character if the string itself is quoted using the same character:

'I can''t'
"He said, ""I told you so."""

Quote the string with the other quote character. In this case, you do not double the quote characters within the string:

"I can't"
'He said, "I told you so."'

Escape the quote character with a backslash; this works regardless of the quote characters used to quote the string:

To turn off the special meaning of backslash and treat it as an ordinary character, enable the NO_BACKSLASH_ESCAPES SQL mode.

As an alternative to using quotes for writing string values, you can use two forms of hexadecimal notation. String values may be specified using the standard SQL notation X'val', where val consists of pairs of hexadecimal digits (‘0’ through ‘9’ and ‘a’ through ‘f’). For example, X'0a' is 10 decimal, and X'ffff' is 65535 decimal. The leading ‘X’ and the non-decimal hex digits (‘a’ through ‘f’) can be specified in uppercase or lowercase:

In string contexts, pairs of hexadecimal digits are interpreted as 8-bit numeric byte values in the range from 0 to 255, and the result is used as a string. In numeric contexts, a hexadecimal constant is treated as a number. The following statement illustrates the interpretation of a hex constant in each type of context:

X'val' notation requires an even number of digits. A value such as X'a' is illegal. If a hexadecimal value written using 0x notation has an odd number of hex digits, MySQL treats it as though the value has a leading zero. For example, 0xa is treated as 0x0a.

3.1.2.1 Types of Strings and Character Set Support

String values fall into two general categories, binary and non-binary:

A binary string is a sequence of bytes. These bytes are interpreted without respect to any concept of character set. A binary string has no special comparison or sorting properties. Comparisons are done byte by byte based on numeric byte values; all bytes are significant, including trailing spaces.

A non-binary string is a sequence of characters. It is associated with a character set, which determines the allowable characters that may be used and how MySQL interprets the string contents. Character sets have one or more collating (sorting) orders. The particular collation used for a string determines the ordering of characters in the character set, which affects comparison operations. The default character set and collation are latin1 and latin1_swedish_ci.

Trailing spaces in non-binary strings are not significant in comparisons, except that for the TEXT types, index-based comparisons are padded at the end with spaces and a duplicate-key error occurs if you attempt to insert into a unique-valued TEXT index a value that is different from an existing value only in the number of trailing spaces.

Character units vary in their storage requirements. A single-byte character set such as latin1 uses one byte per character, but there also are multi-byte character sets in which some or all characters require more than one byte. For example, the Unicode character sets available in MySQL are multi-byte. ucs2 is a double-byte character set in which each character requires two bytes. utf8 is a variable-length multi-byte character set with characters that take from one to three bytes. (As of MySQL 6.0.4, utf8 characters can require up to four bytes.)

To find out which character sets and collations are available in your server, use these two statements:

As shown by the output from SHOW COLLATION, each collation is tied to a particular character set, and a given character set might have several collations. Collation names usually consist of a character set name, a language, and an additional suffix. For example, utf8_icelandic_ci is a collation for the utf8 Unicode character set in which comparisons follow Icelandic sorting rules and characters are compared in case-insensitive fashion. Collation suffixes have the following meanings:

_ci indicates a case-insensitive collation.

_cs indicates a case-sensitive collation.

_bin indicates a binary collation. That is, comparisons are based on numeric character code values without reference to any language. For this reason, _bin collation names do not include any language name. Examples: latin1_bin and utf8_bin.

Binary and non-binary strings have different sorting properties:

Binary strings are processed byte by byte in comparisons based solely on the numeric value of each byte. One implication of this property is that binary strings appear to be case sensitive ('abc' <> 'ABC'), but that is actually a side effect of the fact that uppercase and lowercase versions of a letter have different numeric byte values. There isn’t really any notion of lettercase for binary strings. Lettercase is a function of collation, which applies only to character (non-binary) strings.

Non-binary strings are processed character by character in comparisons, and the relative value of each character is determined by the collating sequence that is used for the character set. For many collations, uppercase and lowercase versions of a given letter have the same collating value, so non-binary string comparisons typically are not case sensitive. However, that is not true for case-sensitive or binary collations.

Because collations are used for comparison and sorting, they affect many operations:

Comparisons operators: <, <=, =, <>, >=, >, and LIKE.

Sorting: ORDER BY, MIN(), and MAX().

Grouping: GROUP BY and DISTINCT.

To determine the character set or collation of a string, use the CHARSET() or COLLATION() function.

Quoted string literals are interpreted according to the current server settings. The default character set and collation are latin1 and latin1_swedish_ci:

Two notational conventions can be used to force a string literal to be interpreted with a given character set. First, a string constant can be designated for interpretation with a given character set using the following notation, where charset is the name of a supported character set:

_charset str

The _charset notation is called a “character set introducer.” The string can be written as a quoted string or as a hexadecimal value. The following examples show how to cause strings to be interpreted in the latin2 or utf8 character set:

For quoted strings, whitespace is optional between the introducer and the following string. For hexadecimal values, whitespace is required.

Second, the notation N'str' is equivalent to _utf8'str'. N (not case sensitive) and must be followed immediately by a quoted string literal with no intervening whitespace.

Introducer notation works for quoted string literals or hexadecimal constants, but not for string expressions or column values. However, any string value can be used to produce a string in a designated character set using the CONVERT() function:

CONVERT(str USING charset);

Introducers and CONVERT() are not the same. An introducer merely modifies how the string is interpreted. It does not change the string value (except that for multi-byte character sets, padding might be added if the string does not contain enough bytes). CONVERT() takes a string argument and produces a new string in the desired character set. To see the difference between introducers and CONVERT(), consider the following two statements that refer to the ucs2 double-byte character set:

Assume that the default character set is latin1 (a single-byte character set). The first statement interprets each pair of characters in the string 'ABCD' as a single double-byte ucs2 character, resulting in a two-character ucs2 string. The second statement converts each character of the string 'ABCD' to the corresponding ucs2 character, resulting in a four-character ucs2 string.

What is the “length” of a string? It depends. If you measure with CHAR_LENGTH(), you get the length in characters. If you measure with LENGTH(), you get the length in bytes. For strings that contain multi-byte characters the two values differ:

Here is a somewhat subtle point. A binary string is not the same thing as a non-binary string that has a binary collation:

The binary string has no character set. It is interpreted with byte semantics and comparisons use single-byte numeric codes.

A non-binary string with a binary collation has character semantics and comparisons use numeric character values that might be based on multiple bytes per character.

Here’s one way to see the difference between binary and non-binary strings with regard to lettercase. Create a binary string and a non-binary string that has a binary collation, and then pass each string to the UPPER() function:

Why doesn’t UPPER() convert the binary string to uppercase? This happens because the string has no character set, so there is no way to know which byte values correspond to uppercase or lowercase characters. To use a binary string with functions such as UPPER() and LOWER(), you must first convert it to a non-binary string:

3.1.2.2 Character Set-Related System Variables

The server maintains several system variables that are involved in various aspects of character set support. Most of these variables refer to character sets and the rest refer to collations. Each of the collation variables is linked to a corresponding character set variable.

Some of the character set variables indicate properties of the server or the current database:

character_set_system indicates the character set used for storing identifiers. This is always utf8.

character_set_server and collation_server indicate the server’s default character set and collation.

character_set_database and collation_database indicate the character set and collation of the default database. These are read-only and set automatically by the server whenever you select a default database. If there is no default database, they’re set to the server’s default character set and collation. These variables come into play when you create a table but specify no explicit character set or collation. In this case, the table defaults are taken from the database defaults.

Other character set variables influence how communication occurs between the client and the server:

character_set_client indicates the character set in which the client sends SQL statements to the server.

character_set_results indicates the character set in which the server returns results to the client. “Results” include data values and also metadata such as column names.

character_set_connection is used by the server. When it receives a statement string from the client, it converts the string from character_set_client to character_set_connection and works with the statement in the latter character set. (There is an exception: Any literal string in the statement that is preceded by a character set introducer is interpreted using the character set indicated by the introducer.) collation_connection is used for comparisons between literal strings within statement strings.

character_set_filesystem indicates the filesystem character set. It is used for interpreting literal strings known to refer to filenames in SQL statements such as LOAD DATA. These filename strings are converted from character_set_client to character_set_filesystem before opening the file. The default is binary (no conversion).

Very likely you’ll find that most character set and collation variables are set to the same value by default. For example, the following output indicates that client/server communication takes place using the latin1 character set:

3.1.3 Date and Time (Temporal) Values

Dates and times are values such as '2011-06-17' or '12:30:43'. MySQL also understands combined date/time values, such as '2011-06-17 12:30:43'. Take special note of the fact that MySQL represents dates in year-month-day order. This syntax often surprises newcomers to MySQL, although it is standard SQL format (also known as “ISO 8601” format). You can display date values any way you like using the DATE_FORMAT() function, but the default display format lists the year first. Input values must be specified with the year first. For values in other formats, you might be able to convert them for input by using the STR_TO_DATE() function.

3.1.4 Spatial Values

MySQL supports spatial values, although only for MyISAM, and, as of MySQL 5.0.16, InnoDB, NDB, and ARCHIVE. This capability enables representation of values such as points, lines, and polygons. For example, the following statement uses the text representation of a point value with X and Y coordinates of (10, 20) to create a POINT and assigns the result to a user-defined variable:

SET @pt = POINTFROMTEXT('POINT(10 20)');

3.1.5 Boolean Values

In expressions, zero is considered false and any non-zero, non-NULL value is considered true.

The special constants TRUE and FALSE evaluate to 1 and 0, respectively. They are not case sensitive.

3.1.6 The NULL Value

NULL is something of a “typeless” value. Generally, it’s used to mean “no value,” “unknown value,” “missing value,” “out of range,” “not applicable,” “none of the above,” and so forth. You can insert NULL values into tables, retrieve them from tables, and test whether a value is NULL. However, you cannot perform arithmetic on NULL values; if you try, the result is NULL. Also, many functions return NULL if you invoke them with a NULL or invalid argument.

The keyword NULL is written without quotes and is not case sensitive. MySQL also treats a standalone \N (case sensitive) as NULL: