This chapter is from the book

This chapter is from the book

Virtually everything you do in MySQL involves data in some way or another
because the purpose of a database management system is, by definition, to manage
data. Even a simple SELECT 1 statement involves expression evaluation
to produce an integer data value.

Every data value in MySQL has a type. For example, 37.4 is a number
and 'abc' is a string. Sometimes data types are explicit,
such as when you issue a CREATE TABLE statement that specifies the
type for each column you define as part of the table:

Other times data types are implicit, such as when you refer to literal values
in an expression, pass values to a function, or use the value returned from
a function. The following INSERT statement does all of those things:

The statement performs the following operations, all of which involve data
types:

It assigns the integer value 14 to the integer column int_col.

It passes the string values 'a' and 'b' to
the CONCAT() string-concatenation function. CONCAT() returns
the string value 'ab', which is assigned to the string
column str_col.

It assigns the integer value 20050115 to the date column date_col.
The assignment involves a type mismatch, but the integer value can reasonably
be interpreted as a date value, so MySQL performs an automatic type conversion
that converts the integer 20050115 to the date '2005-01-15'.

To use MySQL effectively, it's essential to understand how MySQL handles
data. This chapter describes the types of data values that MySQL can handle,
and discusses the issues involved in working with those types:

The general categories of data values that MySQL can represent, including
the NULL value.

The specific data types MySQL provides for table columns, and the properties
that characterize each data type. Some of MySQL's data types are fairly
generic, such as the BLOB string type. Others behave in special
ways that you should understand to avoid being surprised. These include
the TIMESTAMP data type and integer types that have the AUTO_INCREMENT attribute.

MySQL's capabilities for working with different character sets.

Note: Support for multiple character sets was introduced beginning
with MySQL 4.1, but underwent quite a bit of development during the early
4.1 releases. For best results, avoid early releases and use a recent 4.1
release instead.

How to choose data types appropriately for your table columns. It's
important to know how to pick the best type for your purposes when you
create a table, and when to choose one type over another when several related
types might be applicable to the kind of values you want to store.

MySQL's rules for expression evaluation. You can use a wide range
of operators and functions in expressions to retrieve, display, and manipulate
data. Expression evaluation includes rules governing type conversion that
come into play when a value of one type is used in a context requiring
a value of another type. It's important to understand when type conversion
happens and how it works; some conversions don't make sense and result
in meaningless values. Assigning the string '13' to
an integer column results in the value 13, but assigning the string 'abc' to
that column results in the value 0 because 'abc' doesn't
look like a number. Worse, if you perform a comparison without knowing
the conversion rules, you can do considerable damage, such as updating
or deleting every row in a table when you intend to affect only a few specific
rows. MySQL 5.0 introduces "strict" data-handling mode, which
enables you to cause bad data values to be rejected.

Two appendixes provide additional information that supplements the discussion
in this chapter about MySQL's data types, operators, and functions. These
are Appendix B, "Data Type Reference," and Appendix C, "Operator
and Function Reference."

The examples shown throughout this chapter use the CREATE TABLE and ALTER
TABLE statements extensively to create and alter tables. These statements
should be reasonably familiar to you because we have used them in Chapter
1, "Getting Started with MySQL and SQL," and Chapter 2, "MySQL
SQL Syntax and Use." See also Appendix E, "SQL Syntax Reference."

MySQL supports several table types, each of which is managed by a different
storage engine, and which differ in their properties. In some cases, a column
with a given data type behaves differently for different storage engines, so
the way you intend to use a column might determine or influence which storage
engine to choose when you create a table. This chapter refers to storage engines
on occasion, but a more detailed description of the available engines and their
characteristics can be found in Chapter 2.

Data handling also depends in some cases on how default values are defined
and on the current SQL mode. For general background on setting the SQL mode,
see "The Server SQL Mode," in Chapter 2. In the current chapter,
default value handing is covered in "Specifying Column Default Values." Strict
mode and the rules for treatment of bad data are covered in "How MySQL
Handles Invalid Data Values."

Categories of Data Values

MySQL knows about several general categories in which data values can be represented.
These include numbers, string values, temporal values such as dates and times,
spatial values, and the NULL value.

Numeric Values

Numbers are values such as 48 or 193.62. MySQL understands
numbers specified as integers (which have no fractional part) and floating-point
or fixed-point values (which may have a fractional part). Integers can be specified
in decimal or hexadecimal format.

An integer consists of a sequence of digits with no decimal point. In numeric
contexts, an integer can be specified as a hexadecimal constant and is treated
as a 64-bit integer. For example, 0x10 is 16 decimal. Hexadecimal values are
treated as strings by default, so their syntax is given in the next section, "String
Values."

A floating-point or fixed-point number consists of a sequence of digits, a
decimal point, and another sequence of digits. The sequence of digits before
or after the decimal point may be empty, but not both.

MySQL understands scientific notation. This is indicated by immediately following
an integer or floating-point number with 'e' or 'E',
a sign character ('+' or '-'), and an
integer exponent. 1.34E+12 and 43.27e-1 are legal numbers
in scientific notation. The number 1.34E12 is also legal even though
it is missing an optional sign character before the exponent.

Hexadecimal numbers cannot be used in scientific notation; the 'e' that
begins the exponent part is also a legal hex digit and thus would be ambiguous.

Any number can be preceded by a plus or minus sign character ('+' or '-'),
to indicate a positive or negative value.

As of MySQL 5.0.3, bit-field values can be written as b'val',
where val consists of one or more binary digits (0 or 1). For
example, b'1001' is 9 decimal. This notation coincides with
the introduction of the BIT data type, but bit-field values can be
used more generally in other contexts.

String Values

Strings are values such as 'Madison, Wisconsin', 'patient
shows improvement', or even '12345' (which looks
like a number, but isn't). Usually, you can use either single or double
quotes to surround a string value, but there are two reasons to stick with
single quotes:

The SQL standard specifies single quotes, so statements that use single-quoted
strings are more portable to other database engines.

If the ANSI_QUOTES SQL mode is enabled, it treats the double
quote as an identifier quoting character, not as a string quoting character.
This means that a double-quoted value must refer to something like a database
or table.

For the examples that use the double quote as a string quoting character in
the discussion that follows, assume that ANSI_QUOTES mode is not enabled.

MySQL recognizes several escape sequences within strings that indicate special
characters, as shown in Table 3.1. Each sequence begins with a backslash character
('\') to signify a temporary escape from the usual rules
for character interpretation. Note that a NUL byte is not the same as the SQL NULL value;
NUL is a zero-valued byte, whereas NULL in SQL signifies the absence
of a value.

Table 3.1 String Escape Sequences

Sequence

Meaning

\0

NUL (zero-valued byte)

\'

Single quote

\"

Double quote

\b

Backspace

\n

Newline (linefeed)

\r

Carriage return

\t

Tab

\\

Single backslash

\Z

Ctrl-Z (Windows EOF character)

The escape sequences shown in the table are case sensitive, and any character
not listed in the table is interpreted as itself if preceded by a backslash.
For example, \t is a tab, but \T is an ordinary 'T' character.

The table shows how to escape single or double quotes using backslash sequences,
but you actually have several options for including quote characters within
string values:

Double the quote character if the string itself is quoted using the same
character:

'I can''t'
"He said, ""I told you so."""

Quote the string with the other quote character. In this case, you do
not double the quote characters within the string:

"I can't"
'He said, "I told you so."'

Escape the quote character with a backslash; this works regardless of
the quote characters used to quote the string:

To turn off the special meaning of backslash and treat it as an ordinary character,
enable the NO_BACKSLASH_ESCAPES SQL mode, which is available as of
MySQL 5.0.2.

As an alternative to using quotes for writing string values, you can use two
forms of hexadecimal notation. The first consists of '0x' followed
by one or more hexadecimal digits ('0' through '9' and 'a' through 'f').
For example, 0x0a is 10 decimal, and 0xffff is 65535 decimal.
The non-decimal hex digits ('a' through 'f')
can be specified in uppercase or lowercase, but the leading '0x' cannot
be given as '0X'. That is, 0x0a and 0x0A are
legal hexadecimal values, but 0X0a and 0X0A are not. In string
contexts, pairs of hexadecimal digits are interpreted as 8-bit numeric byte
values in the range from 0 to 255, and the result is used as a string. In numeric
contexts, a hexadecimal constant is treated as a number. The following statement
illustrates the interpretation of a hex constant in each type of context:

If a hexadecimal value written using 0x notation has an odd number
of hex digits, MySQL treats it as though the value has a leading zero. For
example, 0xa is treated as 0x0a.

String values may also be specified using the standard SQL notation X'val',
where val consists of pairs of hexadecimal digits. As with 0x notation,
such values are interpreted as strings, but may be used as numbers in a numeric
context:

Properties of Binary and Non-Binary Strings

String values fall into two general categories, binary and non-binary:

A binary string is a sequence of bytes. These bytes are interpreted without
respect to any concept of character set. A binary string has no special
comparison or sorting properties. Comparisons are done byte by byte based
on numeric byte values. Trailing spaces are significant in comparisons.

A non-binary string is a sequence of characters. It is associated with
a character set, which determines the allowable characters that may be
used and how MySQL interprets the string contents. Character sets have
one or more collating (sorting) orders. The particular collation used for
a string determines the ordering of characters in the character set, which
affects comparison operations. Trailing spaces are not significant in comparisons.
The default character set and collation are latin1 and latin1_swedish_ci.

Character units vary in their storage requirements. A single-byte character
set such as latin1 uses one byte per character, but there also are
multi-byte character sets in which some or all characters require more than
one byte. For example, both of the Unicode character sets available in MySQL
are multi-byte. ucs2 is a double-byte character set in which each
character requires two bytes. utf8 is a variable-length multi-byte
character set with characters that take from one to three bytes.

To find out which character sets and collations are available in your server
as it currently is configured, use these two statements:

As shown by the output from SHOW COLLATION, each collation is specific
to a given character set, but a given character set might have several collations.
Collation names usually consist of a character set name, a language, and an
additional suffix. For example, utf8_icelandic_ci is a collation for
the utf8 Unicode character set in which comparisons follow Icelandic
sorting rules and characters are compared in case-insensitive fashion. Collation
suffixes have the following meanings:

_ci indicates a case-insensitive collation.

_cs indicates a case-sensitive collation.

_bin indicates a binary collation. That is, comparisons are based
on character code values without reference to any language. For this reason, _bin collation
names do not include any language name. Examples: latin1_bin and utf8_bin.

The sorting properties for binary and non-binary strings differ as follows:

Binary strings are processed byte by byte in comparisons based solely
on the numeric value of each byte. One implication of this property is
that binary values appear to be case sensitive, but that actually is a
side effect of the fact that uppercase and lowercase versions of a character
have different numeric byte values. There isn't really any notion
of lettercase for binary strings. Lettercase is a function of collation,
which applies only to character (non-binary) strings.

Non-binary strings are processed character by character in comparisons,
and the relative value of each character is determined by the collating
sequence that is used for the character set. For most collations, uppercase
and lowercase versions of a given letter have the same collating value,
so non-binary string comparisons typically are not case sensitive. However,
that is not true for case-sensitive or binary collations.

Because collations are used for comparison and sorting, they affect many operations:

Comparisons operators: <, <=, =, <>, >=, >,
and LIKE.

Sorting: ORDER BY, MIN(), and MAX().

Grouping: GROUP BY and DISTINCT.

To determine the character set or collation of a string, you can use the CHARSET() and COLLATION() functions.

Quoted string literals are interpreted according to the current server settings.
The default character set and collation are latin1 and latin1_swedish_ci:

Two forms of notation can be used to force a string literal to be interpreted
with a given character set:

A string constant can be designated for interpretation with a given character
set using the following notation, where charset is the
name of a supported character set:

_charset str

The _charset notation is called a "character set
introducer." The string can be written as a quoted string or as a
hexadecimal value. The following examples show how to cause strings to
be interpreted in the latin2 and utf8 character sets:

For quoted strings, whitespace is optional between the introducer and
the following string. For hexadecimal values, whitespace is required.

The notation N'str' is equivalent to _utf8'str'. N must
be followed immediately by a quoted literal string with no intervening
whitespace.

Introducer notation works for literal quoted strings or hexadecimal constants,
but not for string expressions or column values. However, any string or string
expression can be used to produce a string in a designated character set using
the CONVERT() function:

CONVERT(str USING charset);

Introducers and CONVERT() are not the same. An introducer does not
change the string value; it merely modifies how the string is interpreted. CONVERT() takes
a string argument and produces a new string in the desired character set. To
see the difference between introducers and CONVERT(), consider the
following two statements that refer to the ucs2 double-byte character
set:

Assume that the default character set is latin1 (a single-byte character
set). The first statement interprets each pair of characters in the string 'ABCD' as
a single double-byte ucs2 character, resulting in a two-character ucs2 string.
The second statement converts each character of the string 'ABCD' to
the corresponding ucs2 character, resulting in a four-character ucs2 string.

What is the "length" of each string? It depends. If you measure
with CHAR_LENGTH(), you get the length in characters. If you measure
with LENGTH(), you get the length in bytes:

Here is a somewhat subtle point: A binary string is not the same thing as
a non-binary string that has a binary collation. The binary string has no character
set. It is interpreted with byte semantics and comparisons use single-byte
numeric codes. A non-binary string with a binary collation has character semantics
and comparisons use numeric character values that might be based on multiple
bytes per character.

Here's one way to see the difference between binary and non-binary strings
with regard to lettercase. Create a binary string and a non-binary string that
has a binary collation, and then pass each string to the UPPER() function:

Why doesn't UPPER() convert the binary string to uppercase?
This occurs because it has no character set, so there is no way to know which
byte values correspond to uppercase or lowercase characters. To use a binary
string with functions such as UPPER() and LOWER(), you must
first convert it to a non-binary string:

Character SetRelated System Variables

The server maintains several system variables that are involved in various
aspects of character set support. Six of these variables refer to character
sets and three refer to collations. Each of the collation variables is linked
to a corresponding character set variable.

character_set_system indicates the character set used for storing
identifiers. This is always utf8.

character_set_server and collation_server indicate the
server's default character set and collation.

character_set_database and collation_database indicate
the character set and collation of the default database. These are read-only
and set automatically by the server whenever you select a default database.
If there is no default database, they're set to the server's
default character set and collation. These variables come into play when
you create a table but specify no explicit character set or collation.
In this case, the table defaults are taken from the database defaults.

The remaining variables influence how communication occurs between the
client and the server:

character_set_client indicates the character set used for SQL
statements that the client sends to the server.

character_set_results indicates the character set used for
results that the server returns to the client. "Results" include
data values and also metadata such as column names.

character_set_connection is used by the server. When it receives
a statement string from the client, it converts the string from character_set_client to character_set_connection and
works with the statement in the latter character set. (There is an exception:
Any literal string in the statement that is preceded by a character set
introducer is interpreted using the character set indicated by the introducer.) collation_connection is
used for comparisons between literal strings within statement strings.

Very likely you'll find that most character set and collation variables
are set to the same value by default. For example, the following output indicates
that client/server communication takes place using the latin1 character
set:

Date and Time (Temporal) Values

Dates and times are values such as '2005-06-17' or '12:30:43'.
MySQL also understands combined date/time values, such as '2005-06-17
12:30:43'. Take special note of the fact that MySQL represents dates
in year-month-day order. This often surprises newcomers to MySQL, although
this is standard SQL format (also known as "ISO 8601" format). You
can display date values any way you like using the DATE_FORMAT() function,
but the default display format lists the year first, and input values must
be specified with the year first.

Spatial Values

MySQL 4.1 and up supports spatial values, although currently only for MyISAM
tables. This capability allows representation of values such as points, lines,
and polygons. For example, the following statement uses the text representation
of a point value with X and Y coordinates of (10, 20) to create a POINT and
assigns the result to a user-defined variable:

SET @pt = POINTFROMTEXT('POINT(10 20)');

The NULL Value

NULL is something of a "typeless" value. Generally, it's
used to mean "no value," "unknown value," "missing
value," "out of range," "not applicable," "none
of the above," and so forth. You can insert NULL values into
tables, retrieve them from tables, and test whether a value is NULL.
However, you cannot perform arithmetic on NULL values; if you try,
the result is NULL. Also, many functions return NULL if you
invoke them with a NULL argument.