An expression that contains NULL always produces
a NULL value unless otherwise indicated in the
documentation for a particular function or operator.

Note

By default, there must be no whitespace between a function name
and the parenthesis following it. This helps the MySQL parser
distinguish between function calls and references to tables or
columns that happen to have the same name as a function. However,
spaces around function arguments are permitted.

12.2 Type Conversion in Expression Evaluation

When an operator is used with operands of different types, type
conversion occurs to make the operands compatible. Some
conversions occur implicitly. For example, MySQL automatically
converts numbers to strings as necessary, and vice versa.

See later in this section for information about the character set
of implicit number-to-string conversions, and for modified rules
that apply to CREATE TABLE ... SELECT
statements.

The following rules describe how conversion occurs for comparison
operations:

If one or both arguments are NULL, the
result of the comparison is NULL, except
for the NULL-safe
<=>
equality comparison operator. For NULL <=>
NULL, the result is true. No conversion is needed.

If both arguments in a comparison operation are strings, they
are compared as strings.

If both arguments are integers, they are compared as integers.

Hexadecimal values are treated as binary strings if not
compared to a number.

If one of the arguments is a
TIMESTAMP or
DATETIME column and the other
argument is a constant, the constant is converted to a
timestamp before the comparison is performed. This is done to
be more ODBC-friendly. Note that this is not done for the
arguments to IN()! To be safe,
always use complete datetime, date, or time strings when doing
comparisons. For example, to achieve best results when using
BETWEEN with date or time values,
use CAST() to explicitly
convert the values to the desired data type.

A single-row subquery from a table or tables is not considered
a constant. For example, if a subquery returns an integer to
be compared to a DATETIME
value, the comparison is done as two integers. The integer is
not converted to a temporal value. To compare the operands as
DATETIME values, use
CAST() to explicitly convert
the subquery value to DATETIME.

If one of the arguments is a decimal value, comparison depends
on the other argument. The arguments are compared as decimal
values if the other argument is a decimal or integer value, or
as floating-point values if the other argument is a
floating-point value.

In all other cases, the arguments are compared as
floating-point (real) numbers.

Comparison of JSON values takes place at two levels. The first
level of comparison is based on the JSON types of the compared
values. If the types differ, the comparison result is determined
solely by which type has higher precedence. If the two values have
the same JSON type, a second level of comparison occurs using
type-specific rules. For comparison of JSON and non-JSON values,
the non-JSON value is converted to JSON and the values compared as
JSON values. For details, see Comparison and Ordering of JSON Values.

The following examples illustrate conversion of strings to numbers
for comparison operations:

For comparisons of a string column with a number, MySQL cannot use
an index on the column to look up the value quickly. If
str_col is an indexed string column,
the index cannot be used when performing the lookup in the
following statement:

SELECT * FROM tbl_name WHERE str_col=1;

The reason for this is that there are many different strings that
may convert to the value 1, such as
'1', ' 1', or
'1a'.

Comparisons that use floating-point numbers (or values that are
converted to floating-point numbers) are approximate because such
numbers are inexact. This might lead to results that appear
inconsistent:

Such results can occur because the values are converted to
floating-point numbers, which have only 53 bits of precision and
are subject to rounding:

mysql> SELECT '18015376320243459'+0.0;
-> 1.8015376320243e+16

Furthermore, the conversion from string to floating-point and from
integer to floating-point do not necessarily occur the same way.
The integer may be converted to floating-point by the CPU, whereas
the string is converted digit by digit in an operation that
involves floating-point multiplications.

The results shown will vary on different systems, and can be
affected by factors such as computer architecture or the compiler
version or optimization level. One way to avoid such problems is
to use CAST() so that a value is
not converted implicitly to a float-point number:

Accurate representation of values in cases where results
previously did not provide sufficient precision, such as for
values close to IEEE limits.

Conversion of numbers to string format with the best possible
precision. The precision of dtoa is always
the same or better than that of the standard C library
functions.

Because the conversions produced by this library differ in some
cases from non-dtoa results, the potential
exists for incompatibilities in applications that rely on previous
results. For example, applications that depend on a specific exact
result from previous conversions might need adjustment to
accommodate additional precision.

The dtoa library provides conversions with the
following properties. D represents a
value with a DECIMAL or string
representation, and F represents a
floating-point number in native binary (IEEE) format.

F ->
D conversion is done with the best
possible precision, returning D as
the shortest string that yields F
when read back in and rounded to the nearest value in native
binary format as specified by IEEE.

D ->
F conversion is done such that
F is the nearest native binary
number to the input decimal string
D.

These properties imply that F ->
D -> F
conversions are lossless unless F is
-inf, +inf, or
NaN. The latter values are not supported
because the SQL standard defines them as invalid values for
FLOAT or
DOUBLE.

For D ->
F -> D
conversions, a sufficient condition for losslessness is that
D uses 15 or fewer digits of precision,
is not a denormal value, -inf,
+inf, or NaN. In some cases,
the conversion is lossless even if D
has more than 15 digits of precision, but this is not always the
case.

This means that such a conversion results in a character
(nonbinary) string (a CHAR,
VARCHAR, or
LONGTEXT value), except in the case
that the connection character set is set to
binary. In that case, the conversion result is
a binary string (a BINARY,
VARBINARY, or
LONGBLOB value).

For integer expressions, the preceding remarks about expression
evaluation apply somewhat differently for
expression assignment; for example, in a
statement such as this:

CREATE TABLE t SELECT integer_expr;

In this case, the table in the column resulting from the
expression has type INT or
BIGINT depending on the length of
the integer expression. If the maximum length of the expression
does not fit in an INT,
BIGINT is used instead. The length
is taken from the max_length value of the
SELECT result set metadata (see
Section 27.7.5, “C API Data Structures”). This means that you can
force a BIGINT rather than
INT by use of a sufficiently long
expression:

Comparison operations result in a value of 1
(TRUE), 0
(FALSE), or NULL. These
operations work for both numbers and strings. Strings are
automatically converted to numbers and numbers to strings as
necessary.

The following relational comparison operators can be used to
compare not only scalar operands, but row operands:

= > < >= <= <> !=

The descriptions for those operators later in this section
detail how they work with row operands. For additional examples
of row comparisons in the context of row subqueries, see
Section 13.2.11.5, “Row Subqueries”.

To work well with ODBC programs, MySQL supports the
following extra features when using IS
NULL:

If sql_auto_is_null
variable is set to 1, then after a statement that
successfully inserts an automatically generated
AUTO_INCREMENT value, you can find
that value by issuing a statement of the following form:

SELECT * FROM tbl_name WHERE auto_col IS NULL

If the statement returns a row, the value returned is
the same as if you invoked the
LAST_INSERT_ID()
function. For details, including the return value after
a multiple-row insert, see
Section 12.14, “Information Functions”. If no
AUTO_INCREMENT value was successfully
inserted, the SELECT
statement returns no row.

If expr is greater than or equal
to min and
expr is less than or equal to
max,
BETWEEN returns
1, otherwise it returns
0. This is equivalent to the expression
(min <=
expr AND
expr <=
max) if all the
arguments are of the same type. Otherwise type conversion
takes place according to the rules described in
Section 12.2, “Type Conversion in Expression Evaluation”, but applied to all the
three arguments.

For best results when using
BETWEEN with date or time
values, use CAST() to
explicitly convert the values to the desired data type.
Examples: If you compare a
DATETIME to two
DATE values, convert the
DATE values to
DATETIME values. If you use a
string constant such as '2001-1-1' in a
comparison to a DATE, cast
the string to a DATE.

Returns 1 if
expr is equal to any of the
values in the IN list, else returns
0. If all values are constants, they are
evaluated according to the type of
expr and sorted. The search for
the item then is done using a binary search. This means
IN is very quick if the
IN value list consists entirely of
constants. Otherwise, type conversion takes place according
to the rules described in Section 12.2, “Type Conversion in Expression Evaluation”,
but applied to all the arguments.

You should never mix quoted and unquoted values in an
IN list because the comparison rules for
quoted values (such as strings) and unquoted values (such as
numbers) differ. Mixing types may therefore lead to
inconsistent results. For example, do not write an
IN expression like this:

Returns 0 if N
< N1, 1 if
N <
N2 and so on or
-1 if N is
NULL. All arguments are treated as
integers. It is required that N1
< N2 <
N3 < ...
< Nn for this function to work
correctly. This is because a binary search is used (very
fast).

12.3.3 Logical Operators

In SQL, all logical operators evaluate to
TRUE, FALSE, or
NULL (UNKNOWN). In MySQL,
these are implemented as 1 (TRUE), 0
(FALSE), and NULL. Most of
this is common to different SQL database servers, although some
servers may return any nonzero value for
TRUE.

MySQL evaluates any nonzero, non-NULL value
to TRUE. For example, the following
statements all assess to TRUE:

Logical OR. When both operands are
non-NULL, the result is
1 if any operand is nonzero, and
0 otherwise. With a
NULL operand, the result is
1 if the other operand is nonzero, and
NULL otherwise. If both operands are
NULL, the result is
NULL.

12.3.4 Assignment Operators

Assignment operator. Causes the user variable on the left
hand side of the operator to take on the value to its right.
The value on the right hand side may be a literal value,
another variable storing a value, or any legal expression
that yields a scalar value, including the result of a query
(provided that this value is a scalar value). You can
perform multiple assignments in the same
SET
statement. You can perform multiple assignments in the same
statement.

Unlike
=, the
:=
operator is never interpreted as a comparison operator. This
means you can use
:= in
any valid SQL statement (not just in
SET
statements) to assign a value to a variable.

While it is also possible both to set and to read the value
of the same variable in a single SQL statement using the
:=
operator, this is not recommended.
Section 9.4, “User-Defined Variables”, explains why you should
avoid doing this.

This operator is used to perform value assignments in two
cases, described in the next two paragraphs.

Within a
SET
statement, = is treated as an assignment
operator that causes the user variable on the left hand side
of the operator to take on the value to its right. (In other
words, when used in a
SET
statement, = is treated identically to
:=.)
The value on the right hand side may be a literal value,
another variable storing a value, or any legal expression
that yields a scalar value, including the result of a query
(provided that this value is a scalar value). You can
perform multiple assignments in the same
SET
statement.

In the SET clause of an
UPDATE statement,
= also acts as an assignment operator; in
this case, however, it causes the column named on the left
hand side of the operator to assume the value given to the
right, provided any WHERE conditions that
are part of the UPDATE are
met. You can make multiple assignments in the same
SET clause of an
UPDATE statement.

The first CASE syntax returns the
result for the first
value=compare_value
comparison that is true. The second syntax returns the result
for the first condition that is true. If no comparison or
condition is true, the result after ELSE is
returned, or NULL if there is no
ELSE part.

Note

The syntax of the CASEexpression described here differs
slightly from that of the SQL
CASEstatement described in
Section 13.6.5.1, “CASE Syntax”, for use inside stored programs. The
CASE statement cannot have an
ELSE NULL clause, and it is terminated
with END CASE instead of
END.

The return type of a CASE
expression result is the aggregated type of all result values:

If all types are numeric, the aggregated type is also
numeric:

If at least one argument is double precision, the
result is double precision.

Otherwise, if at least one argument is
DECIMAL, the result is
DECIMAL.

Otherwise, the result is an integer type (with one
exception):

If all integer types are all signed or all
unsigned, the result is the same sign and the
precision is the highest of all specified integer
types (that is,
TINYINT,
SMALLINT,
MEDIUMINT,
INT, or
BIGINT).

If there is a combination of signed and unsigned
integer types, the result is signed and the
precision may be higher. For example, if the types
are signed INT and
unsigned INT, the
result is signed
BIGINT.

The exception is unsigned
BIGINT combined
with any signed integer type. The result is
DECIMAL with
sufficient precision and scale 0.

If all types are BIT, the
result is BIT. Otherwise,
BIT arguments are treated
similar to BIGINT.

If all types are YEAR, the
result is YEAR. Otherwise,
YEAR arguments are treated similar to
INT.

If all types are character string
(CHAR or
VARCHAR), the result is
VARCHAR with maximum length
determined by the longest character length of the
operands.

The default return type of
IFNULL(expr1,expr2)
is the more “general” of the two expressions, in
the order STRING, REAL,
or INTEGER. Consider the case of a table
based on expressions or where MySQL must internally store a
value returned by IFNULL() in a
temporary table:

Returns the length of the string
str, measured in characters. A
multibyte character counts as a single character. This means
that for a string containing five 2-byte characters,
LENGTH() returns
10, whereas
CHAR_LENGTH() returns
5.

Returns the string that results from concatenating the
arguments. May have one or more arguments. If all arguments
are nonbinary strings, the result is a nonbinary string. If
the arguments include any binary strings, the result is a
binary string. A numeric argument is converted to its
equivalent nonbinary string form.

CONCAT_WS() stands for
Concatenate With Separator and is a special form of
CONCAT(). The first argument is
the separator for the rest of the arguments. The separator is
added between the strings to be concatenated. The separator
can be a string, as can the rest of the arguments. If the
separator is NULL, the result is
NULL.

ELT() returns the
Nth element of the list of strings:
str1 if
N = 1,
str2 if
N = 2, and so
on. Returns NULL if
N is less than 1
or greater than the number of arguments.
ELT() is the complement of
FIELD().

Returns a string such that for every bit set in the value
bits, you get an
on string and for every bit not set
in the value, you get an off
string. Bits in bits are examined
from right to left (from low-order to high-order bits).
Strings are added to the result from left to right, separated
by the separator string (the
default being the comma character ,). The
number of bits examined is given by
number_of_bits, which has a default
of 64 if not specified.
number_of_bits is silently clipped
to 64 if larger than 64. It is treated as an unsigned integer,
so a value of −1 is effectively the same as 64.

Returns a value in the range of 1 to
N if the string
str is in the string list
strlist consisting of
N substrings. A string list is a
string composed of substrings separated by
, characters. If the first argument is a
constant string and the second is a column of type
SET, the
FIND_IN_SET() function is
optimized to use bit arithmetic. Returns 0
if str is not in
strlist or if
strlist is the empty string.
Returns NULL if either argument is
NULL. This function does not work properly
if the first argument contains a comma (,)
character.

Formats the number X to a format
like '#,###,###.##', rounded to
D decimal places, and returns the
result as a string. If D is
0, the result has no decimal point or
fractional part.

The optional third parameter enables a locale to be specified
to be used for the result number's decimal point, thousands
separator, and grouping between separators. Permissible locale
values are the same as the legal values for the
lc_time_names system variable
(see Section 10.7, “MySQL Server Locale Support”). If no locale is
specified, the default is 'en_US'.

Takes a string encoded with the base-64 encoded rules used by
TO_BASE64() and returns the
decoded result as a binary string. The result is
NULL if the argument is
NULL or not a valid base-64 string. See the
description of TO_BASE64() for
details about the encoding and decoding rules.

For a string argument str,
HEX() returns a hexadecimal
string representation of str where
each byte of each character in str
is converted to two hexadecimal digits. (Multibyte characters
therefore become more than two digits.) The inverse of this
operation is performed by the
UNHEX() function.

For a numeric argument N,
HEX() returns a hexadecimal
string representation of the value of
N treated as a longlong
(BIGINT) number. This is
equivalent to
CONV(N,10,16).
The inverse of this operation is performed by
CONV(HEX(N),16,10).

Returns the string str, with the
substring beginning at position pos
and len characters long replaced by
the string newstr. Returns the
original string if pos is not
within the length of the string. Replaces the rest of the
string from position pos if
len is not within the length of the
rest of the string. Returns NULL if any
argument is NULL.

Returns the length of the string
str, measured in bytes. A multibyte
character counts as multiple bytes. This means that for a
string containing five 2-byte characters,
LENGTH() returns
10, whereas
CHAR_LENGTH() returns
5.

Reads the file and returns the file contents as a string. To
use this function, the file must be located on the server
host, you must specify the full path name to the file, and you
must have the FILE privilege.
The file must be readable by all and its size less than
max_allowed_packet bytes. If
the secure_file_priv system
variable is set to a nonempty directory name, the file to be
loaded must be located in that directory.

If the file does not exist or cannot be read because one of
the preceding conditions is not satisfied, the function
returns NULL.

The first syntax returns the position of the first occurrence
of substring substr in string
str. The second syntax returns the
position of the first occurrence of substring
substr in string
str, starting at position
pos. Returns 0
if substr is not in
str. Returns
NULL if any argument is
NULL.

For collations of Unicode character sets,
LOWER() and
UPPER() work according to the
Unicode Collation Algorithm (UCA) version in the collation
name, if there is one, and UCA 4.0.0 if no version is
specified. For example, utf8mb4_0900_ai_ci
and utf8_unicode_520_ci work according to
UCA 9.0.0 and 5.2.0, respectively, whereas
utf8_unicode_ci works according to UCA
4.0.0. See Section 10.1.10.1, “Unicode Character Sets”.

Returns a set value (a string containing substrings separated
by , characters) consisting of the strings
that have the corresponding bit in
bits set.
str1 corresponds to bit 0,
str2 to bit 1, and so on.
NULL values in
str1,
str2, ... are
not appended to the result.

Quotes a string to produce a result that can be used as a
properly escaped data value in an SQL statement. The string is
returned enclosed by single quotation marks and with each
instance of backslash (\), single quote
('), ASCII NUL, and
Control+Z preceded by a backslash. If the argument is
NULL, the return value is the word
“NULL” without enclosing single quotation marks.

Returns a soundex string from str.
Two strings that sound almost the same should have identical
soundex strings. A standard soundex string is four characters
long, but the SOUNDEX()
function returns an arbitrarily long string. You can use
SUBSTRING() on the result to
get a standard soundex string. All nonalphabetic characters in
str are ignored. All international
alphabetic characters outside the A-Z range are treated as
vowels.

Important

When using SOUNDEX(), you
should be aware of the following limitations:

This function, as currently implemented, is intended to
work well with strings that are in the English language
only. Strings in other languages may not produce reliable
results.

This function is not guaranteed to provide consistent
results with strings that use multibyte character sets,
including utf-8. See Bug #22638 for
more information.

This function implements the original Soundex algorithm, not
the more popular enhanced version (also described by D.
Knuth). The difference is that original version discards
vowels first and duplicates second, whereas the enhanced
version discards duplicates first and vowels second.

The forms without a len argument
return a substring from string str
starting at position pos. The forms
with a len argument return a
substring len characters long from
string str, starting at position
pos. The forms that use
FROM are standard SQL syntax. It is also
possible to use a negative value for
pos. In this case, the beginning of
the substring is pos characters
from the end of the string, rather than the beginning. A
negative value may be used for pos
in any of the forms of this function.

For all forms of SUBSTRING(),
the position of the first character in the string from which
the substring is to be extracted is reckoned as
1.

Returns the substring from string
str before
count occurrences of the delimiter
delim. If
count is positive, everything to
the left of the final delimiter (counting from the left) is
returned. If count is negative,
everything to the right of the final delimiter (counting from
the right) is returned.
SUBSTRING_INDEX() performs a
case-sensitive match when searching for
delim.

Converts the string argument to base-64 encoded form and
returns the result as a character string with the connection
character set and collation. If the argument is not a string,
it is converted to a string before conversion takes place. The
result is NULL if the argument is
NULL. Base-64 encoded strings can be
decoded using the FROM_BASE64()
function.

Different base-64 encoding schemes exist. These are the
encoding and decoding rules used by
TO_BASE64() and
FROM_BASE64():

The encoding for alphabet value 62 is
'+'.

The encoding for alphabet value 63 is
'/'.

Encoded output consists of groups of 4 printable
characters. Each 3 bytes of the input data are encoded
using 4 characters. If the last group is incomplete, it is
padded with '=' characters to a length
of 4.

A newline is added after each 76 characters of encoded
output to divide long output into multiple lines.

Returns the string str with all
remstr prefixes or suffixes
removed. If none of the specifiers BOTH,
LEADING, or TRAILING is
given, BOTH is assumed.
remstr is optional and, if not
specified, spaces are removed.

For a string argument str,
UNHEX(str)
interprets each pair of characters in the argument as a
hexadecimal number and converts it to the byte represented by
the number. The return value is a binary string.

A NULL result can occur if the argument to
UNHEX() is a
BINARY column, because values
are padded with 0x00 bytes when stored but those bytes are not
stripped on retrieval. For example, '41' is
stored into a CHAR(3) column as
'41 ' and retrieved as
'41' (with the trailing pad space
stripped), so UNHEX() for the
column value returns 'A'. By contrast
'41' is stored into a
BINARY(3) column as
'41\0' and retrieved as
'41\0' (with the trailing pad
0x00 byte not stripped).
'\0' is not a legal hexadecimal digit, so
UNHEX() for the column value
returns NULL.

Returns the string str with all
characters changed to uppercase according to the current
character set mapping. The default is
utf8mb4.

mysql> SELECT UPPER('Hej');
-> 'HEJ'

See the description of LOWER()
for information that also applies to
UPPER(). This included
information about how to perform lettercase conversion of
binary strings (BINARY,
VARBINARY,
BLOB) for which these functions
are ineffective, and information about case folding for
Unicode character sets.

This list briefly summarizes the arguments. More details are
given in the discussion following the list.

str: The input string
expression.

AS clause: Optional; cast the input
string to a given type and length.

flags: Optional; unused.

The input string, str, is a string
expression. If the input is a nonbinary (character) string
such as a CHAR,
VARCHAR, or
TEXT value, the return value
contains the collation weights for the string. If the input is
a binary (byte) string such as a
BINARY,
VARBINARY, or
BLOB value, the return value is
the same as the input (the weight for each byte in a binary
string is the byte value). If the input is
NULL,
WEIGHT_STRING() returns
NULL.

The preceding examples use
HEX() to display the
WEIGHT_STRING() result. Because
the result is a binary value,
HEX() can be especially useful
when the result contains nonprinting values, to display it in
printable form:

For non-NULL return values, the data type
of the value is VARBINARY if
its length is within the maximum length for
VARBINARY, otherwise the data
type is BLOB.

The AS clause may be given to cast the
input string to a nonbinary or binary string and to force it
to a given length:

AS CHAR(N)
casts the string to a nonbinary string and pads it on the
right with spaces to a length of
N characters.
N must be at least 1. If
N is less than the length of
the input string, the string is truncated to
N characters. No warning occurs
for truncation.

AS BINARY(N)
is similar but casts the string to a binary string,
N is measured in bytes (not
characters), and padding uses 0x00
bytes (not spaces).

As an extension to standard SQL, MySQL permits
LIKE on numeric expressions.

mysql> SELECT 10 LIKE '1%';
-> 1

Note

Because MySQL uses C escape syntax in strings (for
example, \n to represent a newline
character), you must double any \ that
you use in LIKE strings. For
example, to search for \n, specify it
as \\n. To search for
\, specify it as
\\\\; this is because the backslashes
are stripped once by the parser and again when the pattern
match is made, leaving a single backslash to be matched
against.

Exception: At the end of the pattern string, backslash can
be specified as \\. At the end of the
string, backslash stands for itself because there is
nothing following to escape. Suppose that a table contains
the following values:

The query SELECT COUNT(*) FROM foo WHERE bar LIKE
'%baz%'; returns 0. You might
assume that SELECT COUNT(*) FROM foo WHERE bar
NOT LIKE '%baz%'; would return
2. However, this is not the case: The
second query returns 0. This is because
NULL NOT LIKE
expr always returns
NULL, regardless of the value of
expr. The same is true for
aggregate queries involving NULL and
comparisons using
NOT
RLIKE or NOT
REGEXP. In such cases, you must test explicitly
for NOT NULL using
OR (and not
AND), as shown here:

12.5.2 Regular Expressions

A regular expression is a powerful way of specifying a pattern
for a complex search.

MySQL uses Henry Spencer's implementation of regular
expressions, which is aimed at conformance with POSIX 1003.2.
MySQL uses the extended version to support pattern-matching
operations performed with the
REGEXP operator in SQL statements.

This section summarizes, with examples, the special characters
and constructs that can be used in MySQL for
REGEXP operations. It does not
contain all the details that can be found in Henry Spencer's
regex(7) manual page. That manual page is
included in MySQL source distributions, in the
regex.7 file under the
regex directory. See also
Section 3.3.4.7, “Pattern Matching”.

Regular Expression Operators

Performs a pattern match of a string expression
expr against a pattern
pat. The pattern can be an
extended regular expression, the syntax for which is
discussed later in this section. Returns
1 if expr
matches pat; otherwise it returns
0. If either
expr or
pat is NULL,
the result is NULL.
RLIKE is a
synonym for REGEXP.

The pattern need not be a literal string. For example, it
can be specified as a string expression or table column.

Note

Because MySQL uses the C escape syntax in strings (for
example, \n to represent the newline
character), you must double any \ that
you use in your REGEXP
strings.

The REGEXP and
RLIKE
operators work in byte-wise fashion, so they are not
multibyte safe and may produce unexpected results with
multibyte character sets. In addition, these operators
compare characters by their byte values and accented
characters may not compare as equal even if a given
collation treats them as equal.

Syntax of Regular Expressions

A regular expression describes a set of strings. The simplest
regular expression is one that has no special characters in it.
For example, the regular expression hello
matches hello and nothing else.

Nontrivial regular expressions use certain special constructs so
that they can match more than one string. For example, the
regular expression hello|word matches either
the string hello or the string
word.

As a more complex example, the regular expression
B[an]*s matches any of the strings
Bananas, Baaaaas,
Bs, and any other string starting with a
B, ending with an s, and
containing any number of a or
n characters in between.

A regular expression for the REGEXP
operator may use any of the following special characters and
constructs:

Matches any character that is (or is not, if ^ is used)
either a, b,
c, d or
X. A - character
between two other characters forms a range that matches all
characters from the first character to the second. For
example, [0-9] matches any decimal digit.
To include a literal ] character, it must
immediately follow the opening bracket [.
To include a literal - character, it must
be written first or last. Any character that does not have a
defined special meaning inside a [] pair
matches only itself.

Within a bracket expression (written using
[ and ]), matches the
sequence of characters of that collating element.
characters is either a single character
or a character name like newline. The
following table lists the permissible character names.

The following table shows the permissible character names
and the characters that they match. For characters given as
numeric values, the values are represented in octal.

Within a bracket expression (written using
[ and ]),
[=character_class=] represents an
equivalence class. It matches all characters with the same
collation value, including itself. For example, if
o and (+) are the
members of an equivalence class, [[=o=]],
[[=(+)=]], and [o(+)]
are all synonymous. An equivalence class may not be used as
an endpoint of a range.

[:character_class:]

Within a bracket expression (written using
[ and ]),
[:character_class:] represents a
character class that matches all characters belonging to
that class. The following table lists the standard class
names. These names stand for the character classes defined
in the ctype(3) manual page. A particular
locale may provide other class names. A character class may
not be used as an endpoint of a range.

These markers stand for word boundaries. They match the
beginning and end of words, respectively. A word is a
sequence of word characters that is not preceded by or
followed by word characters. A word character is an
alphanumeric character in the alnum class
or an underscore (_).

To use a literal instance of a special character in a regular
expression, precede it by two backslash (\) characters. The
MySQL parser interprets one of the backslashes, and the regular
expression library interprets the other. For example, to match
the string 1+2 that contains the special
+ character, only the last of the following
regular expressions is the correct one:

The REPLACE() function, unlike
all other functions, always ignores the collation of the
string input and performs a case-sensitive comparison.

If a string input or function result is a binary string, the
string has the binary character set and
collation. This can be checked by using the
CHARSET() and
COLLATION() functions, both of
which return binary for a binary string
argument:

An exception to the preceding priniciple occurs for expressions
for virtual generated columns. In such expressions, the table
character set is used for
BIN_TO_UUID(),
CONV(), or
HEX() results, regardless of
connection character set.

If there is any question about the character set or collation of
the result returned by a string function, use the
CHARSET() or
COLLATION() function to find out:

12.6.1 Arithmetic Operators

The usual arithmetic operators are available. The result is
determined according to the following rules:

In the case of
-,
+, and
*, the result
is calculated with BIGINT
(64-bit) precision if both operands are integers.

If both operands are integers and any of them are unsigned,
the result is an unsigned integer. For subtraction, if the
NO_UNSIGNED_SUBTRACTION
SQL mode is enabled, the result is signed even if any
operand is unsigned.

If any of the operands of a
+,
-,
/,
*,
% is a real or
string value, the precision of the result is the precision
of the operand with the maximum precision.

In division performed with
/, the scale
of the result when using two exact-value operands is the
scale of the first operand plus the value of the
div_precision_increment
system variable (which is 4 by default). For example, the
result of the expression 5.05 / 0.014 has
a scale of six decimal places
(360.714286).

These rules are applied for each operation, such that nested
calculations imply the precision of each component. Hence,
(14620 / 9432456) / (24250 / 9432456),
resolves first to (0.0014) / (0.0026), with
the final result having 8 decimal places
(0.60288653).

Because of these rules and the way they are applied, care should
be taken to ensure that components and subcomponents of a
calculation use the appropriate level of precision. See
Section 12.10, “Cast Functions and Operators”.

Integer division. Discards from the division result any
fractional part to the right of the decimal point.

If either operand has a noninteger type, the operands are
converted to DECIMAL and
divided using DECIMAL
arithmetic before converting the result to
BIGINT. If the result exceeds
BIGINT range, an error occurs.

Converts numbers between different number bases. Returns a
string representation of the number
N, converted from base
from_base to base
to_base. Returns
NULL if any argument is
NULL. The argument
N is interpreted as an integer,
but may be specified as an integer or a string. The minimum
base is 2 and the maximum base is
36. If
from_base is a negative number,
N is regarded as a signed number.
Otherwise, N is treated as
unsigned. CONV() works with
64-bit precision.

Computes a cyclic redundancy check value and returns a
32-bit unsigned value. The result is NULL
if the argument is NULL. The argument is
expected to be a string and (if possible) is treated as one
if it is not.

This function can be used to obtain a hexadecimal
representation of a decimal number or a string; the manner
in which it does so varies according to the argument's
type. See this function's description in
Section 12.5, “String Functions”, for details.

If called with one parameter, this function returns the
natural logarithm of X. If
X is less than or equal to 0.0E0,
the function returns NULL and a warning
“Invalid argument for logarithm” is reported.

The inverse of this function (when called with a single
argument) is the EXP()
function.

Returns a random floating-point value
v in the range
0 <= v <
1.0. To obtain a random integer
R in the range
i <=
R <
j, use the expression
FLOOR(i
+ RAND() * (j
− i)).
For example, to obtain a random integer in the range the
range 7 <=
R < 12, use
the following statement:

SELECT FLOOR(7 + (RAND() * 5));

If an integer argument N is
specified, it is used as the seed value:

With a constant initializer argument, the seed is
initialized once when the statement is prepared, prior
to execution.

With a nonconstant initializer argument (such as a
column name), the seed is initialized with the value for
each invocation of
RAND().

One implication of this behavior is that for equal argument
values,
RAND(N)
returns the same value each time, and thus produces a
repeatable sequence of column values. In the following
example, the sequence of values produced by
RAND(3) is the same both places it
occurs.

RAND() in a
WHERE clause is evaluated for every row
(when selecting from one table) or combination of rows (when
selecting from a multiple-table join). Thus, for optimizer
purposes, RAND() is not a
constant value and cannot be used for index optimizations.
For more information, see
Section 8.2.1.17, “Function Call Optimization”.

Use of a column with RAND()
values in an ORDER BY or GROUP
BY clause may yield unexpected results because for
either clause a RAND()
expression can be evaluated multiple times for the same row,
each time returning a different result. If the goal is to
retrieve rows in random order, you can use a statement like
this:

SELECT * FROM tbl_name ORDER BY RAND();

To select a random sample from a set of rows, combine
ORDER BY RAND() with
LIMIT:

SELECT * FROM table1, table2 WHERE a=b AND c<d ORDER BY RAND() LIMIT 1000;

RAND() is not meant to be a
perfect random generator. It is a fast way to generate
random numbers on demand that is portable between platforms
for the same MySQL version.

This function is unsafe for statement-based replication. A
warning is logged if you use this function when
binlog_format is set to
STATEMENT. (Bug #49222)

Rounds the argument X to
D decimal places. The rounding
algorithm depends on the data type of
X. D
defaults to 0 if not specified. D
can be negative to cause D digits
left of the decimal point of the value
X to become zero.

ROUND() uses the following
rules depending on the type of the first argument:

For exact-value numbers,
ROUND() uses the
“round half away from zero” or “round
toward nearest” rule: A value with a fractional
part of .5 or greater is rounded up to the next integer
if positive or down to the next integer if negative. (In
other words, it is rounded away from zero.) A value with
a fractional part less than .5 is rounded down to the
next integer if positive or up to the next integer if
negative.

For approximate-value numbers, the result depends on the
C library. On many systems, this means that
ROUND() uses the "round
to nearest even" rule: A value with any fractional part
is rounded to the nearest even integer.

The following example shows how rounding differs for exact
and approximate values:

Returns the number X, truncated
to D decimal places. If
D is 0, the
result has no decimal point or fractional part.
D can be negative to cause
D digits left of the decimal
point of the value X to become
zero.

12.7 Date and Time Functions

This section describes the functions that can be used to
manipulate temporal values. See
Section 11.3, “Date and Time Types”, for a description of the
range of values each date and time type has and the valid formats
in which values may be specified.

Functions that expect date values usually accept datetime values
and ignore the time part. Functions that expect time values
usually accept datetime values and ignore the date part.

Functions that return the current date or time each are evaluated
only once per query at the start of query execution. This means
that multiple references to a function such as
NOW() within a single query always
produce the same result. (For our purposes, a single query also
includes a call to a stored program (stored routine, trigger, or
event) and all subprograms called by that program.) This principle
also applies to CURDATE(),
CURTIME(),
UTC_DATE(),
UTC_TIME(),
UTC_TIMESTAMP(), and to any of
their synonyms.

Some date functions can be used with “zero” dates or
incomplete dates such as '2001-11-00', whereas
others cannot. Functions that extract parts of dates typically
work with incomplete dates and thus can return 0 when you might
otherwise expect a nonzero value. For example:

mysql> SELECT DAYOFMONTH('2001-11-00'), MONTH('2005-00-00');
-> 0, 0

Other functions expect complete dates and return
NULL for incomplete dates. These include
functions that perform date arithmetic or that map parts of dates
to names. For example:

Fractional seconds for TIME,
DATETIME, and TIMESTAMP
values are supported, with up to microsecond precision. Functions
that take temporal arguments accept values with fractional
seconds. Return values from temporal functions include fractional
seconds as appropriate.

DATEDIFF() returns
expr1 −
expr2 expressed as a value in days
from one date to the other. expr1
and expr2 are date or date-and-time
expressions. Only the date parts of the values are used in the
calculation.

These functions perform date arithmetic. The
date argument specifies the
starting date or datetime value.
expr is an expression specifying
the interval value to be added or subtracted from the starting
date. expr is a string; it may
start with a - for negative intervals.
unit is a keyword indicating the
units in which the expression should be interpreted.

The INTERVAL keyword and the
unit specifier are not case
sensitive.

The following table shows the expected form of the
expr argument for each
unit value.

unit Value

Expected expr Format

MICROSECOND

MICROSECONDS

SECOND

SECONDS

MINUTE

MINUTES

HOUR

HOURS

DAY

DAYS

WEEK

WEEKS

MONTH

MONTHS

QUARTER

QUARTERS

YEAR

YEARS

SECOND_MICROSECOND

'SECONDS.MICROSECONDS'

MINUTE_MICROSECOND

'MINUTES:SECONDS.MICROSECONDS'

MINUTE_SECOND

'MINUTES:SECONDS'

HOUR_MICROSECOND

'HOURS:MINUTES:SECONDS.MICROSECONDS'

HOUR_SECOND

'HOURS:MINUTES:SECONDS'

HOUR_MINUTE

'HOURS:MINUTES'

DAY_MICROSECOND

'DAYS HOURS:MINUTES:SECONDS.MICROSECONDS'

DAY_SECOND

'DAYS HOURS:MINUTES:SECONDS'

DAY_MINUTE

'DAYS HOURS:MINUTES'

DAY_HOUR

'DAYS HOURS'

YEAR_MONTH

'YEARS-MONTHS'

The return value depends on the arguments:

DATETIME if the first
argument is a DATETIME (or
TIMESTAMP) value, or if the
first argument is a DATE
and the unit value uses
HOURS, MINUTES, or
SECONDS.

MySQL permits any punctuation delimiter in the
expr format. Those shown in the
table are the suggested delimiters. If the
date argument is a
DATE value and your
calculations involve only YEAR,
MONTH, and DAY parts
(that is, no time parts), the result is a
DATE value. Otherwise, the
result is a DATETIME value.

Date arithmetic also can be performed using
INTERVAL together with the
+ or
- operator:

date + INTERVAL exprunitdate - INTERVAL exprunit

INTERVAL exprunit is permitted on
either side of the
+ operator if
the expression on the other side is a date or datetime value.
For the -
operator, INTERVAL exprunit is permitted only on
the right side, because it makes no sense to subtract a date
or datetime value from an interval.

If you specify an interval value that is too short (does not
include all the interval parts that would be expected from the
unit keyword), MySQL assumes that
you have left out the leftmost parts of the interval value.
For example, if you specify a unit
of DAY_SECOND, the value of
expr is expected to have days,
hours, minutes, and seconds parts. If you specify a value like
'1:10', MySQL assumes that the days and
hours parts are missing and the value represents minutes and
seconds. In other words, '1:10' DAY_SECOND
is interpreted in such a way that it is equivalent to
'1:10' MINUTE_SECOND. This is analogous to
the way that MySQL interprets
TIME values as representing
elapsed time rather than as a time of day.

Because expr is treated as a
string, be careful if you specify a nonstring value with
INTERVAL. For example, with an interval
specifier of HOUR_MINUTE,
6/4 evaluates to 1.5000
and is treated as 1 hour, 5000 minutes:

Returns a representation of the
unix_timestamp argument as a value
in 'YYYY-MM-DD HH:MM:SS' or
YYYYMMDDHHMMSS format, depending on whether
the function is used in a string or numeric context. The value
is expressed in the current time zone.
unix_timestamp is an internal
timestamp value such as is produced by the
UNIX_TIMESTAMP() function.

If format is given, the result is
formatted according to the format
string, which is used the same way as listed in the entry for
the DATE_FORMAT() function.

The possible values for the first and second arguments result
in several possible format strings (for the specifiers used,
see the table in the
DATE_FORMAT() function
description). ISO format refers to ISO 9075, not ISO 8601.

Returns the current date and time as a value in
'YYYY-MM-DD HH:MM:SS' or
YYYYMMDDHHMMSS format, depending on whether
the function is used in a string or numeric context. The value
is expressed in the current time zone.

If the fsp argument is given to
specify a fractional seconds precision from 0 to 6, the return
value includes a fractional seconds part of that many digits.

NOW() returns a constant time
that indicates the time at which the statement began to
execute. (Within a stored function or trigger,
NOW() returns the time at which
the function or triggering statement began to execute.) This
differs from the behavior for
SYSDATE(), which returns the
exact time at which it executes.

In addition, the SET TIMESTAMP statement
affects the value returned by
NOW() but not by
SYSDATE(). This means that
timestamp settings in the binary log have no effect on
invocations of SYSDATE().
Setting the timestamp to a nonzero value causes each
subsequent invocation of NOW()
to return that value. Setting the timestamp to zero cancels
this effect so that NOW() once
again returns the current date and time.

See the description for
SYSDATE() for additional
information about the differences between the two functions.

Returns the seconds argument,
converted to hours, minutes, and seconds, as a
TIME value. The range of the
result is constrained to that of the
TIME data type. A warning
occurs if the argument corresponds to a value outside that
range.

This is the inverse of the
DATE_FORMAT() function. It
takes a string str and a format
string format.
STR_TO_DATE() returns a
DATETIME value if the format
string contains both date and time parts, or a
DATE or
TIME value if the string
contains only date or time parts. If the date, time, or
datetime value extracted from str
is illegal, STR_TO_DATE()
returns NULL and produces a warning.

The server scans str attempting to
match format to it. The format
string can contain literal characters and format specifiers
beginning with %. Literal characters in
format must match literally in
str. Format specifiers in
format must match a date or time
part in str. For the specifiers
that can be used in format, see the
DATE_FORMAT() function
description.

You cannot use format "%X%V" to convert a
year-week string to a date because the combination of a year
and week does not uniquely identify a year and month if the
week crosses a month boundary. To convert a year-week to a
date, you should also specify the weekday:

Returns the current date and time as a value in
'YYYY-MM-DD HH:MM:SS' or
YYYYMMDDHHMMSS format, depending on whether
the function is used in a string or numeric context.

If the fsp argument is given to
specify a fractional seconds precision from 0 to 6, the return
value includes a fractional seconds part of that many digits.

SYSDATE() returns the time at
which it executes. This differs from the behavior for
NOW(), which returns a constant
time that indicates the time at which the statement began to
execute. (Within a stored function or trigger,
NOW() returns the time at which
the function or triggering statement began to execute.)

In addition, the SET TIMESTAMP statement
affects the value returned by
NOW() but not by
SYSDATE(). This means that
timestamp settings in the binary log have no effect on
invocations of SYSDATE().

Because SYSDATE() can return
different values even within the same statement, and is not
affected by SET TIMESTAMP, it is
nondeterministic and therefore unsafe for replication if
statement-based binary logging is used. If that is a problem,
you can use row-based logging.

Alternatively, you can use the
--sysdate-is-now option to
cause SYSDATE() to be an alias
for NOW(). This works if the
option is used on both the master and the slave.

The nondeterministic nature of
SYSDATE() also means that
indexes cannot be used for evaluating expressions that refer
to it.

TIMEDIFF() returns
expr1 −
expr2 expressed as a time value.
expr1 and
expr2 are time or date-and-time
expressions, but both must be of the same type.

The result returned by TIMEDIFF() is
limited to the range allowed for
TIME values. Alternatively, you
can use either of the functions
TIMESTAMPDIFF() and
UNIX_TIMESTAMP(), both of which
return integers.

With a single argument, this function returns the date or
datetime expression expr as a
datetime value. With two arguments, it adds the time
expression expr2 to the date or
datetime expression expr1 and
returns the result as a datetime value.

Adds the integer expression
interval to the date or datetime
expression datetime_expr. The unit
for interval is given by the
unit argument, which should be one
of the following values: MICROSECOND
(microseconds), SECOND,
MINUTE, HOUR,
DAY, WEEK,
MONTH, QUARTER, or
YEAR.

The unit value may be specified
using one of keywords as shown, or with a prefix of
SQL_TSI_. For example,
DAY and SQL_TSI_DAY both
are legal.

Returns datetime_expr2 −
datetime_expr1, where
datetime_expr1 and
datetime_expr2 are date or datetime
expressions. One expression may be a date and the other a
datetime; a date value is treated as a datetime having the
time part '00:00:00' where necessary. The
unit for the result (an integer) is given by the
unit argument. The legal values for
unit are the same as those listed
in the description of the
TIMESTAMPADD() function.

This is used like the
DATE_FORMAT() function, but the
format string may contain format
specifiers only for hours, minutes, seconds, and microseconds.
Other specifiers produce a NULL value or
0.

If the time value contains an hour
part that is greater than 23, the
%H and %k hour format
specifiers produce a value larger than the usual range of
0..23. The other hour format specifiers
produce the hour value modulo 12.

TO_DAYS() is not intended for
use with values that precede the advent of the Gregorian
calendar (1582), because it does not take into account the
days that were lost when the calendar was changed. For dates
before 1582 (and possibly a later year in other locales),
results from this function are not reliable. See
Section 12.8, “What Calendar Is Used By MySQL?”, for details.

Remember that MySQL converts two-digit year values in dates to
four-digit form using the rules in
Section 11.3, “Date and Time Types”. For example,
'2008-10-07' and
'08-10-07' are seen as identical dates:

Like TO_DAYS(),
TO_SECONDS() is not intended for use with
values that precede the advent of the Gregorian calendar
(1582), because it does not take into account the days that
were lost when the calendar was changed. For dates before 1582
(and possibly a later year in other locales), results from
this function are not reliable. See
Section 12.8, “What Calendar Is Used By MySQL?”, for details.

If called with no argument, returns a Unix timestamp (seconds
since '1970-01-01 00:00:00' UTC). The
return value is an integer if no argument is given or the
argument does not include a fractional seconds part, or
DECIMAL if an argument is given
that includes a fractional seconds part.

If UNIX_TIMESTAMP() is called
with a date argument, it returns
the value of the argument as seconds since
'1970-01-01 00:00:00' UTC. The
date argument may be a
DATE,
DATETIME, or
TIMESTAMP string, or a number
in YYMMDD, YYMMDDHHMMSS,
YYYYMMDD, or
YYYYMMDDHHMMSS format. If the argument
includes a time part, it may optionally include a fractional
seconds part. The server interprets
date as a value in the current time
zone and converts it to an internal value in UTC. Clients can
set their time zone as described in
Section 10.6, “MySQL Server Time Zone Support”.

When UNIX_TIMESTAMP() is used
on a TIMESTAMP column, the
function returns the internal timestamp value directly, with
no implicit “string-to-Unix-timestamp”
conversion. If you pass an out-of-range date to
UNIX_TIMESTAMP(), it returns
0. The valid range of values is the same as
for the TIMESTAMP data type:
'1970-01-01 00:00:01.000000' UTC to
'2038-01-19 03:14:07.999999' UTC.

Note: If you use
UNIX_TIMESTAMP() and
FROM_UNIXTIME() to convert
between TIMESTAMP values and
Unix timestamp values, the conversion is lossy because the
mapping is not one-to-one in both directions. For example, due
to conventions for local time zone changes, it is possible for
two UNIX_TIMESTAMP() to map two
TIMESTAMP values to the same
Unix timestamp value.
FROM_UNIXTIME() will map that
value back to only one of the original
TIMESTAMP values. Here is an
example, using TIMESTAMP values
in the CET time zone:

This function returns the week number for
date. The two-argument form of
WEEK() enables you to specify
whether the week starts on Sunday or Monday and whether the
return value should be in the range from 0
to 53 or from 1 to
53. If the mode
argument is omitted, the value of the
default_week_format system
variable is used. See
Section 5.1.5, “Server System Variables”.

The following table describes how the
mode argument works.

Mode

First day of week

Range

Week 1 is the first week …

0

Sunday

0-53

with a Sunday in this year

1

Monday

0-53

with 4 or more days this year

2

Sunday

1-53

with a Sunday in this year

3

Monday

1-53

with 4 or more days this year

4

Sunday

0-53

with 4 or more days this year

5

Monday

0-53

with a Monday in this year

6

Sunday

1-53

with 4 or more days this year

7

Monday

1-53

with a Monday in this year

For mode values with a meaning of
“with 4 or more days this year,” weeks are
numbered according to ISO 8601:1988:

If the week containing January 1 has 4 or more days in the
new year, it is week 1.

Otherwise, it is the last week of the previous year, and
the next week is week 1.

Note that if a date falls in the last week of the previous
year, MySQL returns 0 if you do not use
2, 3,
6, or 7 as the optional
mode argument:

mysql> SELECT YEAR('2000-01-01'), WEEK('2000-01-01',0);
-> 2000, 0

One might argue that WEEK()
should return 52 because the given date
actually occurs in the 52nd week of 1999.
WEEK() returns
0 instead so that the return value is
“the week number in the given year.” This makes
use of the WEEK() function
reliable when combined with other functions that extract a
date part from a date.

If you prefer a result evaluated with respect to the year that
contains the first day of the week for the given date, use
0, 2,
5, or 7 as the optional
mode argument.

Note that the week number is different from what the
WEEK() function would return
(0) for optional arguments
0 or 1, as
WEEK() then returns the week in
the context of the given year.

12.8 What Calendar Is Used By MySQL?

MySQL uses what is known as a proleptic
Gregorian calendar.

Every country that has switched from the Julian to the Gregorian
calendar has had to discard at least ten days during the switch.
To see how this works, consider the month of October 1582, when
the first Julian-to-Gregorian switch occurred.

Monday

Tuesday

Wednesday

Thursday

Friday

Saturday

Sunday

1

2

3

4

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

There are no dates between October 4 and October 15. This
discontinuity is called the
cutover. Any dates before
the cutover are Julian, and any dates following the cutover are
Gregorian. Dates during a cutover are nonexistent.

A calendar applied to dates when it was not actually in use is
called proleptic. Thus, if
we assume there was never a cutover and Gregorian rules always
rule, we have a proleptic Gregorian calendar. This is what is used
by MySQL, as is required by standard SQL. For this reason, dates
prior to the cutover stored as MySQL
DATE or
DATETIME values must be adjusted to
compensate for the difference. It is important to realize that the
cutover did not occur at the same time in all countries, and that
the later it happened, the more days were lost. For example, in
Great Britain, it took place in 1752, when Wednesday September 2
was followed by Thursday September 14. Russia remained on the
Julian calendar until 1918, losing 13 days in the process, and
what is popularly referred to as its “October
Revolution” occurred in November according to the Gregorian
calendar.

For large data sets, it is much faster to load your data into
a table that has no FULLTEXT index and then
create the index after that, than to load data into a table
that has an existing FULLTEXT index.

Full-text searching is performed using
MATCH() ... AGAINST syntax.
MATCH() takes a comma-separated
list that names the columns to be searched.
AGAINST takes a string to search for, and an
optional modifier that indicates what type of search to perform.
The search string must be a string value that is constant during
query evaluation. This rules out, for example, a table column
because that can differ for each row.

There are three types of full-text searches:

A natural language search interprets the search string as a
phrase in natural human language (a phrase in free text).
There are no special operators, with the exception of double
quote (") characters. The stopword list applies. For more
information about stopword lists, see
Section 12.9.4, “Full-Text Stopwords”.

A boolean search interprets the search string using the rules
of a special query language. The string contains the words to
search for. It can also contain operators that specify
requirements such that a word must be present or absent in
matching rows, or that it should be weighted higher or lower
than usual. Certain common words (stopwords) are omitted from
the search index and do not match if present in the search
string. The IN BOOLEAN MODE modifier
specifies a boolean search. For more information, see
Section 12.9.2, “Boolean Full-Text Searches”.

A query expansion search is a modification of a natural
language search. The search string is used to perform a
natural language search. Then words from the most relevant
rows returned by the search are added to the search string and
the search is done again. The query returns the rows from the
second search. The IN NATURAL LANGUAGE MODE WITH
QUERY EXPANSION or WITH QUERY
EXPANSION modifier specifies a query expansion
search. For more information, see
Section 12.9.3, “Full-Text Searches with Query Expansion”.

12.9.1 Natural Language Full-Text Searches

By default or with the IN NATURAL LANGUAGE
MODE modifier, the
MATCH() function performs a
natural language search for a string against a
text collection. A
collection is a set of one or more columns included in a
FULLTEXT index. The search string is given as
the argument to AGAINST(). For each row in
the table, MATCH() returns a
relevance value; that is, a similarity measure between the
search string and the text in that row in the columns named in
the MATCH() list.

By default, the search is performed in case-insensitive fashion.
To perform a case-sensitive full-text search, use a
case-sensitive or binary collation for the indexed columns. For
example, a column that uses the utf8mb4
character set of can be assigned a collation of
utf8mb4_0900_as_cs or
utf8mb4_bin to make it case sensitive for
full-text searches.

When MATCH() is used in a
WHERE clause, as in the example shown
earlier, the rows returned are automatically sorted with the
highest relevance first. Relevance values are nonnegative
floating-point numbers. Zero relevance means no similarity.
Relevance is computed based on the number of words in the row
(document), the number of unique words in the row, the total
number of words in the collection, and the number of rows that
contain a particular word.

Note

The term “document” may be used interchangeably
with the term “row”, and both terms refer to the
indexed part of the row. The term “collection”
refers to the indexed columns and encompasses all rows.

The first query does some extra work (sorting the results by
relevance) but also can use an index lookup based on the
WHERE clause. The index lookup might make the
first query faster if the search matches few rows. The second
query performs a full table scan, which might be faster than the
index lookup if the search term was present in most rows.

For natural-language full-text searches, the columns named in
the MATCH() function must be the
same columns included in some FULLTEXT index
in your table. For the preceding query, note that the columns
named in the MATCH() function
(title and body) are the
same as those named in the definition of the
article table's FULLTEXT
index. To search the title or
body separately, you would create separate
FULLTEXT indexes for each column.

A full-text search that uses an index can name columns only from
a single table in the MATCH()
clause because an index cannot span multiple tables. For
MyISAM tables, a boolean search can be done
in the absence of an index (albeit more slowly), in which case
it is possible to name columns from multiple tables.

The preceding example is a basic illustration that shows how to
use the MATCH() function where
rows are returned in order of decreasing relevance. The next
example shows how to retrieve the relevance values explicitly.
Returned rows are not ordered because the
SELECT statement includes neither
WHERE nor ORDER BY
clauses:

The following example is more complex. The query returns the
relevance values and it also sorts the rows in order of
decreasing relevance. To achieve this result, specify
MATCH() twice: once in the
SELECT list and once in the
WHERE clause. This causes no additional
overhead, because the MySQL optimizer notices that the two
MATCH() calls are identical and
invokes the full-text search code only once.

A phrase that is enclosed within double quote
(") characters matches only rows that contain
the phrase literally, as it was typed. The
full-text engine splits the phrase into words and performs a
search in the FULLTEXT index for the words.
Nonword characters need not be matched exactly: Phrase searching
requires only that matches contain exactly the same words as the
phrase and in the same order. For example, "test
phrase" matches "test, phrase". If
the phrase contains no words that are in the index, the result
is empty. For example, if all words are either stopwords or
shorter than the minimum length of indexed words, the result is
empty.

The MySQL FULLTEXT implementation regards any
sequence of true word characters (letters, digits, and
underscores) as a word. That sequence may also contain
apostrophes ('), but not more than one in a
row. This means that aaa'bbb is regarded as
one word, but aaa''bbb is regarded as two
words. Apostrophes at the beginning or the end of a word are
stripped by the FULLTEXT parser;
'aaa'bbb' would be parsed as
aaa'bbb.

The built-in FULLTEXT parser determines where
words start and end by looking for certain delimiter characters;
for example, (space),
, (comma), and . (period).
If words are not separated by delimiters (as in, for example,
Chinese), the built-in FULLTEXT parser cannot
determine where a word begins or ends. To be able to add words
or other indexed terms in such languages to a
FULLTEXT index that uses the built-in
FULLTEXT parser, you must preprocess them so
that they are separated by some arbitrary delimiter.
Alternatively, you can create FULLTEXT
indexes using the ngram parser plugin (for Chinese, Japanese, or
Korean) or the MeCab parser plugin (for Japanese).

It is possible to write a plugin that replaces the built-in
full-text parser. For details, see Section 28.2, “The MySQL Plugin API”.
For example parser plugin source code, see the
plugin/fulltext directory of a MySQL source
distribution.

Some words are ignored in full-text searches:

Any word that is too short is ignored. The default minimum
length of words that are found by full-text searches is
three characters for InnoDB search
indexes, or four characters for MyISAM.
You can control the cutoff by setting a configuration option
before creating the index:
innodb_ft_min_token_size
configuration option for InnoDB search
indexes, or ft_min_word_len
for MyISAM.

Note

This behavior does not apply to
FULLTEXT indexes that use the ngram
parser. For the ngram parser, token length is defined by
the ngram_token_size
option.

Words in the stopword list are ignored. A stopword is a word
such as “the” or “some” that is so
common that it is considered to have zero semantic value.
There is a built-in stopword list, but it can be overridden
by a user-defined list. The stopword lists and related
configuration options are different for
InnoDB search indexes and
MyISAM ones. Stopword processing is
controlled by the configuration options
innodb_ft_enable_stopword,
innodb_ft_server_stopword_table,
and
innodb_ft_user_stopword_table
for InnoDB search indexes, and
ft_stopword_file for
MyISAM ones.

Every correct word in the collection and in the query is
weighted according to its significance in the collection or
query. Thus, a word that is present in many documents has a
lower weight, because it has lower semantic value in this
particular collection. Conversely, if the word is rare, it
receives a higher weight. The weights of the words are combined
to compute the relevance of the row. This technique works best
with large collections.

MyISAM Limitation

For very small tables, word distribution does not adequately
reflect their semantic value, and this model may sometimes
produce bizarre results for search indexes on
MyISAM tables. For example, although the
word “MySQL” is present in every row of the
articles table shown earlier, a search for
the word in a MyISAM search index produces
no results:

The search result is empty because the word
“MySQL” is present in at least 50% of the rows,
and so is effectively treated as a stopword. This filtering
technique is more suitable for large data sets, where you
might not want the result set to return every second row from
a 1GB table, than for small data sets where it might cause
poor results for popular terms.

The 50% threshold can surprise you when you first try
full-text searching to see how it works, and makes
InnoDB tables more suited to
experimentation with full-text searches. If you create a
MyISAM table and insert only one or two
rows of text into it, every word in the text occurs in at
least 50% of the rows. As a result, no search returns any
results until the table contains more rows. Users who need to
bypass the 50% limitation can build search indexes on
InnoDB tables, or use the boolean search
mode explained in Section 12.9.2, “Boolean Full-Text Searches”.

12.9.2 Boolean Full-Text Searches

MySQL can perform boolean full-text searches using the
IN BOOLEAN MODE modifier. With this modifier,
certain characters have special meaning at the beginning or end
of words in the search string. In the following query, the
+ and - operators indicate
that a word must be present or absent, respectively, for a match
to occur. Thus, the query retrieves all the rows that contain
the word “MySQL” but that do
not contain the word
“YourSQL”:

In implementing this feature, MySQL uses what is sometimes
referred to as implied Boolean
logic, in which

+ stands for AND

- stands for NOT

[no operator] implies
OR

Boolean full-text searches have these characteristics:

They do not automatically sort rows in order of decreasing
relevance.

InnoDB tables require a
FULLTEXT index on all columns of the
MATCH() expression to perform
boolean queries. Boolean queries against a
MyISAM search index can work even without
a FULLTEXT index, although a search
executed in this fashion would be quite slow.

InnoDB full-text search does not support
the use of multiple operators on a single search word, as in
this example: '++apple'. Use of multiple
operators on a single search word returns a syntax error to
standard out. MyISAM full-text search will successfully
process the same search ignoring all operators except for
the operator immediately adjacent to the search word.

InnoDB full-text search only supports
leading plus or minus signs. For example,
InnoDB supports
'+apple' but does not support
'apple+'. Specifying a trailing plus or
minus sign causes InnoDB to report a
syntax error.

InnoDB full-text search does not support
the use of a leading plus sign with wildcard
('+*'), a plus and minus sign combination
('+-'), or leading a plus and minus sign
combination ('+-apple'). These invalid
queries return a syntax error.

InnoDB full-text search does not support
the use of the @ symbol in boolean
full-text searches. The @ symbol is
reserved for use by the @distance
proximity search operator.

They do not use the 50% threshold that applies to
MyISAM search indexes.

The boolean full-text search capability supports the following
operators:

+

A leading or trailing plus sign indicates that this word
must be present in each row that is
returned. InnoDB only supports leading
plus signs.

-

A leading or trailing minus sign indicates that this word
must not be present in any of the rows
that are returned. InnoDB only supports
leading minus signs.

Note: The - operator acts only to exclude
rows that are otherwise matched by other search terms. Thus,
a boolean-mode search that contains only terms preceded by
- returns an empty result. It does not
return “all rows except those containing any of the
excluded terms.”

(no operator)

By default (when neither + nor
- is specified), the word is optional,
but the rows that contain it are rated higher. This mimics
the behavior of MATCH() ...
AGAINST() without the IN BOOLEAN
MODE modifier.

@distance

This operator works on InnoDB tables
only. It tests whether two or more words all start within a
specified distance from each other, measured in words.
Specify the search words within a double-quoted string
immediately before the
@distance
operator, for example, MATCH(col1) AGAINST('"word1
word2 word3" @8' IN BOOLEAN MODE)

> <

These two operators are used to change a word's contribution
to the relevance value that is assigned to a row. The
> operator increases the contribution
and the < operator decreases it. See
the example following this list.

( )

Parentheses group words into subexpressions. Parenthesized
groups can be nested.

~

A leading tilde acts as a negation operator, causing the
word's contribution to the row's relevance to be negative.
This is useful for marking “noise” words. A row
containing such a word is rated lower than others, but is
not excluded altogether, as it would be with the
- operator.

*

The asterisk serves as the truncation (or wildcard)
operator. Unlike the other operators, it is
appended to the word to be affected.
Words match if they begin with the word preceding the
* operator.

If a word is specified with the truncation operator, it is
not stripped from a boolean query, even if it is too short
or a stopword. Whether a word is too short is determined
from the
innodb_ft_min_token_size
setting for InnoDB tables, or
ft_min_word_len for
MyISAM tables. These options are not
applicable to FULLTEXT indexes that use
the ngram parser.

The wildcarded word is considered as a prefix that must be
present at the start of one or more words. If the minimum
word length is 4, a search for
'+word +the*'
could return fewer rows than a search for
'+word +the',
because the second query ignores the too-short search term
the.

"

A phrase that is enclosed within double quote
(") characters matches only rows that
contain the phrase literally, as it was
typed. The full-text engine splits the phrase
into words and performs a search in the
FULLTEXT index for the words. Nonword
characters need not be matched exactly: Phrase searching
requires only that matches contain exactly the same words as
the phrase and in the same order. For example,
"test phrase" matches "test,
phrase".

If the phrase contains no words that are in the index, the
result is empty. The words might not be in the index because
of a combination of factors: if they do not exist in the
text, are stopwords, or are shorter than the minimum length
of indexed words.

The following examples demonstrate some search strings that use
boolean full-text operators:

'apple banana'

Find rows that contain at least one of the two words.

'+apple +juice'

Find rows that contain both words.

'+apple macintosh'

Find rows that contain the word “apple”, but
rank rows higher if they also contain
“macintosh”.

'+apple -macintosh'

Find rows that contain the word “apple” but not
“macintosh”.

'+apple ~macintosh'

Find rows that contain the word “apple”, but if
the row also contains the word “macintosh”,
rate it lower than if row does not. This is
“softer” than a search for '+apple
-macintosh', for which the presence of
“macintosh” causes the row not to be returned
at all.

'+apple +(>turnover <strudel)'

Find rows that contain the words “apple” and
“turnover”, or “apple” and
“strudel” (in any order), but rank “apple
turnover” higher than “apple strudel”.

'apple*'

Find rows that contain words such as “apple”,
“apples”, “applesauce”, or
“applet”.

'"some words"'

Find rows that contain the exact phrase “some
words” (for example, rows that contain “some
words of wisdom” but not “some noise
words”). Note that the "
characters that enclose the phrase are operator characters
that delimit the phrase. They are not the quotation marks
that enclose the search string itself.

Relevancy Rankings for InnoDB Boolean Mode Search

InnoDB full-text search is
modeled on the
Sphinx full-text
search engine, and the algorithms used are based on
BM25
and
TF-IDF
ranking algorithms. For these reasons, relevancy rankings for
InnoDB boolean full-text search may differ
from MyISAM relevancy rankings.

InnoDB uses a variation of the “term
frequency-inverse document frequency”
(TF-IDF) weighting system to rank a
document's relevance for a given full-text search query. The
TF-IDF weighting is based on how frequently
a word appears in a document, offset by how frequently the
word appears in all documents in the collection. In other
words, the more frequently a word appears in a document, and
the less frequently the word appears in the document
collection, the higher the document is ranked.

How Relevancy Ranking is Calculated

The term frequency (TF) value is the number
of times that a word appears in a document. The inverse
document frequency (IDF) value of a word is
calculated using the following formula, where
total_records is the number of records in
the collection, and matching_records is the
number of records that the search term appears in.

${IDF} = log10( ${total_records} / ${matching_records} )

When a document contains a word multiple times, the IDF value
is multiplied by the TF value:

${TF} * ${IDF}

Using the TF and IDF
values, the relevancy ranking for a document is calculated
using this formula:

${rank} = ${TF} * ${IDF} * ${IDF}

The formula is demonstrated in the following examples.

Relevancy Ranking for a Single Word Search

This example demonstrates the relevancy ranking calculation
for a single-word search.

There are 8 records in total, with 3 that match the
“database” search term. The first record
(id 6) contains the search term 6 times and
has a relevancy ranking of
1.0886961221694946. This ranking value is
calculated using a TF value of 6 (the
“database” search term appears 6 times in record
id 6) and an IDF value
of 0.42596873216370745, which is calculated as follows (where
8 is the total number of records and 3 is the number of
records that the search term appears in):

${IDF} = log10( 8 / 3 ) = 0.42596873216370745

The TF and IDF values
are then entered into the ranking formula:

${rank} = ${TF} * ${IDF} * ${IDF}

Performing the calculation in the MySQL command-line client
returns a ranking value of 1.088696164686938.

You may notice a slight difference in the ranking values
returned by the SELECT ... MATCH ...
AGAINST statement and the MySQL command-line
client (1.0886961221694946 versus
1.088696164686938). The difference is due
to how the casts between integers and floats/doubles are
performed internally by InnoDB (along
with related precision and rounding decisions), and how they
are performed elsewhere, such as in the MySQL command-line
client or other types of calculators.

Relevancy Ranking for a Multiple Word Search

This example demonstrates the relevancy ranking calculation
for a multiple-word full-text search based on the
articles table and data used in the
previous example.

If you search on more than one word, the relevancy ranking
value is a sum of the relevancy ranking value for each word,
as shown in this formula:

${rank} = ${TF} * ${IDF} * ${IDF} + ${TF} * ${IDF} * ${IDF}

Performing a search on two terms ('mysql tutorial') returns
the following results:

In the first record (id 8), 'mysql' appears
once and 'tutorial' appears twice. There are six matching
records for 'mysql' and two matching records for 'tutorial'.
The MySQL command-line client returns the expected ranking
value when inserting these values into the ranking formula for
a multiple word search:

The slight difference in the ranking values returned by the
SELECT ... MATCH ... AGAINST statement
and the MySQL command-line client is explained in the
preceding example.

12.9.3 Full-Text Searches with Query Expansion

Full-text search supports query expansion (and in particular,
its variant “blind query expansion”). This is
generally useful when a search phrase is too short, which often
means that the user is relying on implied knowledge that the
full-text search engine lacks. For example, a user searching for
“database” may really mean that
“MySQL”, “Oracle”, “DB2”,
and “RDBMS” all are phrases that should match
“databases” and should be returned, too. This is
implied knowledge.

Blind query expansion (also known as automatic relevance
feedback) is enabled by adding WITH QUERY
EXPANSION or IN NATURAL LANGUAGE MODE WITH
QUERY EXPANSION following the search phrase. It works
by performing the search twice, where the search phrase for the
second search is the original search phrase concatenated with
the few most highly relevant documents from the first search.
Thus, if one of these documents contains the word
“databases” and the word “MySQL”, the
second search finds the documents that contain the word
“MySQL” even if they do not contain the word
“database”. The following example shows this
difference:

Another example could be searching for books by Georges Simenon
about Maigret, when a user is not sure how to spell
“Maigret”. A search for “Megre and the
reluctant witnesses” finds only “Maigret and the
Reluctant Witnesses” without query expansion. A search
with query expansion finds all books with the word
“Maigret” on the second pass.

Note

Because blind query expansion tends to increase noise
significantly by returning nonrelevant documents, use it only
when a search phrase is short.

12.9.4 Full-Text Stopwords

The stopword list is loaded and searched for full-text queries
using the server character set and collation (the values of the
character_set_server and
collation_server system
variables). False hits or misses might occur for stopword
lookups if the stopword file or columns used for full-text
indexing or searches have a character set or collation different
from character_set_server or
collation_server.

Case sensitivity of stopword lookups depends on the server
collation. For example, lookups are case insensitive if the
collation is utf8mb4_0900_ai_ci, whereas
lookups are case sensitive if the collation is
utf8mb4_0900_as_cs or
utf8mb4_bin.

Stopwords for InnoDB Search Indexes

InnoDB has a relatively short list of default
stopwords, because documents from technical, literary, and other
sources often use short words as keywords or in significant
phrases. For example, you might search for “to be or not
to be” and expect to get a sensible result, rather than
having all those words ignored.

To define your own stopword list for all
InnoDB tables, define a table with the same
structure as the
INNODB_FT_DEFAULT_STOPWORD table,
populate it with stopwords, and set the value of the
innodb_ft_server_stopword_table
option to a value in the form
db_name/table_name
before creating the full-text index. The stopword table must
have a single VARCHAR column
named value. The following example
demonstrates creating and configuring a new global stopword
table for InnoDB.

By default, words less than 3 characters in length or greater
than 84 characters in length do not appear in an
InnoDB full-text search index. Maximum and
minimum word length values are configurable using the
innodb_ft_max_token_size and
innodb_ft_min_token_size
variables. This default behavior does not apply to the ngram
parser plugin. ngram token size is defined by the
ngram_token_size option.

To create stopword lists on a table-by-table basis, create other
stopword tables and use the
innodb_ft_user_stopword_table
option to specify the stopword table that you want to use before
you create the full-text index.

Stopwords for MyISAM Search Indexes

The stopword file is loaded and searched using
latin1 if
character_set_server is
ucs2, utf16,
utf16le, or utf32.

To override the default stopword list for MyISAM tables, set the
ft_stopword_file system
variable. (See Section 5.1.5, “Server System Variables”.) The
variable value should be the path name of the file containing
the stopword list, or the empty string to disable stopword
filtering. The server looks for the file in the data directory
unless an absolute path name is given to specify a different
directory. After changing the value of this variable or the
contents of the stopword file, restart the server and rebuild
your FULLTEXT indexes.

The stopword list is free-form, separating stopwords with any
nonalphanumeric character such as newline, space, or comma.
Exceptions are the underscore character (_)
and a single apostrophe (') which are treated
as part of a word. The character set of the stopword list is the
server's default character set; see
Section 10.1.3.2, “Server Character Set and Collation”.

The following table shows the default list of stopwords for
MyISAM search indexes. In a MySQL source
distribution, you can find this list in the
storage/myisam/ft_static.c file.

a's

able

about

above

according

accordingly

across

actually

after

afterwards

again

against

ain't

all

allow

allows

almost

alone

along

already

also

although

always

am

among

amongst

an

and

another

any

anybody

anyhow

anyone

anything

anyway

anyways

anywhere

apart

appear

appreciate

appropriate

are

aren't

around

as

aside

ask

asking

associated

at

available

away

awfully

be

became

because

become

becomes

becoming

been

before

beforehand

behind

being

believe

below

beside

besides

best

better

between

beyond

both

brief

but

by

c'mon

c's

came

can

can't

cannot

cant

cause

causes

certain

certainly

changes

clearly

co

com

come

comes

concerning

consequently

consider

considering

contain

containing

contains

corresponding

could

couldn't

course

currently

definitely

described

despite

did

didn't

different

do

does

doesn't

doing

don't

done

down

downwards

during

each

edu

eg

eight

either

else

elsewhere

enough

entirely

especially

et

etc

even

ever

every

everybody

everyone

everything

everywhere

ex

exactly

example

except

far

few

fifth

first

five

followed

following

follows

for

former

formerly

forth

four

from

further

furthermore

get

gets

getting

given

gives

go

goes

going

gone

got

gotten

greetings

had

hadn't

happens

hardly

has

hasn't

have

haven't

having

he

he's

hello

help

hence

her

here

here's

hereafter

hereby

herein

hereupon

hers

herself

hi

him

himself

his

hither

hopefully

how

howbeit

however

i'd

i'll

i'm

i've

ie

if

ignored

immediate

in

inasmuch

inc

indeed

indicate

indicated

indicates

inner

insofar

instead

into

inward

is

isn't

it

it'd

it'll

it's

its

itself

just

keep

keeps

kept

know

known

knows

last

lately

later

latter

latterly

least

less

lest

let

let's

like

liked

likely

little

look

looking

looks

ltd

mainly

many

may

maybe

me

mean

meanwhile

merely

might

more

moreover

most

mostly

much

must

my

myself

name

namely

nd

near

nearly

necessary

need

needs

neither

never

nevertheless

new

next

nine

no

nobody

non

none

noone

nor

normally

not

nothing

novel

now

nowhere

obviously

of

off

often

oh

ok

okay

old

on

once

one

ones

only

onto

or

other

others

otherwise

ought

our

ours

ourselves

out

outside

over

overall

own

particular

particularly

per

perhaps

placed

please

plus

possible

presumably

probably

provides

que

quite

qv

rather

rd

re

really

reasonably

regarding

regardless

regards

relatively

respectively

right

said

same

saw

say

saying

says

second

secondly

see

seeing

seem

seemed

seeming

seems

seen

self

selves

sensible

sent

serious

seriously

seven

several

shall

she

should

shouldn't

since

six

so

some

somebody

somehow

someone

something

sometime

sometimes

somewhat

somewhere

soon

sorry

specified

specify

specifying

still

sub

such

sup

sure

t's

take

taken

tell

tends

th

than

thank

thanks

thanx

that

that's

thats

the

their

theirs

them

themselves

then

thence

there

there's

thereafter

thereby

therefore

therein

theres

thereupon

these

they

they'd

they'll

they're

they've

think

third

this

thorough

thoroughly

those

though

three

through

throughout

thru

thus

to

together

too

took

toward

towards

tried

tries

truly

try

trying

twice

two

un

under

unfortunately

unless

unlikely

until

unto

up

upon

us

use

used

useful

uses

using

usually

value

various

very

via

viz

vs

want

wants

was

wasn't

way

we

we'd

we'll

we're

we've

welcome

well

went

were

weren't

what

what's

whatever

when

whence

whenever

where

where's

whereafter

whereas

whereby

wherein

whereupon

wherever

whether

which

while

whither

who

who's

whoever

whole

whom

whose

why

will

willing

wish

with

within

without

won't

wonder

would

wouldn't

yes

yet

you

you'd

you'll

you're

you've

your

yours

yourself

yourselves

zero

12.9.5 Full-Text Restrictions

Full-text searches can be used with most multibyte character
sets. The exception is that for Unicode, the
utf8 character set can be used, but not
the ucs2 character set. Although
FULLTEXT indexes on
ucs2 columns cannot be used, you can
perform IN BOOLEAN MODE searches on a
ucs2 column that has no such index.

The remarks for utf8 also apply to
utf8mb4, and the remarks for
ucs2 also apply to
utf16, utf16le, and
utf32.

Ideographic languages such as Chinese and Japanese do not
have word delimiters. Therefore, the built-in full-text
parser cannot determine where words begin and end
in these and other such languages.

A character-based ngram full-text parser that supports
Chinese, Japanese, and Korean (CJK), and a word-based MeCab
parser plugin that supports Japanese are provided for use
with InnoDB and MyISAM
tables.

Although the use of multiple character sets within a single
table is supported, all columns in a
FULLTEXT index must use the same
character set and collation.

The MATCH() column list must
match exactly the column list in some
FULLTEXT index definition for the table,
unless this MATCH() is
IN BOOLEAN MODE on a
MyISAM table. For
MyISAM tables, boolean-mode searches can
be done on nonindexed columns, although they are likely to
be slow.

The argument to AGAINST() must be a
string value that is constant during query evaluation. This
rules out, for example, a table column because that can
differ for each row.

For InnoDB, all DML operations
(INSERT,
UPDATE,
DELETE) involving columns
with full-text indexes are processed at transaction commit
time. For example, for an INSERT
operation, an inserted string is tokenized and decomposed
into individual words. The individual words are then added
to full-text index tables when the transaction is committed.
As a result, full-text searches only return committed data.

The '%' character is not a supported wildcard character for
full-text searches.

12.9.6 Fine-Tuning MySQL Full-Text Search

MySQL's full-text search capability has few user-tunable
parameters. You can exert more control over full-text searching
behavior if you have a MySQL source distribution because some
changes require source code modifications. See
Section 2.8, “Installing MySQL from Source”.

Full-text search is carefully tuned for effectiveness. Modifying
the default behavior in most cases can actually decrease
effectiveness. Do not alter the MySQL sources unless
you know what you are doing.

Most full-text variables described in this section must be set
at server startup time. A server restart is required to change
them; they cannot be modified while the server is running.

Some variable changes require that you rebuild the
FULLTEXT indexes in your tables. Instructions
for doing so are given later in this section.

Minimum and maximum word length full-text parameters do not
apply to FULLTEXT indexes created using
the ngram parser. ngram token size is defined by the
ngram_token_size option.

After changing any of these options, rebuild your
FULLTEXT indexes for the change to take
effect. For example, to make two-character words searchable,
you could put the following lines in an option file:

[mysqld]
innodb_ft_min_token_size=2
ft_min_word_len=2

Then restart the server and rebuild your
FULLTEXT indexes. For
MyISAM tables, note the remarks regarding
myisamchk in the instructions that follow
for rebuilding MyISAM full-text indexes.

Configuring the Natural Language Search Threshold

For MyISAM search indexes, the 50%
threshold for natural language searches is determined by the
particular weighting scheme chosen. To disable it, look for
the following line in
storage/myisam/ftdefs.h:

#define GWS_IN_USE GWS_PROB

Change that line to this:

#define GWS_IN_USE GWS_FREQ

Then recompile MySQL. There is no need to rebuild the indexes
in this case.

Note

By making this change, you severely
decrease MySQL's ability to provide adequate relevance
values for the MATCH()
function. If you really need to search for such common
words, it would be better to search using IN
BOOLEAN MODE instead, which does not observe the
50% threshold.

Modifying Boolean Full-Text Search Operators

To change the operators used for boolean full-text searches on
MyISAM tables, set the
ft_boolean_syntax system
variable. (InnoDB does not have an
equivalent setting.) This variable can be changed while the
server is running, but you must have the
SYSTEM_VARIABLES_ADMIN or
SUPER privilege to do so. No
rebuilding of indexes is necessary in this case. See
Section 5.1.5, “Server System Variables”, which describes the
rules governing how to set this variable.

Character Set Modifications

For the built-in full-text parser, you can change the set of
characters that are considered word characters in several
ways, as described in the following list. After making the
modification, rebuild the indexes for each table that contains
any FULLTEXT indexes. Suppose that you want
to treat the hyphen character ('-') as a word character. Use
one of these methods:

Modify the MySQL source: In
storage/innobase/handler/ha_innodb.cc
(for InnoDB), or in
storage/myisam/ftdefs.h (for
MyISAM), see the
true_word_char() and
misc_word_char() macros. Add
'-' to one of those macros and
recompile MySQL.

Modify a character set file: This requires no
recompilation. The true_word_char()
macro uses a “character type” table to
distinguish letters and numbers from other characters. .
You can edit the contents of the
<ctype><map> array in one
of the character set XML files to specify that
'-' is a “letter.” Then
use the given character set for your
FULLTEXT indexes. For information about
the <ctype><map> array
format, see Section 10.3.1, “Character Definition Arrays”.

To avoid lengthy rebuild times for full-text indexes on large
tables, you can use the
innodb_ft_num_word_optimize
option to perform the optimization in stages. The
innodb_ft_num_word_optimize option defines
the number of words that are optimized each time
OPTIMIZE TABLE is run. The
default setting is 2000, which means that 2000 words are
optimized each time OPTIMIZE
TABLE is run. Subsequent
OPTIMIZE TABLE operations
continue from where the preceding
OPTIMIZE TABLE operation ended.

Rebuilding MyISAM Full-Text Indexes

If you modify full-text variables that affect indexing
(ft_min_word_len,
ft_max_word_len, or
ft_stopword_file), or if you
change the stopword file itself, you must rebuild your
FULLTEXT indexes after making the changes
and restarting the server.

To rebuild the FULLTEXT indexes for a
MyISAM table, it is sufficient to do a
QUICK repair operation:

mysql> REPAIR TABLE tbl_name QUICK;

Alternatively, use ALTER TABLE
as just described. In some cases, this may be faster than a
repair operation.

Each table that contains any FULLTEXT index
must be repaired as just shown. Otherwise, queries for the
table may yield incorrect results, and modifications to the
table will cause the server to see the table as corrupt and in
need of repair.

If you use myisamchk to perform an
operation that modifies MyISAM table
indexes (such as repair or analyze), the
FULLTEXT indexes are rebuilt using the
default full-text parameter values for
minimum word length, maximum word length, and stopword file
unless you specify otherwise. This can result in queries
failing.

The problem occurs because these parameters are known only by
the server. They are not stored in MyISAM
index files. To avoid the problem if you have modified the
minimum or maximum word length or stopword file values used by
the server, specify the same
ft_min_word_len,
ft_max_word_len, and
ft_stopword_file values for
myisamchk that you use for
mysqld. For example, if you have set the
minimum word length to 3, you can repair a table with
myisamchk like this:

shell> myisamchk --recover --ft_min_word_len=3 tbl_name.MYI

To ensure that myisamchk and the server use
the same values for full-text parameters, place each one in
both the [mysqld] and
[myisamchk] sections of an option file:

12.9.7 Adding a Collation for Full-Text Indexing

This section describes how to add a new collation for full-text
searches using the built-in full-text parser. The sample
collation is like latin1_swedish_ci but
treats the '-' character as a letter rather
than as a punctuation character so that it can be indexed as a
word character. General information about adding collations is
given in Section 10.4, “Adding a Collation to a Character Set”; it is assumed that
you have read it and are familiar with the files involved.

To add a collation for full-text indexing, use the following
procedure. The instructions here add a collation for a simple
character set, which as discussed in
Section 10.4, “Adding a Collation to a Character Set”, can be created using a
configuration file that describes the character set properties.
For a complex character set such as Unicode, create collations
using C source files that describe the character set properties.

Add a collation to the Index.xml file.
The collation ID must be unused, so choose a value different
from 1000 if that ID is already taken on your system.

Modify the ctype array in
latin1.xml. Change the value
corresponding to 0x2D (which is the code for the
'-' character) from 10 (punctuation) to
01 (small letter). In the following array, this is the
element in the fourth row down, third value from the end.

12.9.8 ngram Full-Text Parser

The built-in MySQL full-text parser uses the white space between
words as a delimiter to determine where words begin and end,
which is a limitation when working with ideographic languages
that do not use word delimiters. To address this limitation,
MySQL provides an ngram full-text parser that supports Chinese,
Japanese, and Korean (CJK). The ngram full-text parser is
supported for use with InnoDB and
MyISAM.

An ngram is a contiguous sequence of
n characters from a given sequence of
text. The ngram parser tokenizes a sequence of text into a
contiguous sequence of n characters.
For example, you can tokenize “abcd” for different
values of n using the ngram full-text
parser.

Configuring ngram Token Size

The ngram parser has a default ngram token size of 2 (bigram).
For example, with a token size of 2, the ngram parser parses the
string “abc def” into four tokens:
“ab”, “bc”, “de” and
“ef”.

ngram token size is configurable using the
ngram_token_size configuration
option, which has a minimum value of 1 and maximum value of 10.

Typically, ngram_token_size is
set to the size of the largest token that you want to search
for. If you only intend to search for single characters, set
ngram_token_size to 1. A
smaller token size produces a smaller full-text search index,
and faster searches. If you need to search for words comprised
of more than one character, set
ngram_token_size accordingly.
For example, “Happy Birthday” is
“生日快乐” in simplified Chinese, where
“生日” is “birthday”, and
“快乐” translates as “happy”. To
search on two-character words such as these, set
ngram_token_size to a value of
2 or higher.

As a read-only variable,
ngram_token_size may only be
set as part of a startup string or in a configuration file:

Creating a FULLTEXT Index that Uses the ngram Parser

The following example demonstrates creating a table with an
ngramFULLTEXT index,
inserting sample data (Simplified Chinese text), and viewing
tokenized data in the
INFORMATION_SCHEMA.INNODB_FT_INDEX_CACHE
table.

ngram Parser Space Handling

The ngram parser eliminates spaces when parsing. For example:

“ab cd” is parsed to “ab”,
“cd”

“a bc” is parsed to “bc”

ngram Parser Stopword Handling

The built-in MySQL full-text parser compares words to entries in
the stopword list. If a word is equal to an entry in the
stopword list, the word is excluded from the index. For the
ngram parser, stopword handling is performed differently.
Instead of excluding tokens that are equal to entries in the
stopword list, the ngram parser excludes tokens that
contain stopwords. For example, assuming
ngram_token_size=2, a document
that contains “a,b” is parsed to “a,”
and “,b”. If a comma (“,”) is defined
as a stopword, both “a,” and “,b” are
excluded from the index because they contain a comma.

By default, the ngram parser uses the default stopword list,
which contains a list of English stopwords. For a stopword list
applicable to Chinese, Japanese, or Korean, you must create your
own. For information about creating a stopword list, see
Section 12.9.4, “Full-Text Stopwords”.

ngram Parser Term Search

For natural language mode search, the
search term is converted to a union of ngram terms. For example,
the string “abc” (assuming
ngram_token_size=2) is
converted to “ab bc”. Given two documents, one
containing “ab” and the other containing
“abc”, the search term “ab bc” matches
both documents.

For boolean mode search, the search term is
converted to an ngram phrase search. For example, the string
'abc' (assuming
ngram_token_size=2) is
converted to '“ab bc”'. Given two documents, one
containing 'ab' and the other containing 'abc', the search
phrase '“ab bc”' only matches the document
containing 'abc'.

ngram Parser Wildcard Search

Because an ngram FULLTEXT index contains only
ngrams, and does not contain information about the beginning of
terms, wildcard searches may return unexpected results. The
following behaviors apply to wildcard searches using ngram
FULLTEXT search indexes:

If the prefix term of a wildcard search is shorter than
ngram token size, the query returns all indexed rows that
contain ngram tokens starting with the prefix term. For
example, assuming
ngram_token_size=2, a
search on “a*” returns all rows starting with
“a”.

If the prefix term of a wildcard search is longer than ngram
token size, the prefix term is converted to an ngram phrase
and the wildcard operator is ignored. For example, assuming
ngram_token_size=2, an
“abc*” wildcard search is converted to
“ab bc”.

ngram Parser Phrase Search

The search phrase “abc def” is converted to
“ab bc de ef”, which returns documents containing
“abc def” and “ab bc de ef”. A
document that contains “abcdef” is not returned.

12.9.9 MeCab Full-Text Parser Plugin

The built-in MySQL full-text parser uses the white space between
words as a delimiter to determine where words begin and end,
which is a limitation when working with ideographic languages
that do not use word delimiters. To address this limitation for
Japanese, MySQL provides a MeCab full-text parser plugin. The
MeCab full-text parser plugin is supported for use with
InnoDB and
MyISAM.

The MeCab full-text parser plugin is a full-text parser plugin
for Japanese that tokenizes a sequence of text into meaningful
words. For example, MeCab tokenizes
“データベース管理” (“Database
Management”) into “データベース”
(“Database”) and “管理”
(“Management”). By comparison, the
ngram full-text
parser tokenizes text into a contiguous sequence of
n characters, where
n represents a number between 1 and
10.

In addition to tokenizing text into meaningful words, MeCab
indexes are typically smaller than ngram indexes, and MeCab
full-text searches are generally faster. One drawback is that it
may take longer for the MeCab full-text parser to tokenize
documents, compared to the ngram full-text parser.

The full-text search syntax described in
Section 12.9, “Full-Text Search Functions” applies to the MeCab parser
plugin. Differences in parsing behavior are described in this
section. Full-text related configuration options are also
applicable.

Installing the MeCab Parser Plugin

The MeCab parser plugin requires mecab and
mecab-ipadic.

On supported Fedora, Debian and Ubuntu platforms (except Ubuntu
12.04 where the system mecab version is too
old), MySQL dynamically links to the system
mecab installation if it is installed to
the default location. On other supported Unix-like platforms,
libmecab.so is statically linked in
libpluginmecab.so, which is located in the
MySQL plugin directory. mecab-ipadic is
included in MySQL binaries and is located in
MYSQL_HOME\lib\mecab.

On Windows, libmecab.dll is found in the
MySQL bin directory.
mecab-ipadic is located in
MYSQL_HOME/lib/mecab.

To install and configure the MeCab parser plugin, perform the
following steps:

In the MySQL configuration file, set the
mecab_rc_file configuration
option to the location of the mecabrc
configuration file, which is the configuration file for
MeCab. If you are using the MeCab package distributed with
MySQL, the mecabrc file is located in
MYSQL_HOME/lib/mecab/etc/.

[mysqld]
loose-mecab-rc-file=MYSQL_HOME/lib/mecab/etc/mecabrc

The loose prefix is an
option modifier. The
mecab_rc_file option is not
recognized by MySQL until the MeCaB parser plugin is
installed but it must be set before attempting to install
the MeCaB parser plugin. The loose prefix
allows you restart MySQL without encountering an error due
to an unrecognized variable.

If you use your own MeCab installation, or build MeCab from
source, the location of the mecabrc
configuration file may differ.

Also in the MySQL configuration file, set the minimum token
size to 1 or 2, which are the values recommended for use
with the MeCab parser. For InnoDB tables,
minimum token size is defined by the
innodb_ft_min_token_size
configuration option, which has a default value of 3. For
MyISAM tables, minimum token size is
defined by ft_min_word_len,
which has a default value of 4.

[mysqld]
innodb_ft_min_token_size=1

Modify the mecabrc configuration file
to specify the dictionary you want to use. The
mecab-ipadic package distributed with
MySQL binaries includes three dictionaries
(ipadic_euc-jp,
ipadic_sjis, and
ipadic_utf-8). The
mecabrc configuration file packaged
with MySQL contains and entry similar to the following:

dicdir = /path/to/mysql/lib/mecab/lib/mecab/dic/ipadic_euc-jp

To use the ipadic_utf-8 dictionary, for
example, modify the entry as follows:

dicdir=MYSQL_HOME/lib/mecab/dic/ipadic_utf-8

If you are using your own MeCab installation or have built
MeCab from source, the default dicdir
entry in the mecabrc file will differ,
as will the dictionaries and their location.

Note

After the MeCab parser plugin is installed, you can use
the mecab_charset status
variable to view the character set used with MeCab. The
three MeCab dictionaries provided with the MySQL binary
support the following character sets.

The ipadic_euc-jp dictionary
supports the ujis and
eucjpms character sets.

The ipadic_sjis dictionary supports
the sjis and
cp932 character sets.

The ipadic_utf-8 dictionary
supports the utf8 and
utf8mb4 character sets.

mecab_charset only
reports the first supported character set. For example,
the ipadic_utf-8 dictionary supports
both utf8 and
utf8mb4.
mecab_charset always
reports utf8 when this dictionary is in
use.

MeCab Parser Space Handling

The MeCab parser uses spaces as separators in query strings. For
example, the MeCab parser tokenizes 'データベース管理'
as 'データベース' and '管理'.

MeCab Parser Stopword Handling

By default, the MeCab parser uses the default stopword list,
which contains a short list of English stopwords. For a stopword
list applicable to Japanese, you must create your own. For
information about creating stopword lists, see
Section 12.9.4, “Full-Text Stopwords”.

MeCab Parser Term Search

For natural language mode search, the search term is converted
to a union of tokens. For example, 'データベース管理' is
converted to 'データベース 管理' .

For boolean mode search, the search term is converted to a
search phrase. For example, 'データベース管理' is
converted to '"データベース 管理"'.

SELECT COUNT(*) FROM articles WHERE MATCH(title,body) AGAINST('データベース管理' IN BOOLEAN MODE);

MeCab Parser Wildcard Search

Wildcard search terms are not tokenized. A search on
'データベース管理*' is performed on the prefix,
'データベース管理'.

SELECT COUNT(*) FROM articles WHERE MATCH(title,body) AGAINST('データベース*' IN BOOLEAN MODE);

MeCab Parser Phrase Search

Phrases are tokenized. For example, "データベース管理"
is tokenized as "データベース 管理".

SELECT COUNT(*) FROM articles WHERE MATCH(title,body) AGAINST('"データベース管理"' IN BOOLEAN MODE);

Installing MeCab From a Binary Distribution (Optional)

This section describes how to install mecab
and mecab-ipadic from a binary distribution
using a native package management utility. For example, on
Fedora, you can use Yum to perform the installation:

yum mecab-devel

On Debian or Ubuntu, you can perform an APT installation:

apt-get install mecab
apt-get install mecab-ipadic

Installing MeCab From Source (Optional)

If you want to build mecab and
mecab-ipadic from source, basic
installation steps are provided below. For additional
information, refer to the MeCab documentation.

Download the tar.gz packages for mecab
and mecab-ipadic from
http://taku910.github.io/mecab/#download. As
of February, 2016, the latest available packages are
mecab-0.996.tar.gz and
mecab-ipadic-2.7.0-20070801.tar.gz.

A COLLATE clause is not permitted within a
CONVERT() or
CAST() call, but you can apply it
to the function result. For example, this is legal:

SELECT CAST('test' AS CHAR CHARACTER SET utf8) COLLATE utf8_bin;

But this is illegal:

SELECT CAST('test' AS CHAR CHARACTER SET utf8 COLLATE utf8_bin);

Normally, you cannot compare a BLOB
value or other binary string in case-insensitive fashion because
binary strings use the binary character set,
which has no collation with the concept of lettercase. To perform
a case-insensitive comparison, use the
CONVERT() or
CAST() function to convert the
value to a nonbinary string. Comparisons of the resulting string
use its collation. For example, if the conversion result character
set has a case-insensitive collation, a
LIKE operation is not case sensitive:

SELECT 'A' LIKE CONVERT(blob_col USING latin1)
FROM tbl_name;

To use a different character set, substitute its name for
latin1 in the preceding statement. To specify a
particular collation for the converted string, use a
COLLATE clause following the
CONVERT() call:

SELECT 'A' LIKE CONVERT(blob_col USING latin1) COLLATE latin1_german1_ci
FROM tbl_name;

CONVERT() and
CAST() can be used more generally
for comparing strings that are represented in different character
sets. For example, a comparison of these strings results in an
error because they have different character sets:

For string literals, another way to specify the character set is
to use a character set introducer (_latin1 and
_latin2 in the preceding example are instances
of introducers). Unlike conversion functions such as
CAST(), or
CONVERT(), which convert a string
from one character set to another, an introducer designates a
string literal as having a particular character set, with no
conversion involved. For more information, see
Section 10.1.3.8, “Character Set Introducers”.

Character set conversion is also useful preceding lettercase
conversion of binary strings.
LOWER() and
UPPER() are ineffective when
applied directly to binary strings because the concept of
lettercase does not apply. To perform lettercase conversion of a
binary string, first convert it to a nonbinary string:

The cast functions are useful for sorting
ENUM columns in lexical order.
Normally, sorting of ENUM columns
occurs using the internal numeric values. Casting the values to
CHAR results in a lexical sort:

MySQL supports arithmetic with both signed and unsigned 64-bit
values. For numeric operators (such as
+ or
-) where one of the
operands is an unsigned integer, the result is unsigned by default
(see Section 12.6.1, “Arithmetic Operators”). To override this,
use the SIGNED or UNSIGNED
cast operator to cast a value to a signed or unsigned 64-bit
integer, respectively.

If either operand is a floating-point value, the result is a
floating-point value and is not affected by the preceding rule.
(In this context, DECIMAL column
values are regarded as floating-point values.)

The BINARY operator converts the
expression to a binary string. A common use for
BINARY is to force a character
string comparison to be done byte by byte rather than
character by character, in effect becoming case sensitive. The
BINARY operator also causes
trailing spaces in comparisons to be significant.

The BINARY operator in
expressions differs in effect from the
BINARY attribute in character column
definitions. A character column defined with the
BINARY attribute is assigned table default
character set and the binary (_bin)
collation of that character set. Every nonbinary character set
has a _bin collation. For example, the
binary collation for the utf8 character set
is utf8_bin, so if the table default
character set is utf8, these two column
definitions are equivalent:

CHAR(10) BINARY
CHAR(10) CHARACTER SET utf8 COLLATE utf8_bin

The use of CHARACTER SET binary in the
definition of a CHAR,
VARCHAR, or
TEXT column causes the column
to be treated as the corresponding binary string data type.
For example, the following pairs of definitions are
equivalent:

CHAR(10) CHARACTER SET binary
BINARY(10)
VARCHAR(10) CHARACTER SET binary
VARBINARY(10)
TEXT CHARACTER SET binary
BLOB

CONVERT() with
USING converts data between different
character sets. In MySQL, transcoding names are the same as
the corresponding character set names. For example, this
statement converts the string 'abc' in the
default character set to the corresponding string in the
utf8 character set:

SELECT CONVERT('abc' USING utf8);

CONVERT() without
USING and
CAST() take an expression and a
type value specifying the result
type. These type values are
permitted:

BINARY[(N)]

Produces a string with the
BINARY data type. See
Section 11.4.2, “The BINARY and VARBINARY Types” for a description of
how this affects comparisons. If the optional length
N is given,
BINARY(N)
causes the cast to use no more than
N bytes of the argument. Values
shorter than N bytes are padded
with 0x00 bytes to a length of
N.

CHAR[(N)]
[charset_info]

Produces a string with the
CHAR data type. If the
optional length N is given,
CHAR(N)
causes the cast to use no more than
N characters of the argument.
No padding occurs for values shorter than
N characters.

With no charset_info clause,
CHAR produces a string with the default
character set. To specify the character set explicitly,
these charset_info values are
permitted:

CHARACTER SET
charset_name:
Produces a string with the given character set.

ASCII: Shorthand for
CHARACTER SET latin1.

UNICODE: Shorthand for
CHARACTER SET ucs2.

In all cases, the string has the default collation for the
character set.

Two functions providing basic XPath 1.0 (XML Path Language,
version 1.0) capabilities are available. Some basic information
about XPath syntax and usage is provided later in this section;
however, an in-depth discussion of these topics is beyond the
scope of this manual, and you should refer to the
XML Path Language (XPath)
1.0 standard for definitive information. A useful resource
for those new to XPath or who desire a refresher in the basics is
the Zvon.org
XPath Tutorial, which is available in several languages.

Note

These functions remain under development. We continue to improve
these and other aspects of XML and XPath functionality in MySQL
8.0 and onwards. You may discuss these, ask
questions about them, and obtain help from other users with them
in the MySQL XML User
Forum.

XPath expressions used with these functions support user variables
and local stored program variables. User variables are weakly
checked; variables local to stored programs are strongly checked
(see also Bug #26518):

User variables (weak checking).
Variables using the syntax
$@variable_name
(that is, user variables) are not checked. No warnings or
errors are issued by the server if a variable has the wrong
type or has previously not been assigned a value. This also
means the user is fully responsible for any typographical
errors, since no warnings will be given if (for example)
$@myvariable is used where
$@myvariable was intended.

Variables in stored programs (strong checking).
Variables using the syntax
$variable_name
can be declared and used with these functions when they are
called inside stored programs. Such variables are local to
the stored program in which they are defined, and are
strongly checked for type and value.

Parameters.
Variables used in XPath expressions inside stored routines
that are passed in as parameters are also subject to strong
checking.

Expressions containing user variables or variables local to stored
programs must otherwise (except for notation) conform to the rules
for XPath expressions containing variables as given in the XPath
1.0 specification.

Note

A user variable used to store an XPath expression is treated as
an empty string. Because of this, it is not possible to store an
XPath expression as a user variable. (Bug #32911)

ExtractValue() takes two string
arguments, a fragment of XML markup
xml_frag and an XPath expression
xpath_expr (also known as a
locator); it returns the
text (CDATA) of the first text node which
is a child of the element or elements matched by the XPath
expression.

If multiple matches are found, the content of the first child
text node of each matching element is returned (in the order
matched) as a single, space-delimited string.

If no matching text node is found for the expression
(including the implicit /text())—for
whatever reason, as long as
xpath_expr is valid, and
xml_frag consists of elements which
are properly nested and closed—an empty string is
returned. No distinction is made between a match on an empty
element and no match at all. This is by design.

If you need to determine whether no matching element was found
in xml_frag or such an element was
found but contained no child text nodes, you should test the
result of an expression that uses the XPath
count() function. For example, both of
these statements return an empty string, as shown here:

This function uses the current SQL collation for making
comparisons with contains(), performing the
same collation aggregation as other string functions (such as
CONCAT()), in taking into
account the collation coercibility of their arguments; see
Section 10.1.8.4, “Collation Coercibility in Expressions”, for an
explanation of the rules governing this behavior.

This function replaces a single portion of a given fragment of
XML markup xml_target with a new
XML fragment new_xml, and then
returns the changed XML. The portion of
xml_target that is replaced matches
an XPath expression xpath_expr
supplied by the user.

If no expression matching
xpath_expr is found, or if multiple
matches are found, the function returns the original
xml_target XML fragment. All three
arguments should be strings.

A discussion in depth of XPath syntax and usage are beyond the
scope of this manual. Please see the
XML Path Language
(XPath) 1.0 specification for definitive information. A
useful resource for those new to XPath or who are wishing a
refresher in the basics is the
Zvon.org
XPath Tutorial, which is available in several languages.

Descriptions and examples of some basic XPath expressions follow:

/tag

Matches
<tag/> if
and only if
<tag/> is
the root element.

Example: /a has a match in
<a><b/></a> because it
matches the outermost (root) tag. It does not match the inner
a element in
<b><a/></b> because in
this instance it is the child of another element.

/tag1/tag2

Matches
<tag2/> if
and only if it is a child of
<tag1/>,
and
<tag1/> is
the root element.

Example: /a/b matches the
b element in the XML fragment
<a><b/></a> because it is
a child of the root element a. It
does not have a match in
<b><a/></b> because in
this case, b is the root element
(and hence the child of no other element). Nor does the XPath
expression have a match in
<a><c><b/></c></a>;
here, b is a descendant of
a, but not actually a child of
a.

This construct is extendable to three or more elements. For
example, the XPath expression /a/b/c
matches the c element in the
fragment
<a><b><c/></b></a>.

//tag

Matches any instance of
<tag>.

Example: //a matches the
a element in any of the following:
<a><b><c/></b></a>;
<c><a><b/></a></b>;
<c><b><a/></b></c>.

// can be combined with
/. For example, //a/b
matches the b element in either of
the fragments <a><b/></a>
or
<c><a><b/></a></c>.

Note

//tag is the
equivalent of
/descendant-or-self::*/tag.
A common error is to confuse this with
/descendant-or-self::tag,
although the latter expression can actually lead to very
different results, as can be seen here:

The * operator acts as a
“wildcard” that matches any element. For example,
the expression /*/b matches the
b element in either of the XML
fragments <a><b/></a> or
<c><b/></c>. However, the
expression does not produce a match in the fragment
<b><a/></b> because
b must be a child of some other
element. The wildcard may be used in any position: The
expression /*/b/* will match any child of a
b element that is itself not the
root element.

You can match any of several locators using the
| (UNION)
operator. For example, the expression
//b|//c matches all
b and c
elements in the XML target.

It is also possible to match an element based on the value of
one or more of its attributes. This done using the syntax
tag[@attribute="value"].
For example, the expression //b[@id="idB"]
matches the second b element in the
fragment <a><b id="idA"/><c/><b
id="idB"/></a>. To match against
any element having
attribute="value",
use the XPath expression
//*[attribute="value"].

To filter multiple attribute values, simply use multiple
attribute-comparison clauses in succession. For example, the
expression //b[@c="x"][@d="y"] matches the
element <b c="x" d="y"/> occurring
anywhere in a given XML fragment.

To find elements for which the same attribute matches any of
several values, you can use multiple locators joined by the
| operator. For example, to match all
b elements whose
c attributes have either of the
values 23 or 17, use the expression
//b[@c="23"]|//b[@c="17"]. You can also use
the logical or operator for this purpose:
//b[@c="23" or @c="17"].

Note

The difference between or and
| is that or joins
conditions, while | joins result sets.

XPath Limitations.
The XPath syntax supported by these functions is currently
subject to the following limitations:

Nodeset-to-nodeset comparison (such as
'/a/b[@c=@d]') is not supported.

All of the standard XPath comparison operators are supported.
(Bug #22823)

Relative locator expressions are resolved in the context of
the root node. For example, consider the following query and
result:

“Up-and-down” navigation is not supported in
cases where the path would lead “above” the root
element. That is, you cannot use expressions which match on
descendants of ancestors of a given element, where one or more
of the ancestors of the current element is also an ancestor of
the root element (see Bug #16321).

The following XPath functions are not supported, or have known
issues as indicated:

id()

lang()

local-name()

name()

namespace-uri()

normalize-space()

starts-with()

string()

substring-after()

substring-before()

translate()

The following axes are not supported:

following-sibling

following

preceding-sibling

preceding

XPath expressions passed as arguments to
ExtractValue() and
UpdateXML() may contain the colon
character (:) in element selectors, which
enables their use with markup employing XML namespaces notation.
For example:

This is similar in some respects to what is permitted by
Apache Xalan and
some other parsers, and is much simpler than requiring namespace
declarations or the use of the namespace-uri()
and local-name() functions.

Error handling.
For both ExtractValue() and
UpdateXML(), the XPath locator
used must be valid and the XML to be searched must consist of
elements which are properly nested and closed. If the locator is
invalid, an error is generated:

The replacement XML used as the third argument to
UpdateXML() is
not checked to determine whether it
consists solely of elements which are properly nested and
closed.

XPath Injection. code injection occurs when
malicious code is introduced into the system to gain
unauthorized access to privileges and data. It is based on
exploiting assumptions made by developers about the type and
content of data input from users. XPath is no exception in this
regard.

A common scenario in which this can happen is the case of
application which handles authorization by matching the
combination of a login name and password with those found in an
XML file, using an XPath expression like this one:

//user[login/text()='neapolitan' and password/text()='1c3cr34m']/attribute::id

This is the XPath equivalent of an SQL statement like this one:

SELECT id FROM users WHERE login='neapolitan' AND password='1c3cr34m';

No checks are performed on the input. This means that a malevolent
user can “short-circuit” the test by entering
' or 1=1 for both the login name and password,
resulting in $xpath being evaluated as shown
here:

//user[login/text()='' or 1=1 and password/text()='' or 1=1]/attribute::id

Since the expression inside the square brackets always evaluates
as true, it is effectively the same as this
one, which matches the id attribute of every
user element in the XML document:

//user/attribute::id

One way in which this particular attack can be circumvented is
simply by quoting the variable names to be interpolated in the
definition of $xpath, forcing the values passed
from a Web form to be converted to strings:

This is the same strategy that is often recommended for preventing
SQL injection attacks. In general, the practices you should follow
for preventing XPath injection attacks are the same as for
preventing SQL injection:

Never accepted untested data from users in your application.

Check all user-submitted data for type; reject or convert data
that is of the wrong type

Test numeric data for out of range values; truncate, round, or
reject values that are out of range. Test strings for illegal
characters and either strip them out or reject input
containing them.

Do not output explicit error messages that might provide an
unauthorized user with clues that could be used to compromise
the system; log these to a file or database table instead.

Just as SQL injection attacks can be used to obtain information
about database schemas, so can XPath injection be used to traverse
XML files to uncover their structure, as discussed in Amit
Klein's paper
Blind
XPath Injection (PDF file, 46KB).

It is also important to check the output being sent back to the
client. Consider what can happen when we use the MySQL
ExtractValue() function:

Because ExtractValue() returns
multiple matches as a single space-delimited string, this
injection attack provides every valid ID contained within
users.xml to the user as a single row of
output. As an extra safeguard, you should also test output before
returning it to the user. Here is a simple example:

In MySQL 8.0, bit functions and operators permit
binary string type arguments
(BINARY,
VARBINARY, and the
BLOB types) and return a value of
like type, which enables them to take arguments and produce return
values larger than 64 bits. Nonbinary string arguments are
converted to BIGINT and processed
as such, as before.

An implication of this change in behavior is that bit operations
on binary string arguments might produce a different result in
MySQL 8.0 than in 5.7. For information
about how to prepare in MySQL 5.7 for potential
incompatibilities between MySQL 5.7 and 8.0, see
Bit Functions and Operators, in
MySQL 5.7 Reference Manual.

Handling of bit-value literals in bit operations is similar to
hexadecimal literals (that is, as numbers).

Bit Operations in MySQL 8.0

MySQL 8.0 extends bit operations to handle binary string arguments
directly (without conversion) and produce binary string results.
(Arguments that are not integers or binary strings are still
converted to integers, as before.) This extension enhances bit
operations in the following ways:

Bit operations become possible on values longer than 64 bits.

It is easier to perform bit operations on values that are more
naturally represented as binary strings than as integers.

For example, consider UUID values and IPv6 addresses, which have
human-readable text formats like this:

It is cumbersome to operate on text strings in those formats. An
alternative is convert them to fixed-length binary strings without
delimiters. UUID_TO_BIN() and
INET6_ATON() each produce a value
of data type BINARY(16), a binary
string 16 bytes (128 bits) long. The following statements
illustrate this (HEX() is used to produce
displayable values):

Those binary values are easily manipulable with bit operations to
perform actions such as extracting the timestamp from UUID values,
or extracting the network and host parts of IPv6 addresses. (For
examples, see later in this discussion.)

Arguments that count as binary strings include column values,
routine parameters, local variables, and user-defined variables
that have a binary string type:
BINARY,
VARBINARY, or one of the
BLOB types.

What about hexadecimal literals and bit literals? Recall that
those are binary strings by default in MySQL, but numbers in
numeric context. How are they handled for bit operations in MySQL
8.0? Does MySQL continue to evaluate them in numeric context, as
is done prior to MySQL 8.0? Or do bit operations evaluate them as
binary strings, now that binary strings can be handled
“natively” without conversion?

Answer: It has been common to specify arguments to bit operations
using hexadecimal literals or bit literals with the intent that
they represent numbers, so MySQL continues to evaluate bit
operations in numeric context when all bit arguments are
hexadecimal or bit literals, for backward compatility. If you
require evaluation as binary strings instead, that is easily
accomplished: Use the _binary introducer for at
least one literal.

These bit operations evaluate the hexadecimal literals and bit
literals as integers:

Padding or stripping can also be accomplished using functions such
as LPAD(),
RPAD(),
SUBSTR(), or
CAST(). In such cases, the
expression arguments are no longer all literals and
_binary becomes unnecessary. Examples:

Binary String Bit-Operation Examples

The following example illustrates use of bit operations to extract
parts of a UUID value, in this case, the timestamp and IEEE 802
node number. This technique requires bitmasks for each extracted
part.

Convert the text UUID to the corresponding 16-byte binary value so
that it can be manipulated using bit operations in binary-string
context:

Construct bitmasks for the timestamp and node number parts of the
value. The timestamp comprises the first three parts (64 bits,
bits 0 to 63) and the node number is the last part (48 bits, bits
80 to 127):

The CAST(... AS BINARY(16)) function is used
here because the masks must be the same length as the UUID value
against which they are applied. The same result can be produced
using other functions to pad the masks to the required length:

The preceding example uses these bit operations: right shift
(>>)
and bitwise AND
(&).

Note

UUID_TO_BIN() takes a flag that
causes some bit rearrangement in the resulting binary UUID
value. If you use that flag, modify the extraction masks
accordingly.

The next example uses bit operations to extract the network and
host parts of an IPv6 address. Suppose that the network part has a
length of 80 bits. Then the host part has a length of 128 −
80 = 48 bits. To extract the network and host parts of the
address, convert it to a binary string, then use bit operations in
binary-string context.

Convert the text IPv6 address to the corresponding binary string:

mysql> SET @ip = INET6_ATON('fe80::219:d1ff:fe91:1a72');

Define the network length in bits:

mysql> SET @net_len = 80;

Construct network and host masks by shifting the all-ones address
left or right. To do this, begin with the address
::, which is shorthand for all zeros, as you
can see by converting it to a binary string like this:

The preceding example uses these bit operations: Complement
(~), left
shift
(<<),
and bitwise AND
(&).

The remaining discussion provides details on argument handling for
each group of bit operations, more information about literal-value
handling in bit operations, and potential incompatibilities
between MySQL 8.0 and older MySQL versions.

Bitwise AND, OR, and XOR Operations

For &,
|, and
^ bit
operations, the result type depends on whether the arguments are
evaluated as binary strings or numbers:

Binary-string evaluation occurs when the arguments have a
binary string type, and at least one of them is not a
hexadecimal literal, bit literal, or NULL
literal. Numeric evaluation occurs otherwise, with argument
conversion to unsigned 64-bit integers as necessary.

Binary-string evaluation produces a binary string of the same
length as the arguments. If the arguments have unequal
lengths, an
ER_INVALID_BITWISE_OPERANDS_SIZE
error occurs. Numeric evaluation produces an unsigned 64-bit
integer.

Bitwise Complement and Shift Operations

For ~,
<<, and
>> bit
operations, the result type depends on whether the bit argument is
evaluated as a binary string or number:

Binary-string evaluation occurs when the bit argument has a
binary string type, and is not a hexadecimal literal, bit
literal, or NULL literal. Numeric
evaluation occurs otherwise, with argument conversion to an
unsigned 64-bit integer as necessary.

Binary-string evaluation produces a binary string of the same
length as the bit argument. Numeric evaluation produces an
unsigned 64-bit integer.

For shift operations, bits shifted off the end of the value are
lost without warning, regardless of the argument type. In
particular, if the shift count is greater or equal to the number
of bits in the bit argument, all bits in the result are 0.

BIT_AND(), BIT_OR(), and BIT_XOR() Operations

For the BIT_AND(),
BIT_OR(), and
BIT_XOR() bit functions, the result
type depends on whether the function argument values are evaluated
as binary strings or numbers:

Binary-string evaluation occurs when the argument values have
a binary string type, and the argument is not a hexadecimal
literal, bit literal, or NULL literal.
Numeric evaluation occurs otherwise, with argument value
conversion to unsigned 64-bit integers as necessary.

NULL values do not affect the result unless all
values are NULL. In that case, the result is a
neutral value having the same length as the length of the argument
values (all bits 1 for BIT_AND(),
all bits 0 for BIT_OR(), and
BIT_XOR()).

Special Handling of Hexadecimal Literals, Bit Literals, and NULL
Literals

For backward compatibility, MySQL 8.0 evaluates bit operations in
numeric context when all bit arguments are hexadecimal literals,
bit literals, or NULL literals. That is, bit
operations on binary-string bit arguments do not use binary-string
evaluation if all bit arguments are unadorned hexadecimal
literals, bit literals, or NULL literals. (This
does not apply to such literals if they are written with a
_binary introducer,
BINARY operator, or other way of
specifying them explicitly as binary strings.)

The literal handling just described is the same as prior to MySQL
8.0. Examples:

These bit operations evaluate the literals in numeric context
and produce a BIGINT result:

b'0001' | b'0010'
X'0008' << 8

These bit operations evaluate NULL in
numeric context and produce a BIGINT result
that has a NULL value:

NULL & NULL
NULL >> 4

In MySQL 8.0, you can cause those operations to evaluate the
arguments in binary-string context by indicating explicitly that
at least one argument is a binary string:

The result of the last two expressions is NULL,
just as without the BINARY operator, but the
data type of the result is a binary string type rather than an
integer type.

Bit-Operation Incompatibilities with MySQL 5.7

Because bit operations can handle binary string arguments natively
in MySQL 8.0, some expressions produce a different result in MySQL
8.0 than in 5.7. The five problematic expression types to watch
out for are:

The result type depends on whether the arguments are evaluated
as binary strings or numbers:

Binary-string evaluation occurs when the arguments have a
binary string type, and at least one of them is not a
hexadecimal literal, bit literal, or
NULL literal. Numeric evaluation occurs
otherwise, with argument conversion to unsigned 64-bit
integers as necessary.

Binary-string evaluation produces a binary string of the
same length as the arguments. If the arguments have
unequal lengths, an
ER_INVALID_BITWISE_OPERANDS_SIZE
error occurs. Numeric evaluation produces an unsigned
64-bit integer.

For more information, see the introductory discussion in this
section.

The result type depends on whether the arguments are evaluated
as binary strings or numbers:

Binary-string evaluation occurs when the arguments have a
binary string type, and at least one of them is not a
hexadecimal literal, bit literal, or
NULL literal. Numeric evaluation occurs
otherwise, with argument conversion to unsigned 64-bit
integers as necessary.

Binary-string evaluation produces a binary string of the
same length as the arguments. If the arguments have
unequal lengths, an
ER_INVALID_BITWISE_OPERANDS_SIZE
error occurs. Numeric evaluation produces an unsigned
64-bit integer.

For more information, see the introductory discussion in this
section.

The result type depends on whether the arguments are evaluated
as binary strings or numbers:

Binary-string evaluation occurs when the arguments have a
binary string type, and at least one of them is not a
hexadecimal literal, bit literal, or
NULL literal. Numeric evaluation occurs
otherwise, with argument conversion to unsigned 64-bit
integers as necessary.

Binary-string evaluation produces a binary string of the
same length as the arguments. If the arguments have
unequal lengths, an
ER_INVALID_BITWISE_OPERANDS_SIZE
error occurs. Numeric evaluation produces an unsigned
64-bit integer.

For more information, see the introductory discussion in this
section.

The result type depends on whether the bit argument is
evaluated as a binary string or number:

Binary-string evaluation occurs when the bit argument has
a binary string type, and is not a hexadecimal literal,
bit literal, or NULL literal. Numeric
evaluation occurs otherwise, with argument conversion to
an unsigned 64-bit integer as necessary.

Binary-string evaluation produces a binary string of the
same length as the bit argument. Numeric evaluation
produces an unsigned 64-bit integer.

Bits shifted off the end of the value are lost without
warning, regardless of the argument type. In particular, if
the shift count is greater or equal to the number of bits in
the bit argument, all bits in the result are 0.

For more information, see the introductory discussion in this
section.

The result type depends on whether the bit argument is
evaluated as a binary string or number:

Binary-string evaluation occurs when the bit argument has
a binary string type, and is not a hexadecimal literal,
bit literal, or NULL literal. Numeric
evaluation occurs otherwise, with argument conversion to
an unsigned 64-bit integer as necessary.

Binary-string evaluation produces a binary string of the
same length as the bit argument. Numeric evaluation
produces an unsigned 64-bit integer.

Bits shifted off the end of the value are lost without
warning, regardless of the argument type. In particular, if
the shift count is greater or equal to the number of bits in
the bit argument, all bits in the result are 0.

For more information, see the introductory discussion in this
section.

The result type depends on whether the bit argument is
evaluated as a binary string or number:

Binary-string evaluation occurs when the bit argument has
a binary string type, and is not a hexadecimal literal,
bit literal, or NULL literal. Numeric
evaluation occurs otherwise, with argument conversion to
an unsigned 64-bit integer as necessary.

Binary-string evaluation produces a binary string of the
same length as the bit argument. Numeric evaluation
produces an unsigned 64-bit integer.

For more information, see the introductory discussion in this
section.

Many encryption and compression functions return strings for which
the result might contain arbitrary byte values. If you want to
store these results, use a column with a
VARBINARY or
BLOB binary string data type. This
will avoid potential problems with trailing space removal or
character set conversion that would change data values, such as
may occur if you use a nonbinary string data type
(CHAR,
VARCHAR,
TEXT).

For versions in which functions such as MD5()
or SHA1() return a string of hexadecimal digits
as a binary string, the return value cannot be converted to
uppercase or compared in case-insensitive fashion as is. You must
convert the value to a nonbinary string. See the discussion of
binary string conversion in Section 12.10, “Cast Functions and Operators”.

If an application stores values from a function such as
MD5() or
SHA1() that returns a string of hex
digits, more efficient storage and comparisons can be obtained by
converting the hex representation to binary using
UNHEX() and storing the result in a
BINARY(N)
column. Each pair of hexadecimal digits requires one byte in
binary form, so the value of N depends
on the length of the hex string. N is
16 for an MD5() value and 20 for a
SHA1() value. For
SHA2(),
N ranges from 28 to 32 depending on the
argument specifying the desired bit length of the result.

The size penalty for storing the hex string in a
CHAR column is at least two times,
up to eight times if the value is stored in a column that uses the
utf8 character set (where each character uses 4
bytes). Storing the string also results in slower comparisons
because of the larger values and the need to take character set
collation rules into account.

Suppose that an application stores
MD5() string values in a
CHAR(32) column:

Applications should be prepared to handle the very rare case that
a hashing function produces the same value for two different input
values. One way to make collisions detectable is to make the hash
column a primary key.

Note

Exploits for the MD5 and SHA-1 algorithms have become known. You
may wish to consider using another one-way encryption function
described in this section instead, such as
SHA2().

Caution

Passwords or other sensitive values supplied as arguments to
encryption functions are sent in cleartext to the MySQL server
unless an SSL connection is used. Also, such values will appear
in any MySQL logs to which they are written. To avoid these
types of exposure, applications can encrypt sensitive values on
the client side before sending them to the server. The same
considerations apply to encryption keys. To avoid exposing
these, applications can use stored procedures to encrypt and
decrypt values on the server side.

AES_ENCRYPT() and
AES_DECRYPT() implement
encryption and decryption of data using the official AES
(Advanced Encryption Standard) algorithm, previously known as
“Rijndael.” The AES standard permits various key
lengths. By default these functions implement AES with a
128-bit key length. Key lengths of 196 or 256 bits can be
used, as described later. The key length is a trade off
between performance and security.

AES_ENCRYPT() encrypts the
string str using the key string
key_str and returns a binary string
containing the encrypted output.
AES_DECRYPT() decrypts the
encrypted string crypt_str using
the key string key_str and returns
the original cleartext string. If either function argument is
NULL, the function returns
NULL.

The str and
crypt_str arguments can be any
length, and padding is automatically added to
str so it is a multiple of a block
as required by block-based algorithms such as AES. This
padding is automatically removed by the
AES_DECRYPT() function. The
length of crypt_str can be
calculated using this formula:

16 * (trunc(string_length / 16) + 1)

For a key length of 128 bits, the most secure way to pass a
key to the key_str argument is to
create a truly random 128-bit value and pass it as a binary
value. For example:

Do not pass a password or passphrase directly to
crypt_str, hash it first. Previous
versions of this documentation suggested the former approach,
but it is no longer recommended as the examples shown here are
more secure.

If AES_DECRYPT() detects
invalid data or incorrect padding, it returns
NULL. However, it is possible for
AES_DECRYPT() to return a
non-NULL value (possibly garbage) if the
input data or the key is invalid.

AES_ENCRYPT() and
AES_DECRYPT() permit control of
the block encryption mode and take an optional
init_vector initialization vector
argument:

The block_encryption_mode
system variable controls the mode for block-based
encryption algorithms. Its default value is
aes-128-ecb, which signifies encryption
using a key length of 128 bits and ECB mode. For a
description of the permitted values of this variable, see
Section 5.1.5, “Server System Variables”.

For modes that require the optional
init_vector argument, it must be 16
bytes or longer (bytes in excess of 16 are ignored). An error
occurs if init_vector is missing.

For modes that do not require
init_vector, it is ignored and a
warning is generated if it is specified.

A random string of bytes to use for the initialization vector
can be produced by calling
RANDOM_BYTES(16). For
encryption modes that require an initialization vector, the
same vector must be used for encryption and decryption.

Compresses a string and returns the result as a binary string.
This function requires MySQL to have been compiled with a
compression library such as zlib.
Otherwise, the return value is always NULL.
The compressed string can be uncompressed with
UNCOMPRESS().

Nonempty strings are stored as a 4-byte length of the
uncompressed string (low byte first), followed by the
compressed string. If the string ends with space, an extra
. character is added to avoid problems
with endspace trimming should the result be stored in a
CHAR or
VARCHAR column. (However,
use of nonbinary string data types such as
CHAR or
VARCHAR to store compressed
strings is not recommended anyway because character set
conversion may occur. Use a
VARBINARY or
BLOB binary string column
instead.)

Calculates an MD5 128-bit checksum for the string. The value
is returned as a string of 32 hexadecimal digits, or
NULL if the argument was
NULL. The return value can, for example, be
used as a hash key. See the notes at the beginning of this
section about storing hash values efficiently.

The return value is a nonbinary string in the connection
character set.

mysql> SELECT MD5('testing');
-> 'ae2b1fca515949e5d54fb22b8ed95575'

This is the “RSA Data Security, Inc. MD5 Message-Digest
Algorithm.”

See the note regarding the MD5 algorithm at the beginning this
section.

This function is deprecated, will be removed in a future
MySQL release, and should no longer be used.

Returns a hashed password string calculated from the cleartext
password str. The return value is a
nonbinary string in the connection character set, or
NULL if the argument is
NULL. This function is the SQL interface to
the algorithm used by the server to encrypt MySQL passwords
for storage in the mysql.user grant table.

The old_passwords system
variable controls the password hashing method used by the
PASSWORD() function. It also
influences password hashing performed by
CREATE USER and
GRANT statements that specify a
password using an IDENTIFIED BY clause.

The following table shows the permitted values of
old_passwords, the password
hashing method for each value, and which authentication
plugins use passwords hashed with each method.

Value

Password Hashing Method

Associated Authentication Plugin

0

MySQL native hashing

mysql_native_password

2

SHA-256 hashing

sha256_password

SHA-256 password hashing
(old_passwords=2) uses a
random salt value, which makes the result from
PASSWORD() nondeterministic.
Consequently, statements that use this function are not safe
for statement-based replication.

Encryption performed by
PASSWORD() is one-way (not
reversible). It is not the same type of encryption used for
Unix passwords.

Under some circumstances, statements that invoke
PASSWORD() may be recorded in
server logs or on the client side in a history file such as
~/.mysql_history, which means that
cleartext passwords may be read by anyone having read access
to that information. For information about the conditions
under which this occurs for the server logs and how to
control it, see Section 6.1.2.3, “Passwords and Logging”. For
similar information about client-side logging, see
Section 4.5.1.3, “mysql Logging”.

This function returns a binary string of
len random bytes generated using
the random number generator of the SSL library (OpenSSL or
yaSSL). Permitted values of len
range from 1 to 1024. For values outside that range,
RANDOM_BYTES() generates a
warning and returns NULL.

RANDOM_BYTES() can be used to
provide the initialization vector for the
AES_DECRYPT() and
AES_ENCRYPT() functions. For
use in that context, len must be at
least 16. Larger values are permitted, but bytes in excess of
16 are ignored.

RANDOM_BYTES() generates a
random value, which makes its result nondeterministic.
Consequently, statements that use this function are unsafe for
statement-based replication.

Calculates an SHA-1 160-bit checksum for the string, as
described in RFC 3174 (Secure Hash Algorithm). The value is
returned as a string of 40 hexadecimal digits, or
NULL if the argument was
NULL. One of the possible uses for this
function is as a hash key. See the notes at the beginning of
this section about storing hash values efficiently. You can
also use SHA1() as a
cryptographic function for storing passwords.
SHA() is
synonymous with SHA1().

The return value is a nonbinary string in the connection
character set.

Calculates the SHA-2 family of hash functions (SHA-224,
SHA-256, SHA-384, and SHA-512). The first argument is the
cleartext string to be hashed. The second argument indicates
the desired bit length of the result, which must have a value
of 224, 256, 384, 512, or 0 (which is equivalent to 256). If
either argument is NULL or the hash length
is not one of the permitted values, the return value is
NULL. Otherwise, the function result is a
hash value containing the desired number of bits. See the
notes at the beginning of this section about storing hash
values efficiently.

The return value is a nonbinary string in the connection
character set.

Uncompresses a string compressed by the
COMPRESS() function. If the
argument is not a compressed value, the result is
NULL. This function requires MySQL to have
been compiled with a compression library such as
zlib. Otherwise, the return value is always
NULL.

The password is subjected to increasingly strict tests and the
return value reflects which tests were satisfied, as shown in
the following table. In addition, if the
validate_password_check_user_name
system variable is enabled and the password matches the user
name,
VALIDATE_PASSWORD_STRENGTH()
returns 0 regardless of how other
validate_password system variables are set.

The BENCHMARK() function
executes the expression expr
repeatedly count times. It may be
used to time how quickly MySQL processes the expression. The
result value is always 0. The intended use
is from within the mysql client, which
reports query execution times:

The time reported is elapsed time on the client end, not CPU
time on the server end. It is advisable to execute
BENCHMARK() several times, and
to interpret the result with regard to how heavily loaded the
server machine is.

BENCHMARK() is intended for
measuring the runtime performance of scalar expressions, which
has some significant implications for the way that you use it
and interpret the results:

Only scalar expressions can be used. Although the
expression can be a subquery, it must return a single
column and at most a single row. For example,
BENCHMARK(10, (SELECT * FROM
t)) will fail if the table t
has more than one column or more than one row.

Executing a SELECT
expr statement
N times differs from executing
SELECT BENCHMARK(N,
expr) in terms of the
amount of overhead involved. The two have very different
execution profiles and you should not expect them to take
the same amount of time. The former involves the parser,
optimizer, table locking, and runtime evaluation
N times each. The latter
involves only runtime evaluation
N times, and all the other
components just once. Memory structures already allocated
are reused, and runtime optimizations such as local
caching of results already evaluated for aggregate
functions can alter the results. Use of
BENCHMARK() thus measures
performance of the runtime component by giving more weight
to that component and removing the “noise”
introduced by the network, parser, optimizer, and so
forth.

Returns a utf8 string containing the
current active roles for the current session, separated by
commas, or NONE if there are none. The
value reflects the setting of the
sql_quote_show_create system
variable.

Suppose that an account is granted roles as follows:

GRANT 'r1', 'r2' TO 'u1'@'localhost';
SET DEFAULT ROLE ALL TO 'u1'@'localhost';

In sessions for u1, the initial
CURRENT_ROLE() value names the
default account roles. Using SET
ROLE changes that:

Returns the user name and host name combination for the MySQL
account that the server used to authenticate the current
client. This account determines your access privileges. The
return value is a string in the utf8
character set.

The example illustrates that although the client specified a
user name of davida (as indicated by the
value of the USER() function),
the server authenticated the client using an anonymous user
account (as seen by the empty user name part of the
CURRENT_USER() value). One way
this might occur is that there is no account listed in the
grant tables for davida.

Within a stored program or view,
CURRENT_USER() returns the
account for the user who defined the object (as given by its
DEFINER value) unless defined with the
SQL SECURITY INVOKER characteristic. In the
latter case, CURRENT_USER()
returns the object's invoker.

Triggers and events have no option to define the SQL
SECURITY characteristic, so for these objects,
CURRENT_USER() returns the
account for the user who defined the object. To return the
invoker, use USER() or
SESSION_USER().

The following statements support use of the
CURRENT_USER() function to take
the place of the name of (and, possibly, a host for) an
affected user or a definer; in such cases,
CURRENT_USER() is expanded
where and as needed:

Returns the default (current) database name as a string in the
utf8 character set. If there is no default
database, DATABASE() returns
NULL. Within a stored routine, the default
database is the database that the routine is associated with,
which is not necessarily the same as the database that is the
default in the calling context.

A SELECT statement may include
a LIMIT clause to restrict the number of
rows the server returns to the client. In some cases, it is
desirable to know how many rows the statement would have
returned without the LIMIT, but without
running the statement again. To obtain this row count, include
an SQL_CALC_FOUND_ROWS option in the
SELECT statement, and then
invoke FOUND_ROWS() afterward:

The second SELECT returns a
number indicating how many rows the first
SELECT would have returned had
it been written without the LIMIT clause.

In the absence of the SQL_CALC_FOUND_ROWS
option in the most recent successful
SELECT statement,
FOUND_ROWS() returns the number
of rows in the result set returned by that statement. If the
statement includes a LIMIT clause,
FOUND_ROWS() returns the number
of rows up to the limit. For example,
FOUND_ROWS() returns 10 or 60,
respectively, if the statement includes LIMIT
10 or LIMIT 50, 10.

The row count available through
FOUND_ROWS() is transient and
not intended to be available past the statement following the
SELECT SQL_CALC_FOUND_ROWS statement. If
you need to refer to the value later, save it:

If you are using SELECT
SQL_CALC_FOUND_ROWS, MySQL must calculate how many
rows are in the full result set. However, this is faster than
running the query again without LIMIT,
because the result set need not be sent to the client.

SQL_CALC_FOUND_ROWS and
FOUND_ROWS() can be useful in
situations when you want to restrict the number of rows that a
query returns, but also determine the number of rows in the
full result set without running the query again. An example is
a Web script that presents a paged display containing links to
the pages that show other sections of a search result. Using
FOUND_ROWS() enables you to
determine how many other pages are needed for the rest of the
result.

The use of SQL_CALC_FOUND_ROWS and
FOUND_ROWS() is more complex
for UNION statements than for
simple SELECT statements,
because LIMIT may occur at multiple places
in a UNION. It may be applied
to individual SELECT statements
in the UNION, or global to the
UNION result as a whole.

The intent of SQL_CALC_FOUND_ROWS for
UNION is that it should return
the row count that would be returned without a global
LIMIT. The conditions for use of
SQL_CALC_FOUND_ROWS with
UNION are:

The SQL_CALC_FOUND_ROWS keyword must
appear in the first SELECT
of the UNION.

With no argument,
LAST_INSERT_ID() returns a
BIGINT UNSIGNED (64-bit) value representing
the first automatically generated value successfully inserted
for an AUTO_INCREMENT column as a result of
the most recently executed
INSERT statement. The value of
LAST_INSERT_ID() remains
unchanged if no rows are successfully inserted.

For example, after inserting a row that generates an
AUTO_INCREMENT value, you can get the value
like this:

mysql> SELECT LAST_INSERT_ID();
-> 195

The currently executing statement does not affect the value of
LAST_INSERT_ID(). Suppose that
you generate an AUTO_INCREMENT value with
one statement, and then refer to
LAST_INSERT_ID() in a
multiple-row INSERT statement
that inserts rows into a table with its own
AUTO_INCREMENT column. The value of
LAST_INSERT_ID() will remain
stable in the second statement; its value for the second and
later rows is not affected by the earlier row insertions.
(However, if you mix references to
LAST_INSERT_ID() and
LAST_INSERT_ID(expr),
the effect is undefined.)

If the previous statement returned an error, the value of
LAST_INSERT_ID() is undefined.
For transactional tables, if the statement is rolled back due
to an error, the value of
LAST_INSERT_ID() is left
undefined. For manual
ROLLBACK,
the value of LAST_INSERT_ID()
is not restored to that before the transaction; it remains as
it was at the point of the
ROLLBACK.

Within the body of a stored routine (procedure or function) or
a trigger, the value of
LAST_INSERT_ID() changes the
same way as for statements executed outside the body of these
kinds of objects. The effect of a stored routine or trigger
upon the value of
LAST_INSERT_ID() that is seen
by following statements depends on the kind of routine:

If a stored procedure executes statements that change the
value of LAST_INSERT_ID(),
the changed value is seen by statements that follow the
procedure call.

For stored functions and triggers that change the value,
the value is restored when the function or trigger ends,
so following statements will not see a changed value.

The ID that was generated is maintained in the server on a
per-connection basis. This means that the
value returned by the function to a given client is the first
AUTO_INCREMENT value generated for most
recent statement affecting an
AUTO_INCREMENT column by that
client. This value cannot be affected by other
clients, even if they generate
AUTO_INCREMENT values of their own. This
behavior ensures that each client can retrieve its own ID
without concern for the activity of other clients, and without
the need for locks or transactions.

The value of LAST_INSERT_ID()
is not changed if you set the
AUTO_INCREMENT column of a row to a
non-“magic” value (that is, a value that is not
NULL and not 0).

Important

If you insert multiple rows using a single
INSERT statement,
LAST_INSERT_ID() returns the
value generated for the first inserted
row only. The reason for this is to
make it possible to reproduce easily the same
INSERT statement against some
other server.

Although the second INSERT
statement inserted three new rows into t,
the ID generated for the first of these rows was
2, and it is this value that is returned by
LAST_INSERT_ID() for the
following SELECT statement.

If you use INSERT
IGNORE and the row is ignored, the
LAST_INSERT_ID() remains
unchanged from the current value (or 0 is returned if the
connection has not yet performed a successful
INSERT) and, for non-transactional tables,
the AUTO_INCREMENT counter is not
incremented. For InnoDB tables, the
AUTO_INCREMENT counter is incremented if
innodb_autoinc_lock_mode is
set to 1 or 2, as
demonstrated in the following example:

If expr is given as an argument to
LAST_INSERT_ID(), the value of
the argument is returned by the function and is remembered as
the next value to be returned by
LAST_INSERT_ID(). This can be
used to simulate sequences:

You can generate sequences without calling
LAST_INSERT_ID(), but the
utility of using the function this way is that the ID value is
maintained in the server as the last automatically generated
value. It is multi-user safe because multiple clients can
issue the UPDATE statement and
get their own sequence value with the
SELECT statement (or
mysql_insert_id()), without
affecting or being affected by other clients that generate
their own sequence values.

Returns a utf8 string containing a GraphML
document representing memory role subgraphs. The
ROLE_ADMIN or
SUPER privilege is required to
see content in the <graphml> element.
Otherwise, the result shows only an empty element:

SELECT: -1 if the statement
returns a result set, or the number of rows
“affected” if it does not. For example, for
SELECT * FROM t1,
ROW_COUNT() returns -1. For
SELECT * FROM t1 INTO OUTFILE
'file_name',
ROW_COUNT() returns the
number of rows written to the file.

For UPDATE statements, the
affected-rows value by default is the number of rows actually
changed. If you specify the
CLIENT_FOUND_ROWS flag to
mysql_real_connect() when
connecting to mysqld, the affected-rows
value is the number of rows “found”; that is,
matched by the WHERE clause.

For REPLACE statements, the
affected-rows value is 2 if the new row replaced an old row,
because in this case, one row was inserted after the duplicate
was deleted.

For
INSERT
... ON DUPLICATE KEY UPDATE statements, the
affected-rows value per row is 1 if the row is inserted as a
new row, 2 if an existing row is updated, and 0 if an existing
row is set to its current values. If you specify the
CLIENT_FOUND_ROWS flag, the affected-rows
value is 1 (not 0) if an existing row is set to its current
values.

Returns a string that indicates the MySQL server version. The
string uses the utf8 character set. The
value might have a suffix in addition to the version number.
See the description of the
version system variable in
Section 5.1.5, “Server System Variables”.

This function is unsafe for statement-based replication. A
warning is logged if you use this function when
binlog_format is set to
STATEMENT.

12.15.2 Argument Handling by Spatial Functions

Spatial values, or geometries, have the properties described at
Section 11.5.2.2, “Geometry Class”. The following discussion
lists general spatial function argument-handling characteristics.
Specific functions or groups of functions may have additional
argument-handling characteristics, as discussed in the sections
where those function descriptions occur.

Spatial functions are defined only for valid geometry values.

The spatial reference identifier (SRID) of a geometry identifies
the coordinate space in which the geometry is defined. In MySQL,
the SRID value is an integer associated with the geometry value.
The maximum usable SRID value is
232−1. If a larger value is
given, only the lower 32 bits are used.

In MySQL, all computations are done assuming SRID 0, regardless of
the actual SRID value. SRID 0 represents an infinite flat
Cartesian plane with no units assigned to its axes. In the future,
computations may use the specified SRID values. To ensure SRID 0
behavior, create geometry values using SRID 0. SRID 0 is the
default for new geometry values if no SRID is specified.

Geometry values produced by any spatial function inherit the SRID
of the geometry arguments.

Spatial functions that take multiple geometry arguments require
those arguments to have the same SRID value (that is, same value
in the lower 32 bits). Assuming equal SRIDs, spatial functions do
nothing with them after performing the equality check; geometry
values are implicitly handled using Cartesian coordinates (SRID
0). If a spatial function returns
ER_GIS_DIFFERENT_SRIDS, it means
that the geometry arguments did not all have the same SRID. You
must modify them to have the same SRID.

The Open Geospatial Consortium guidelines require that input
polygons already be closed, so unclosed polygons are rejected as
invalid rather than being closed.

Empty geometry-collection handling is as follows: An empty WKT
input geometry collection may be specified as
'GEOMETRYCOLLECTION()'. This is also the output
WKT resulting from a spatial operation that produces an empty
geometry collection.

During parsing of a nested geometry collection, the collection is
flattened and its basic components are used in various GIS
operations to compute results. This provides additional
flexibility to users because it is unnecessary to be concerned
about the uniqueness of geometry data. Nested geometry collections
may be produced from nested GIS function calls without having to
be explicitly flattened first.

12.15.3 Functions That Create Geometry Values from WKT Values

These functions take as arguments a Well-Known Text (WKT)
representation and, optionally, a spatial reference system
identifier (SRID). They return the corresponding geometry.

The functions interpret geographic coordinates (latitude,
longitude) as in the order specified by the spatial reference
system. An optional options argument
may be given to override the default axis order.

ST_GeomFromText() accepts a WKT
value of any geometry type as its first argument. Other functions
provide type-specific construction functions for construction of
geometry values of each geometry type.

Constructs a GeometryCollection value using
its WKT representation and SRID.

The result is NULL if the geometry argument
is NULL or not a syntactically well-formed
geometry, or if the SRID argument is NULL.

The functions interpret geographic coordinates (latitude,
longitude) as in the order specified by the spatial reference
system. An optional options
argument may be given to override the default axis order.

The result is NULL if the
options argument is
NULL. If the
options argument is invalid, an
error occurs to indicate why.

Constructs a geometry value of any type using its WKT
representation and SRID.

The result is NULL if the geometry argument
is NULL or not a syntactically well-formed
geometry, or if the SRID argument is NULL.

The functions interpret geographic coordinates (latitude,
longitude) as in the order specified by the spatial reference
system. An optional options
argument may be given to override the default axis order.

The result is NULL if the
options argument is
NULL. If the
options argument is invalid, an
error occurs to indicate why.

The result is NULL if the geometry argument
is NULL or not a syntactically well-formed
geometry, or if the SRID argument is NULL.

The functions interpret geographic coordinates (latitude,
longitude) as in the order specified by the spatial reference
system. An optional options
argument may be given to override the default axis order.

The result is NULL if the
options argument is
NULL. If the
options argument is invalid, an
error occurs to indicate why.

Constructs a MultiLineString value using
its WKT representation and SRID.

The result is NULL if the geometry argument
is NULL or not a syntactically well-formed
geometry, or if the SRID argument is NULL.

The functions interpret geographic coordinates (latitude,
longitude) as in the order specified by the spatial reference
system. An optional options
argument may be given to override the default axis order.

The result is NULL if the
options argument is
NULL. If the
options argument is invalid, an
error occurs to indicate why.

The result is NULL if the geometry argument
is NULL or not a syntactically well-formed
geometry, or if the SRID argument is NULL.

The functions interpret geographic coordinates (latitude,
longitude) as in the order specified by the spatial reference
system. An optional options
argument may be given to override the default axis order.

The result is NULL if the
options argument is
NULL. If the
options argument is invalid, an
error occurs to indicate why.

In MySQL 8.0, spatial functions such as
ST_MPointFromText() and
ST_GeomFromText() that accept
WKT-format representations of MultiPoint
values permit individual points within values to be surrounded
by parentheses. For example, both of the following function
calls are valid:

Constructs a MultiPolygon value using its
WKT representation and SRID.

The result is NULL if the geometry argument
is NULL or not a syntactically well-formed
geometry, or if the SRID argument is NULL.

The functions interpret geographic coordinates (latitude,
longitude) as in the order specified by the spatial reference
system. An optional options
argument may be given to override the default axis order.

The result is NULL if the
options argument is
NULL. If the
options argument is invalid, an
error occurs to indicate why.

The result is NULL if the geometry argument
is NULL or not a syntactically well-formed
geometry, or if the SRID argument is NULL.

The functions interpret geographic coordinates (latitude,
longitude) as in the order specified by the spatial reference
system. An optional options
argument may be given to override the default axis order.

The result is NULL if the
options argument is
NULL. If the
options argument is invalid, an
error occurs to indicate why.

The result is NULL if the geometry argument
is NULL or not a syntactically well-formed
geometry, or if the SRID argument is NULL.

The functions interpret geographic coordinates (latitude,
longitude) as in the order specified by the spatial reference
system. An optional options
argument may be given to override the default axis order.

The result is NULL if the
options argument is
NULL. If the
options argument is invalid, an
error occurs to indicate why.

12.15.4 Functions That Create Geometry Values from WKB Values

These functions take as arguments a
BLOB containing a Well-Known Binary
(WKB) representation and, optionally, a spatial reference system
identifier (SRID). They return the corresponding geometry.

The functions interpret geographic coordinates (latitude,
longitude) as in the order specified by the spatial reference
system. An optional options argument
may be given to override the default axis order.

ST_GeomFromWKB() accepts a WKB
value of any geometry type as its first argument. Other functions
provide type-specific construction functions for construction of
geometry values of each geometry type.

Constructs a GeometryCollection value using
its WKB representation and SRID.

The result is NULL if the WKB or SRID
argument is NULL.

The functions interpret geographic coordinates (latitude,
longitude) as in the order specified by the spatial reference
system. An optional options
argument may be given to override the default axis order.

The result is NULL if the
options argument is
NULL. If the
options argument is invalid, an
error occurs to indicate why.

Constructs a geometry value of any type using its WKB
representation and SRID.

The result is NULL if the WKB or SRID
argument is NULL.

The functions interpret geographic coordinates (latitude,
longitude) as in the order specified by the spatial reference
system. An optional options
argument may be given to override the default axis order.

The result is NULL if the
options argument is
NULL. If the
options argument is invalid, an
error occurs to indicate why.

The functions interpret geographic coordinates (latitude,
longitude) as in the order specified by the spatial reference
system. An optional options
argument may be given to override the default axis order.

The result is NULL if the
options argument is
NULL. If the
options argument is invalid, an
error occurs to indicate why.

Constructs a MultiLineString value using
its WKB representation and SRID.

The result is NULL if the WKB or SRID
argument is NULL.

The functions interpret geographic coordinates (latitude,
longitude) as in the order specified by the spatial reference
system. An optional options
argument may be given to override the default axis order.

The result is NULL if the
options argument is
NULL. If the
options argument is invalid, an
error occurs to indicate why.

The functions interpret geographic coordinates (latitude,
longitude) as in the order specified by the spatial reference
system. An optional options
argument may be given to override the default axis order.

The result is NULL if the
options argument is
NULL. If the
options argument is invalid, an
error occurs to indicate why.

Constructs a MultiPolygon value using its
WKB representation and SRID.

The result is NULL if the WKB or SRID
argument is NULL.

The functions interpret geographic coordinates (latitude,
longitude) as in the order specified by the spatial reference
system. An optional options
argument may be given to override the default axis order.

The result is NULL if the
options argument is
NULL. If the
options argument is invalid, an
error occurs to indicate why.

The functions interpret geographic coordinates (latitude,
longitude) as in the order specified by the spatial reference
system. An optional options
argument may be given to override the default axis order.

The result is NULL if the
options argument is
NULL. If the
options argument is invalid, an
error occurs to indicate why.

The functions interpret geographic coordinates (latitude,
longitude) as in the order specified by the spatial reference
system. An optional options
argument may be given to override the default axis order.

The result is NULL if the
options argument is
NULL. If the
options argument is invalid, an
error occurs to indicate why.

12.15.5 MySQL-Specific Functions That Create Geometry Values

MySQL provides a set of useful nonstandard functions for creating
geometry values. The functions described in this section are MySQL
extensions to the OpenGIS specification.

These functions produce geometry objects from either WKB values or
geometry objects as arguments. If any argument is not a proper WKB
or geometry representation of the proper object type, the return
value is NULL.

For example, you can insert the geometry return value from
Point() directly into a
POINT column: