MySQL :: MySQL 5.5 Reference Manual :: 9.1.14.1 Unicode Character Sets

MySQL 5.5 Reference Manual :: 9 Internationalization and Localization :: 9.1 Character Set Support :: 9.1.14 Character Sets and Collations That MySQL Supports :: 9.1.14.1 Unicode Character Sets

« 9.1.14 Character Sets and Collations That MySQL Supports

9.1.14.2 West European Character Sets »

Section Navigation [Toggle]

9.1.14.1. Unicode Character Sets

MySQL 5.5 supports these Unicode character sets:

ucs2, the UCS-2 encoding of the Unicode character set using 16 bits per character
utf16, the UTF-16 encoding for the Unicode character set; like ucs2 but with an extension for supplementary characters
utf32, the UTF-32 encoding for the Unicode character set using 32 bits per character
utf8, a UTF-8 encoding of the Unicode character set using one to three bytes per character
utf8mb4, a UTF-8 encoding of the Unicode character set using one to four bytes per character

ucs2 and utf8 support BMP characters. utf8mb4, utf16, and utf32 support BMP and supplementary characters. The utf8mb4, utf16, and utf32 character sets (and their collations) were added in MySQL 5.5.3.

You can store text in about 650 languages using these character sets. This section lists the collations available for each Unicode character set. For general information about the character sets, see Section 9.1.10, “Unicode Support”.

A similar set of collations is available for each Unicode character set. These are shown in the following list, where xxx represents the character set name. For example, xxx_danish_ci represents the Danish collations, the specific names of which are ucs2_danish_ci, utf16_danish_ci, utf32_danish_ci, utf8_danish_ci, and utf8mb4_danish_ci.

xxx_bin
xxx_czech_ci
xxx_danish_ci
xxx_esperanto_ci
xxx_estonian_ci
xxx_general_ci (default)
xxx_hungarian_ci
xxx_icelandic_ci
xxx_latvian_ci
xxx_lithuanian_ci
xxx_persian_ci
xxx_polish_ci
xxx_roman_ci
xxx_romanian_ci
xxx_sinhala_ci
xxx_slovak_ci
xxx_slovenian_ci
xxx_spanish2_ci
xxx_spanish_ci
xxx_swedish_ci
xxx_turkish_ci
xxx_unicode_ci

MySQL implements the xxx_unicode_ci collations according to the Unicode Collation Algorithm (UCA) described at http://www.unicode.org/reports/tr10/. The collation uses the version-4.0.0 UCA weight keys: http://www.unicode.org/Public/UCA/4.0.0/allkeys-4.0.0.txt. Currently, the xxx_unicode_ci collations have only partial support for the Unicode Collation Algorithm. Some characters are not supported yet. Also, combining marks are not fully supported. This affects primarily Vietnamese, Yoruba, and some smaller languages such as Navajo. The following discussion uses utf8_unicode_ci for concreteness.

For any Unicode character set, operations performed using the xxx_general_ci collation are faster than those for the xxx_unicode_ci collation. For example, comparisons for the utf8_general_ci collation are faster, but slightly less correct, than comparisons for utf8_unicode_ci. The reason for this is that utf8_unicode_ci supports mappings such as expansions; that is, when one character compares as equal to combinations of other characters. For example, in German and some other languages “ß” is equal to “ss”. utf8_unicode_ci also supports contractions and ignorable characters. utf8_general_ci is a legacy collation that does not support expansions, contractions, or ignorable characters. It can make only one-to-one comparisons between characters.

To further illustrate, the following equalities hold in both utf8_general_ci and utf8_unicode_ci (for the effect this has in comparisons or when doing searches, see Section 9.1.7.7, “Examples of the Effect of Collation”):

Ä = A
Ö = O
Ü = U

A difference between the collations is that this is true for utf8_general_ci:

ß = s

Whereas this is true for utf8_unicode_ci:

ß = ss

MySQL implements language-specific collations for the utf8 character set only if the ordering with utf8_unicode_ci does not work well for a language. For example, utf8_unicode_ci works fine for German and French, so there is no need to create special utf8 collations for these two languages.

utf8_general_ci also is satisfactory for both German and French, except that “ß” is equal to “s”, and not to “ss”. If this is acceptable for your application, then you should use utf8_general_ci because it is faster. Otherwise, use utf8_unicode_ci because it is more accurate.

utf8_swedish_ci, like other utf8 language-specific collations, is derived from utf8_unicode_ci with additional language rules. For example, in Swedish, the following relationship holds, which is not something expected by a German or French speaker:

Ü = Y < Ö

The xxx_spanish_ci and xxx_spanish2_ci collations correspond to modern Spanish and traditional Spanish, respectively. In both collations, “ñ” (n-tilde) is a separate letter between “n” and “o”. In addition, for traditional Spanish, “ch” is a separate letter between “c” and “d”, and “ll” is a separate letter between “l” and “m”

In the xxx_roman_ci collations, I and J compare as equal, and U and V compare as equal.

For all Unicode collations except the “binary” (xxx_bin) collations, MySQL performs a table lookup to find a character's collating weight. But if a character is not in the table (for example, because it is a “new” character), collating weight determination becomes more complex:

For BMP characters in general collations (for example, utf8_general_ci), weight = code point.
For BMP characters in UCA collations (for example, utf8_unicode_ci), the following algorithm applies:
```
if (code >= 0x3400 && code <= 0x4DB5)
  base= 0xFB80; /* CJK Ideograph Extension */
else if (code >= 0x4E00 && code <= 0x9FA5)
  base= 0xFB40; /* CJK Ideograph */
else
  base= 0xFBC0; /* All other characters */
aaaa= base +  (code >> 15);
bbbb= (code & 0x7FFF) | 0x8000;
```
The result is a sequence of two collating elements, aaaa followed by bbbb.

Thus, U+04cf CYRILLIC SMALL LETTER PALOCHKA currently is, with all UCA collations, greater than U+04c0 CYRILLIC LETTER PALOCHKA. Eventually, after further collation tuning, all palochkas will sort together.
For supplementary characters in general collations, the weight is the weight for 0xfffd REPLACEMENT CHARACTER. For supplementary characters in UCA collations, their collating weight is 0xfffd. That is, to MySQL, all supplementary characters are equal to each other, and greater than almost all BMP characters.

An example with Deseret characters and COUNT(DISTINCT):
```
CREATE TABLE t (s1 VARCHAR(5) CHARACTER SET utf32 COLLATE utf32_unicode_ci);
INSERT INTO t VALUES (0xfffd);   /* REPLACEMENT CHARACTER */
INSERT INTO t VALUES (0x010412); /* DESERET CAPITAL LETTER BEE */
INSERT INTO t VALUES (0x010413); /* DESERET CAPITAL LETTER TEE */
SELECT COUNT(DISTINCT s1) FROM t;
```
The result is 2 because in the MySQL Unicode collation, the replacement character has a weight of 0x0dc6, whereas Deseret Bee and Deseret Tee both have a weight of 0xfffd. (Were the utf32_general_ci collation used instead, the result would be 1 because all three characters have a weight of 0xfffd in that collation.)

The current rule that all supplementary characters are equal to each other is nonoptimal but is not expected to cause trouble. These characters are very rare, so it will be very rare that a multi-character string consists entirely of supplementary characters. In Japan, since the supplementary characters are obscure Kanji ideographs, the typical user does not care what order they are in, anyway. If you really want rows sorted by MySQL's rule and secondarily by code point value, it is easy:

ORDER BY s1 COLLATE utf32_unicode_ci, s1 COLLATE utf32_bin

The utf16_bin Collation

There is a difference between “ordering by the character's code value” and “ordering by the character's binary representation,” a difference that appears only with utf16_bin, because of surrogates.

Suppose that utf16_bin (the binary collation for utf16) was a binary comparison “byte by byte” rather than “character by character.” If that were so, then the order of characters in utf16_bin would differ from the order in utf8_bin. For example, the following chart shows two rare characters. The first character is in the range E000-FFFF, so it is greater than a surrogate but less than a supplementary. The second character is a supplementary.

Code point  Character                    utf8         utf16
----------  ---------                    ----         -----
0FF9D       HALFWIDTH KATAKANA LETTER N  EF BE 9D     FF 9D
10384       UGARITIC LETTER DELTA        F0 90 8E 84  D8 00 DF 84

The two characters in the chart are in order by code point value because 0xff9d < 0x10384. And they are in order by utf8 value because 0xef < 0xf0. But they are not in order by utf16 value, if we use byte-by-byte comparison, because 0xff > 0xd8.

So MySQL's utf16_bin collation is not “byte by byte.” It is “by code point.” When MySQL sees a supplementary-character encoding in utf16, it converts to the character's code-point value, and then compares. Therefore, utf8_bin and utf16_bin are the same ordering. This is consistent with the SQL:2008 standard requirement for a UCS_BASIC collation: “UCS_BASIC is a collation in which the ordering is determined entirely by the Unicode scalar values of the characters in the strings being sorted. It is applicable to the UCS character repertoire. Since every character repertoire is a subset of the UCS repertoire, the UCS_BASIC collation is potentially applicable to every character set. NOTE 11 — The Unicode scalar value of a character is its code point treated as an unsigned integer.”

If the character set is ucs2, comparison is byte-by-byte, but ucs2 strings should not contain surrogates, anyway.

For additional information about Unicode collations in MySQL, see Collation-Charts.Org (utf8).

Previous / Next / Up / Table of Contents

User Comments

Posted by Erel Segal on December 9 2006 7:42pm

[Delete] [Edit]

Note for Hebrew speakers: in utf8_general_ci, dots (Niqqud symbols) are treated as seperate characters, so
a. If you have dotted words in your table, they won't be ordered correctly.
b. You can have words with the same letters and different dots in a unique index column.

On the other hand, in utf8_unicode_ci, dots are igonred, so:
a. The order will be correct;
b. Words with the same letters and different dots will be regarded as equal, and you won't be able to have them in a unique index column.

I still didn't find a collation that treats both issues correctly.

Add your own comment.

Top / Previous / Next / Up / Table of Contents