Unicode collation algorithm

The Unicode collation algorithm (UCA) is an algorithm defined in Unicode Technical Report #10, which is a customizable method to produce binary keys from strings representing text in any writing system and language that can be represented with Unicode. These keys can then be efficiently byte-by-byte compared in order to collate or sort them according to the rules of the language, with options for ignoring case, accents, etc.

Unicode Technical Report #10 also specifies the Default Unicode Collation Element Table (DUCET). This data file specifies a default collation ordering. The DUCET is customizable for different languages. Some such customisations can be found in the Unicode Common Locale Data Repository (CLDR).

An open source implementation of UCA is included with the International Components for Unicode, ICU. ICU supports tailoring, and the collation tailorings from CLDR are included in ICU. The effects of tailoring and many language-specific tailorings are displayed in the on-line ICU Locale Explorer.

External links

Unicode Collation Algorithm: Unicode Technical Standard #10
Mimer SQL Unicode Collation Charts

Tools

ICU Locale Explorer An online demonstration of the Unicode Collation Algorithm using International Components for Unicode
msort A sort program that provides an unusual level of flexibility in defining collations and extracting keys.

Unicode

Code points

Characters

Special purpose	BOM Combining Grapheme Joiner Left-to-right mark / Right-to-left mark Soft hyphen Word joiner Zero-width joiner Zero-width non-joiner Zero-width space
Lists	Characters CJK Unified Ideographs Combining character Duplicate characters Numerals Scripts Spaces Symbols Halfwidth and fullwidth Alias names and abbreviations

Processing

Algorithms	Bidirectional text Collation ISO/IEC 14651 Equivalence Variation sequences International Ideographs Core
Comparison	BOCU-1 CESU-8 Punycode SCSU UTF-1 UTF-7 UTF-8 UTF-16/UCS-2 UTF-32/UCS-4 UTF-EBCDIC

On pairs of
code points

Usage

Related standards

Related topics

Scripts and symbols in Unicode
Common and inherited scripts	Combining marks Diacritics Punctuation Space Numbers
Modern scripts	Adlam Arabic Armenian Balinese Bamum Batak Bengali Bopomofo Braille Buhid Burmese Canadian Aboriginal Chakma Cham Cherokee CJK Unified Ideographs (Han) Cyrillic Deseret Devanagari Geʽez Georgian Greek Gujarati Gunjala Gondi Gurmukhi Hangul Hanifi Rohingya Hanja Hanunuo Hebrew Hiragana Javanese Kanji Kannada Katakana Kayah Li Khmer Lao Latin Lepcha Limbu Lisu (Fraser) Lontara Malayalam Masaram Gondi Mende Kikakui Medefaidrin Miao (Pollard) Mongolian Mru N'Ko New Tai Lue Nüshu Nyiakeng Puachue Hmong Odia Ol Chiki Osage Osmanya Pahawh Hmong Pau Cin Hau Pracalit (Newa) Ranjana Rejang Samaritan Saurashtra Shavian Sinhala Sorang Sompeng Sundanese Syriac Tagbanwa Tai Le Tai Tham Tai Viet Tamil Telugu Thaana Thai Tibetan Tifinagh Tirhuta Vai Wancho Warang Citi Yi
Ancient and historic scripts	Ahom Anatolian hieroglyphs Ancient North Arabian Avestan Bassa Vah Bhaiksuki Brāhmī Carian Caucasian Albanian Coptic Cuneiform Cypriot Dives Akuru Dogra Egyptian hieroglyphs Elbasan Elymaic Glagolitic Gothic Grantha Hatran Imperial Aramaic Inscriptional Pahlavi Inscriptional Parthian Kaithi Kharosthi Khitan small script Khojki Khudawadi Khwarezmian (Chorasmian) Linear A Linear B Lycian Lydian Mahajani Makasar Mandaic Manichaean Marchen Meetei Mayek Meroitic Modi Multani Nabataean Nandinagari Ogham Old Hungarian Old Italic Old Permic Old Persian cuneiform Old Sogdian Old Turkic Palmyrene ʼPhags-pa Phoenician Psalter Pahlavi Runic Sharada Siddham Sogdian South Arabian Soyombo Sylheti Nagri Tagalog (Baybayin) Takri Tangut Ugaritic Yezidi Zanabazar Square
Notational scripts	Duployan SignWriting
Symbols, emojis	Cultural, political, and religious symbols Currency Mathematical operators and symbols Phonetic symbols (including IPA) Emoji
Category: Unicode Category: Unicode blocks

This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.

Unicode collation algorithm

See also

External links

Tools