UTF-1

UTF-1 is a method of transforming ISO 10646/Unicode into a stream of bytes. Its design does not provide self-synchronization, which makes searching for substrings and error recovery difficult. It reuses the ASCII printing characters for multi-byte encodings, making it unsuited for some uses (for instance Unix filenames cannot contain the byte value used for forward slash). UTF-1 is also slow to encode or decode due to its use of division and multiplication by a number which is not a power of 2. Due to these issues, it did not gain acceptance and was quickly replaced by UTF-8.

UTF-1
MIME / IANAISO-10646-UTF-1
Language(s)International
Current statusObscure, of mainly historical interest.
ClassificationUnicode Transformation Format, extended ASCII, variable-width encoding
ExtendsUS-ASCII
Transforms / EncodesISO 10646 (Unicode)
Succeeded byUTF-8

    Design

    Similar to UTF-8, UTF-1 is a variable-width encoding that is backwards-compatible with ASCII. Every Unicode code point is represented by either a single byte, or a sequence of two, three, or five bytes. ASCII is supported via the single-byte encodings, which, unlike those of UTF-8, also include the non-ASCII code points U+0080 through U+009F.

    UTF-1 does not use the C0 and C1 control codes or the space character in multi-byte encodings: a byte in the range 0–0x20 or 0x7F–0x9F always stands for the corresponding code point. This design with 66 protected characters tried to be ISO 2022 compatible.

    UTF-1 uses "modulo 190" arithmetic (256 − 66 = 190). For comparison, UTF-8 protects all 128 ASCII characters and needs one bit for this, and a second bit to make it self-synchronizing, resulting in "modulo 64" arithmetic (8 − 2 = 6; 26 = 64). BOCU-1 protects only the minimal set required for MIME-compatibility (0x00, 0x07–0x0F, 0x1A–0x1B, and 0x20), resulting in "modulo 243" arithmetic (256 − 13 = 243).

    code pointUTF-8UTF-1
    U+007F7F7F
    U+0080C2 8080
    U+009FC2 9F9F
    U+00A0C2 A0A0 A0
    U+00BFC2 BFA0 BF
    U+00C0C3 80A0 C0
    U+00FFC3 BFA0 FF
    U+0100C4 80A1 21
    U+015DC5 9DA1 7E
    U+015EC5 9EA1 A0
    U+01BDC6 BDA1 FF
    U+01BEC6 BEA2 21
    U+07FFDF BFAA 72
    U+0800E0 A0 80AA 73
    U+0FFFE0 BF BFB5 48
    U+1000E1 80 80B5 49
    U+4015E4 80 95F5 FF
    U+4016E4 80 96F6 21 21
    U+D7FFED 9F BFF7 2F C3
    U+E000EE 80 80F7 3A 79
    U+F8FFEF A3 BFF7 5C 3C
    U+FDD0EF B7 90F7 62 BA
    U+FDEFEF B7 AFF7 62 D9
    U+FEFFEF BB BFF7 64 4C
    U+FFFDEF BF BDF7 65 AD
    U+FFFEEF BF BEF7 65 AE
    U+FFFFEF BF BFF7 65 AF
    U+10000F0 90 80 80F7 65 B0
    U+38E2DF0 B8 B8 ADFB FF FF
    U+38E2EF0 B8 B8 AEFC 21 21 21 21
    U+FFFFFF3 BF BF BFFC 21 37 B2 7A
    U+100000F4 80 80 80FC 21 37 B2 7B
    U+10FFFFF4 8F BF BFFC 21 39 6E 6C
    U+7FFFFFFFFD BF BF BF BF BFFD BD 2B B9 40

    Although modern Unicode ends at U+10FFFF, both UTF-1 and UTF-8 were designed to encode the complete 31 bits of the original Universal Character Set (UCS-4), and the last entry in this table shows this original final code point.

    See also

    References

    • "The Unicode Standard: Appendix F FSS-UTF" (PDF) (PDF, 768 KiB). Version 1.1. Unicode, Inc.
    • ISO/IEC JTC 1/SC2/WG2 (1993-01-21). "ISO IR 178: UCS Transformation Format One (UTF-1)" (PDF) (PDF, 256 KiB) (1 ed.). Registration number 178.
    • Czyborra, Roman (1998-11-30). "Unicode Transformation Formats: UTF-8 & Co". Archived from the original on 2016-06-07. Retrieved 2016-06-07.
    • F. Yergeau, F. "UTF-8, a transformation format of ISO 10646".
    This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.