Concept and Encoding for UTF-8

Concept of UTF-8

UTF-8 is one of the most common methods for encoding Unicode characters, which encodes variable lengths of up to 4 bytes per character. This allows all characters in the world to be represented as a single character set, and since it is compatible with ASCII, ASCII characters can be represented as 1 byte.

Key Features of UTF-8

  1. Variable-length character encoding : Each Unicode character is variably encoded from 1 byte to up to 4 bytes, and the required memory may be optimized depending on the character.
  2. ASCII compatibility : UTF-8 includes ASCII character configurations as they are. ASCII characters are expressed in the same one byte in UTF-8, and existing ASCII text files can be retrieved without problems even if they are encoded in UTF-8.
  3. Self-synchronization : The UTF-8 encoding is designed to easily identify the beginning of a character, and can start at any part of the string to find a valid character boundary.
  4. Network Efficiency : It is usually used in e-mail and the web, and is efficient for texts that mainly use ASCII characters such as English and has less overhead when transmitting data.
  5. Error detection : Because the encoding rules are strict, it is easy to detect when data is damaged or encoded incorrectly.
  6. General purpose : It has the generally of being able to express characters in any language around the world. For this reason, it is used standardly in the development of the Internet and software worldwide.
  7. Reverse compatibility : It is designed in a way that maintains compatibility with previous encoding, and the ASCII portion of the encoded document may be processed the same as the existing ASCII encoding.
  8. Economical : Since most web content is in English, it is a very economical encoding method for English text.
  9. Scalability : It is possible to express all characters supported by Unicode, and it is also possible to add new characters according to the extension of Unicode.

How to encode UTF-8

The encoding method depends on the Unicode value of the character, and the detailed method is as follows.

  1. One-byte character (U+0000 ~ U+007F) : ASCII characters are represented by one byte.
    • For example, ‘A’ is Unicode U+0041 and is encoded as 0x41.
  2. Two-byte characters (U+0080 ~ U+07FF) : The first byte is encoded in 110xxxxx, and the second byte is encoded in 10xxxxxx. ‘x’ represents the bit portion of the Unicode value.
    • For example, ‘¢’ (cent symbol) is Unicode U+00A2 and is encoded as 0xC2 0xA2.
  3. Three-byte characters (U+0800 ~ U+FFFF) : The first byte is encoded in 1110xxxx, and the next two bytes are encoded in 10xxxxxx, respectively.
    • For example, ‘한’ is Unicode U+D55C, encoded as 0xED 0x95 0x9C.
  4. Four-byte characters (U+10000 ~ U+10FFFF) : The first byte is encoded in 11110xxx, and the next three bytes are encoded in 10xxxxxx, respectively.
    • For example, the emoji (e.g., 😂) is Unicode U+1F602 and is encoded as 0xF0 0x9F 0x98 0x82.

Related to and Feeling Hangul Encoding

The Unicode range of Korean syllables is (U+AC00 ~ U+D7A3), which is 3 bytes when encoded in the above method. In the past, I also had a stereotype that Hangul was 2 bytes, but after hearing the concept of UTF-8, I learned that it could be 3 bytes depending on the encoding method. I also thought that I would like to look deeper if I had the opportunity.

Leave a Reply