Unicode and You

我是Unicode新手。But like many newbies, I had an urge to learn once my interest was piqued by anintroduction to Unicode.

Unicode isn’t hard to understand, but it does cover some low-level CS concepts, likebyte order. Reading about Unicode is a nice lesson in design tradeoffs and backwards compatibility.

My thoughts are below. Read them alone, or as a follow-up to Joel’s unicode article above. If you’re like me, you’ll get an itch to read about the details in the Unicode specs or in Wikipedia. Really, it can be cool, I swear.

Key concepts

Let’s level set on some ideas:

Ideas and data are different.“A”的概念不同于纸上的标记,不同于“aaay”的声音,也不同于存储在计算机中的数字65。

One idea has many possible encodings.An encoding is just a method to transform an idea (like the letter “A”) into raw data (bits and bytes). The idea of “A” can be encoded many different ways. Encodings differ in efficiency and compatibility.

Know thy encoding.When reading data, you must know the encoding used in order to interpret it properly. This is a simple but important concept. If you see the number 65 in binary, what does it really mean? “A” in ASCII? Your age? Your IQ? Unless there is some context, you’d never know. Imagine if someone came up to you and said “65”. You’d have no idea what they were talking about. Now imagine they came up and said “The following number is an ASCII character: 65”. Weird, yes, but see how much clearer it is?

Embrace the philosophy that a concept and the data that stores it are different. Let it rustle around in your mind…

Got it? Let’s dive in.

Back to ASCII and Code Pages

您可能听说过ASCII/ANSI字符集。它们将数值0-127映射到各种西方字符和控制代码(换行符、制表符等)。注意,值0-127适合8位字节中较低的7位。ASCII没有显式定义128-255值映射到什么。

Now, ASCII encoding works great for English text (using Western characters), but the world is a big place. What about Arabic, Chinese and Hebrew?

To solve this, computer makers defined “code pages” that used the undefined space from 128-255 in ASCII, mapping it to various characters they needed. Unfortunately, 128 additional characters aren’t enough for the entire world: code pages varied by country (Russian code page, Hebrew code page, etc.).

If people with the same code page exchanged data, all was good. Character #200 on my machine was the same as Character #200 on yours. But if codepages mixed (Russian sender, Hebrew receiver), things got strange.

在俄语和希伯来语中,映射到#200的字符是不同的,你可以想象这对电子邮件和生日邀请造成的混乱。这是一个很大的如果,是否有人会阅读您的消息使用相同的代码页您编写的文本。If you visit an international website, for example, your browser could try toguessthe codepage if it was not specified (“Hrm… this text has a lot of character #213 and #218… probably Hebrew”). But clearly this method was error-prone: codepages needed to be rescued.

Unicode to the Rescue

The world had a conundrum: they couldn’t agree on what numbers mapped to what letters in ASCII. The Unicode group went back to the basics: Letters are abstract concepts. Unicode labeled each abstract character with a “code point”. For example, “A” mapped to code point U+0041 (this code point is in hex; code point 65 in decimal).

The Unicode group did the hard work of mapping each character in every language to some code point (not without fierce debate, I am sure). When all was done, the Unicode standard left room for over 1 million code points, enough for all known languages with room to spare for undiscovered civilizations. For fun, you can browse the codepoints with the charmap utility (Start Menu > Run > Charmap) or online atUnicode.org.

This brings us to our first design decision:compatibility.

For compatibility with ASCII, code points U+0000 to U+007F (0-127) were the same as ASCII. Purists probably didn’t like this, because the full Latin character sets were defined elsewhere, and now one letter had 2 codepoints. Also, this put Western characters “first”, whereas Chinese, Arabic and the “nonstandard” languages were stuck in the non-sleek codepoints that require 2 bytes to store.

However, this design was necessary – ASCII was a standard, and if Unicode was to be adopted by the Western world it needed to be compatible, without question. Now, the majority of common languages fit into the first 65535 codepoints, which can be stored as 2 bytes.

Phew. The world was a better place, and everyone agreed on what codepoint mapped to what character.

但是问题仍然存在:我们如何将代码点存储为数据?

Encoding to the Rescue

从上面,编码将一个想法转化为原始数据。在这种情况下,思想是一个代码点。

For example, let’s look at the ASCII “encoding” scheme to store Unicode codepoints. The rules are pretty simple:

  • Code points from U+0000 to U+007F are stored in a single byte
  • Code points above U+0080 are dropped on the floor, never to be seen again

简单,是吧?

As you can see, ASCII isn’t great for storing Unicode – in fact, it ignores most Unicode codepoints altogether. If you have a Unicode document and save it as ASCII -wham- all your special characters are gone. You’ll often see this as a warning in some text editors when you save Unicode data in a file original saved as ASCII.

But the example has a purpose. An encoding is a system to convert an idea into data. In this case, the conversion can be politely called “lossy”.

I did Unicode experiments with Notepad (can read/write Unicode) andProgrammer’s Notepad,十六进制编辑器。我想看到记事本保存的原始字节。To the examples for yourself:

  • Open notepad and type “Hello”
  • Save file separately as ANSI, Unicode, Unicode Big Endian, UTF-8
  • Open file with Programmer’s Notepad and do View > View Hex

All about ASCII

让我们写“Hello”在记事本,保存为ANSI (ASCII),并在十六进制编辑器打开它。It looks like this:

Byte: 48 65 6C 6C 6F Letter: H e l l o

ASCII is important because many tools and communication protocols only accept ASCII characters. It’s a generally accepted minimum bar for text. Because of its universal acceptance, some Unicode encodings will transform codepoints into series of ASCII characters so they can be transmitted without issue.

Now, in the example above, we know the data is text because we authored it. If we randomly found the file, we couldassume它的内容是ASCII文本,但它可能是帐户号码或我们所知道的其他数据,恰好在ASCII中看起来像“Hello”。

通常,我们可以根据出现在某些地方的某些标题或“Magic Numbers”(特殊字符序列),对数据应该是什么做出很好的猜测。但你永远无法确定,有时你可能会猜错。

不相信我吗?Ok, do the following

  • Open notepad
  • Write “this program can break”
  • Save the file as “blah.txt” (or anything else
  • Open the file in notepad

哇,发生什么事了?我将把这个问题留给读者作为练习。

UCS-2 / UTF-16

This is the encoding I first thought of when I heard “Unicode” – store every character as 2 bytes (what a waste!). At a base level, this can handle codepoints 0x0000 to 0xFFFF, or 0-65535 for you humans out there. And 65,535 should be enough characters for anybody (there are ways to store codepoints above 65535, but read the spec for more details).

Storing data in multiple bytes leads to my favorite conundrum: byte order! Some computers store the little byte first, others the big byte.

To resolve the problem, we can do the following:

  • Option 1:Choose a conventionthat says all text data must be big or little-endian. This won’t happen – computers on the wrong side of the decision would suffer inefficiency every time they opened a file, since they cannot convert it to the other byte order.
  • Option 2:Everyone agrees to a byte order mark (BOM),在每个文件的顶部有一个标题。如果您打开一个文件,而BOM是向后的,这意味着它是以不同的字节顺序编码的,需要进行转换。

The solution was the BOM header: UCS-2 encodings could write codepoint U+FEFF as a file header. If you open a UCS-2 string and see FEFF, the data is in the right byte order and can be used directly. If you see FFFE, the data came from another type of machine, and needs to be converted to your architecture. This involves swapping every byte in the file.

But unfortunately, things are not that simple. The BOM is actually a valid Unicode character – what if someone sent a file without a header, and that character was actually part of the file?

这在Unicode中是一个开放的问题。建议是避免U+FEFF(头文件除外),并使用替代字符代替(有等价物)。

This opens up design observation #2:Multi-byte data will havebyte order issues!

ASCII never had to worry about byte order – each character was a single byte, and could not be misinterpreted. But realistically, if you see bytes 0xFEFF or 0xFFEE at the start of a file, it’s a good chance it’s a BOM in a Unicode text file. It’s probably an indication of byte order. Probably.

(旁白:UCS-2将数据存储在一个16位的扁平块中。UTF-16允许在2个16位字符之间分割最多20位,称为代理对。代理项对中的每个字符本身都是无效的unicode字符,但是可以同时提取一个有效的unicode字符。)

UCS-2 Example

Type “Hello” in notepad and save it as Unicode (little-endian UCS-2 is the native format on Windows):

Hello-little-endian:

FF FE 4800 6500 6C00 6c00 6F00 header H e l l o

Save it again as Unicode Big Endian, and you get:

Hello-big-endian:

FE FF 0048 0065 006C 006C 006F header H e l l o

Observations

  • The header BOM (U+FEFF) shows up as expected: FF FE for little-endian, FEFF for big
  • Letters use 2 bytes no matter what: “H” is 0x48 in ASCII, and 0x0048 in UCS-2
  • Encoding is simple. Take the codepoint in hex and write it out in 2 bytes. No extra processing is required.
  • The encoding is too simple. It wastes space for plain ASCII text that does not use the high-order byte. And ASCII text is very common.
  • The encoding inserts null bytes (0x00) which can be a problem. Old-school ASCII programs may think the Unicode string has ended when it gets to the null byte. On a little-endian machine, reading one byte at a time, you’d get to H (H = 0x4800) and then hit the null and stop. On a big endian machine, you’d hit the null first (H is 0x0048) and not even see the H in ASCII. Not good.

Design observation #3: Consider backwards compatibility. How will an old program read new data? Ignoring new data is good. Breaking on new data is bad.

UTF-8

UCS-2 / UTF-16 is nice and simple, but boy it does waste some bits. Not only does it double ASCII, but the converted ASCII might not even be readable due to the null characters.

Enter UTF-8. Its goal is to encode Unicode characters in single byte where possible (ASCII), and not break ASCII applications by having null characters. It is the default encoding for XML.

Read the UTF-8 specs for more detail, but at a high level:

  • Code points 0 – 007F are stored as regular, single-byte ASCII.
  • Code points 0080 and above are converted to binary and stored (encoded) in a series of bytes.
  • The first “count” byte indicates the number of bytes for the codepoint, including the count byte. These bytes start with 11..0:

    110xxxxx (The leading “11” is indicates 2 bytes in sequence, including the “count” byte)

    1110xxxx (1110 -> 3 bytes in sequence)

    11110xxx (11110 -> 4 bytes in sequence)

  • Bytes starting with 10… are “data” bytes and contain information for the codepoint. A 2-byte example looks like this

    110xxxxx 10xxxxxx

这意味着序列中有2个字节。X表示代码点的二进制值,需要将剩余的位压缩进去。

Observations about UTF-8

  • 没有空字节。所有ASCII字符(0-127)都是相同的。非ascii字符都以“1”作为最高位开始。
  • ASCII文本以相同的方式高效地存储。
  • Unicode字符以“1”作为高位开始,可以被只使用ascii的程序忽略(然而,在某些情况下它们可能会被丢弃!)请参阅UTF-7了解更多细节)。
  • There is a time-space tradeoff. There is processing to be done on every Unicode character, but this is a reasonable tradeoff.

Design principle #4

  • UTF-8很好地解决了80%的情况(ASCII),而使其他情况成为可能(Unicode)。UCS-2平等地处理所有情况,但对于解决99%的情况来说,在80%的情况下效率很低。But UCS-2 is less processing-intensive than UTF-8, which requires bit manipulation on all Unicode characters.
  • Why does XML store data in UTF-8 instead of UCS-2? Is space or processing power more important when reading XML documents?
  • Why does Windows XP store strings as UCS-2 natively? Is space or processing power more important for the OS internals?

In any case, UTF-8 still needs a header to indicate how the text was encoded. Otherwise, it could be interpreted as straight ASCII with some codepage to handle values above 127. It still uses the U+FEFF codepoint as a BOM, but the BOM itself is encoded in UTF-8 (clever, eh?).

UTF-8 Example

Hello-UTF-8:

EF BB BF 48 65 6C 6C 6F header H e l l o

Again, the ASCII text is not changed in UTF-8. Feel free to use charmap to copy in some Unicode characters and see how they are stored in UTF-8. Or, you canexperiment online.

UTF-7

While UTF-8 is great for ASCII, it still stores Unicode data as non-ASCII characters with the high-bit set. Some email protocols do not allow non-ASCII values, so UTF-8 data would not be sent properly. Systems that can handle data with anything in the high bit are “8-bit clean”; systems that require data have values 0-127 (like SMTP) are not. So how do we send Unicode data through them?

Enter UTF-7. The goal is to encode Unicode data in 7 bits (0-127), which is compatible with ASCII. UTF-7 works like this

  • Codepoints in the ASCII range are stored as ASCII, except for certain symbols (+, -) that have special meaning
  • Codepoints above ASCII are converted to binary, and stored in base64 encoding (stores binary information in ASCII)

How do you know which ASCII letters are real ASCII, and which are base64 encoded? Easy. ASCII characters between the special symbols “+” and “-“ are considered base64 encoded.

“-”的作用类似于转义后缀字符。如果它跟在一个字符后面,则按字面解释该条目。因此,“+-”被解释为“+”,没有任何特殊编码。这就是如何在UTF-7中存储实际的“+”符号。

UTF-7 Example

维基百科有一些UTF-7的例子,因为记事本不能保存为UTF-7。

“Hello” is the same as ASCII — we are using all ASCII characters and no special symbols:

Byte: 48 65 6C 6C 6F Letter: H e l l o

“£1” (1 British pound) becomes:

+AKM-1

The characters “+AKM-” means AKM should be decoded in base64 and converted to a codepoint, which maps to 0x00A3 or the British pound symbol. The “1” is kept the same, since it is a ASCII character.

UTF很聪明,对吧?它本质上是一种Unicode到ASCII的转换,它删除所有最高位设置的字符。除了需要转义的特殊字符(-和+)外,大多数ASCII字符看起来都是一样的。

Wrapping it up – what I’ve learned

I’m still a newbie but have learned a few things about Unicode:

  • Unicode does not mean 2 bytes. Unicode defines code points that can be stored in many different ways (UCS-2, UTF-8, UTF-7, etc.). Encodings vary in simplicity and efficiency.
  • Unicode有超过65,535(16位)的字符。编码可以指定更多的字符,但前65535个字符涵盖了大多数常见语言。
  • 您需要知道正确读取文件的编码。You can oftenguessthat a file is Unicode based on the Byte Order Mark (BOM), but confusion can still arise unless you know the exact encoding. Even text that looks like ASCII could actually be encoded with UTF-7; you just don’t know.

Unicode is an interesting study. It opened my eyes to design tradeoffs, and the importance of separating the core idea from the encoding used to save it.

Other Posts In This Series

  1. Number Systems and Bases
  2. The Quick Guide to GUIDs
  3. Understanding Quake's Fast Inverse Square Root
  4. A Simple Introduction To Computer Networking
  5. Swap two variables using XOR
  6. Understanding Big and Little Endian Byte Order
  7. Unicode and You
  8. A little diddy about binary file formats
  9. Sorting Algorithms

Join 450k Monthly Readers

Enjoy the article? There's plenty more to help you build a lasting, intuitive understanding of math. Join the newsletter for bonus content and the latest updates.