How computers store text

written 2023-04-02

So you know how computers store numbers, by representing them in binary using a bunch of on/off transistors. The way they store text involves representing each character as a number. There are several ways they do this, called character encodings. First I'll talk about the simplest one: ASCII (American Standard Code for Information Interchage).

ASCII

In ASCII, each character is one byte, which means there are 256 possible characters, But actually ASCII only has 128 characters for historical reasons. They're represented as the numbers 0-127, so bytes with values >=128 are considered invalid characters in ASCII.

The 128 ASCII characters include everything you can easily type on an American keyboard: uppercase letters, lowercase letters (those are separate), numbers, punctuation, and several "control codes" like newlines (enter), tabs, backspace, delete, "NUL" (value 0), and a whole bunch more that are never used nowadays.

Of course, the crucial limitation of ASCII is that it only supports English text, and really only American English (for example, there's a dollar sign but no British pound sign). Even if the values 128-255 had been used, that's obviously not enough characters to support all the world's languages.

Wikipedia has a full list of the ASCII characters:

https://wikipedia.org/wiki/ASCII

Encoding zoo

This isn't really relevant to modern computing, but I think it's worth knowing the historical context.

Before 1992, computing was full of other encodings designed to address the limitations of ASCII. For example, Latin-1 is an expansion of ASCII that uses the values 128-255 to support other languages that use the Latin alphabet. Shift JIS is a completely separate character encoding meant for Japanese.

The problem with all these encodings was that they were all incompatible with each other. To read a text document, you had to know what character encoding it was. If you opened a Shift JIS document in a text editor but your text editor thought it was Latin-1, it would just show up as gibberish. This also meant you couldn't easily have a document that contains both Latin and Japanese characters (like a document teaching English speakers how to speak Japanese).

Unicode

Unicode is a huge standard that assigns numbers to all characters used by all languages, as well as thousands of other characters like mathematical and miscellaneous symbols and emoji. This is great because it allows text in different languages to use the same character set, so software can render a document without worrying about what character set to use. You still need to have fonts installed that support the characters used, of course.

There are actually multiple ways of encoding unicode text, but the one basically everything uses is called UTF-8.

UTF-8 encodes characters as 1-4 bytes each. The first 128 Unicode characters - which are actually the same as ASCII for backwards compatibility - are encoded as 1 byte. This is great because it means all ASCII text is automatically valid as UTF-8 as well, so existing ASCII documents didn't have to be rewritten in UTF-8.