A Long Explanation of Character Encodings and UTF-8 and the IMC Software

This was posted as a correction to a note I made about some character encoding errors that appeared on the LA Indymedia website. It's poorly written. If you need me to edit and clarify, send an email to johnk@riceball.com.

It's ISO 8859, not 8890.

There's another relevant encoding called Windows-1252, aka ANSI (which I just learned is a misnomer).

Here's an explanation of character sets (charsets). Computers can deal only with 1s and 0s. A signal is high or low, on or off. To do math, numbers are (more or less) represented as aggregates of 1s and 0s, using binary numbers. So there's a mapping between binary numbers and what we think of as numbers - the numbers on the left are in binary, and the numbers on the right are in decimal:

0 = 0
1 = 1
10 = 2
11 = 3
100 = 4
101 = 5
110 = 6
111 = 7
etc.

Digression : how binary works (you may skip this section)

The way binary works is identical to the way we represent values larger than 9: we use the ten numerical symbols we have, but arrange them in "places".

Look at this sequence of statements:
x is 1 "x".
xx is 2 "x"s.
xxx is 3 "x"s.
xxxxxxxxx is 9 "x"s.
xxxxxxxxxx is 10 "x"s.

Notice that there is no symbol to represent "ten". Rather, we use the "one" symbol in the tens place, followed by the "zero" symbol in the ones place.

Likewise, think about this:

xxxxxxxxxxxxxx is 14 "x"s.

Again, there is no symbol to represent "fourteen". Instead, we use "14", which is "one" in the tens place, and "four" in the ones place. It means 10 + 4, which is fourteen.

Though the computer doesn't really "know" anything, much less numbers, it can simulate that knowledge very well by building on the binary representation of numbers. Entire sections of computer chips are dedicated to doing math.

Mapping numbers to characters

In the same way that computers aggregate bits into numbers, they map numbers to characters or glyphs. A glyph is a drawing of a letter. Think of "heiroglyphs."

The mapping of numbers to glyphs is like this:

A = 65
B = 66
C = 67
etc.

So a data file with the word "CAB" is saved as: 67, 65, 66.

(That, in turn, is converted to eletrical signals that alter the magnetism on a disk, and those are recorded in binary, and look like this, more or less: 0100001101000000100001)

When the file is read in, the numbers are read, and then translated back to "CAB".

Why is A 65 and not 1? Well, that's an interesting question, but the best answer is "someone picked 65". The mapping is basically arbitrary but has its roots in history. I'm sure that, at some point, someone built a computer that set "A" at the value of "1".

Of course, imagine what happens if you copy data from a computer where A=1 over to a computer where A=65.

The data is stored as: 3, 1, 2 on the first computer.
It's read as, 3, 1, 2 on the second computer... but that doesn't map to any characters!

Thus, 3,1,2 is, at best, meaningless. At worst, imagine that on the second computer, there was this mapping:

NULL = 0
SOH = 1
STX = 2
ETX = 3

Now, that's weird. I'll explain these meanings, which, today, are totally meaningless, but they had meaning in the 1950s. SOH means "start of heading". STX means "start of transmission" and ETX means "end of transmission". These were data transmission control signals between a computer and a teletypewriter.

So, these things called mappings matter. Once upon a time, different telegraph and computer companies used different mappings, and they could not communicate with each other. The problem was eventually worked out and a standard mapping called ASCII was created in the early 1960s. It's pronounced "ass-key". Seriously.

(Despite this standard, companies like Telex didn't conform. Why? Probably because the incompatible hardware and character set was a way to lock customers into their data service. If you know what Telex is, you're probably older than I. It was a text messaging system similar to email or SMS, except to use it, you needed to be wired into the Telex system, rent a 100 pound teletypewriter, and pay a lot of money to Telex. Since it was so expensive, regular people used services like Western Union to send "telegrams" which were basically like very short printed-out emails or SMS, delivered to your door by a bike courier.)

ASCII was awesome. Now, people could take a data tape from a Sperry-Univac and load it into a Burroughs or a Control Data. Well, in theory... if it was all text. But a great breakthrough is still a great breakthrough.

Of course, in a capitalist system, commodification leads to a decline in prices due to market competition, so companies always seek to maintain a monopoly by making their products unique. Note that I didn't say "better." Customers like standards, but capitalist bankers (if they knew what encoding standards were) would abhor them after watching standards decimate profit margins as cheaper competition entered the market.

That's why you couldn't necessarily load that ASCII-encoded data tape described above onto an IBM until the late 1970s. They used a mapping called EBCDIC, and it wasn't ASCII. If you had IBM, you had to buy more IBM equipment. (EBCDIC is pronounced ehb-s'dick.)

Furthermore, over time, the different computer companies extended ASCII with additional characters. They added drawing characters or accented characters, for example. These arbitrary mappings were incompatible with each other, again. Despite these problems, they all had ASCII within it, so life was tolerable. English can be adequately represented in ASCII, and will be for the forseeable future.

(By the way, you're probably wondering why one couldn't simply convert files from one encoding to another. You could only if you knew all the data was textual. The way data files are constructed, numeric, textual, and other data are mixed, so conversion requires being able to identify the parts that are textual.)

Digression again, into more encoding history

IBM is an old company. Their encoding of numbers to letters predates computers, and goes back to their tabulating machines starting in the 1930s. This punched card encoding, and the physical limitations of punched cards, which became delicate if too many holes were punched, influenced the design of EBCDIC. This is described in the Wikipedia article about Punched Cards.

Punched cards had 80-characters per card, which is why computer screens, for many years, had 80 characters per line, and why printers were printing 80 characters per line.

The Mac

Everything was cool with the ASCII standard, until the Macintosh happened. (That was around twenty years.) The Mac pioneered low-cost desktop publishing. It didn't "invent" it - there were a number of different electronic typesetting systems in existence since the 60s, and in the 70s, computers were controlling these typesetters. In the 80s, a system called TeX was invented to publish math books. What Apple did was meld a low-cost laser printer running Adobe Postscript with the Macintosh; both had fonts.

The main thing that distinguished Mac printouts from other computer printouts was the fonts. And one feature that made desktop publishers orgasmic was "smart quotes" or "curly quotes", which are the quotes that look like 6 and 9, and are paired up. Those are typesetters quotes, and look way better than the straight quotes that typewriters produced.

ASCII didn't have curly quotes.

I know this sounds stupid, but at the time it was really important. If you find a copy of The Little Mac Book at a thrift shop, you can read about the issue. It merited a small section.

The Apple invented their own encoding, called Mac Roman. Quotes were at codes 210-213.

Microsoft was lagging, and so they created Windows-1252. Quote were at 145-148.

These mappings also had a lot of other glyphs like accented letters. As you may have guessed, these glyphs were at different codes, too.

The World

The quote situation was bad, but even more challenging was the fact that the non-Western countries were also buying computers. The demand for typesetting in Asian and Arabic languages was huge, because, unlike Roman lettering, it was difficult to make typewriters for hand-drawn scripts and huge character sets.

Yes, you could do typesetting, usually by hand in the old manner, but typing was a much tougher thing. The Chinese invented moveable type. Setting Chinese in type is easy, because the characters are square. Making a typewriter that contains 2000+ Chinese symbols is not easy.

Arabic has only 28 letters, but there are dots above and below the letters, and the shapes of the letters change at the end of words. The letters are different widths, too. So the Arabic typewriter is more complex than the Roman typewriter, which typically has around 70 characters (upper and lower case and numbers), all the same width, and no accents.

American typing needs were simple, so we could enjoy typewriters. Asian and Middle Eastern typing needs were more complex, so computers were a compelling technology. In the 90s, I remember seeing ads in English language computer magazines for software that supported producing printing in Chinese, Japanese, Korean, Hebrew, and different Arabic scripts. Imagine that.

So computer software companies rushed to implement new technologies to allow for different character sets.

So if you wanted to write in Arabic, you would use an Arabic encoding.

A lot of different encodings were created, and caused a lot of headaches. For example, an encoding for the UK puts a pound sign (looks like an L) at #. Not cool. Imagine if that happened with the yen? How would you print out exchange rates? Standards were created, and eventually, something like ISO 8859-1 was created, which includes currency symbols for Pounds, Yen, Dollars, and Cents. No peso, no euro, no bhat, no won. It didn't matter much because there were millions of computer users in the first world, and not many in the third world. It was the 80s. Globalization wasn't huge yet. Computers were still expensive, and the Third World was still poor.

At this point, I need to get back into binary.

Characters, at the time, could have codes ranging from 0 to 255. That limit was imposed by the fact that the basic numeric unit in computers, then as now, was the "byte". A byte is 8 binary digits: 00000000. As we noted above, numbers in computers don't really exist. They can only do electrical signals representing 1 and 0, called "bits". To do numbers, you use groups of bits. The normal size of a group of bits is 8.

Why 8? Let's just say there are a lot of economic reasons going back to the 1950s. There were computers with larger and smaller groupings. It ended up 8 mainly because of the need for text, and standards like ASCII which is 7 bits, but was generally extended by companies to add more characters by using an 8th bit.

8 bits is not enough to encode characters in Chinese, meaning it's also inadequate for Japanese communications, which use Chinese characters. To handle Chinese, you needed to use multiple bytes. Apple worked out this problem and implemented Japanese support, and sold a ton of Macs into Japan. Back then, Japan had emerged as not only an economic superpower, but also a client state within the United States sphere of influence. So they had money and could buy Macs. China was still trying to break into western markets.

More languages worked, but the bilingualism problem persisted. You could not easily mix languages, especially if you wanted to move the data off of a Mac, but even within a Mac it wasn't reliable.

So in the late 80s, Apple and Xerox convened to invent Unicode, a single encoding for all characters. There were potentially 16,384 code points mapping glyphs to numbers. Later, in 1996, it was expanded to millions of potential code points.

Guess where the quotes are in Unicode?

Before I give you the number, here's another long explanation.

This is where it gets complicated again. The original Unicode was two bytes, or 16 bits of data. However, that's not the Unicode that got popular. In fact, Unicode was pretty unpopular with the masses. A special encoding called UTF-8 was created that overlapped a LOT with ISO-8859, which overlapped with ASCII. By supporting old documents within this new encoding, adoption was very easy -- if you, as a programmer, didn't really understand UTF-8 and Unicode, and your software didn't understand UTF-8, just force everything into ASCII for a few years until you and your software tools get the hang of UTF-8.

Basically, a regular ASCII file is a valid UTF-8 file.

In UTF-8, if you need to use the extended Unicode character set of millions of code points, you use an escape sequence. An escape sequence is a special character followed by one or more special characters. In the case of Unicode, the typical escape sequence is three bytes.

So the difference between UTF-16 and UTF-8 is drastic. In UTF-16, all your text is two bytes per character. In UTF-8, most of your text is one byte per character, except when you need a special character, when it generally expands to two or three bytes. If you're using an ancient language, you may use four bytes per character.

These sequences are decoded, and then map back to code points, which then point to specific glyphs.

The sequence for the first curly quote is:
E2 80 98, which, numerically, looks like 226, 128, 152
That sequence maps to the Unicode code point U+2018, which in decimal is 8216.
(Note that these values are expressed in hexadecimal, which I won't explain, but it's just another way to write numbers.)

It's a bit more complex than everything it replaced, but it's now the solution to pretty much all our encoding needs for a while. UTF-8 supports emoji/emoticons, so you have to figure that it's an encoding that can keep up with our needs.

How this relates to Indymedia

Unicode was starting to become widely implemented in the late 1990s, when IMC was starting. Microsoft pushed Unicode in Windows NT 4, which came out in 1997. Unix systems started using Unicode in the early 2000s. Windows 98/Me didn't do Unicode, but Microsoft discontinued that product. Mac OS didn't really support it, but Apple let that die, and replaced it with OS X, which was a Unix that supported Unicode (or allegedly did). This was around 1999.

The support tended to migrate upwards to computer languages throughout the early 2000s. Perl finished (more or less) adding it in 2003. PHP added some kind of support around then, too. Web browsers still support old encodings, but Unicode was recommended with HTML 4.0, which was released in 1997.

Support for became common around 2005, but the thing with rolling out new technology is that the data will always hold you back.

There were millions of pages encoded in everything except Unicode (specifically UTF-8). It would take the better part of a decade for websites to start using UTF-8.

So, in 1999, in this environment, the IMC programmers decided to stick with ISO-8859, aka Latin-1, but when possible specify no encoding at all, so that Unicode wasn't damaged through conversion. It was known, the PHP language supported it, the databases supported it, and so forth. You don't develop idealized software - you have to deploy it on what's called a software stack, and if every layer of the stack doesn't support Unicode, you can't really use Unicode.

Today

Today, some 25 years after its invention, Unicode support, specifically UTF-8, is basically universal. Probably all the new web pages are in UTF-8. However, we face a whole other problem, which is old systems with new data.

IMC sites are old systems being filled with new data.

To handle this, I went and tried to change all the right settings to make UTF-8 support universal. This involved writing small programs that examined each article, made a guess about the encoding based on what sequences and characters were being used, and then performed a conversion on the data to turn it into UTF-8. (This was done only on the newswire, which is why features and calendar items look screwed up.)

Everything was pretty cool, but then a problem emerged and I needed to restore the database from a backup file. That file was encoded as UTF-8, and I was pushing the data into UTF-8 tables, but at some step the data was assumed to be something else, and the data got re-encoded as UTF-8.

So those 3-character sequences ended up being converted into even longer sequences that, in UTF-8 display as 3-character sequences.

You see similar problems all over the web, because there are still old pages and old data being served, and old computers still in operation. They say technology changes fast, but in my experience, it's the economy that changes fast, and when a computer system exits the fast economy and moves into a non-economy or is separated from market forces, technology changes slowly.

Conclusion

Don't panic. These errors can be fixed, pretty reliably. It's been fixed before. I just don't think it's worth doing until we move onto a new computer system where absolutely everything defaults to UTF-8.

Addendum:

I checked the problem, and it's actually assumed the data going in was coded as Windows 1252, aka ANSI. So the fix is to get the data, and then decode to 1252.

The original assumption was that the data (really in UTF-8) was really 1252, so the UTF-8 escapes got turned into escaped characters. This happened 2 or 3 times. So to undo it, you decode it as if it was 1252 encoded into UTF8. That'll get back your original binary.

Attached is the script I'm using to do that.

Note that it uses a function guess_decode() that tries to decode until the text won't "shrink" anymore. Then it backs off until it gets the original UTF-8 version.


<?php
//display.php is used for displaying a single
//article from the DB

include("shared/global.cfg");

function main( $start ) {
    $count = 5000;
    $id = $start;
    echo "<p>$start</p>";
    for($id=$start; $id < $start+$count; $id++) {
        if ($id) {
            $db_obj = new DB;
            $query = "SELECT heading, summary, article FROM webcast WHERE id=".$id;
            $result = $db_obj->query($query);
            $article_fields = array_pop($result);
            $article = $article_fields['article'];
            $article = guess_decode( $article );
            $summary = guess_decode( $article_fields['summary'] );
            $heading = guess_decode( $article_fields['heading'] );
            $h = mysql_real_escape_string($heading);
            $s = mysql_real_escape_string($summary);
            $a = mysql_real_escape_string($article);
            $query = "UPDATE webcast SET heading='$h', summary='$s', article='$a' WHERE id=$id";
            $db_obj->query($query);
        }
    }
    echo "<p><a href='?id=$id'>$id</a></p>";
}

/*
 * Fixes data that's been erroneously reencoded assuming   
 * it is CP1252 when it's really UTF-8, multiple times.
 * Decodes the input.  Repeats until the size of the 
 * string stops shrinking.  The string stops shrinking 
 * when all the CP1252 8-bit chars vanish.  Then you have
 * to back off two steps to get to the original UTF-8.
 * It goes from UTF-8 -> CP1252, then CP1252->no-8-bits.
 * I'm hoping it handles edge cases correctly.
 */
function guess_decode( $a ) { 
    $new = mb_convert_encoding( $a, 'CP1252', 'UTF-8');
    $b = $a;
    $c = $a;
    while( strlen($a) > strlen($new) ) {
        $c = $b;
        $b = $a;
        $a = $new;
        $new = mb_convert_encoding( $a, 'CP1252', 'UTF-8');
    }
    return $c;
}

$id = intval($_GET['id']);
main($id);
?>

Also note - this is not good code. Don't make your queries like that in production code. I'm just doing it this way because I needed to fix this thing quickly, and there's that old db abstraction object to deal with. Use PDO instead.