Chant Down Babylon! Character Set Conversion from Latin-1 ISO-8859-1 cp1252 to UTF-8

I swiped this code from php.net.

Character set conversion is one of those things I’ve avoided over the years. Just use UTF-8 from the start. But IMC has thousands of articles stored as a BLOB datatype, so that it’s text in various character sets. The software in front of the data was using ISO-8859-1, but PHP wasn’t really mangling the data — it just passed the binary through unchanged, until I installed the mbstring extension (or more accurately, it was baked into PHP). That caused some problems, and it snowballed into converting everything to UTF-8.

There are five dominant character sets used to enter this data: ascii, iso-8859-1 (aka latin-1), windows-cp1252, and utf-8.

As you probably know, ascii is a subset of the other four, so we can ignore that.

Latin-1 is nominally the charset used in the app, but most users produce data in Windows, and seemed to paste cp1252 codes into the app. cp1252 is an extension to latin-1 that includes things like the curly quotes and em-dashes. Word produces these automatically, so a lot of these characters get pasted into the app. Fortunately, these character codes exist in a range within latin-1 that are not printable.

People also pasted UTF-8 encoded text into the app as well. UTF-8 has all the glyphs of latin-1, except most have different character codes.

So converting the data requires something that converts from cp1252 to UTF-8. Unfortunately, PHP doesn’t include such a function. Instead, it has utf8_encode, which converts from latin-1 to utf8. So someone wrote fix_latin, which deals with this hybrid. Code is below.

fix_latin has enough logic to avoid converting utf-8 encoded data, which would result in mangled data.

The Mac is a whole other problem. Today, it uses utf-8, so it’s ok, but back in the 80s and 90s, they had a different character set. Unlike Windows, the mappings were totally different. More info here. And see the mapping at madore.org.

Mac text creates a real problem – how to identify if the text was produced on a Mac or on Windows. Conversion isn’t the problem. Identification is much harder, because I’m not there to look at each file and determine if it’s MacRoman or latin-1. Here’s a stackoverflow post about this.

$byte_map=array();
init_byte_map();
$nibble_good_chars = '@^([x00-x7F]+|[xC0-xDF][x80-xBF]|[xE0-xEF][x80-xBF]{2}|[xF0-xF7][x80-xBF]{3}|[xF8-xFB][x80-xBF]{4})(.*)$@s';
function init_byte_map(){
  global $byte_map;
  for($x=128;$x<256;++$x){
    $byte_map[chr($x)]=utf8_encode(chr($x));
  }
  $cp1252_map=array(
    "x80"=>"xE2x82xAC",    // EURO SIGN
    "x82" => "xE2x80x9A",  // SINGLE LOW-9 QUOTATION MARK
    "x83" => "xC6x92",      // LATIN SMALL LETTER F WITH HOOK
    "x84" => "xE2x80x9E",  // DOUBLE LOW-9 QUOTATION MARK
    "x85" => "xE2x80xA6",  // HORIZONTAL ELLIPSIS
    "x86" => "xE2x80xA0",  // DAGGER
    "x87" => "xE2x80xA1",  // DOUBLE DAGGER
    "x88" => "xCBx86",      // MODIFIER LETTER CIRCUMFLEX ACCENT
    "x89" => "xE2x80xB0",  // PER MILLE SIGN
    "x8A" => "xC5xA0",      // LATIN CAPITAL LETTER S WITH CARON
    "x8B" => "xE2x80xB9",  // SINGLE LEFT-POINTING ANGLE QUOTATION MARK
    "x8C" => "xC5x92",      // LATIN CAPITAL LIGATURE OE
    "x8E" => "xC5xBD",      // LATIN CAPITAL LETTER Z WITH CARON
    "x91" => "xE2x80x98",  // LEFT SINGLE QUOTATION MARK
    "x92" => "xE2x80x99",  // RIGHT SINGLE QUOTATION MARK
    "x93" => "xE2x80x9C",  // LEFT DOUBLE QUOTATION MARK
    "x94" => "xE2x80x9D",  // RIGHT DOUBLE QUOTATION MARK
    "x95" => "xE2x80xA2",  // BULLET
    "x96" => "xE2x80x93",  // EN DASH
    "x97" => "xE2x80x94",  // EM DASH
    "x98" => "xCBx9C",      // SMALL TILDE
    "x99" => "xE2x84xA2",  // TRADE MARK SIGN
    "x9A" => "xC5xA1",      // LATIN SMALL LETTER S WITH CARON
    "x9B" => "xE2x80xBA",  // SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
    "x9C" => "xC5x93",      // LATIN SMALL LIGATURE OE
    "x9E" => "xC5xBE",      // LATIN SMALL LETTER Z WITH CARON
    "x9F" => "xC5xB8"       // LATIN CAPITAL LETTER Y WITH DIAERESIS
  );
  foreach($cp1252_map as $k=>$v){
    $byte_map[$k]=$v;
  }
}
function fix_latin($instr){
  if(mb_check_encoding($instr,'UTF-8'))return $instr; // no need for the rest if it's all valid UTF-8 already
  global $nibble_good_chars,$byte_map;
  $outstr='';
  $char='';
  $rest='';
  while((strlen($instr))>0){
    if(1==preg_match($nibble_good_chars,$instr,$match)){
      $char=$match[1];
      $rest=$match[2];
      $outstr.=$char;
    }elseif(1==preg_match('@^(.)(.*)$@s',$instr,$match)){
      $char=$match[1];
      $rest=$match[2];
      $outstr.=$byte_map[$char];
    }
    $instr=$rest;
  }
  return $outstr;
}

Update: Attached is a script that was used to convert an sf-active article database from a mix of encodings into UTF-8. The code is not quite straightforward, because shoving a bunch of updates into the queue caused an unintentional DOS. The system has a watchdog script, so the db will come back up. The script accounts for this, and logs successes and failures, and tests that the field is checked to see if the change is already saved. it’s not throttling back its requests, because I figure one DOS out of 160k requests was OK, because the db starts back up in a few minutes.