Hispanic Surname Extraction with Regular Expressions

The challenge with these names is twofold. For one, they follow a European convention of using "of" to denote the family, e.g. De La Cruz. This is like the Irish O'Connor or Italian del Vecchio. Secondarily, some people use their Catholic-style names, where the maternal name is combined with the paternal name, e.g. Lucy Lopez Valdez. Lucy may be known as Lucy Lopez or Lucy Valdez, or the full name (the wife doesn't drop the name).

Here's a little Javascript code that splits a name field, and makes a guess at the last name.

    parts = d['Name'].split(' ');

    while (parts.length>1 && parts[parts.length-1].match(/^(de|dela|la)$/i))
    {
        lname.value = parts.pop() + ' ' + lname.value;
    }
    // matches many common Spanish names, to handle the "two last names"
    if (parts.length>1 && parts[parts.length-1].match(/^[CPMSGRH].+[aoz]$/))
    {
        lname.value = parts.pop() + ' ' + lname.value;
    }

The second conditional is a hack to check for the "Spanishness" of a name. It is based on the fact that many Mexican names end in z, a, or o. They usually start with a consonant, but I only chose a few. You don't want to make the regex too general, or it'll catch many non-Spanish names. Italian names will tend to get matched, but Italian Americans tend to use the English convention of not using the maternal name. Some other cultures, like Chinese, hyphenate, or get converted into a single name when the name is romanized, e.g. Mao Tse-Tung or Mao Zedong.

Back in the 80s, when I was learning programming, this was called "heuristics".

http://en.wikipedia.org/wiki/Heuristic

http://www.springerlink.com/content/tfpg9bhlq5d49vb2/