Heuristic Hispanic Surname Extraction with Regular Expressions

Over here in Los Angeles, and the Southwest in general, there are a lot of people from who have Mexican or Latin American heritage, and have Spanish surnames.

The challenge with these names is twofold.

First, they follow a European convention of using “of” to denote the family, e.g. De La Cruz, del Valle, de la Rosa. This is like the Irish O’Connor or Italian del Vecchio. So the surname may be two or three words. (Some Arabic surnames are, as well, but in Spanish, they have been combined into one word: Alvarez, Alamo, Alvarenga, etc.)

Secondarily, some people use their Catholic-style names, where the maternal name is combined with the paternal name, e.g. Lucy Lopez Valdez. Lucy may be known as Lucy Lopez or Lucy Valdez, or the full name (the wife doesn’t drop the name). This is why Catholics have long names.

Here’s a little Javascript code that splits a name field, and makes a guess at the last name.

    parts = d['Name'].split(' ');

    while (parts.length>1 && parts[parts.length-1].match(/^(de|dela|la)$/i))
        lname.value = parts.pop() + ' ' + lname.value;
    // matches many common Spanish names, to handle the "two last names"
    if (parts.length>1 && parts[parts.length-1].match(/^[CPMSGRH].+[aoz]$/))
        lname.value = parts.pop() + ' ' + lname.value;

The second conditional is a hack to check for the “Spanishness” of a name. It is based on the fact that many Mexican names end in z, a, or o. They usually start with a consonant, but I only chose a few.

You don’t want to make the regex too general, or it’ll catch many non-Spanish names. Italian names will tend to get matched, but Italian Americans tend to use the English convention of not using the maternal name. Some other cultures, like Chinese, hyphenate, or get converted into a single name when the name is romanized, e.g. Mao Tse-Tung or Mao Zedong.


Leave a Reply