LibreOffice is a Faster Tool to Load Excel Data into an SQL Database

We wanted to know how Bernie Sanders did in several electoral districts.

What are our data sources.

There are two public databases that can help us discover these vote totals. One is the election results broken down by voting precinct. The other is a database that maps the precinct numbers to district numbers.

lavote.net's precinct reports

The Los Angeles County Registrar Recorder publshes vote tallies aggregated by precinct. The problem is, these are reports, not raw or cooked data. The spreadsheets have a two-line header, and each precinct has a line for a total of votes at the polls and votes by mail.

We need to extract the metadata in the headers, and ignore the rows that are totals.

Spreadsheet from lavote.net

Importing this data will take some effort, and I'll go into this below.

Los Angeles County's precinct files

The County of Los Angeles publishes a GIS dataset that includes a table that lists all the precincts, and what State Assembly and State Senage and Congressional district the precinct is in.

It's in a file that ends with .DBF, the old XBASE database file format. LibreOffice will open this easily.

This data is a lot easier to import. I cleaned up the headers, deleted columns I didn't want, and saved it as a CSV file.

MySQL Workbench can perform the import.

Extracting data from the precinct reports

The precinct reports are delivered as dozens of XLS spreadsheets.

While it would be possible to fix up each data table by hand, and save it as a CSV file, and then import the table into the database, it would be a lot of work.

Not only that, importing CSV files takes a long time if you use MySQL Workbench.

A much faster solution is to write a script for LibreOffice that extracts the data in the spreadsheet and produces SQL statements that will populate the database.

Writing the script takes a while, though.

LibreOffice Calc

LibreOffice has a built-in macro language that's basicallly the same as Visual Basic for Applications. The IDE provided is not as good, but it suffices.

There's a good tutorial and reference manual for LibreOffice's Basic at https://wiki.openoffice.org/wiki/Documentation/BASIC_Guide.

And also at https://wiki.documentfoundation.org/Macros.

This script, below, is a module with a subroutine that will open the spreadsheet, extract some metadata about the report, write an SQL CREATE statement, and then write INSERT statements for the lines we wish to import. It skips over the "total" lines.

You can paste the code into LibreOffice.

REM  *****  BASIC  *****

Sub ConvertExcelToSQL(cFile)
    Dim cURL as string
    Dim oDoc as object
    Dim oSheet as object
    Dim oCell as object
    Dim Dummy() 
    cURL = ConvertToURL(cFile)
    oDoc = StarDesktop.loadComponentFromURL(cURL, "_blank", 0, Dummy)

    '
    ' Create our output file from the source filename.
    '   
    cOutput = Left(cFile, Len(cFile) - 4) + ".sql"
    iNumber = Freefile
    open cOutput for output as #iNumber

    '
    ' Extract header info
    '
    oSheet = oDoc.Sheets(0)
    ' row 2 col 1 Contest
    sContest = oSheet.getCellByPosition(0,1).String

    '
    ' Identify column offsets by names.
    '   
    iLocation = 0
    iPrecinct = 1
    iSerial = 2
    iBallotGroup = 3
    iVBM = 4
    iRegistration = 5
    iType = 6
    iBallots = 7
    '
    ' There are a variable number of candidate columns.
    ' Read them all.
    '
    Dim aCandidates(40) as string
    idx = 8
    iCandidateCount = 0
    do
        sCandidate = oSheet.getCellByPosition(idx, 2).String
        if sCandidate <> "" Then
            aCandidates(idx - 8) = Replace(LCase(sCandidate), " ", "_")
            idx = idx + 1
            iCandidateCount = iCandidateCount + 1
        end if
        ' Print sCandidate
    loop while (sCandidate <> "")

    '
    ' The RRCC wasn't very consistent in their naming conventions.
    ' So we fix it here. These mangle the table name.
    '
    sParty = Trim(oSheet.getCellByPosition(1,1).String)
    if sParty <> "" Then sParty = "_" & sParty

    sTableName = "t_" + LCase(Replace(sContest & sParty, " ", "_"))
    sTableName = Replace(sTableName, "#", "")
    sTableName = Replace(sTableName, "-", "_")
    sTableName = Replace(sTableName, "__", "_")
    sTableName = Replace(sTableName, "__", "_")
    sTableName = Replace(sTableName, "dem_democratic", "democratic")
    sTableName = Replace(sTableName, "rep_republican", "republican")
    sTableName = RTrim(sTableName, "_")

    '
    ' SQL CREATE statement.
    '
    sOut = "create table " + SLQuote(sTableName) + " ("
    sOut = sOut + "location varchar(255), "
    sOut = sOut + "precinct varchar(255), "
    sOut = sOut + "serial int(11), "
    sOut = sOut + "ballotgroup varchar(255), "
    sOut = sOut + "vbm varchar(255), "
    sOut = sOut + "registration int(11), "
    sOut = sOut + "type varchar(255), "
    sOut = sOut + "ballots int(11) "
    for i = 0 to iCandidateCount - 1
            sOut = sOut + ", " + SLQuote(aCandidates(i)) + " int(11)"
    next
    sOut = sOut + ");"
    print #iNumber, sOut

    '
    ' Walk through the document, copying out the data.
    ' Skipping over totals.
    '
    row = 3
    do
        sType = oSheet.getCellByPosition(iType, row).String
        if (sType <> "TOTAL" and sType <> "") Then
            sLocation = oSheet.getCellByPosition(iLocation, row).String
            sPrecinct = oSheet.getCellByPosition(iPrecinct, row).String
            sSerial = oSheet.getCellByPosition(iSerial, row).String
            sBallotGroup = oSheet.getCellByPosition(iBallotGroup, row).String
            sVBM = oSheet.getCellByPosition(iVBM, row).String
            sRegistration = oSheet.getCellByPosition(iRegistration, row).String
            sType = oSheet.getCellByPosition(iType, row).String
            sBallots = oSheet.getCellByPosition(iBallots, row).String
            sOut = "insert into " + SLQuote(sTableName) + " values ("
            sOut = sOut + SQuote(sLocation)
            sOut = sOut + CSQuote(sPrecinct)
            sOut = sOut + CNoQUote(sSerial)
            sOut = sOut + CNoQUote(sBallotGroup)
            sOut = sOut + CSQuote(sVBM)
            sOut = sOut + CNoQUote(sRegistration)
            sOut = sOut + CSQuote(sType)
            sOut = sOut + CNoQUote(sBallots)
            ' candidate dump
            for i = 0 to iCandidateCount - 1
                sOut = sOut + CNoQUote(oSheet.getCellByPosition(i + 8, row).String)     
            next
            sOut = sOut + ");"
            print #iNumber, sOut

        end if
        row = row + 1
    loop while sType <> ""
    close #iNumber
    oDoc.close(True)
end sub


'
' Utility functions
'

Function Replace(Source As String, Search As String, NewPart As String)
  Dim Result As String  
  Result = join(split(Source, Search), NewPart)
  Replace = Result
End Function

'
' Quoting functions help us avoid confusing '"' and "'" in our SQL.
'

Function SLQuote(s As String)
    SLQuote = "`" + s + "`"
end function

Function SQuote(s As String)
    SQuote = "'" + s + "'"
end function

Function CSQuote(s As String)
    CSQuote = ",'" + s + "'"
end function

Function CNoQUote(s As String)
    CNoQuote = "," + s
end function


sub test
    ConvertExcelToSQL("/home/johnk/Desktop/980_SVCExcel/JUDGE-SUPERIOR_COURT_#120_06-07-16_by_Precinct_980-4211.xls")
end sub

A sample export file looks like this.

21ST_STATE_SENATE_DIST_06-07-16_Voter_Nominated_by_Precinct_980-4178.sql:

create table `t_21st_state_senate_dist_voter_nominated` (location varchar(255), precinct varchar(255), serial int(11), ballotgroup varchar(255), vbm varchar(255), registration int(11), type varchar(255), ballots int(11) , `johnathon_l_ervin` int(11), `scott_wilk` int(11), `steve_hill` int(11), `star_moffatt` int(11));
insert into `t_21st_state_senate_dist_voter_nominated` values ('ACTON','0050003A',1,2,'N',1047,'POLLING PLACE',302,46,171,28,27);
insert into `t_21st_state_senate_dist_voter_nominated` values ('ACTON','0050003A',1,2,'N',0,'VBM PORTION',169,24,110,9,13);
insert into `t_21st_state_senate_dist_voter_nominated` values ('ACTON','0050004B',2,2,'N',569,'POLLING PLACE',149,27,88,9,10);
insert into `t_21st_state_senate_dist_voter_nominated` values ('ACTON','0050004B',2,2,'N',0,'VBM PORTION',131,20,85,12,7);
insert into `t_21st_state_senate_dist_voter_nominated` values ('ACTON','0050005A',3,2,'N',733,'POLLING PLACE',215,41,116,8,30);
insert into `t_21st_state_senate_dist_voter_nominated` values ('ACTON','0050005A',3,2,'N',0,'VBM PORTION',145,16,88,13,12);

The macro can be run from the command line, like this:

A "Magic Folder" written in Perl

Scripting out the commands to process each XLS file is tedious, so I used an old technique I call the "magic folder", which processes all the files that get dropped into a folder.

This is an old script, but I keep reusing it. It's in Perl. I don't even know how to read it, accurately, anymore.

Remember to quit the "magic folder" script when it's done.

#! /usr/bin/perl

# This script monitors the directory and exports
# XLS files to SQL. This works only with
# the LA County Election results by precinct reports.

chdir '/home/johnk/Desktop/980_SVCExcel';

use Cwd;

%docs = ();
%pdfs = ();

sub main() {
  print "checking\n";
  opendir DH,'.';
  while( $file = readdir(DH) ) {
    if ($file =~ /.xls$/) {
      $docs{$file} = 1;
    }
    if ($file =~ /.sql$/) {
      $pdfs{$file} = 1;
    }
  }
  closedir DH;

  foreach $d (keys(%docs))  {
    print "$d\n";
    $p = $d;
    $p =~ s/.xls/.sql/;
    next if ($pdfs{$p});

    $filename = getcwd().'/'.$d;
    print "$filename\n";

    system('libreoffice -invisible "macro:///Standard.Module1.ConvertExcelToSQL('.$filename.')"');
  }
}

for(;;) {
  &main();
  sleep 10;
}

Loading the data tables

Once all the SQL files are created, you can sort the file list by type, create a new folder for these data files, and copy them.

Then, create a new database. I called mine "elections". Set up a user with permissions, or just use your root user account to load the data.

mysql elections < 21ST_STATE_SENATE_DIST_06-07-16_Voter_Nominated_by_Precinct_980-4178.sql

To load up a large batch, use the FOR construct:

for i in *ass*.sql; do mysql elections < $i; done

This executes the mysql command for every file that matches ass.sql.

You can also do this:

cat *.sql | mysql elections

That combines all the SQL files and sends the commands to mysql.

Making a query

Once all the data is loaded, you can perform a query.

To find out the election results in Assembly District 15, you would write the following query.

SELECT sum(bernie_sanders), sum(hillary_clinton) 
FROM t_presidential_preference_democratic p 
JOIN (
    SELECT precinct FROM 
    elections.rrcc_precincts 
    WHERE stateassembly=39
    ) s 
ON p.precinct=s.precinct;

This subquery returns a table of precincts that are in state assembly district 15.

    SELECT precinct FROM 
    elections.rrcc_precincts 
    WHERE stateassembly=39

That gets joined with the presidential preference results, by precinct, including a row only if precinct on both tables matches.

The totals are calculated and the result is ready in a couple seconds.

AttachmentSize
datafile.png44.61 KB