Reply to comment

Turning California WARN PDFs into Text

This was an odd project. Taking several PDFs of layoff data and turning them into text, so they might be used more like a database. This info should be offered up by the state as a database, but it's not (at least it wasn't to me). I ended up using a PDF to Text application to generate text files, then wrote these scripts to scrape the data out of the text. My goal was to dig up all the unionized workplaces.

The WARN act is a law the requires employers to give 90 days notice of any coming mass layoffs. I don't recall the exact numbers, but, it applies to businesses that have a pretty large number of workers.

These scripts are basically complete, but running them requires moving them into the right directories. Study the sources to figure this out.

split.pl: (splits the text file into individual records)

#! /usr/bin/perl 
open FH, ">/dev/null";
while()
{
	if ($_ =~ m#([^ ]+.*[^ ]+?).+?(\d+?)\s+(\d+/\d+/\d+)\s#)
	{
		$comp = $1;
		$count = $2;
		$date = $3;
		
		$comp =~ s/[^\d\w]+/-/g;
		$comp =~ s/[-]+$//;
		$date =~ s/[^\d]/-/g;
		$name = "$comp.$count.$date.txt";
		open FH, ">splits/$name";
		print FH $_;
	}
	else
	{
		print FH $_;
	}
}
close FH;

parse.pl: (read each file and extract the interesting parts)

#! /usr/bin/perl

$line = <STDIN>; ## 1st line is the company, count, date, and part of the
                 ## location.  split on ctl-U character.
$line =~ s/[\r\n]//g;
$line =~ s/ $//g;
chomp $line;
#print "**$line**\n";
$company = ( $line =~ /(.+?)\s+?\d+/ )[0];
if ($company !~ /\cU\cU/)
{
	$company =~ s/\s+$//g;
	$line2 = <STDIN>;
	$line2 =~ s/\s\cU\cU\s*//g;
	$company .= " $line2";
	$company =~ s/\s+$//g;
}
$company =~ s/\s+\cU\cU//g;
#print "**$company**\n";
($count, $date, $location) = ( $line =~ m#$company[\s\cU]+?(\d+?)\s+?(\d+?/\d+?/\d+?)\s+?(\w.+?)$# );
#print "**$count**$date**$location**\n";

$line = <STDIN>;
$line =~ s/[\r\n]//g;
chomp $line;

($street, $location2) = ( $line =~ /(.+?)\s+?\cU\cU\s+([A-Z ]+?)$/ );
##print "**$street**$location2**\n";
if (! $location2)
{
	($street) = ( $line =~ /(.+?)\s+?\cU\cU/ );
	##print "**$street**\n";
}
else
{
	$location = $location . ' ' . $location2;
}
#print "**$street**$location**\n";

$line = <STDIN>;
$line =~ s/[\r\n]//g;
chomp $line;

($city, $state, $zip) = ( $line =~ /^([\w ]+?), (\w\w) ([\d-]+?)$/ );
#print "**$city**$state**$zip**\n";

if (! $zip)
{
	($city, $state, $zip, $extra) = ( $line =~ /^([\w ]+?), (\w\w) ([\d-]+?)\s+(.+)$/ );
	$location .= ' '.$extra;
}

while($line = <STDIN>) 
{
	goto BAILOUT if ($line =~ /^Company Contact Name and Telephone Number/ );
	$line =~ s/[\r\n]//g;
	chomp $line;
	#print "**$line**\n";
}

BAILOUT:
$line = <STDIN>;
$line =~ s/[\r\n]//g;
chomp $line;
($cname, $layoff_or_closure) = ( $line =~ /^(.+?)\s+?\cU\cU\s+?Layoff or Closure:  (\w+?)$/ );
#print "**$cname**$layoff_or_closure**\n";

$company_contact = <STDIN>;
$company_contact =~ s/[\r\n]//g;
chomp $company_contact;
#print "**$company_contact**\n";

while( ($line = <STDIN>) !~ /^Union Representation/ ) 
{
	## accumulate contact info here
};
$line =~ s/[\r\n]//g;
chomp $line;
#print "1**$line**\n";

while($line = <STDIN>) 
{
	goto CONT if ($line =~ /^Name and Address of Union/);
}

CONT:
$union_contact = "";
while($line = <STDIN>)
{
	goto CONT2 if $line =~ /^Job Title/ ; 
	
	$line="" if ($line =~ /Name and Address of Union Representing Employees/);
	$line =~ s/[\r\n]//g;
	chomp $line;
	$union_contact = "$union_contact\r\n$line" if ($union_contact ne "");
	$union_contact = $line if ($union_contact eq "");
};
#print "**$parts**\n";
CONT2:

print "\"$company\",$layoff_or_closure,$count,\"$date\",\"$location\",\"$street\",$city,$state,$zip,\"$cname\",$company_contact,\"$union_contact\"\r\n";

make.sh

#! /bin/bash

for i in *.txt ; do
  echo $i
  ./parse.pl < $i >> report.csv
done
AttachmentSize
layoff.jpg30.83 KB

Reply

The content of this field is kept private and will not be shown publicly.
  • Lines and paragraphs break automatically.

More information about formatting options

5 + 12 =
Solve this simple math problem and enter the result. E.g. for 1+3, enter 4.