Salsa Petition Counts with Scrapy the Python Web Scraper

I was feeling like crap, so after taking the day off and taking a nap, I spent several hours learning Scrapy. Scrapy is a pretty awesome website scraper. This example logs into a website, extracts some data, and stores it locally.

It's different from WWW-Mechanize in that Scrapy's main mode of operation is multithreaded, so requests to websites are parallelized. To sequence requests, you use callbacks (which are called after a request concludes).

The other thing is, requests aren't performed when the Request object is created. I think what you do is create Requests or lists of Requests and return them from the callbacks. You end up with a structure full of Requests, and then evaluate it, a process that might take a long time.

This script was based on the Scrapy tutorial, but it has a login sequence, and after logging in, makes multiple requests, extracts petition results, and then saves the results to a MySQL database table.

Because requests are potentially parallelized, you either need to use an object to total up the values, or use an external database. There isn't an easy way to pass values up along the call chain, and I wanted to record the values to make charts, anyway.

This script goes into the spiders/ directory.

from scrapy.spider import Spider
from scrapy.http import FormRequest
from scrapy.http import Request
from scrapy.http import Response
from scrapy.selector import Selector
from datetime import date
import MySQLdb as mdb
import re

class SalsaSpider(Spider):
    name = "salsa"
    allowed_domains = [
        "wfc2.salsalabs.com",
        "hq.salsalabs.com",
        "hq-afl.salsalabs.com"
    ]
    start_urls = [
        "https://wfc2.salsalabs.com/dia/hq/sso/",
    ]
    data_urls = [
        "https://hq-afl.salsalabs.com/salsa/hq/p/dia/action3/common/hq/report?action_KEY=8164",
        "https://hq-afl.salsalabs.com/salsa/hq/p/dia/action3/common/hq/report?action_KEY=8132",
        "https://hq-afl.salsalabs.com/salsa/hq/p/dia/action3/common/hq/report?action_KEY=7894"
    ]
    count = 0
    timestamp = ''
    dbhost = '127.0.0.1'
    dbuser = 'dbusername'
    dbpass = '----'
    dbname = 'salsadata'
    salsaemail = 'email@example.com'
    salsapass = '----'

    def parse(self, response):
        self.timestamp = date.today().isoformat()
        con = mdb.connect( self.dbhost, self.dbuser, self.dbpass, self.dbname )
        cur = con.cursor()
        cur.execute('DELETE FROM petitioncounts WHERE date=%s', ( self.timestamp ))
        con.commit()
        con.close()
        return [FormRequest.from_response( response, 
                formname='f',
                formdata={'email': self.salsaemail,
                        'password': self.salsapass },
                callback=self.after_login)]
    
    def after_login( self, response ):
        requests = []
        for data_url in self.data_urls:
                requests.append( Request(data_url, self.save_data) )
        return requests
        
    def save_data( self, response ):
        url = response.url
        m = re.search('=([0-9]+?)$', url)
        petition_id = int( m.group(1) )
        sel = Selector(response)
        count = sel.xpath('//td[@class="last"]').xpath('//b/text()').extract()[1]
        print '----count'
        print count
        print '----------'
        try:
                con = mdb.connect( self.dbhost, self.dbuser, self.dbpass, self.dbname )
                cur = con.cursor()
                cur.execute('INSERT INTO petitioncounts (`date`,`petition_id`,`count`) VALUES (%s, %s, %s)', ( self.timestamp, petition_id, count ))
                con.commit()
        except mdb.Error, e:
                print "Error:", e
                exit(1)
        finally:
                if con:
                        con.close()

Here's the database schema:

CREATE TABLE `petitioncounts` (
  `petition_id` int(6) DEFAULT NULL,
  `count` int(7) DEFAULT NULL,
  `date` date DEFAULT NULL,
  KEY `petition_ids` (`petition_id`),
  KEY `dates` (`date`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1

(Latin-1? How did that end up in there?)

I ended up learning how to write CGI scripts in Python, too.

Here's a CGI program that will dump the data to an HTML table:

#! /usr/bin/python

import cgi
import MySQLdb as mdb

dbhost = '127.0.0.1'
dbuser = 'user'
dbpass = '----'
dbname = 'salsadata'

sql = 'select date,sum(count) from (select distinct * from petitioncounts) as pc group by pc.date order by pc.date'

try:
	con = mdb.connect( dbhost, dbuser, dbpass, dbname )
	cur = con.cursor()
	cur.execute(sql)
	results = cur.fetchall()
except mdb.Error, e:
	print "Error:", e
	exit(1)
finally:
	con.close()

print "Content-type: text/html"
print
print "<html><head><title>pope2la petition counts</title></head><body>"
print "<table>"
for row in results:
	date = row[0]
	count = row[1]
	print "<tr><td>", date, "</td><td>", count, "</td></tr>"
print "</table>"
print "</body></html>"

And here's a script that will total the latest numbers and email them to you.

#! /usr/bin/python

import MySQLdb as mdb
import smtplib

dbhost = '127.0.0.1'
dbuser = 'user'
dbpass = '-----'
dbname = 'salsadata'
mailhost = dbhost
sender = "johnk@riceball.com"
receivers = [ "johnk@riceball.com" ]

sql = 'select date,sum(count) from (select distinct * from petitioncounts) as pc group by pc.date order by pc.date DESC LIMIT 1'

try:
	con = mdb.connect( dbhost, dbuser, dbpass, dbname )
	cur = con.cursor()
	cur.execute(sql)
	results = cur.fetchall()
except mdb.Error, e:
	print "Error:", e
	exit(1)
finally:
	con.close()

for row in results:
	date = row[0]
	count = row[1]
	thecount = '%s : %s' % (date, count)

smtpObj = smtplib.SMTP( mailhost )

message = """From: %s
Subject: Count for %s

%s

A list of daily totals is at:
http://192.168.1.55/cgi-bin/peti...

This is accessible only in the office.
""" % ( sender, thecount, thecount )

smtpObj.sendmail( sender, receivers, message )