Advanced Screen Scraping With wget (and Mailarchiva)

I was testing a new product called Mailarchiva, and I misunderstood the instructions. The upshot was that a mailbox full of messages was moved into Mailarchiva, and I wanted to restore them to the mailbox.

Mailarchiva comes with a tool to decrypt its message store, but it didn’t work. The problem was that the main product and the utility package got de-syncrhonized, and the one tool I needed stopped working (because a method’s type signature changed). Also, despite being an open source project, they didn’t have sources for the utilities up on sf.net, so I couldn’t re-build the program to make it work.

Not being a major java programmer, I had a hard time coaxing the system to the point where it would run without an exception – problem was, the utility’s libraries expected one format for the message store, and the server’s expected another. It was getting really difficult.

I had some manually produced backups, but not of the current month. (I didn’t follow my own advice not to test with live data.)

You just can win, sometimes.

The solution, sort of, was to use the website dowloader, wget, to interact with the app via it’s web interface, and use that to download the messages to files. Screen scraping.

First, I found a page with great examples:
http://drupal.org/node/118759#comment-286253

Then, a quick visit to the wget man page:
http://www.gnu.org/software/wget/manual/html_node/Types-of-Files.html#Types-of-Files

Here’s the short version of how to do it:

The first step is to figure out how to log in and get a cookie.

The second step is to figure out how to download the messages.

The third is to figure out the range of pages in the results, and then write a loop to recursively download the messages from each set.

Then, finally, copy the .EML files up to the server via Outlook Express.

Here’s the long version:

First, you have to submit a web form, and get a session id in a cookie. Here’s the command I used:

wget -S --post-data='j_username=admin&j_password=fakepass' http://192.168.1.103:8090/mailar...

192.168.1.103 is the IP address of my test installation.

The –post-data line lets you submit the login form, as if you were typing it in and submitting it. To find the URL to submit, you look at the source of the login form.

Then, you inspect the output, looking for the Cookie. Then, concot a longer, more complex command to submit the search form:

wget --header="Cookie: JSESSIONID=62141726A04B7C8BDE24C32514EB19F3; Path=/mailarchiva" --post-data='criteria[0].field=subject&criteria[0].method=all&criteria[0].query=&dateType=sentdate&after=1/1/09 1:00 AM&before=12/18/09 11:59 PM&submit.search=Search' http://192.168.1.103:8090/mailar...

Note that we’re passing the cookie back.

Inspecting the resultant file will reveal that the search worked!

Then, you try to download the attachments by spidering the links, and downloading files that end in .eml.

wget -r -l 2 -A "*viewmail.do*" -A "*downloadmessage.do*" -R "signoff.do" -R "search.do" -R "configurationform.do" --header="Cookie: JSESSIONID=62141726A04B7C8BDE24C32514EB19F3; Path=/mailarchiva" --post-data='criteria[0].field=subject&criteria[0].method=all&criteria[0].query=&dateType=sentdate&after=1/1/09 1:00 AM&before=12/18/09 11:59 PM&submit.search=Search' http://192.168.1.103:8090/mailar...

That pretty much does what I want, but, I need to do it for a bunch of pages. The quick solution is to use the browser to find out what the last message is, and then write the following shell script:

for i in 1 2 3 4 5 ; do
wget -r -l 2 -A '*viewmail.do*' -A '*downloadmessage.do*' -R 'signoff.do' -R 'configurationform.do' --header='Cookie: JSESSIONID=62141726A04B7C8BDE24C32514EB19F3; Path=/mailarchiva' --post-data='criteria[0].field=subject&criteria[0].method=all&criteria[0].query=&dateType=sentdate&after=1/1/01 1:00 AM&before=12/18/09 11:59 PM&page='$i http://192.168.1.103:8090/mailar...
done

Note that a parameter was added to the post. It’s page.

A parameter was also removed, the submit value. Submitting the old value seemed to prevent the paging. There’s probably a branch in the code based on the type of “submit” you’re sending, because there are a few different buttons, with different effects.

Again, that’s discovered by reading the sources and experimenting.

So, I ran the script and waited a long time. Then, I shared the data via Samba (I coded this on a Linux box, but ran the application on Windows). A nice side effect was that the shared files displayed DOS 8.3 filenames. So, the messages, which were originally named “blah.do?id=21341342334.eml” became “BADJFU~5.EML”.

To upload, I used Outlook Express. Despite its bad reputation, OE is good at interacting with IMAP mailboxes, and its support for the .EML file format seems to be good.

Wget saved the day (but it was a long day).

Lesson learned or, “lessons refreshed” is really what happened.

I should have set up a test account, put mail into it, then archived it.

Additionally, I should remember that when dealing with “enterprise” software, it’s not going to work like Windows or Mac (or even Linux) software. Larger businesses are assumed to have certain processes that SOHO businesses don’t.

This would be a perfect application for a web service. It would avoid all the program execution problems. Instead of accessing the data through command line application, access it over the network, using a simple interface.

Additionally, this kludgy rescue would have been impossible if the application had been written to use a Swing GUI or a native GUI. The web interface made it possible to scrape the data out of the system.

As for Mailarchiva – if you are trying to archive your own mail server, it seems to be a good product. The docs could use some work 🙂 I found others, but Mailarchiva running on a Linux box would probably be the most stable solution. The bad news is that it’s not intended for archiving personal email accounts like Gmail, AOL and ISP accounts. So, it wasn’t the right tool for me.

What I really need is a free/cheap archiver for products like Gmail. It would both mirror and archive the IMAP folders, but allow the user to hold on to emails for as long as they wanted. So far, what I’ve found either doesn’t do folders, or doesn’t do archiving. Archiving is just saving every single email it sees, and retaining messages even if they’re deleted.

Leave a Reply