Screen Scraping a Cel Phone Photo

This is a simple screen-scraper to pull an image from a cel-phone page. I guess when people send you a file, you can use this to download it, sorta. It appears to be for a canadian internet company.

There were a few roadbumps to success. First was finding the precise location where the photo's URL was. To find it, I went to the page, did a "view source", and then back to the page and "view image". Then, i did a search for parts of the URL within the source. It turned out the URL was in a bit of JavaScript (not in an IMG tag).

Plucking the URL out is simple. Copy the text surrounding the URL, then turn it into a regex. Escape all ( and ) characters. Escape the / and \. Escape the right quotes. Then turn the URL part into this: (.+?), and trail it with some predictable, unique text. Put / and / around it.

Next, prepare to get the page. It turns out that their app server checks the USER_AGENT, probably to deliver content to mobile devices or to determine what kind of script code to deliver. I wanted to act like Mozilla, so I found an appropriate string on the web.

Get the data, and match the file. If it bears fruit, load in the URL (prepending it with the path to the file), and save it out.


<?php
	// your url
	$pageName= "http://picturemessaging.rogers.com/share.do?invite=lEYrALTEREDh1U";
	// this matches the bit of code in the page with the image url
	$match = "/slideshowObjectInfo\(1, 'image','\/mmps\/RECIPIENT\/(.+?)'/";
	
	// this reads the url into a string
	ini_set('user_agent','Mozilla/5.001 (Windows NT5; N; x86; ja) Gecko/25250101 MegaCorpBrowser/1.0');
	
	$text = file_get_contents($pageName);
	// this tries to pull out the URL for the picture
	preg_match($match, $text, $matches);
	// this tries to pull down the picture
	if ($matches[1])
	{
		$image = file_get_contents('http://picturemessaging.rogers.com/mmps/RECIPIENT/'.html_entity_decode($matches[1]));
		file_put_contents( 'image.jpg', $image);
	}
?>

After a couple days, I did some more work on this spider.

Here's a new version of the script. It is a little bit better, and it tells you what it's doing. It works for the situation where it has to get the first image's large image, but it might fail if you have more than one photo. Then, you'll need something that can parse out different values in the javascript, and simulate the javascript url-generating code, to get the urls to follow.


<?php
$patterns = array(
    array(    'match' => "#viewLargeURL = '(.+?)slide=' \+ whichImg \+ '&pictureCount=' \+ pictureCount \+ '&fromMessage=#",
              'postProcess' => '',
              'linkPrefix' => 'http://picturemessaging.rogers.com',
           'linkSuffix' => 'slide=0&pictureCount=1&fromMessage=true',
              'fileName' => 'imagepage.txt' ),
    array(    'match' => "# '',
              'linkPrefix' => 'http://picturemessaging.rogers.com',
           'linkSuffix' => '',
              'fileName' => 'image.jpg' )
);
$startPageUrl= "http://picturemessaging.rogers.com/share.do?invite=lEYr42JXYkkCVY8zkh1U";

spider( $startPageUrl, $patterns );

function spider( $pageName, $patterns )
{
     ini_set('user_agent','Mozilla/5.001 (Windows NT5; N; x86; ja) Gecko/25250101 MegaCorpBrowser/1.0');

     echo 'getting: '.$pageName.'

'; $text = file_get_contents($pageName); foreach( $patterns as $rule ) { echo "matching ".htmlspecialchars($rule[match])."

"; preg_match($rule['match'], $text, $matches); if ($match = $matches[1]) { echo "match succeeded, found $match

"; if ($rule['postProcess']) $match = $rule['postProcess']($matches[1]); $url = $rule['linkPrefix'].$match.$rule['linkSuffix']; echo 'getting: '.$url.'

'; $text = file_get_contents($url); if ($rule['fileName']) file_put_contents( $rule['fileName'], $text); } else { echo "match failed

"; } } } ?>

Comments

The repetitious task of getting urls and following them was turned into a kind of "engine". Each step is, basically, like the last. The real difficulty in generalizing this will be the fact that the most important parts of the page are generated in JavaScript.

The correct way to handle this is to put the spider into a browser with JavaScript. That way, you can get the final output of the rendered page, and then spider the final output instead of code. Easier described than implemented, as usual:-).