Keyword Analysis or Discovery (a first try)

I was messing around with some textual analysis, trying to figure out how to do a “related articles” feature in Drupal. The problem with most systems is that they require someone to choose tags, which is additional work on top of the writing and initial categorization.

This script uses the simple SEO technique of counting unique words, pairs, and triplets. The output produced is not a “related stories” list, but, it’s a starting point.

/* takes text as input, produces a list of keywords with counts */
$common_words = array('the','be','to','of','and','a','in','that','have','I','it','for','not','on','with',
'he','as','you','do','at','this','but','his','by','from','they','we','say','her','she','or','an','will',
'my','one','all','would','there','their','what','so','up','out','if','about','who','get','which','go',
'me','when','make','can','like','time','no','just','him','know','take','people','into','year','your',
'good','some','could','them','see','other','than','then','now','look','only','come','its','over','think',
'also','back','after','use','two','how','our','work','first','well','way','even','new','want','because',
'any','these','give','day','most','us','is','are',"don't",'has','was',
'by the','of a','to the','of the','on the','and the','in the','has also','for this',
'which is','in a','not to','but it','is that','that is');

$topic_words = array('immigration'=>1,'immigrant'=>1,'action'=>1,'police'=>1,'environment'=>1,'liberation'=>1,
'undocumented'=>1,'election'=>1,'gay'=>1,'environment'=>1,'pesticides'=>1);

$connector_words = array('in'=>1,'is'=>1,'which'=>1,'that'=>1,'for'=>1,'and'=>1,'but'=>1,'to'=>1,'the'=>1,'from'=>1);

$text = '';
$fh = fopen("php://stdin","r");
while( $line = fgets($fh) )
    $text .= $line;
    

$result = calc($text);
print_r($result);

function calc( $text ) 
{
    global $common_words,$topic_words, $connector_words;
    foreach($common_words as $word)
        $common[$word] = 1;
    $o = array();
    $text = strtolower($text);
    $text = preg_replace("/['’]s/",'',$text); // no possessives
    $text = preg_replace("/[&;:"“”(),.~]/",'',$text);
    $text = preg_replace("/[nr]+/",' ',$text);
    $words = explode(' ',$text);
    
    // now make an array of pairs
    $pairs = array();
    $last_word = '';
    foreach($words as $word)
    {
        if ($last_word and $last_word and !$connector_words[$word] and !$connector_words[$last_word]) array_push($pairs, $last_word . ' ' . $word);
        $last_word = $word;
    }
    foreach($pairs as $pair)
    {
        if (! $common[$pair])
        {
            if (!$o[$pair]) $o[$pair]=0;
            $o[$pair]++;
        }
    }

    // now make an array of triplets
    $triplets = array();
    $word2 = $word3 = '';
    foreach($words as $word3)
    {
        if ($word1 and $word2 and $word3 and !$connector_words[$word1] and !$connector_words[$word2] and !$connector_word[$word3]) array_push($triplets, $word1 . ' ' . $word2 . ' ' . $word3);
        $word1 = $word2;
        $word2 = $word3;
    }
    foreach($triplets as $word)
    {
        if (! $common[$word])
        {
            if (!$o[$word]) $o[$word]=0;
            $o[$word]++;
        }
    }

    foreach($words as $word)
    {
        if (! $common[$word]) 
        {
            if ($o[$word])
                $o[$word]++;
            else 
                $o[$word] = 1;
        }
    }
    unset($o['']);
    foreach($o as $key=>$value)
        if ($value == 1)
            unset($o[$key]);
    asort($o);
    foreach($words as $word)
    {
        if ($topic_words[$word])
        {
            if ($t[$word])
                $t[$word]++;
            else
                $t[$word] = 1;
        }
    }
    return array($o,$t);
}

Sample output:

johnk@johnk-desktop:~/Sites/test$ php keyword.php < text8
Array
(
    [0] => Array
        (
            [environmental] => 2
            [parents] => 2
            [pesticides] => 2
            [robina suwol] => 2
            [organizations] => 2
            [internationally] => 2
            [pest] => 2
            [safety] => 2
            [6] => 2
            [more] => 2
            [heart of] => 2
            [california safe] => 2
            [heart] => 2
            [california safe schools] => 2
            [robina] => 2
            [nation] => 2
            [students] => 2
            [green] => 3
            [health] => 3
            [safe] => 3
            [suwol] => 3
            [safe schools] => 3
            [schools] => 3
            [children] => 3
             => 3
            [california] => 4
            [policy] => 4
            [school] => 7
        )

    [1] => Array
        (
            [pesticides] => 2
        )
)