Object-Oriented Parser for iCalendar

Been working on a parser for ICS files, and it’s done in an OO style – so that parts of the data become instantiated as objects, and the parse tree is a hierarchy of objects. Searches for “OO Parser” and “object oriented parser” turned up a lot of OO parser generators, or YACC for OO parsers. Very Meta. Most were abstracts to CS papers, but one was an article about a parser for Pascal that was easy to read.

Searching for iCalendar parsers was more fruitful, but, didn’t turn up much code. The File_IMC module in PEAR hasn’t been maintained since 2003, and it produces arrays, not object trees. It seems complete, though, so it’s a one-stop solution. One highlight was qCal, which is an OO library for iCalendar. It’s supposed to have a parser, but it wasn’t finished. Then again, neither is mine. It’s a first draft, but, it has one really cool feature: not much code, and it uses subclassing magic.

The way it’s done is a little unorthodox. Typically, you tokenize based on whitespace and punctuation, but iCalendar has very little syntax and grammar (despite what the standard might look like). It’s like XML in that way – little syntax, lots of semantics. ICS files are written as lines, and lines that span more than one line (but are really one line). So, I treat a “line” as a basic unit, then break it into parts, to produce a “token”. These tokens are assembled into the parse tree.


<?php // vim:ai:ts=4 sw=4: 
include_once('core/Database.class.php');

# this is pseudocode

# http://www.ietf.org/rfc/rfc2445.txt

class ICSImporter 
{
        function ICSImporter( &$parent )
        {
                if (get_class($parent)=='icstokenizer')
                        $this->tokenizer =& $parent;
                else
                        $this->tokenizer = $parent->tokenizer;
                $this->read();
                unset($this->tokenizer);
        }
        function read()
        {
                while( $token = $this->tokenizer->next() )
                {
                        switch( $token['key'] )
                        {
                                case 'BEGIN':
                                        $classname = 'ICS'.rtrim($token[value]);
                                        $this->tree[] = new $classname( $this );
                                break;
                                case 'END':
                                        unset($this->tokenizer);
                                        return $this;
                                break;
                        }
                }
        }
}

class ICSVCALENDAR extends ICSImporter
{
}
class ICSVTIMEZONE extends ICSImporter
{
}
class ICSVEVENT extends ICSImporter
{
}
class ICSSTANDARD extends ICSImporter
{
}
class ICSDAYLIGHT extends ICSImporter
{
}
class ICSVALARM extends ICSImporter
{
}

/**
 * Lexer/Tokenizer
 *
 * iCalendar is basically a line-based language with little grammar, 
 * so we'll treat a "line" as a single token.  A "line" can extend
 * over multiple lines, as specified in the RFC.
 */
class ICSTokenizer 
{
        function ICSTokenizer( $fhandle )
        {
                $this->filehandle = $fhandle;
                $this->tokenizer &= $this;
        }
        /**
         * Reads a line, and breaks it up into its parts.
         */
        function next()
        {
                $t = array();

                $line = $this->read();
                if ($line == Null) return Null;

                preg_match( '/^(.+?):(.*)/', $line, $matches );
                $t['value'] = $matches[2];
                $left = $matches[1];

                preg_match( '/^([A-Z_-]+);*(.*)/', $left, $matches );
                $t['key'] = $matches[1];
                $paramPairs = $matches[2];
                $params = split( ';', $paramPairs );
                $p = array();
                foreach( $params as $param )
                {
                        list( $pkey, $pval ) = split( '=', $param );
                        if ($pkey) $p[$pkey] = $pval;
                }
                if (count($p)>0) $t['params'] = $p;

                return $t;
        }
        /**
         * Read with a lookahead of one line, to handle tokens
         * that span more than one physical line.
         */
        function read()
        {
                // lookahead part - use previous line if it's available.
                if ($this->nextline)
                {
                        $line = $this->nextline;
                        $this->nextline = Null;
                }
                else
                {
                        if (feof($this->filehandle)) return Null;
                        $line = fgets($this->filehandle);
                }

                // load up nextline with the next line
                $this->nextline = fgets($this->filehandle);

                // if the nextline is a continuation of the line, concatenate
                // it to the current line.
                while (preg_match('/^ /', $this->nextline))
                {
                        $line = rtrim($line,"rn") . substr($this->nextline, 1);
                        $this->nextline = fgets($this->filehandle);
                }

                return $line;
        }
}
?>

The tokenizer could use some tightening up and debugging, but, it works so far.

The parser can be extended to handle the different paramaters by altering the code like this:

                       switch( $token['key'] )
                        {
                                case 'BEGIN':
                                        $classname = 'ICS'.rtrim($token[value]);
                                        $this->tree[] = new $classname( $this );
                                break;
                                case 'END':
                                        unset($this->tokenizer);
                                        return $this;
                                break;
                                default:
                                         $this->_set( $token );
                                break;
                        }

And, define _set($token) to be a generic setter, or a setter that does something special with the token. _set() could be re-defined in each of the child classes to do the right thing for each token. This is where you can go through contortions to handle the difficult semantics of calendars.

This idea could be extended to both the BEGIN and END blocks by adding calls to $this->_begin(); and $this->_end();. Then, each child class could be subclassed, again, to create classes that serialize the objects to the database. (This is probably not the right way, though. A Visitor would probably do a better job.)