Object-Oriented Parser for iCalendar

Been working on a parser for ICS files, and it's done in an OO style - so that parts of the data become instantiated as objects, and the parse tree is a hierarchy of objects. Searches for "OO Parser" and "object oriented parser" turned up a lot of OO parser generators, or YACC for OO parsers. Very Meta. Most were abstracts to CS papers, but one was an article about a parser for Pascal that was easy to read.

Searching for iCalendar parsers was more fruitful, but, didn't turn up much code. The File_IMC module in PEAR hasn't been maintained since 2003, and it produces arrays, not object trees. It seems complete, though, so it's a one-stop solution. One highlight was qCal, which is an OO library for iCalendar. It's supposed to have a parser, but it wasn't finished. Then again, neither is mine. It's a first draft, but, it has one really cool feature: not much code, and it uses subclassing magic.

The way it's done is a little unorthodox. Typically, you tokenize based on whitespace and punctuation, but iCalendar has very little syntax and grammar (despite what the standard might look like). It's like XML in that way - little syntax, lots of semantics. ICS files are written as lines, and lines that span more than one line (but are really one line). So, I treat a "line" as a basic unit, then break it into parts, to produce a "token". These tokens are assembled into the parse tree.


<?php // vim:ai:ts=4 sw=4: 
include_once('core/Database.class.php');

# this is pseudocode

# http://www.ietf.org/rfc/rfc2445.txt

class ICSImporter 
{
        function ICSImporter( &$parent )
        {
                if (get_class($parent)=='icstokenizer')
                        $this->tokenizer =& $parent;
                else
                        $this->tokenizer = $parent->tokenizer;
                $this->read();
                unset($this->tokenizer);
        }
        function read()
        {
                while( $token = $this->tokenizer->next() )
                {
                        switch( $token['key'] )
                        {
                                case 'BEGIN':
                                        $classname = 'ICS'.rtrim($token[value]);
                                        $this->tree[] = new $classname( $this );
                                break;
                                case 'END':
                                        unset($this->tokenizer);
                                        return $this;
                                break;
                        }
                }
        }
}

class ICSVCALENDAR extends ICSImporter
{
}
class ICSVTIMEZONE extends ICSImporter
{
}
class ICSVEVENT extends ICSImporter
{
}
class ICSSTANDARD extends ICSImporter
{
}
class ICSDAYLIGHT extends ICSImporter
{
}
class ICSVALARM extends ICSImporter
{
}

/**
 * Lexer/Tokenizer
 *
 * iCalendar is basically a line-based language with little grammar, 
 * so we'll treat a "line" as a single token.  A "line" can extend
 * over multiple lines, as specified in the RFC.
 */
class ICSTokenizer 
{
        function ICSTokenizer( $fhandle )
        {
                $this->filehandle = $fhandle;
                $this->tokenizer &= $this;
        }
        /**
         * Reads a line, and breaks it up into its parts.
         */
        function next()
        {
                $t = array();

                $line = $this->read();
                if ($line == Null) return Null;

                preg_match( '/^(.+?):(.*)/', $line, $matches );
                $t['value'] = $matches[2];
                $left = $matches[1];

                preg_match( '/^([A-Z_-]+);*(.*)/', $left, $matches );
                $t['key'] = $matches[1];
                $paramPairs = $matches[2];
                $params = split( ';', $paramPairs );
                $p = array();
                foreach( $params as $param )
                {
                        list( $pkey, $pval ) = split( '=', $param );
                        if ($pkey) $p[$pkey] = $pval;
                }
                if (count($p)>0) $t['params'] = $p;

                return $t;
        }
        /**
         * Read with a lookahead of one line, to handle tokens
         * that span more than one physical line.
         */
        function read()
        {
                // lookahead part - use previous line if it's available.
                if ($this->nextline)
                {
                        $line = $this->nextline;
                        $this->nextline = Null;
                }
                else
                {
                        if (feof($this->filehandle)) return Null;
                        $line = fgets($this->filehandle);
                }

                // load up nextline with the next line
                $this->nextline = fgets($this->filehandle);

                // if the nextline is a continuation of the line, concatenate
                // it to the current line.
                while (preg_match('/^ /', $this->nextline))
                {
                        $line = rtrim($line,"\r\n") . substr($this->nextline, 1);
                        $this->nextline = fgets($this->filehandle);
                }

                return $line;
        }
}
?>

The tokenizer could use some tightening up and debugging, but, it works so far.

The parser can be extended to handle the different paramaters by altering the code like this:

                       switch( $token['key'] )
                        {
                                case 'BEGIN':
                                        $classname = 'ICS'.rtrim($token[value]);
                                        $this->tree[] = new $classname( $this );
                                break;
                                case 'END':
                                        unset($this->tokenizer);
                                        return $this;
                                break;
                                default:
                                         $this->_set( $token );
                                break;
                        }

And, define _set($token) to be a generic setter, or a setter that does something special with the token. _set() could be re-defined in each of the child classes to do the right thing for each token. This is where you can go through contortions to handle the difficult semantics of calendars.

This idea could be extended to both the BEGIN and END blocks by adding calls to $this->_begin(); and $this->_end();. Then, each child class could be subclassed, again, to create classes that serialize the objects to the database. (This is probably not the right way, though. A Visitor would probably do a better job.)

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

This is the author of qCal

This is the author of qCal again. I just wanted to let you know that I have released the very first pre-alpha version of qCal. You can check it out here: http://lukevisinoni.com/2009/12/31/qcal-v001-released/

I will be releasing a new version every other Thursday from now on, so you can expect v0.0.2 on January 14th. I hope you are still interested :)

Hello, This is the author of

Hello,
This is the author of qCal. I just wanted to let you guys know that the project is back from the dead and I expect a first release in the next few weeks. I would LOVE to get your input if you are interested in the library. It has grown quite a bit since the last time you looked at it, so please give it another chance. :)

Thanks for the update. I'll

Thanks for the update. I'll definitely check it out next time I need an ical library. Thanks for putting it out there!

I've been looking for a

I've been looking for a iCalendar (/vcalendar/ical/ics) parser for PHP for quite some time now. Seeing that this blog post is over a year old I'm wondering how is your parser project doing today?

As a sidenote, I did notice qCal, too, but it's practically missing the implementation... not quite usable yet. I also found Bennu (http://bennu.sourceforge.net/) which unfortunately does look dead regardless of what it says on the frontpage. Looking at the source code makes me happy, though. This one is lacking a real parser, too, but the overall the code looks more readable than qCal. If I had to choose one to expand, I'd probably select Bennu.

Why does pretty much every ical library out there focus in generating ical files only? That's the easy part and does not really require a huge library. The hard parts are parsing the file and interpreting the RRULEs, TIMEZONEs and stuff. I'm still looking for a nice library to read a iCalendar file and output a list of all events, with UTC timestamps for the start of the event and the end of the event, in a given time window (say from 2009-3-1 to 2000-3-31).

The project this was for got

The project this was for got diverted, and the whole "distributed calendars combined into one" wasn't going to happen. It was just too complex for the audience in question (and the volume of data was low). So, unfortunately, I haven't worked on the code since.

One problem I remember looking at, and being intimidated by, was the support for timezones. If you're going to use ical, your program has to handle ical's idea of timezones, which are linked from events. So, you'll need a library to go along with that parser. Then, that makes writing the larger app that much more complex, because you're dealing with ical's idea of timezones.

The calendar system I defined was a lot simpler than ical, so it would have required simplifying the ical data to fit into the local calendar system. Anyway, it was looking kind of hairy.