Cleanup Your HTML: A Tag Closing Function

User-submitted HTML often contains small markup errors that can affect other parts of the page. The most common are unclosed tags that cause text to be bolded, italicized, or linked all the way down the page. The visual effect is catastrophic, though the error is really minor.

The html_close_tags() function scans HTML code, and generates a string that will close all the open tags. An easy way to use it is like this:


  $html = $html.html_close_tags($html);

The string analysis is done “C” style, by iterating over characters, rather than using regexes (Perl style), or by breaking the data into parts, parsing it, and then concatenating the output (Lisp style). C style is to just read across the data a character at a time, accumulating substrings as needed. Because PHP lacks GOTO, I settled for using a series of loops to implement the state machine. (Sometimes, goto is the right way to do something.)

Scanning was used because I thought it would be faster than any technique that would require multiple concatenations. It’s also very straightforward compared to regexes. I used a couple extraneous variables to add some documentation, as recommended by older programming texts.

One somewhat serious deficiency is that quoted attribute strings aren’t well-supported. An attribute like onClick=’pop(100,300,’bar’)’ will fail to parse correctly because escaping is not supported.

This is a relatively rare situation, because user input should not allow JavaScript. Of course, another function should have been called to sanitize the input of JavaScript.

To use the code, remember to strip out the test cases.

<?php
// vim:set ts=4 sw=4:
/**
 * A function to close tags in user-supplied HTML.
 * It's written to handle code with improperly nested tags.
 * This does not sanitize the data for tricks like using html entities
 * to encode tag names.
 */
function html_close_tags($html)
{
    $ignoretags = array( 'p', 'br' );
    $tagstack = array();
    $size = strlen( $html );
    $i = 0;
    $ch = $html[$i];
    $mark = $i;
    while( $i < $size )
    {
        // outside the tag state (1)
        while( $ch != '<' )
        {
            $ch = $html[++$i]; // advance one char
        } // while not '<'

        // inside the tag state (2)
        // get the tag type (open or close)
        $ch = $html[++$i]; // advance one char
        if( $ch == '/' )
        {
            $closeTag = true;
        }
        else // it's an opening tag
        {
            $closeTag = false;
        }
        if ($closeTag) $i++; // advance one char
        $mark = $i; // mark start of name

        // get the tag name (2.1)
        while( $ch != ' ' and $ch != '>' )
        {
            $ch = $html[++$i]; // advance one char
        }
        $tagname = strtolower( substr( $html, $mark, $i-$mark ) );
        // Don't advance char after this state.

        // get the rest of the tag attributes (2.2)
        while( $ch != '>' )
        {
            $ch = $html[++$i]; // advance one char

            // special case within quotes
            // note that this does not handle complex quoting or escapes
            if ($ch == '"')
            {
                while( $ch != '"' )
                    $ch = $html[++$i]; // advance one char
            }
            if ($ch == "'")
            {
                while( $ch != "'" )
                    $ch = $html[++$i]; // advance one char
            }
        }

        $last = $html[$i-1];
        // If tag attribute part contains a trailing slash
        // assume it's self-closing and don't add to tag stack.
        if ( $last=='/' ) 
        {
        }
        // If the tag is an opening tag, put on tagstack.
        else if ( $closeTag == false )
        {
            // unless it's in the ignoretags array, add it
            if (!in_array( $tagname, $ignoretags ))
                $tagstack[] = $tagname;
        }
        // If the tag is a closing tag, pop a matching tag off stack
        // by searching for the first matching tag and removing 
        // that element.
        else if ( $closeTag == true )
        {
            for($c=count($tagstack)-1; $c >= 0; $c--)
            {
                if( $tagstack[$c]==$tagname )
                {
                    unset($tagstack[$c]);
                    break; // stop the for loop
                }
            }
        }
        $ch = $html[++$i]; // advance one char

    } // main loop
    // Scan remaining elements, building up a string of closing tags.
    // Note that string is built up "backwards" to close tags in order
    // (because the loop reads from the bottom of the stack).
    foreach( $tagstack as $tag )
        $result = '</'.$tag.'>'.$result;
    return $result;
}

echo "<pre>";

$test = "<p><br><b><i><ul><li>";
echo htmlspecialchars(html_close_tags($test));
echo "nn";

$test = "test<p>test<br>test<b><i><ul><li>";
echo htmlspecialchars(html_close_tags($test));
echo "nn";

$test = "</p></br></b></i></ul></li>";
echo htmlspecialchars(html_close_tags($test));
echo "nn";

$test = "<p><br/><b><i><ul><li>";
echo htmlspecialchars(html_close_tags($test));
echo "nn";

$test = "<p><br /><b><i><ul><li>";
echo htmlspecialchars(html_close_tags($test));
echo "nn";

$test = "<p><br><b><i><ul><li></li></ul></i></b>";
echo htmlspecialchars(html_close_tags($test));
echo "nn";

$test = "<p><br><b><i><ul><li></ul></li></i></b>";
echo htmlspecialchars(html_close_tags($test));
echo "nn";
?>
Attachment Size
HtmlCloseTags.inc.php.txt 3.32 KB