Cleanup Your HTML: A Tag Closing Function

User-submitted HTML often contains small markup errors that can affect other parts of the page. The most common are unclosed tags that cause text to be bolded, italicized, or linked all the way down the page. The visual effect is catastrophic, though the error is really minor.

The html_close_tags() function scans HTML code, and generates a string that will close all the open tags. An easy way to use it is like this:


  $html = $html.html_close_tags($html);

The string analysis is done "C" style, by iterating over characters, rather than using regexes (Perl style), or by breaking the data into parts, parsing it, and then concatenating the output (Lisp style). C style is to just read across the data a character at a time, accumulating substrings as needed. Because PHP lacks GOTO, I settled for using a series of loops to implement the state machine. (Sometimes, goto is the right way to do something.)

Scanning was used because I thought it would be faster than any technique that would require multiple concatenations. It's also very straightforward compared to regexes. I used a couple extraneous variables to add some documentation, as recommended by older programming texts.

One somewhat serious deficiency is that quoted attribute strings aren't well-supported. An attribute like onClick='pop(100,300,\'bar\')' will fail to parse correctly because escaping is not supported.

This is a relatively rare situation, because user input should not allow JavaScript. Of course, another function should have been called to sanitize the input of JavaScript.

To use the code, remember to strip out the test cases.

<?php
// vim:set ts=4 sw=4:
/**
 * A function to close tags in user-supplied HTML.
 * It's written to handle code with improperly nested tags.
 * This does not sanitize the data for tricks like using html entities
 * to encode tag names.
 */
function html_close_tags($html)
{
    $ignoretags = array( 'p', 'br' );
    $tagstack = array();
    $size = strlen( $html );
    $i = 0;
    $ch = $html[$i];
    $mark = $i;
    while( $i < $size )
    {
        // outside the tag state (1)
        while( $ch != '<' )
        {
            $ch = $html[++$i]; // advance one char
        } // while not '<'

        // inside the tag state (2)
        // get the tag type (open or close)
        $ch = $html[++$i]; // advance one char
        if( $ch == '/' )
        {
            $closeTag = true;
        }
        else // it's an opening tag
        {
            $closeTag = false;
        }
        if ($closeTag) $i++; // advance one char
        $mark = $i; // mark start of name

        // get the tag name (2.1)
        while( $ch != ' ' and $ch != '>' )
        {
            $ch = $html[++$i]; // advance one char
        }
        $tagname = strtolower( substr( $html, $mark, $i-$mark ) );
        // Don't advance char after this state.

        // get the rest of the tag attributes (2.2)
        while( $ch != '>' )
        {
            $ch = $html[++$i]; // advance one char

            // special case within quotes
            // note that this does not handle complex quoting or escapes
            if ($ch == '"')
            {
                while( $ch != '"' )
                    $ch = $html[++$i]; // advance one char
            }
            if ($ch == "'")
            {
                while( $ch != "'" )
                    $ch = $html[++$i]; // advance one char
            }
        }

        $last = $html[$i-1];
        // If tag attribute part contains a trailing slash
        // assume it's self-closing and don't add to tag stack.
        if ( $last=='/' ) 
        {
        }
        // If the tag is an opening tag, put on tagstack.
        else if ( $closeTag == false )
        {
            // unless it's in the ignoretags array, add it
            if (!in_array( $tagname, $ignoretags ))
                $tagstack[] = $tagname;
        }
        // If the tag is a closing tag, pop a matching tag off stack
        // by searching for the first matching tag and removing 
        // that element.
        else if ( $closeTag == true )
        {
            for($c=count($tagstack)-1; $c >= 0; $c--)
            {
                if( $tagstack[$c]==$tagname )
                {
                    unset($tagstack[$c]);
                    break; // stop the for loop
                }
            }
        }
        $ch = $html[++$i]; // advance one char

    } // main loop
    // Scan remaining elements, building up a string of closing tags.
    // Note that string is built up "backwards" to close tags in order
    // (because the loop reads from the bottom of the stack).
    foreach( $tagstack as $tag )
        $result = '</'.$tag.'>'.$result;
    return $result;
}

echo "<pre>";

$test = "<p><br><b><i><ul><li>";
echo htmlspecialchars(html_close_tags($test));
echo "\n\n";

$test = "test<p>test<br>test<b><i><ul><li>";
echo htmlspecialchars(html_close_tags($test));
echo "\n\n";

$test = "</p></br></b></i></ul></li>";
echo htmlspecialchars(html_close_tags($test));
echo "\n\n";

$test = "<p><br/><b><i><ul><li>";
echo htmlspecialchars(html_close_tags($test));
echo "\n\n";

$test = "<p><br /><b><i><ul><li>";
echo htmlspecialchars(html_close_tags($test));
echo "\n\n";

$test = "<p><br><b><i><ul><li></li></ul></i></b>";
echo htmlspecialchars(html_close_tags($test));
echo "\n\n";

$test = "<p><br><b><i><ul><li></ul></li></i></b>";
echo htmlspecialchars(html_close_tags($test));
echo "\n\n";
?>
AttachmentSize
HtmlCloseTags.inc.php.txt3.32 KB

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Great! Big problem

Great!

Big problem though-

If the string ends in anything but a tag, it crashes. Could be a letter, number, punctuation, symbol (even an angle bracket if not part of a tag). Quick fix is maybe to append a dummy tag or a < br > at the end.

As for the issue with quotes and attributes, my app was already stripping that stuff out. I am taking job descriptions that people pasted from Word (which creates a really nasty tag soup) and sanitizing it for basic display. I want to leave breaks, paragraphs, bold and such, but everything else can go. Here's the regex to remove all tag attributes:

$string = preg_replace('/<([^\s>]*)(\s[^<]*)>/',"<\\1>",$string);

Also, if anybody wants it- Word makes a bunch of capitalized tags... for shame... lets fix that:

$string = preg_replace("/(<\/?)(\w+)([^>]*>)/e","'\\1'.strtolower('\\2').'\\3'",$string);

Thanks for the comment. I

Thanks for the comment. I didn't catch that bug. (Sorry for the lateness of this approval, too.)