Here's a first-draft of some code that tokenizes SQL statements and returns an array of all the tokens.
The logic is pretty simple. We scan the string left to right, and look for barewords, single quoted strings, double quoted strings, back quoted strings, or special symbols like "=", "(", "*", etc. These chunks of text may abut each other without intervening whitespace. Thus, "FOO(BAR)" is tokenized into four individual tokens: FOO, (, BAR, ). Quoted strings are scanned and tokenized with quotes intact; the parser must strip the quotes. The functions to scan quoted strings are buggy, and don't handle escaped quote characters.
See bottom of page for parsing notes.
/**
* Tokenizes text that looks something like SQL.
*/
function tokenizeSQL( $SQL )
{
$functions = array ( 'concat', 'if' );
$token = '\\(|\\)|[\']|"|\140|[*]|,|<|>|<>|=|[+]';
$terminal = $token.'|;| |\\n';
$result = array();
$string = $SQL;
$string = ltrim($string);
$string = rtrim($string,';').';'; // always ends with a terminal
$string = preg_replace( "/[\n\r]/s", ' ', $string );
while(
preg_match( "/^($token)($terminal)/s", $string, $matches ) ||
preg_match( "/^({$token})./s", $string, $matches ) ||
preg_match( "/^([a-zA-Z0-9_.]+?)($terminal)/s", $string, $matches)
)
{
$t = $matches[1];
if ($t=='\'')
{
// it's a string
$t = tokSingleQuoteString( $string );
array_push($result, $t);
}
else if ($t=="\140")
{
// it's a backtick string (a name)
$t = tokBackQuoteString( $string );
array_push($result, $t);
}
else if ($t=='"')
{
// it's a double quoted string (a name in normal sql)
$t = tokDoubleQuoteString( $string );
array_push($result, $t);
}
else
{
array_push($result, $t);
}
$string = substr( $string, strlen($t) );
$string = ltrim($string);
}
return $result;
}
function tokSingleQuoteString( $string )
{
// matches a single-quoted string in $string
// $string starts with a single quote
preg_match('/^(\'.*?\').*$/s', $string, $matches );
return $matches[1];
}
function tokBackQuoteString( $string )
{
// matches a back-quoted string in $string
// $string starts with a back quote
preg_match('/^([\140].*?[\140]).*$/s', $string, $matches );
return $matches[1];
}
function tokDoubleQuoteString( $string )
{
// matches a back-quoted string in $string
// $string starts with a back quote
preg_match('/^(".*?").*$/s', $string, $matches );
return $matches[1];
}
Here's some code that can be used to make a page that will let you see the parser's output.
if ($_GET['sql'])
{
print "<pre>";
print_r(tokenizeSQL($_GET['sql']));
print "</pre>";
}
?>
<form>
<textarea name=sql><?=$_GET['sql']?></textarea>
<input type="submit">
</form>
The parser must be able to disabiguate the uses of parens in this string: "select concat( a, b ) from ( select * from x where y=1 )".
The first use of parens is to delimit the parameters to concat. The second use is to delimit a subselect.
The tokenizer is insensitive to this issue, and treats all parens equally. I think this is okay, but I may be wrong. All information about whitespace is lost.