Languages, IMC Keywords, Zend Framework, and Bad Data (random caffeinated thoughts)

I'm getting back to setting up Drupal for LA IMC, but keep hitting little walls. I've decided I definitely want to extract proper nouns and keywords from the articles. We keep losing traffic because the big sites are so much better at internal linking. (I guess having staff and budget is a real edge, LOL.) Lately, I've been disliking PHP, largely due to this little exercise I'm doing porting a toy program to different computer programming languages.

Different languages create different cultures. Perl's is still pretty mindblowing. There's a module that would be perfect for extracting information from English text. When that was written, Perl was a good choice for the project. Not only did it have regexes - it also had fast hashes, closures, and terse syntax for dealing with text. (I think that copy I linked is pirated. If you want to read the whole book, go to O'Reilly's Safari and buy a copy or subscribe, or get a used copy from one of the resale sites.)

I'm reading a pretty good book about Zend Framework. (I'll find the link - it's an ebook that cost a ton of $.) ZF is kind of cool. It's yet another "I hate SQL" framework. Sad. SQL isn't great, but forming relations between tables using functions and elaborate arrays as parameters is really lame. Compared to a confusing array() parameter, a JOIN is a thing of beauty. Other than that, it's a pretty nice framework, and it feels more like traditional OO than other frameworks like Cake or CodeIgniter. It has nice auth and form libraries.

Some things make me wonder, though. Why does ZF need to have a database wrapper that's even more elaborate than PDO? PDO is fine. It feels like Perl's DBI and Java's JDBC. PDO makes it easier to use prepared queries than to build your own query strings through concatenation and interpolation, and that increases security significantly, and is a compelling reason to use PDO in all projects.* In my experience, being able to port across database platforms isn't going to happen. You pick a database, and that's it - if you want to switch, you have to check and rewrite queries. Making the code work across two databases is twice the work -- and unless you're planning to switch databases mid-project, what's the point of making it crossplatform? It's only useful for canned software products that need to be database independent.

Getting back to prepared statements... Back in the Slaptech Framework daze, Josh created a function called sqlprintf which took care of quoting and interpolating in printf-like strings, and we used it everywhere. I ran fuzzers on some of our code, and they didn't find SQL injection attacks. I don't know if I ran the fuzzers right, so there's that possibility, but the point is, bad data had a tendency to cause program termination. Manual attempts at injection attacks were also foiled. PDO's prepared statements with bound or positional parameters will do the same thing. The only problem is, error messages and program termination are considered bad form in web apps today - so it's necessary to wrap the database work in exception handlers that will percolate errors up through the framework rather than terminate the program.

The issue of data validation and error reporting in web apps is kind of complex because of these new Ajax apps and mobile apps. It's important to switch from using "special return values" to standardizing on using exception handling to throw error objects. Some errors have to be converted into Ajaxy errors - and in jQuery, that means using HTTP error headers rather than returning error objects. I still see more code returning special error objects than raising HTTP error status messages. It's better to use HTTP, because it exists and is standard (and gets logged), and is handled as an exception by jQuery.

Validating input is getting more complex - mainly because I'm getting better at it. I used to just use preg_match(), but, looking back, it was a pretty unstructured way to validate data. More advanced efforts tended toward elaborate specification schemes and a validation engine. I'm going away from that, for now, writing logic instead of configuration, to make better validation code. Right now, data validation is more like writing little programs for each piece of data. The PHP filter_var() functions are verbose, and a pain to type, but are good for readability.

My reasoning is that it's actually easier to read the logic in the script than it is to learn the validation engine. While there aren't that many different cases with data validation, it's not always one-size-fits-all. For each input parameter, we end up asking questions like this:

Is data present?
Does data conform to a format?
Does data point to a resource that really exists? (Is that email address real? Is the address a real address?)
Is there a default value?
What's the type of this data? Number, text, object, object that points to a file?
How do we handle this error? Record error to a collection of errors, throw an error immediately, or what?

They aren't answered identically for all your input parameters.

That's at least 2^6 permutations, or 64 possible outcomes (assuming 2 possible answers for each - an undercount because there are more possible answers for some of these questions). If you use the "configuration not code" technique, you end up needing to specify 6 or more lines. If you code rather than configure, you end up using 6 or more lines of code. You can't win either way -- you will have many lines of code. So if your web page takes ten parameters, you should expect to have plenty of input validation code.

And, if you don't.... well, you're not doing it right.

.

Donate Bitcoin to support the writer: 1F13NCPcFKjJ7oy36zc2vBTtqoQCBPpAPH

Comments are disabled

If you wish to comment, post this article on reddit or hacker news.