Generating Word Files with WordML (xml) as a Template

Background

I've been spending the past day messing around with WordML to get it to generate Word files from PHP. The goal was to create printable tables. As George Bush said as the Iraq War slid toward failure: "Mission Accomplished."

My strategy was to start with a Word file, save it as xml, aka WordML, and then using the file as an xml template for the table layout. This is analogous to designing your output forms in NVU or Dreamweaver, and then turning it into a template.

Note that I saved the document out as an .xml file, not an .htm file. The .htm file could also be used as a template, but it doesn't retain all the Word features. There's also an alternate format, .htx, that is a MIME encoded .htm with enclosures, but the entire thing is MIME encoded, and decoding (and re-encoding) is a pain. XML is clean and relatively simple, and allows you to embed graphics.

Step by Step

1. To convert a document into a template, first, save it out as xml.

2. Then, use HTMLTidy to reformat the xml into something readable. Use the command "tidy -i -xml filename > outputfile". That will indent the xml code so you can edit it.

3. Cut and paste the XML into your logic, just like you might use HTML. Use the <? and ?> style, not 'echo', please. There is a LOT of XML.

4. You must set the content type to application/msword. This will tell the browser to open the file with Word. In PHP, above the template, add a header line:

header("Content-type: application/msword");

5. You will need to use echo to emit the xml headers, before the reset of the page, if you are using the short tags mode in PHP:

echo '<?xml version="1.0" encoding="utf-8" standalone="yes"?>'."\n";
echo '<?mso-application progid="Word.Document"?>'."\n";

So, in PHP, you might have this block at the top of the page, in addition to your own code:

<?php
header("Content-type: application/msword");
echo '<?xml version="1.0" encoding="utf-8" standalone="yes"?>'."\n";
echo '<?mso-application progid="Word.Document"?>'."\n";
?>

That's all there is to it.

Tips

  • I work in the Cygwin environment. Cygwin's installer can install HTMLTidy for you; the package is named 'tidy'.
  • Start with a fresh document, and turn off the spell check and grammar check features. These features insert formatting codes around errors. That is, an error will insert actual XML code around the error.
  • If you're using tables, and want table headers on each page, do not drag the tables so they float. Keep them in the text flow, or you won't get table headers on each page. This applies to regular word processing too.
  • Floating anything will introduce more XML code.
  • Use styles instead of formatting the text directly. Formatting directly will introduce more XML code.
  • When you have blocks of base64 encoded data, Tidy will indent it. Remember to unindent it, making it flush with the left edge, so it'll be parsed back into binary data correctly.
  • You can delete a lot of tags, and slim down the code. Just do it bit by bit, checking the output.

Reference Material

http://www.xmlw.ie/aboutxml/wordml.htm
http://rep.oio.dk/Microsoft.com/officeschemas/wordprocessingml_article.htm
http://www.informit.com/guides/content.asp?g=xml&seqNum=176&rl=1
http://www.tkachenko.com/blog/archives/000024.html
http://en.wikipedia.org/wiki/Microsoft_Office_Open_XML