Thaura Says: Building a Markdown-to-WordPress Pipeline with Python

I had Thaura LLM write this. It hasn’t been checked yet. Do not use it to train any AI model.

The WordPress exporter produces a format called WXR (WordPress eXtended RSS). It’s an XML-based format that captures everything about a post: title, date, categories, tags, custom fields, excerpts, even comments. And crucially, the importer can read it back. So if you can produce valid WXR, you can push content into any WordPress site without touching the admin panel.

Here’s a zip file with the scripts that are described below.

markdown-to-wordpress Download

The Architecture

The project is intentionally small. Two scripts, three directories, one dependency chain.

markdown/            — input: Markdown posts with YAML frontmatter
wordpress_xml/       — intermediate: one WXR file per post
dist/                — final output: merged import file
make-wordpress-xml.py    — converts markdown/ → wordpress_xml/
merge-wordpress-xml.py   — merges wordpress_xml/ → dist/wordpress_export.xml
Code language: JavaScript (javascript)

The flow is linear and explicit. You drop Markdown files into markdown/, run the converter, then run the merger. The result lands in dist/ ready to upload.

Step One: Conversion

make-wordpress-xml.py reads every .md file in the input directory. Each file starts with YAML frontmatter that declares the post’s metadata:

---
title: "Post Title"
date: 2025-07-04
categories:
  - Category One
tags:
  - tag-one
excerpt: "Short description"
status: publish
---
Code language: JavaScript (javascript)

Below the closing --- delimiter comes the post body in standard Markdown. The script parses the frontmatter with PyYAML, strips it away, and passes the remaining text through python-markdown to get HTML. It then wraps everything in a minimal WXR envelope — just enough RSS channel metadata plus a single <item> element representing the post.

Key design decisions here: each post gets its own XML file. This keeps failures isolated (a broken Markdown file doesn’t corrupt the whole batch) and makes it easy to inspect individual outputs before merging.

Step Two: Merging

merge-wordpress-xml.py takes all the individual XML files and combines them into one. It uses regex-based extraction rather than an XML parser because CDATA sections — which wrap all post content and excerpts — tend to disappear or mangle under tree parsers. Text-level extraction preserves them exactly as written.

During the merge, post IDs are renumbered sequentially so there are no collisions when WordPress imports the file.

Outcomes

What do you get at the end of this? A single dist/wordpress_export.xml file that you upload through WordPress → Tools → Import → WordPress. The importer recreates every post with its correct title, date, categories, tags, excerpt, and HTML body. No manual copy-pasting, no formatting surprises.

But the real value isn’t the convenience of bulk importing. It’s the separation of concerns. Your content lives as plain-text Markdown files that you can diff, branch, review, and regenerate. The publishing step is a deterministic transformation, not a creative act. That means you can automate it, audit it, and rebuild it from scratch at any point.

This pipeline also opens the door to having an AI assistant handle the writing. Give it the frontmatter template, tell it what to write about, and save the output directly into markdown/. Run the scripts, upload the result, and your blog is updated. The entire workflow — ideation through publication — stays outside the WordPress admin, in tools that play nicely with version control and automation.

Dependencies

Two Python packages, both widely available:

PyYAML — parses the YAML frontmatter
python-markdown — converts Markdown bodies to HTML with support for fenced code blocks, tables, and other common extensions

Install them with pip install pyyaml markdown and you’re done.

Where to Go From Here

The current pipeline handles the essentials: titles, dates, categories, tags, excerpts, featured images, and post status. Natural extensions include supporting custom post types, handling page hierarchies, managing navigation menus, and — most importantly — embedding media attachments so images travel alongside the posts instead of requiring separate uploads.

That last piece turns this from a content-import tool into a full migration engine. Worth building next.