Archiving Another Drupal Site and Redirecting URLs

I’ve started another project to move a site off of Drupal. Save Our Community is an old community group whose website is out of date an obsolete. The software is back at version 6 of Drupal. It should be kept up for historical reasons.

This article builds on the experience from Fixing a Static Archive of a Drupal Site to Redirect Old URLs with Mod Rewrite, where I documented part of how I fixed this site’s old archive.

This archiving job should be easier, because the site hasn’t yet been rebuilt in another platform, leaving around numerous “dead” URLs.

Preparing Drupal

Remove any blocks you don’t want in the archive.
Delete any lingering spam.
Clear out caches.
Reduce the post trimming to 200 characters.
Write a post explaining what you’re doing.
If you want to retain any analytics, make sure the codes are present.
If you have anonymous comments enabled, disable it. This hides the form.
If you want to remove the “Login or Register to post comments” text, and the “Comment viewing options”, see below.
Remove RSS feeds and sortable table links. See below.

I archived with this command

wget -nH -E -k -r –restrict-file-names=windows http://saveourcommunity.us

I’m not sure it’s quite right, but it worked. My worry is that the filenames have “?” characters that could affect how Google sees the site.

Cleaning up the Text to Login, Register, or Sort Comments

You may want to remove some of the text that helps you post comments, and decorates comments, for a cleaner appearance.

In modules/comment/comment.module around line 1806,
get rid of the line to link to the login and registration pages.

Short circuit returning an empty string instead of that line.

[code]
if (variable_get(‘user_register’, 1)) {
// Users can register themselves.
return ”; // add this line, to prevent printing the next line
return t(‘<a href="@login">Login</a> or <a href="@register">register</a> to post comments’, array(‘@login’ => url(‘user/login’, array(‘query’ => $destinatio
n)), ‘@register’ => url(‘user/register’, array(‘query’ => $destination))));
[/code]

Then, return an empty string from this function:

[code]
function theme_comment_controls($form) {
return ”; // add this line to short circuit this
$output = ‘<div class="container-inline">’;
$output .= drupal_render($form);
$output .= ‘</div>’;
$output .= ‘<div class="description">’. t(‘Select your preferred way to display the comments and click "Save settings" to activate your changes.’) .'</div>’;
return theme(‘box’, t(‘Comment viewing options’), $output);
}
[/code]

And short circuit this function:

[code]
function comment_controls(&$form_state, $mode = COMMENT_MODE_THREADED_EXPANDED, $order = COMMENT_ORDER_NEWEST_FIRST, $comments_per_page = 50) {
return ”; // add this line
[/code]

Now all three function will return empty strings, leaving your pages cleaner.

Cleaning up the Forum Headers

The headers on the Forum pages, and any other pages, have links that allow different sortings. These can cause hundreds of pages to be downloaded. You can remove these links by editing the file includes/tablesort.inc:

In the function tablesort_header(), you comment out one line, and add a line:

[code]
// $cell[‘data’] = l($cell[‘data’] . $image, $_GET[‘q’], array(‘attributes’ => array(‘title’ => $title), ‘query’ => ‘sort=’. $ts[‘sort’] .’&order=’. urlencode($
cell[‘data’]) . $ts[‘query_string’], ‘html’ => TRUE));
$cell[‘data’] = $cell[‘data’].$image;
[/code]

Removing RSS Feeds

Drupal adds RSS feeds to all the lists of articles. To remove this, short circuit the feed icon, in include/common.inc:

[code]
function drupal_add_feed($url = NULL, $title = ”) {
return ”;
[/code]

Bad Links

After downloading the site once, I found numerous files with names like this:

node/1234@q=node%2F345.html

Somewhere on the site, there was a link that looked like this “?q=node/345”.

It should have been written like this: “/?q=node/345”, or “/node/345”. These URLs worked when the website used only ?q= URLs, and also work when the pretty URLs are enabled.

The problem is, they produce duplicate pages.

So, you need to get rid of these. It turned out that I had manually coded up a link in a block, and it was adding extra pages all over the site.

There were also some links within articles that also followed this old convention. I had to fix those links by hand.

The New Location

The old website will be stored in /archive on the webserver.

Creating a Sitemap

Though most sites use XML sitemaps, the search engines also support sitemap.txt, a text file that contains one URL per line. All the content we wish to include in the sitemap are in the /node and /content directories, so we do this:

ls node/.html > sitemap.txt
ls content/.html >> sitemap.txt

Then, run this one-liner to prepend the URL to the filenames:

cat sitemap.txt | perl -n -e ‘print “http://saveourcommunity.us/archive/$_”;’ > sitemap.tmp

Substitute in your address, of course.

If it worked, do a

mv sitemap.tmp sitemap.txt

Adding a Sitemap Link in robots.txt

You add a line like this, to the robots.txt file:

Sitemap: http://saveourcommunity.us/archive/sitemap.txt

Rewriting URLs with 301 Redirects

Old URLs are of the form:

/node/123
/node?page=15
/content/foo-bar

New URLs look like this:

/archive/node/123.html
/archive/node@page=15.html
/archive/content/foo-bar.html

Apache’s mod_rewrite rules are pretty simple once you have examples and know regular expressions. Add these to the top of .htaccess

[code]
RewriteEngine On
RewriteRule ^node/(\d+)$ /archive/node/$1.html [L,R=301]
RewriteRule ^content/(.*)$ /archive/content/$1.html [L,R=301]

RewriteCond %{QUERY_STRING} page=(\d*)$
RewriteRule ^node$ /archive/node@page=%1.html? [L,R=301]
[/code]

My main goal was to get the redirects for /node and /content correct. The other pages can be crawled and indexed. None of them need to be indexed, and it may be worthwhile to prevent them from being indexed.

The main goal is to index the content in the sitemap, and retain all inbound links to articles.

Other URLs

There are a couple places to find more URLs to test. First is to search for the site on Google. You’ll find some URLs to fix.

The other source for URLs is the Search Console’s report of URLs indexed. Download it and start looking for unexpected URLs. I got over a thousand URLs, and a lot of bad ones were in there.

Shutting Off the Old Site

I moved everything except /files into a subdirectory. I’ll need to deal with this later. A 301 redirect to the new files area may be enough.

Then I create a new index.html, to link to the archive.

And that’s it.

SEO Issues

Here’s a snapshot of the first day, Feb. 25.

Index size: 1402
Valid: 1212
Errors: 0
Duplicate page without canonical tag: 1667
Crawl Anomaly: 154
Crawled Not Indexed: 45

I’m expecting the index size to shrink. There should be fewer duplicate pages.