Update, February 1, 2018
The Search Console’s index is reporting only until Jan. 22, which was nine days ago, and five days before this blog post was originally written. So the effects of the above repair on the index aren’t yet visible. The Search Traffic section is showing data up to the 30th, and it shows an improvement in the search position for pages: 23, versus and average of 30. The positions for queries was a position of 23 versus an average of 30. It’s still too little data to know if something has improved.
Looking through the “Duplicate page without canonical tag” report (in the new Search Console), I found a bunch of category and taxonomy term (tag) pages being flagged as duplicate, and removed. Fortunately, it was usually removing the correct page, but rather than suffer potential problems, I just wiped out the contents of the category/ and taxonomy/ directories. They’re already blocked by robots.txt, but, I may unblock that path in the future, in case those pages don’t drop from the index.
I also removed the blogs/ directory, which just repeated content from the homepage.
I have blocked comment/ with robots.txt, but haven’t deleted it yet.
Robots.txt and these deletions should make for a less “bushy” graph of pages, for the next couple weeks.
What to do with Old URLs
I don’t know what to do with these dead URLs. My first thought was to point them to the home page, but I wondered if that kind of shaping would harm the site’s authority.
Right now, the site has 951 pages indexed, and 1770 pages excluded. (That’s actually as of 9 days ago.) 2721 URLs. I think I’ve got most of the inbound links, so these are Google Index entries. They’re a mix of old URLs from the Drupal site, new URLs from the archive of the Drupal site, and static pages that are now gone.
I’ll see what the previous changes did to the indexation, and then come up with a strategy for redirecting old URLs, trying to get 100% coverage of the old URLs.
I hope all these problem URLs should become 301 redirected URLs. As the list of 301s grows, and the others shrink, Google should delete the original URL from the index.
Aha! Drupal Reply Forms are Identified as Duplicate Content
On each old Drupal article, there’s a link “Add new comment”, that opens up a comment form. The entire article is displayed above the form. Because the form is opened by clicking on a link, rather than a form submission button, the
wget program downloaded a comment form for each article.
Which version of the article is considered the canonical article? I don’t know, but Google made a choice.
The fix up the index, I created a new rewrite rule:
# All these old reply links end up looking like duplicate content. RewriteRule ^comment/reply/.*$ / [L,R=301]