Digression: My Current Perspective on Cleaning up the Graph
After doing some of this search engine work, my mental model of search engines has improved. They have databases of pages, and of URLs: two different things. They have an index of pages – another database which maps words to pages.
So a search goes through the word index, to the pages. The pages, map to URLs – but many pages have more than one URL. The search engine tries to collapse similar pages into one page, and then pick one canonical URL.
Websites should try to be organized so each page has a canonical URL, and the search engines knows it.
Here’s a picture of the ideal graph of words, pages, and urls, where there’s a clear, 1:1 correspondence of pages to URLs. Imagine this is the result of a search for “foo”:
Here’s a picture of what’s more common: the top page is perfect, but in the second page(s), the engine thinks they are two similar pages, with five different URLs, and there is no clear mapping, so the search engine makes a guess about the URLs in the lower part:
Google thinks the two pages are really the same page, because they look so similar, and it then chooses a single URL to represent both.
To fix this, rel=”canoncial” links can be used to fix the URLs. Then, to eliminate the duplicate content problem, edit the pages to be more different, or if one of the pages is an list of articles, use “noindex” meta tag to keep that page out of the search engine.
Here’s the picture with fixes:
Ahhhh. That’s better.
Why didn’t I know about this before?
I wondered about why I hadn’t known this before, and came to the conclusion that the old CMSs were doing a very good job of organizing my site, so it would get indexed correctly. Even my bespoke CMSs were doing things correctly, by following the plain-old-info-architecture rules.
As search engine spamming has gotten better, the search engines have become more selective, and the effect is that the indexes for old sites are shrinking as bad or duplicate pages are eliminated from the search engines. Now, I’m finally seeing some of these effects at the LA Indymedia site.
When I ported the old site to WordPress, I saw a whole other issue, of publishing archives with different URLs.