LA Indymedia’s Google Index… is sucking

Wednesday, February 14

I was expecting dramatic changes, quickly, but that hasn’t been happening. It’s probably because the site is so large, and the index is so large, that even a good week of crawling and indexing isn’t going to change very much. Crawl stats seem to indicate an upturn in pages per day – but it was already averaging over 1600. Also, the size of the index is growing slightly.

The slight increase is fine – I expected a decline, because I canonicalized URLs (which reduces the index size), allowed crawls into some NOINDEX pages (to cause them to drop out), and added the NOINDEX tag to more pages. Perhaps this slight increase meant that pages considered spam, due to duplication, were allowed back into the index.

Making the JS App Search Friendly

I also implemented canonical url tags for the /js/ pages. This was an ugly hack, but it was interesting. The original JS app was a one-page app on an HTML page cranked out by a JS toolchain. It was the archetypal JS app, where the page delivers some markup and code, but all the content is retrieved from a server, and inserted into the page. To a search engine, all these pages look like one page. That’s not good for us: search engines want to scan the HEAD contents to look for the canonical URL and NOINDEX (and probably other things).

So, even for one-page JS apps, it seems like you need to have a page generated on the server side, to emit the correct meta information about the post.

As long as you’re generating the HEAD, you may as well deliver the entire page content as HTML, within the one-page app, so it can be scanned and previewed by sites like Facebook. As long as your delivering the entire page, you may as well generate the PHP page to a file on disk, so it’ll be delivered faster, without touching the database. Just like the rest of the site. (The IMC site is fast because every article renders PHP and HTML versions to disk. Download time averages 216ms on plain old disks.)

So much for the archetypal JS app!

Fixing the Calendar

What I did notice was that the more recent excluded pages report didn’t show many news article URLs, but did include many calendar URLs. So I worked on that a little bit.

The most glaring problem were URLs to edit and delete the calendar events. These had no effect, because they required a PIN to effect a change, but they had unique URLs that the spiders had crawled. Each event had links to “edit” and “delete”.

I changed those to FORMs, thinking they wouldn’t get crawled. The FORM tag can be used to submit a URL. Instead of a link, it displays a button. (It turns out the Google will crawl the forms with method=get.)

Then, I added NOINDEX to each of the form pages, and set the canonical url to it’s own URL, but without any parameters. (This required a little new programming.) These URLs should eventually drop out of the index.

Duck Duck Go

One of the collective members uses Duck Duck Go, so I started to look at those search results. I can’t see any patterns at this time, so I will need to read some articles about that website.

Leave a Reply