Preserving Your Users' Privacy

If you operate a website, or wish to retain privacy while visiting websites, you need to be aware of how websites share their traffic data with other sites.

The typical way data is shared is *not* by directly transferring lists between companies, as one might expect. The most common way data is shared is via "embeds", or bits of code in web pages that call out to another company's server. This site, for example, usually runs embeds for: two google ads, one pagerank counter, a hidden embed for google analytics, and another hidden embed for Kontera ads. In the past I've used other page counters.

There are also embeds for content from YouTube and other sites.

The code for each embed contains a reference to a different server. That causes the browser to issue a request to that server. The server processes the request and logs your information, and might set a cookie to track you each time your browser makes a request.

The information transferred for each request generally does NOT include personal information, but it does include information like browser type, version, operating system, IP address (which reveals you ISP and rough geographic location), and the cookie for that remote server. Additionally, if the embed returns javascript (like the google analytics embed does), the javascript code can interrogate your system further, revealing things like screen sizes, fonts installed, plugins installed, and so forth. Javascript includes security so you can't write code that steals the data from another server's info - but it will reveal a lot about a system.

Above and beyond Javascript, there are newer cookies stored by Flash. Because these are harder to remove, they're called "super cookies.

All these bits of information, taken together, can form a "fingerprint" for you. For example, I use Ubuntu 10.4, Firefox 3.6.13, connect to the internet via AT&T, in Los Angeles. With just those settings, you can narrow down from millions of users down to thousands. Now, let's suppose I visit a website that gets only a couple thousand visitors per day. You can assume that I might be the only person with this "fingerprint" with only four pieces of information.

So even if I have cookies turned off, you can probably assume that anyone with my fingerprint is me.

Now, let's mix in some other information: my list of installed plugins, and list of installed fonts. If you can get that, you pretty much have my fingerprint. You might even have the fingerprint across computers, because I install a couple dozen fonts on most of my personal computers, and a few are fairly unusual.

Steve Gibson of Security Now! reported that with this fingerprinting, you could identify an single user on the internet with over 90% accuracy. (It was something like 95%.) I'm not sure what that statistic means, but I'll assume it means that given a fingerprint, the odds that you will correctly identify the user is 90%. Perhaps it means that for 90% of fingerprints, there's only one match, while the other 10% match multiple users (and assume that there are no fingerprints that match no users).

On top of this, embeds are used across sites, so these servers keep track of your visitations across sites. A widely deployed embed like google analytics knows not only the site stats, but how users move between sites (that use the embed). The most widely deployed embeds of this type are ad servers -- and what they do is profile not only your identity, but what pages you visit to discern what content you like to read.

If you wish to preserve the anonymity of your users, you must not use embeds. You must also figure out how to fund your project, because the value of ads online is increasing, and the increase is due to the ability of sites to track and profile you.