Tea Total: On Anonymity

Have you seen the old New Yorker cartoon in which two dogs are sitting at a computer, and one is telling the other On the Internet, nobody knows your a dog?

Hmm. Let me share some reasons why online anonymity is a little more complicated than that.

First, let's dispense with the obvious. If you write a large body of information drawn from your personal experiences, then anyone reading with a reasonably analytical turn of mind will be able to deduce a great deal about you (unless you're deliberately deceptive). Gender, profession, hobbies, and location are the obvious things. This information may give a reader a shrewd guess to who you are; if not, it may at least give a guess that you belong to a small group of people.

Now, let's discuss the more interesting piece: personal style.

If you are reading this as an academic, you will appreciate that writing style, together with a few contextual clues, is enough to often render blind review a polite fiction. You send me a review in which you way not enough attention is paid to the following references and list twelve references with a common co-author; I may well guess who you are. Or I write a review of your paper after we have communicated verbally or in writing; you may well recognize me simply from the way that I choose my words. The blind review is really only hazy, but that's fine. I may still have doubts about my guess, as might you. Similarly, if I were to read an anonymous blog written by someone I knew, I might well be able to guess that person's identity, but it would be hard for me to be sure.

What is it that you identify as a particular writer's style? My guess is that a large fraction of what you identify as style has to do with statistical patterns in language use. The most obvious things involve sentence length and complexity, the sort of metrics that lead one to declare that some author writes at an nth grade level. But many of us have our own personal catch-phrases, too. In music, people have discovered that by looking at the probabilities associated with very short sequences of notes, one can distinguish the music of Bach, say, from the music of Mozart. On Amazon, you may have noticed the introduction of Statistically Improbable Phrases (SIPs) on the book description pages. Many of the best spam filters work by training a classifier to distinguish spam from legitimate mail on the basis of automatically-determined statistical properties in the text. A blog is a large volume of text, which is exactly what's needed for such statistical text classifiers; given the text of a blog and a set of texts (e-mail messages or class papers, perhaps) written by possible authors of said blog, I dare say it would take me about a week to put together a good tool to decide who -- if any -- of my suspects was actually responsible for the writing. Given that there are people out there who know much more about this sort of thing than I do, I would be surprised to find that such a tool does not already exist. And I haven't even said anything about tracing the link structure!

I think it would be interesting, actually, to try to build classifiers in this way to see if there are any two distinct authors who write in such a similar manner that it's impossible to distinguish their writing based only on the probabilities of two- or three-word sequences. Or could such a classifier be trained to automatically recognize regional dialects, identifying from what part of the US a particular writer might hail? How reliable would such a classification be? But I digress.

I suspect that most people would be shocked at how easily their privacy and anonymity could be removed if anyone cared to take the effort (though anyone who has stood in line at a grocery store knows full well that the private lives of the rich and the famous are often anything but private). But usually nobody cares to make the effort. If you wish to be anonymous, the polite thing for me to do when I recognize you is to pretend that I know nothing. But this assumes that I've paid enough attention to recognize you in the first place. It's a little like an inexpensive lock: a deterrent that tells most people that it would take some bother to steal the protected item, and tells the few people who might be able to pick the locks with no difficulty that you'd really consider it rude for them to run off with your stuff.

Currently drinking: Green tea

Tea Total

Blog Archive

Monday, July 18, 2005

On Anonymity