Tea Total: Markov chains

I said a while back that I would write something about Markov chains. I just finished a tricky debugging session, and I don't think I'll be good for much else for the next half hour; so here goes.

Remember Frogger? Suppose our hero, the intrepid frog, is exploring a major intersection. From any of the corners, median strips, or other safe havens, he can venture forth to explore one of the other havens. The frog rolls a die (badly weighted, since frogs are not good dice-makers) to tell which corner to hop to at each step. Supposing the frog does not get squashed, we might ask: what fraction of the time the frog is like to spend at each corner?

More generally, we might have some graph that we're moving around. Each vertex in the graph (each state) is connected to a few other vertices, and at every time step we move to one of the adjacent vertices or stay still based on a roll of the dice. The important thing is that whenever we're at node X, the probability that we'll move to another node Y must be fixed -- it can't change over time, or depending on past history (the latter condition is more important than the former condition). Typical questions that we might then ask are: if we let the chain run for an arbitrarily long time, will the probability associated with landing on a particular node settle down? If so, how quickly?

Notice that I've changed something fundamental as soon as I start talking about probabilities. If I watch a foolhardy frog hopping around a major intersection, I'll only see the frog at one corner at a time -- barring gruesome accidents, of course. But if I look at the frog's behavior over long time spans, or if I look at the aggregate behavior of a bushel-basket full of frogs that I picked up at Ramses Fish Market and House of Plaques, then I can talk about a distribution over all the corners. That is, I've gone from a discrete value to something continuous.

Notice also that I've made a critical assumption, the one that makes a Markov chain a Markov chain: the transition probabilities do not depend on prior history. One of the profound things about physical systems is that, even for complicated systems, this is often a good assumption! It's almost a scientific article of faith that, underneath it all, the universe plays by a fixed set of rules. When a system does appear to have memory, it's often because we weren't looking closely enough to see some hidden variable. For instance, if I hit a ball with a bat, there's some probability (very small, in my case) that the bat will break. That probability changes over time, but that's not because of the passage of time per se; it's because of the accumulated damage to the bat. The probability that a bat with a particular amount of damage will break does not depend on time. The trick here, which applies in general, is that the memory of the system is encoded in additional variables. So I now have a bigger graph -- not just bat broken and bat not broken, but also bat slightly damaged (or moderately damaged, or heavily damaged) -- but on this larger graph, I can call the process memoryless.

The problem with adding variables to represent memory is that they make the number of possible states in the system much larger. At the logical extreme, you could call the entire history part of the state -- but then you'd have an infinite state space, which is inconvenient. The remarkable thing about the natural world is that the history of very complex systems can often be summarized with such a small number of variables. Human memories are not so easily summarized, which is part of why statistical mechanics produces very useful descriptions of nature, but the analogous statistical mechanics of human behavior (psychohistory in Isaac Asimov's Foundation books) is never likely to be seen outside science fiction.

Back to the matter at hand: Markov chains. If I have a finite state space, I can represent a Markov chain by a matrix, which I will call A. The A_ij entry (row i, column j) represents the probability of moving state i to state j. If p^k is a vector representing the probability I'm in each state at the kth step, then I can write the probability at step k+1 as p^k+1 = A p^k. The solution to this recurrence, starting at some initial probability distribution p⁰, is p^k = A^k p⁰.

What began as a question about randomly hopping frogs has turned into a question of linear algebra -- as many interesting statistical problems ultimately do. The trick to analyzing linear systems is to choose your variables right. If we use the parameterization of the probability distribution, the entry p^k_i represents the probability of ending up in state i at step k. If we ask instead how some combination of the entries of p^k behaves, then we can solve an easier problem. This is the second time we've relaxed the problem this way: the first time, we went from a discrete set (the frog is at the northeast corner of Fourth and Broadway) to a continuous set of probability distributions. Now we're taking things a step further and looking at coordinates that don't even represent being in a particular state! But by choosing those vectors right, we can (usually) turn a messy matrix equation into a bunch of one-dimensional equations -- which are easy to analyze.

Huzzah for the Jordan canonical form! No, that's not the way Michael Jordan throws a basketball.

At any rate, this sort of analysis tells you a number of things. In particular:

There is at least one stationary distribution: that is, a p^s such that A p^s = p^s. If the graph of transitions has only one connected component, then there is only one such distribution.
If there is one stationary distribution, then in the long run (as k goes to infinity), there are only two things p^k can do. The first thing is that the chain can slosh back and forth between distributions. For instance, suppose I put a stone on checkerboard and, at each step, allow myself to move up, down, left, or right by one square. If I start at a white square, then I'll always be on a white square every second move, and on black squares during the moves between. In the long run, I'll have to distributions to how often I visit each square on the board: one for how often I visit the white squares, and one for how often I visit the black squares. The second thing that p^k can do is to converge to p^s -- in my checkerboard example, this is what happens if at each step I allow some nonzero probability that the stone will stay where it is. A Markov chain which always converges to a single stationary distribution is called ergodic -- a simple enough idea with a fancy name.
If the chain is ergodic, we can ask how long does it take to get to the stationary distribution? The answer is that you can characterize the speed of convergence in terms of differences between eigenvalues of A (usually between the eigenvalue 1 and the eigenvalue which is second-largest in magnitude). Yet another reason for people to care about eigenvalue calculations!
The finiteness of the chain is not only quantitatively important: it's also qualitatively important. If I allow an infinite number of states, then in general there does not need to be any stationary distribution! In a finite, ergodic Markov chain, the probability of landing in a particular state tends to diffuse from the initial distribution (this is not a false analogy: many diffusion equations can be derived from considering Markovian behavior of particles at the microscale). If the state space is infinite, then the probability can keep diffusing away forever, so that the chain visits ever more states with ever-decreasing probability. Much of what I've said generalizes to some continuous state spaces, but these Markov chains are characterized by a different sort of finiteness condition.

Currently drinking: Gen mai cha

Tea Total

Blog Archive

Monday, January 17, 2005

Markov chains