I said a while back that I would write something about Markov chains. I just finished a tricky debugging session, and I don't think I'll be good for much else for the next half hour; so here goes.
Remember Frogger? Suppose our hero, the intrepid frog, is exploring a major intersection. From any of the corners, median strips, or other safe havens, he can venture forth to explore one of the other havens. The frog rolls a die (badly weighted, since frogs are not good dice-makers) to tell which corner to hop to at each step. Supposing the frog does not get squashed, we might ask: what fraction of the time the frog is like to spend at each corner?
More generally, we might have some graph that we're moving around. Each vertex in the graph (each state) is connected to a few other vertices, and at every time step we move to one of the adjacent vertices or stay still based on a roll of the dice. The important thing is that whenever we're at node X, the probability that we'll move to another node Y must be fixed -- it can't change over time, or depending on past history (the latter condition is more important than the former condition). Typical questions that we might then ask are: if we let the chain run for an arbitrarily long time, will the probability associated with landing on a particular node settle down? If so, how quickly?
Notice that I've changed something fundamental as soon as I start talking about probabilities. If I watch a foolhardy frog hopping around a major intersection, I'll only see the frog at one corner at a time -- barring gruesome accidents, of course. But if I look at the frog's behavior over long time spans, or if I look at the aggregate behavior of a bushel-basket full of frogs that I picked up at Ramses Fish Market and House of Plaques, then I can talk about a distribution over all the corners. That is, I've gone from a discrete value to something continuous.
Notice also that I've made a critical assumption, the one that makes
a Markov chain a Markov chain: the transition probabilities do
not depend on prior history. One of the profound things about
physical systems is that, even for complicated systems, this is
often a good assumption! It's almost a scientific article of faith
that, underneath it all, the universe plays by a fixed set of rules.
When a system does appear to have memory, it's often
because we weren't looking closely enough to see some hidden
variable. For instance, if I hit a ball with a bat, there's some
probability (very small, in my case) that the bat will break. That
probability changes over time, but that's not because of the passage
of time per se; it's because of the accumulated damage to the bat.
The probability that a bat with a particular amount of
damage will break does not depend on time. The trick here,
which applies in general, is that the memory
of the system is
encoded in additional variables. So I now have a bigger graph --
not just bat broken
and bat not broken,
but also
bat slightly damaged
(or moderately damaged, or heavily
damaged) -- but on this larger graph, I can call the process
memoryless.
The problem with adding variables to represent memory is that they
make the number of possible states in the system much
larger. At the logical extreme, you could call the entire history
part of the state -- but then you'd have an infinite state space,
which is inconvenient. The remarkable thing about the natural world
is that the history of very complex systems can often be summarized
with such a small number of variables. Human memories are not so
easily summarized, which is part of why statistical mechanics
produces very useful descriptions of nature, but the analogous
statistical mechanics of human behavior (psychohistory
in
Isaac Asimov's Foundation books) is never likely to be seen
outside science fiction.
Back to the matter at hand: Markov chains. If I have a finite state space, I can represent a Markov chain by a matrix, which I will call A. The Aij entry (row i, column j) represents the probability of moving state i to state j. If pk is a vector representing the probability I'm in each state at the kth step, then I can write the probability at step k+1 as pk+1 = A pk. The solution to this recurrence, starting at some initial probability distribution p0, is pk = Ak p0.
What began as a question about randomly hopping frogs has turned into a question of linear algebra -- as many interesting statistical problems ultimately do. The trick to analyzing linear systems is to choose your variables right. If we use the parameterization of the probability distribution, the entry pki represents the probability of ending up in state i at step k. If we ask instead how some combination of the entries of pk behaves, then we can solve an easier problem. This is the second time we've relaxed the problem this way: the first time, we went from a discrete set (the frog is at the northeast corner of Fourth and Broadway) to a continuous set of probability distributions. Now we're taking things a step further and looking at coordinates that don't even represent being in a particular state! But by choosing those vectors right, we can (usually) turn a messy matrix equation into a bunch of one-dimensional equations -- which are easy to analyze.
Huzzah for the Jordan canonical form! No, that's not the way Michael Jordan throws a basketball.
At any rate, this sort of analysis tells you a number of things. In particular:
- There is at least one stationary distribution: that is, a ps such that A ps = ps. If the graph of transitions has only one connected component, then there is only one such distribution.
- If there is one stationary distribution, then in the long run (as k goes to infinity), there are only two things pk can do. The first thing is that the chain can slosh back and forth between distributions. For instance, suppose I put a stone on checkerboard and, at each step, allow myself to move up, down, left, or right by one square. If I start at a white square, then I'll always be on a white square every second move, and on black squares during the moves between. In the long run, I'll have to distributions to how often I visit each square on the board: one for how often I visit the white squares, and one for how often I visit the black squares. The second thing that pk can do is to converge to ps -- in my checkerboard example, this is what happens if at each step I allow some nonzero probability that the stone will stay where it is. A Markov chain which always converges to a single stationary distribution is called ergodic -- a simple enough idea with a fancy name.
-
If the chain is ergodic, we can ask
how long does it take to get to the stationary distribution?
The answer is that you can characterize the speed of convergence in terms of differences between eigenvalues of A (usually between the eigenvalue 1 and the eigenvalue which is second-largest in magnitude). Yet another reason for people to care about eigenvalue calculations! -
The finiteness of the chain is not only quantitatively important:
it's also qualitatively important. If I allow an infinite number of
states, then in general there does not need to be any stationary
distribution! In a finite, ergodic Markov chain, the probability of
landing in a particular state tends to
diffuse
from the initial distribution (this is not a false analogy: many diffusion equations can be derived from considering Markovian behavior of particles at the microscale). If the state space is infinite, then the probability can keepdiffusing away
forever, so that the chain visits ever more states with ever-decreasing probability. Much of what I've said generalizes to some continuous state spaces, but these Markov chains are characterized by a different sort of finiteness condition.
- Currently drinking: Gen mai cha