This is part one of a planned five-part sequence. This article and
the next are on *network tomography*; then there will be two
articles on coverage testing and *code tomography*, a simple
application of the same ideas to a completely different problem; and
the fourth article will be on a physical tomography problem with the
same flavor (I have two in mind, but haven't yet decided which to
use). Yes, these are things I think about as part of my research; I
recently spent some time on a paper on a variation of today's topic,
which is why I'm thinking about it; and no, it's not going into the
thesis, which already has too much stuff in it. I'm writing this
because this stuff is a lot of fun, because it ultimately relies on
mathematical ideas that I think are very broadly accessible, and
because in this forum I can be a smart-ass without being taken to
task for informality. I'm also writing this because the combination
of jsMath and a little custom AWK script lets me embed mathematical
TeX directly into my HTML, which I think is very cool indeed. Shall
we begin?

A couple years ago, Yan Chen knocked on my door to ask a numerical
question. He was using the code from Numerical Recipes to compute
something (a singular value decomposition), and it wasn't working.
I told him that he certainly shouldn't be using Numerical Recipes --
there are better codes out there -- and asked why he needed that
particular decomposition, anyhow. He told me, and thus started an
interesting collaboration.

Yan's problem was one of *network tomography*. Suppose you
have a bunch of computers attached to the Internet, and they're all
cooperating on something. From time to time, there will be links in
the network that fail, or get so congested that almost no packets
get through them. How should the computers arrange communications
between themselves in order to avoid these congested links?

The obvious thing to do is to measure every possible path between
hosts in the network. But if there are n hosts in the network,
then there will be n(n-1) total paths. That quickly adds up: if
you had a hundred cooperating computer, you'd need to monitor 9900
paths. Is there a better way?

A path through the network consists of a sequence of links. To
successfully traverse a path, a packet needs to successfully
traverse each link in the path in turn. We write the probability
that a packet will make it through the path as P_{\mathrm{path}},
and the probability that it will go through network link j as
P_j. Then, assuming the link packet losses are independent events
(a good assumption in general),

P_{\mathrm{path}} = \prod_{j \in \mathrm{path}} P_{j}

Because sums are even easier than products, let me take logarithms
of the above equation. If I define

b_{\mathrm{path}} =
-\log(P_{\mathrm{path}}) and

x_j = -\log(P_{j}), then

b_{\mathrm{path}} = \sum_{j \in \mathrm{path}} x_j

Now let me consider all

n(n-1) paths simultaneously. If I define

b_i as the log loss rate on the

ith path, and define

G_{ij} to
be one if path

i uses link

j (zero otherwise), then I can write

b_i = \sum_j G_{ij} x_j

Ah! A linear system of equations -- one of my favorite things.
In matrix form, I would write

b = Gx.

Here's the cool thing: paths overlap each other, and that ultimately
leads to *linear dependencies*. You can see the effect for
yourself in a very simple network, shaped like a Y, with a router R
in the center and hosts A, B, and C connected to R. If I want to go
from A to B, I use the path AR+RB (I go from A to R, and then from R
to B). I can also write AR+RB in terms of the other end-to-end
paths in the system:

\mathrm{AR}+\mathrm{RB} = \left\{
\begin{array}{c}
+(\mathrm{CR}+\mathrm{RB}) \ -(\mathrm{CR}+\mathrm{RA}) \ +(\mathrm{BR}+\mathrm{RA}) \ -(\mathrm{BR}+\mathrm{RC}) \ +(\mathrm{AR}+\mathrm{RC})
\end{array}
\right\}

What this means is that the matrix

G which for this network is a
six-by-six matrix is

*rank-deficient* (in this case the rank
of

G is 5).

Let k be the rank of G. If I pick n hosts in a typical
(large) computer network, then k is typically much less than
n(n-1); in fact, it typically seems to grow much more slowly than
O(n^2). I can prove this for unrealistically simple networks, and
I can argue about it heuristically for real networks, but I don't
really have anything resembling a good proof. I don't even have
anything resembling a clearly stated *conjecture*: I'm not
sure how to legitimately employ the usual asymptotic arguments
involving the case n \rightarrow \infty, because
the network isn't infinite in extent. Nevertheless, I'll abuse
notation slightly and say that for intermediate

sizes of n
-- not too small, but not approaching the size of the network --
typically k seems to grow something like O(n \log n).

Fortunately, I don't have to understand all the details of why k
is small in order to take advantage of its smallness. First,
though, let me clear away a common piece of mental clutter: the
notion that in order to compute a sum, I need to know all the
summands.

Cleve Moler wrote an article on The World's Simplest Impossible
Problem

some time ago (I don't have the article with me, so I
may have bungled the title). The article began with a question,
something like I'm thinking of two numbers that sum to five; what
are they?

Of course, I can't answer that question, since the
first few words (I'm thinking of

) suggest that Cleve had a
*unique* solution in mind out of an infinite number of
possible solutions. But if he had then called those numbers a and
b and asked me what a + b + 5 was, I could easily answer
ten.

I don't know the values of a and b, but I do know a
*relationship* between them, and that's the mathematically
important thing.

Because G is rank-deficient, I can't, in general, use the
equations Gx = b to determine the link losses x_j from the path
losses b_i: the system has an infinite number of possible
solutions. Depending on your early training, this may be the cause
for consternation, excitement, or an outburst of fervent mysticism.
But I don't actually *need* the values of x_j for the
problem I described. Remember: I want to know the packet loss rates
along every end-to-end path between my n networked computers, but
I don't want to be bothered with measuring all those paths directly.
Clearly the information I want -- the vector b -- can be
determined by direct measurement! I just want to make the
measurement less expensive.

The key realization is that I can get all the information I might
desire from k measurements. If I can find a *basis* of k
independent paths -- paths that can't be written in terms of each
other -- then I can write *every end-to-end path* as a linear
combination of paths from that basis. This is exactly what I did
earlier when I mentioned the example of the Y-shaped network. So I
don't really need to worry about every path in the network; I only
need to worry about some basis of k paths. Let me suppose I can
find k such paths, putting aside for the moment the surprisingly
delicate question of how I accomplish this task in practice. I'll
put these paths together in a matrix G^M with one row in each
path, and I'll put the measurements of the log loss rates on those
paths into a vector b^M. Then I seek a solution -- *any*
solution -- to the system

G^M y = b^M

Computing a solution

y requires relatively few measurements; I
only need direct measurements of the

k loss rates in

b^M. But
because every row of

G can be written as a linear combination of
rows from

G^M, the solution

y automatically satisfies

Gy = b.
Which means that I can

*infer* the loss rates on all

n(n-1)
paths from only

k = O(n \log n) direct measurements.

Cool, no?

I've glossed over a number of technical mathematical and system
issues here:

- How does one measure the network topology to get the matrix G?
- Internet routing is dynamic; doesn't G change over time? If
it does, how can you efficiently update your solutions?
- How do you find k independent paths to measure?
- How do you make sure that one host in the network doesn't end up
doing a lion's share of the measurement work?
- What do you do with computers that want to leave or join the system?
- How does measurement noise affect the results?
- Once you have information about all the possible end-to-end loss
rates, how do use it to systematically build a reliable
communication system between the network hosts?

Some (not all) of these issues are treated in papers that Yan and I
have written; you can find links to papers and presentations from
either of our web pages.

**Next time:** What can we say about individual links?
**Currently drinking:** Coffee