Tuesday, July 05, 2005

Network Tomography

This is part two in a planned five-part sequence on some interesting applications of linear algebra to tomography for computer networks, programs, and a yet-to-be-decided physical example. Today: understanding link properties from end-to-end measurements.

I'm going to start with the same setup that we had yesterday. We have n computers attached to the Internet, and they need to communicate in order to cooperate on something. We can measure the log loss rates on any of the n(n-1) end-to-end network paths; and with a mild assumption about independence of failures, we found that the log loss rate of a path is just the sum of the log loss rates for all the links on the paths. We summarized the relation between the log loss rates for all the paths and for all the paths with the compact equation:

Gx = b
in which

  • G_{ij} is one if path i uses link j, zero otherwise;
  • b_i is the log loss rate for path i; and
  • x_j is the log loss rate for link j.

The matrix G is typically very rank-deficient: both rows and columns in the matrix are linearly dependent upon each other. In practice the rank k of G seems to grow sort of like O(n \log n), though we still only have a fuzzy understanding of the behavior of k as a function of n and of the overall network topology. Yesterday, I described how we could use the low rank of G to infer loss rates on all n(n-1) end-to-end network paths by measurements of only k end-to-end paths. Today, I'll describe a different problem: what can we tell about the loss rates of the individual links? This information can be useful if, for example, we would like to diagnose where there is a network problem, so that we can route around it, fix it, or complain loudly about it to someone who is contractually obligated to fix it for us.

The answer, as it turns out, is that we can't tell a whole lot about the individual links just using the equations I've written so far. The problem can be illustrated by a simple example: if I know c + d = 5, can I find c? No, c could be anything! If you give me d, I can find c, and vice-versa; but without more information, I cannot say anything more about c or d. Similarly, I know a lot about sums of link properties, but the linear system that I wrote down generally provides too little information for me to find the individual link properties. The rank deficiency that was so useful for yesterday's monitoring application is now getting in our way.

How can we get the extra information we need to make progress? Well, there is a natural set of inequality constraints which I have so far ignored. Remember from yesterday that I actually described the link loss rates using negative log probabilities:

x_j = -\log(P_j)
where P_j is the probability of successful transmission across link j. Since P_j is a probability, it lies strictly between zero and one; therefore x_j is strictly non-negative. More generally, I have x \geq 0 and b \geq 0, where the inequalities should be interpreted component-by-component. This tells us a lot! In the example where c+d = 5, if we know that c and d are non-negative, then we can immediately say that 0 \leq c,d \leq 5. We still can't get exact values for c and d without more information, but we can get upper and lower bounds, which is a major improvement.

So while the equations Gx = b don't generally provide sufficient information to determine the link loss x_i, I can get a possibly-useful bounds on x_i because I know that x \geq 0. For example, I know that the loss rate on any network path can be no smaller than the maximum loss rate over all links used by that path; or, put the other way around, I know that the loss rate of any link is bounded above by the loss rates of all paths that use the link. So if path i uses link j (G_{ij} = 1), then I know x_j \leq b_j. But here's a remarkable thing about the Internet: most of the time, the packets you send make it through intact! Most links and paths in the network lose few packets; only a few links, and the paths that go through them, suffer from excessive packet loss. So even with this simple bound, we can show that many links in the network have very low loss rates.

The tightest possible upper bound for x_i based on what I've discussed so far is x_i \leq x_i^{\mathrm{max}}, where

x_i^{\mathrm{max}} := \max\{ y_i : Gy = b, \; y \geq 0 \}
Similarly, I can get a lower bound on x_i by replacing the max by a min. The bound x_i^{\mathrm{max}} is defined by a maximization involving linear equalities and inequalities; that is, it's the solution to a linear programming problem. The theory of linear programming is pretty well-developed, so we might hope that there's a cheap algorithm to compute great bounds for every component of i, which is about all we can hope for so far. Unfortunately, these linear programs can get unwieldy unless we use heuristics or special structures in the system.

I like linear problems with special structure. They give me a sense of purpose.

This is where the joint work with Yan currently stands. The paper which we're finishing now is essentially about how to compute good, practical bounds on link loss rates in order to find where there are problems in the network. I think this approach has a lot of promise, though a lot of fine-tuning remains, both in the mathematical theory and in the implementation. There are alternatives, but those alternatives usually involve either measurements which are not end-to-end (and are hard to make reliably); or statistical models in which errors in the model assumptions can affect prediction accuracy.

  • Next time: What has tomography to do with software testing?
  • Currently drinking: Black tea with vanilla