This is part two in a planned five-part sequence on some interesting applications of linear algebra to tomography for computer networks, programs, and a yet-to-be-decided physical example. Today: understanding link properties from end-to-end measurements.
I'm going to start with the same setup that we had yesterday. We have n computers attached to the Internet, and they need to communicate in order to cooperate on something. We can measure the log loss rates on any of the n(n-1) end-to-end network paths; and with a mild assumption about independence of failures, we found that the log loss rate of a path is just the sum of the log loss rates for all the links on the paths. We summarized the relation between the log loss rates for all the paths and for all the paths with the compact equation:
- G_{ij} is one if path i uses link j, zero otherwise;
- b_i is the log loss rate for path i; and
- x_j is the log loss rate for link j.
The matrix G is typically very rank-deficient: both rows and columns in the matrix are linearly dependent upon each other. In practice the rank k of G seems to grow sort of like O(n \log n), though we still only have a fuzzy understanding of the behavior of k as a function of n and of the overall network topology. Yesterday, I described how we could use the low rank of G to infer loss rates on all n(n-1) end-to-end network paths by measurements of only k end-to-end paths. Today, I'll describe a different problem: what can we tell about the loss rates of the individual links? This information can be useful if, for example, we would like to diagnose where there is a network problem, so that we can route around it, fix it, or complain loudly about it to someone who is contractually obligated to fix it for us.
The answer, as it turns out, is that we can't tell a whole lot about the individual links just using the equations I've written so far. The problem can be illustrated by a simple example: if I know c + d = 5, can I find c? No, c could be anything! If you give me d, I can find c, and vice-versa; but without more information, I cannot say anything more about c or d. Similarly, I know a lot about sums of link properties, but the linear system that I wrote down generally provides too little information for me to find the individual link properties. The rank deficiency that was so useful for yesterday's monitoring application is now getting in our way.
How can we get the extra information we need to make progress? Well, there is a natural set of inequality constraints which I have so far ignored. Remember from yesterday that I actually described the link loss rates using negative log probabilities:
So while the equations Gx = b don't generally provide sufficient information to determine the link loss x_i, I can get a possibly-useful bounds on x_i because I know that x \geq 0. For example, I know that the loss rate on any network path can be no smaller than the maximum loss rate over all links used by that path; or, put the other way around, I know that the loss rate of any link is bounded above by the loss rates of all paths that use the link. So if path i uses link j (G_{ij} = 1), then I know x_j \leq b_j. But here's a remarkable thing about the Internet: most of the time, the packets you send make it through intact! Most links and paths in the network lose few packets; only a few links, and the paths that go through them, suffer from excessive packet loss. So even with this simple bound, we can show that many links in the network have very low loss rates.
The tightest possible upper bound for x_i based on what I've discussed so far is x_i \leq x_i^{\mathrm{max}}, where
I like linear problems with special structure. They give me a sense of purpose.
This is where the joint work with Yan currently stands. The paper which we're finishing now is essentially about how to compute good, practical bounds on link loss rates in order to find where there are problems in the network. I think this approach has a lot of promise, though a lot of fine-tuning remains, both in the mathematical theory and in the implementation. There are alternatives, but those alternatives usually involve either measurements which are not end-to-end (and are hard to make reliably); or statistical models in which errors in the model assumptions can affect prediction accuracy.
- Next time: What has tomography to do with software testing?
- Currently drinking: Black tea with vanilla