Friday, July 29, 2005

Turtles in Austin

Texans believe in air conditioning.

I should rephrase that: there exist at least some Texans who believe in air conditioning at least part of the time. Presumably one of those Texans is involved in the administration of the Austin convention center. It was hot and humid outside, but I wished more than once that I'd paid closer attention to the advice to bring a blazer.

When I opened my bag on the first evening, I found the expected pad of paper, program, and pen, but there was a twist. The pen is a particularly fancy four-in-one widget: at one end there is a pen and a PDA stylus, and at the other end there is a penlight and a laser pointer. I'll have to hold on to this one. Assuming I don't wear down the batteries while playing with the cat, I imagine it will come in useful for future presentations.

It was a productive meeting. I learned a lot, I met some people, I caught up with a surprising number of Berkeley and ex-Berkeley folks, and I scribbled my way through several more pages of potential research leads. And I still haven't finished clean copies of my notes on leads from the other three workshops! An embarrassment of riches isn't a cause for complaints, but I'm frustrated that I have such difficulty keeping up with myself.

I also spent some time exploring Austin with Tsuyoshi. He took the picture of the turtle that now graces the top of this page. When I think of Texas, turtles are usually not the first thing that comes to mind -- but it's cute, no?

I'm glad to be done for a while with the business of traveling and giving talks. Time to buckle down and get some real work done!

  • Currently drinking: Black tea with strawberry

Saturday, July 23, 2005

In a Barrel

Searching for LISP-based implementations of different programming languages is a task akin to shooting a mixture of metaphors and pre-hatched chickens in a barrel.

Let's see who Google brings to this post based on that sentence.

I think I've written already that next semester I'll be the GSI (graduate student instructor -- a UC-ism for teaching assistant) for the local compilers class, CS 164. The plan is to implement everything in Common Lisp. That means that all the compiler stages (scanner, parser, optimizer, code generator) will be written in Lisp, as will be the virtual machine on which the compiled programs are to run. The language that we're compiling, though, is going to be a Java variant.

I'm looking forward to using Lisp for this class. The last time I used Lisp for very much was when I took an artificial intelligence course as an undergraduate, but I have more than a passing familiarity with it. Indeed, I think that someone who graduates with a degree in computer science and isn't familiar with Lisp and Forth has been robbed. These two languages are fundamental. The fact that they are so fundamental is actually the source of my current frustration, but let me return to that after I explain briefly what I mean when I say that these languages are fundamental.

Like human languages, programming languages have grammar. The grammmar of a programming language describes the types of constructions which make up acceptable programs. Like human languages, the grammar is far from the whole picture; beyond the syntax, or the structure of an utterance, there is the semantics, or the formal meaning of the utterance. The sentence The blue flu blogs at midnight is syntactically valid, but semantically questionable. Programming language syntax is usually described in BNF (Backus-Naur form -- you may insert your own pun on Backus-Bacchus here). It's easiest to describe the notation by example. Suppose I wanted to write down the syntax of simple arithmetic expressions involving plus, minus, times, and divide. I might write the grammar like this:

  expression := 
    addend |
    expression + addend |
    expression - addend
  addend :=
    multiplicand |
    addend * multiplicand |
    addend / multiplicand
  multiplicand :=
    number |
    variable |
    ( expression ) 

In words, what I've written is that an expression consists of a sums and differences of addends; an addend consists of products and quotients of multiplicands; and a multiplicand is a number, a variable, or a parenthesized expression. So if I wanted to decide whether the expression 1 + 2*(3+4) was valid, I'd have to see if I could describe it according to these rules. The description of the expression in terms of syntactic rules gives us a parse tree, which is sort of like a grade school sentence diagram, but more useful. Since I'm limited by HTML, let me use the following convention for writing the parse tree above: when I use a particular rule, I'll write the name of the thing I produce first, then the list of inputs to the rule; and I'll wrap the whole thing in parentheses to prevent confusion about where I am. In that case, the parse tree for 1 + 2*(3+4) is

  (expression
    (expression
      (addend (multiplicand (number 1))))
    +
    (addend
      (multiplicand (number 2))
      *
      (multiplicand
        oparen
        (expression
          (expression
            (addend (multiplicand (number 3))))
          +
          (addend (multiplicand (number 4))))
        cparen)))

I say that I write things this way because of the limitations of HTML, but even if I were a little more willing to draw sketches for you, I might decide to write the parse tree in the above way. Why? Because it's convenient for computer manipulation, and people have developed powerful tools for dealing with this sort of representation of expressions. It will not surprise you to learn that the above expression is a sort of proto-Lisp.

Of course, if I actually wanted to write the above expression in any dialect of Lisp that someone would use, it would look like

  (+ 1
    (* 2
       (+ 3 4)))

Nevertheless, the point remains: in essence, a Lisp programmer writes a direct representation of the syntax trees for his programs. And trees are things that computer scientists are happy manipulating as data, which means that Lisp programmers can easily write programs that analyze and rewrite programs. And a compiler is essentially a tool for analyzing and rewriting programs, so lots of compiler writers use Lisp, or something very Lisp-like, inside their compiler.

Forth is what you get by turning Lisp on its head, so that instead of writing (+ 1 1) you write 1 1 +. The Forth model will be familiar to anyone who has used an HP calculator or looked at the specs for the Java Virtual Machine.

The problem: since Lisp and Forth are so damn easy to parse, everyone writes parsers for them -- and Google isn't smart enough to distinguish writing about parsers written in Lisp for other languages from writing about parsers written in other languages for Lisp. The latter are ubiquitous, the former a little less so -- hence my problem.

Since my battery is running low, ruminations on the resolution to this problem will have to wait for another day.

Friday, July 22, 2005

Its Own Worst Enemy

I like Best Software Writing I (ed Joel Spolsky) more the more I read. One of the things I like about it is that most of the selected pieces are available in electronic form, so that I can point people to them without saying Oh, but you need to buy the book to see it. That said, this is a book worth checking out, even if you're not a software developer.

I just finished reading Clay Shirky's essay A Group Is Its Own Worst Enemy, a thoughtful analysis of certain aspects of social software design. I may have more comments on this later, but right now I'll just say go read it! I will also mention that I hope Pete has the opportunity to read and respond to this one, since I suspect his comments on this essay would be far more entertaining than my own.

Thursday, July 21, 2005

Pen trip

I will fly to Austin on Sunday so that I can give a talk on MEMS resonator simulation on Monday morning at the computational mechanics conference. I'll be back in Berkeley come Friday, and that will be it for my summer travels. I'd originally planned to go home at the start of August, but I'm tired of travel, and want to spend the rest of my summer enjoying my own bed and my own cooking. And, of course, writing.

To my family and Maryland friends: I hope I'll get to see you -- at least some of you -- for Christmas or the New Year.

On an unrelated note, I am now the proud owner of one new fountain pen and one rehabilitated fountain pen. And I'm cackling over these things, I am. I've had the older pen for quite a while, but for the past couple years I've not used it. It has a fine nib, and it gets quirky when it's not very clean; unfortunately, it's hard to clean the nib when you only use cartridge ink. I got a converter for it and spend a quarter of an hour using the converter to flush water through the nib; at the end of that time, the water was running clear and I was able to write again. Hooray!

The new pen also has a cartridge converter. I filled both pens with blue ink. One day, I hope to be able to perform this task without turning my hands blue.

Tuesday, July 19, 2005

PowerBook accelerometers

The Sudden Motion Sensor (SMS) on the PowerBook is used to turn off the hard drive in the case of sudden acceleratins (like those that occur when you drop the computer, for instance). The ThinkPads have a similar system. The device is almost certainly using a microelectromechanical system (MEMS) sensor, probably one of the ADXL series from Analog Devices -- the same types of sensors that are probably used to control the air bags in your car.

It's possible to read the accelerometers at a user level, too. In general, this seems to be of dubious utility, unless you want to tilt your computer around as a way to control something. But I think it's utterly cool that I can download a little applet that will, in real time, show the orientation of my laptop based on the accelerometer readings. This demo is going into my where are MEMS used? slide for my talk on Monday, and probably for several other talks after that.

Monday, July 18, 2005

On Anonymity

Have you seen the old New Yorker cartoon in which two dogs are sitting at a computer, and one is telling the other On the Internet, nobody knows your a dog?

Hmm. Let me share some reasons why online anonymity is a little more complicated than that.

First, let's dispense with the obvious. If you write a large body of information drawn from your personal experiences, then anyone reading with a reasonably analytical turn of mind will be able to deduce a great deal about you (unless you're deliberately deceptive). Gender, profession, hobbies, and location are the obvious things. This information may give a reader a shrewd guess to who you are; if not, it may at least give a guess that you belong to a small group of people.

Now, let's discuss the more interesting piece: personal style.

If you are reading this as an academic, you will appreciate that writing style, together with a few contextual clues, is enough to often render blind review a polite fiction. You send me a review in which you way not enough attention is paid to the following references and list twelve references with a common co-author; I may well guess who you are. Or I write a review of your paper after we have communicated verbally or in writing; you may well recognize me simply from the way that I choose my words. The blind review is really only hazy, but that's fine. I may still have doubts about my guess, as might you. Similarly, if I were to read an anonymous blog written by someone I knew, I might well be able to guess that person's identity, but it would be hard for me to be sure.

What is it that you identify as a particular writer's style? My guess is that a large fraction of what you identify as style has to do with statistical patterns in language use. The most obvious things involve sentence length and complexity, the sort of metrics that lead one to declare that some author writes at an nth grade level. But many of us have our own personal catch-phrases, too. In music, people have discovered that by looking at the probabilities associated with very short sequences of notes, one can distinguish the music of Bach, say, from the music of Mozart. On Amazon, you may have noticed the introduction of Statistically Improbable Phrases (SIPs) on the book description pages. Many of the best spam filters work by training a classifier to distinguish spam from legitimate mail on the basis of automatically-determined statistical properties in the text. A blog is a large volume of text, which is exactly what's needed for such statistical text classifiers; given the text of a blog and a set of texts (e-mail messages or class papers, perhaps) written by possible authors of said blog, I dare say it would take me about a week to put together a good tool to decide who -- if any -- of my suspects was actually responsible for the writing. Given that there are people out there who know much more about this sort of thing than I do, I would be surprised to find that such a tool does not already exist. And I haven't even said anything about tracing the link structure!

I think it would be interesting, actually, to try to build classifiers in this way to see if there are any two distinct authors who write in such a similar manner that it's impossible to distinguish their writing based only on the probabilities of two- or three-word sequences. Or could such a classifier be trained to automatically recognize regional dialects, identifying from what part of the US a particular writer might hail? How reliable would such a classification be? But I digress.

I suspect that most people would be shocked at how easily their privacy and anonymity could be removed if anyone cared to take the effort (though anyone who has stood in line at a grocery store knows full well that the private lives of the rich and the famous are often anything but private). But usually nobody cares to make the effort. If you wish to be anonymous, the polite thing for me to do when I recognize you is to pretend that I know nothing. But this assumes that I've paid enough attention to recognize you in the first place. It's a little like an inexpensive lock: a deterrent that tells most people that it would take some bother to steal the protected item, and tells the few people who might be able to pick the locks with no difficulty that you'd really consider it rude for them to run off with your stuff.

  • Currently drinking: Green tea

Sunday, July 17, 2005

Reading list

I have not read the new Harry Potter book. Even if I had, I doubt I'd have much to say about it. Good book reviews either alert the reader to a book he may not otherwise have noticed or chosen to read, or they put the book under review into a wider perspective, or they become a sort of essay draped on the structure of the contents of a book. For a book so heralded as Rowling's installment, there is no chance I would be alerting you to something you might not have known about; and given the recent state of things here, any more detailed essay I would write would be either terse or polemical.

That said, here are some books that I've either recently finished, recently acquired, or recently re-examined:

  • Providence of a Sparrow, Chris Chester.

    My brother Scott gave me this book as a Christmas gift. I read the first couple chapters, but got distracted well before I finished. The book was buried in my shelves, and so it was a pleasant surprise to find it again recently, just before I traveled. Chester's memoir revolves around B, an orphaned house sparrow that he adopted, but it contains much more. Like Sterling North's Rascal, Providence of a Sparrow is about the details of an everyday life in which a not-so-everyday animal plays an important role. The incidents Chester describes are entertaining, and the writing is full of bits of humor and lyric. I recommend this one.

  • It Must Be Beautiful, ed. Graham Famelo.

    I already reviewed this one. Together with Providence of a Sparrow, it was my airplane reading for the most recent trip.

  • University, Inc: The Corporate Corruption of American Higher Education, Jennifer Washburn.

    I heard a description of this book on an NPR program -- was it an interview on Fresh Air? -- and was sufficiently intrigued that I picked it up. I have not read it yet, but it's in the short queue.

  • The Meaning of It All, Richard Feynman.

    Based on a three-lecture series Feynmann gave in 1963, this book is characteristically Feynman: enthusiastic, entertaining, interesting, and sharp. Let me quote two paragraphs that seem particularly relevant:

    Admitting that we do not know, and maintaining perpetually the attitude that we do not know the direction necessarily to go, permit a possibility of alteration of thinking, of new contributions and new discoveries for the problem of developing a way to do what we want ultimately, even when we do not know what we want.

    Looking back at the worst times, it always seems that they were times in which there were people who believed with absolute faith and absolute dogmatism in something. And they were so serious in this matter that they insisted that the rest of the world agree with them. and then they would do things that were directly inconsistent with their own beliefs in order to maintain that what they said was true.

  • The Best Software Writing I. ed. Joel Spolsky.

    The introduction to the book makes the point perfectly:

    The software development world desperately needs better writing. If I have to read another 2000 page book about some class library written by 16 separate people in broken ESL, I’m going to flip out. If I see another hardback book about object oriented models written with dense faux-academic pretentiousness, I’m not going to shelve it any more in the Fog Creek library: it’s going right in the recycle bin. If I have to read another spirited attack on Microsoft’s buggy code by an enthusiastic nine year old Trekkie on Slashdot, I might just poke my eyes out with a sharpened pencil. Stop it, stop it, stop it!

    Damn straight.

    This book, like many other well-written engineering books, is accessible to a broader audience than just professionals. Based on what I've read so far, I'd recommend this book to most of my friends, at least at the level of you should stand in the aisle at the local book store and flip through this book. The same holds for Spolsky's earlier book, Joel on Software.

  • ANSI Common Lisp, Paul Graham.

    I've had this book since I took artificial intelligence as an undergrad (eight years?). I'm re-reading it now in order to re-load my Lisp programming skills into my brain, since this is the implemntation language for the compilers course for which I'm the TA in the fall. I'm impressed all over again. It's good.

  • Spectral/hp Element Methods for CFD, Karniadakis and Sherwin.

    I picked up the latest edition at the SIAM meeting. Now I can return the library's copy, which I've had out on loan for a year or so. Obviously, this is a specialist book; however, if you're interested in spectral elements, by all means grab a copy (if only from your library). There's a lot of good solid technical information there, and much of it is hard to find in other texts (Trefethen's excellent Spectral Methods in MATLAB and Boyd's thorough and entertaining Chebyshev and Fourier Spectral Methods both focus on pure spectral methods, and give relatively emphasis to the hybrid spectral element methods which are necessary to effectively apply spectral ideas to the complicated geometries found in typical engineering applications).

  • Spectra and Pseudospectra: The Behavior of Non-Normal Matrices and Operators, Trefethen and Embree.

    It's out! It's out! I don't have my copy yet, but I made the order at the SIAM meeting (conference discounts are a Good Thing). I've been looking forward to this book for some time now.

Wednesday, July 13, 2005

From New Orleans

My talk went fine, and was far from the most technically interesting thing I've seen or heard while hear. The weather is muggy, the food is good, the coffee and tea are respectively mediocre and dreadful. Tensor the Laptop has continued to function beautifully, even during the required night-before slide editing session. It's been fun, but I'm glad to head home tomorrow. And I'll be glad when to finish my summer travels at the end of the month.

I picked up two tech books. I've waited patiently for a year for one of them to come out; the other is a revised and expanded edition of a text I've had on loan from the library for a year or so. Yes, they're both related to my research.

I finished It Must Be Beautiful (ed Graham Famelo). It's good reading. Some essays involved beautiful theories, others involved entertaining stories, still others involved both. None of the essays are technical, but neither are the presentations dumbed down. Let me tease you with a few of the essay titles and authors:

  • The Rediscovery of Gravity. Roger Penrose
  • Erotica, Aesthetics, and Schrodinger's Wave Equation. Arthur Miller
  • The Equations of Life. John Maynard Smith
  • The Best Possible Time to be Alive. Robert May

I enjoyed the book as much for the writing as for the material. The authors do not bludgeon the reader with halting prose, nor do they produce gee whiz pieces like those sometimes found in newspapers and popular magazines. I don't uniformly disapprove of gee-whiz pieces, mind you, but it does please me to see well-crafted compound-complex sentences.

Saturday, July 09, 2005

MEX and OS X

If it isn't clear by now, half the point of these last few points is to record quirks of architectural quirks as I encounter them while re-building my software under OS X. This is useful for me (it's sometimes easier for me to search old blog entries than it is to search through my old research notebooks); it may or may not also be useful for the people who stumble in here on the guidance of a search engine.

Today's adventure: mixed language MEX (Matlab EXternal interface) files under OS X.

I usually build MEX files by calling a command line utility called mex, which is provided as part of the standard MATLAB installation. mex is a shell script that calls the compiler with flags saying where various MATLAB include files are, then calls the linker with all the options and libraries needed to make a shared library that MATLAB knows how to load.

Unfortunately, the MathWorks often doesn't set up their MEX configuration in the way that I might wish. On my own system, I can modify the build options to my heart's content, either by modifying the system defaults (in mexopt.sh) or by providing my own options using the -f flag. The problem with this is that, in general, other people will have the default options set if they try to download and compile my software. A natural way around this trouble is to only invoke the mex utility at the last moment, compiling as much as possible by invoking the compiler directly. And that is now what I do with FEAPMEX.

Before the most recent round of patches, the FEAPMEX build system used mex to compile everything that explicitly invoked any of the MATLAB helper functions. But when I tried to compile FEAPMEX on OS X, the system complained that many of the common block variables used by FEAP were multiply defined. Private external definitions, the linker called them. I did a quick search through the linker documentation for the Mach-O object format used by OS X, and found out what a private external symbol is (it seemed like a terrible oxymoron to me when I first saw it). And then I got confused, because it seemed there was no way I could get private external symbols.

And then I realized that the symbols that the linker complained about were only showing up in files that I was handling through mex and not compiling directly. Hmm.

So, lesson the first: if you are building a MEX file under OS X and you get errors involving private external definitions, try invoking the compiler directly to create the object files, and use mex only to invoke the linker. All you will need to do is to specify the location of the MATLAB include paths using the -I compiler command line option. Lesson the second: even if you have no errors in building your MEX file, this might be a good idea. When I recompiled, I found a bunch of places in which I'd forgotten to explicitly declare routines; usually the compiler would warn me about such things, but the mex script suppressed that warning. In one or two places, the lack of an explicit declaration wasn't just a bad idea, but clearly would lead to an error (under certain circumstances). For example, there was one place where a function that returned void (a subroutine) was used without an explicit declaration; that's not a good thing in C code, since implicitly-declared functions are assumed to return type int.

After I fixed the problem with the FEAPMEX build system and corrected the errors that the compiler helped me find, I was able to build the code and run the test suite with no issues. Building the standalone version of FEAP proved even more straightforward, though it took me two tries to discover that under OS X the history library (used in my enhanced version of the command line interface) is build into libreadline rather than being a separate library file.

It took me about an hour to install and compile FEAP and to install, fix, and compile FEAPMEX. And now I'm very pleased, not only because the ports went more smoothly than expected, but also because the code runs fast on this machine. Especially compared to my poor old ThinkPad.

Friday, July 08, 2005

Fixing things

HiQLab now compiles and runs on the PowerBook. The problem, as it turns out, had to do with a compiler issue in the Xcode version of gcc-4. Somehow the default destructor was getting bungled for classes that contained STL vectors. In a way, I'm glad that this happened, because it meant that I was forced to do something I'd intended to do for a while anyhow (default destructors have this creepy way of screwing up builds if you aren't careful). It also meant that I had the chance to check out some of the Xcode debugging tools.

It might be more accurate to write that I'm not glad that I had to make the changes now, but I'm glad that it took me so little time to figure out what changes to make. Besides the destructor problem, which is less a bug in my code than it is a compiler problem, I also found a place where I'd managed to deallocate two things out of order. I need to add a test case to my automated system so that I can check for that more carefully in the future.

I always have mixed feelings when I can find a bug quickly. On the one hand, computers are notoriously more exacting and literal-minded than people are. Consequently, I doubt I'll ever reach the state where I write bug-free code the first time through, except when writing the simplest programs. Given that I will make stupid mistakes, I rejoice in the fact that I can learn enough to recognize and rapidly compensate for certain classes of errors. But given that the whole name of the game is to offload as much mechanical work as possible onto the computer, I wonder if I could improve my tools so that such bugs don't plague me in the first place.

Yes, I know that by programming in a different language, I can get rid of an awful lot of resource management errors -- and that's part of why my front-end is in Lua, which is a garbage-collected language. As for the other stuff, I refuse to write matrix code in Java, and I refuse to rewrite large volumes of Fortran, so the Frankensteinian mixture of C++, Fortran, Lua, and MATLAB that I use for HiQLab still seems like a good engineering solution. Perhaps one day I'll change my mind and add in a few more languages.

On an unrelated note, I've also decided that my experiment with jsMath, while exciting, ultimately didn't work out as I'd planned. If you're interested in that handful of articles, you can read the equations in TeX form -- which is not such a terrible thing. The remainder of my planned five-part series will probably be written as PDF files.

It's alive!

I'm typing this on the new PowerBook. I had good timing -- the backlight on the Thinkpad simply wouldn't come on after I returned to the office this afternoon, and so it is attached to an external monitor. I'm in the process of transferring my files over to the Mac. It's going smoothly at the Mac end, less smoothly at the Thinkpad end. This is perhaps to be expected.

I'm impressed. Out of the box, things just worked. I didn't have to fiddle at all to get the networking to go; I didn't have to do any tricks to attach to the mail server; I didn't once dive into the /etc subtree. I'm still getting used to the general feel, and will probably do some things inefficiently for a while. On the other hand, I actually used the NeXT labs at UM for a while, and I've on-and-off run the GNUStep system on top of Linux, so it's not like the system setup is totally unfamiliar. All in all, there are only a few flies in the ointment, and they're pretty minor:

  • Why does OS X Tiger come with a non-graphical Emacs? I can download a nice flashy Emacs, but including only a terminal window version as the default seems perverse.
  • The development kit doesn't include any flavor of Fortran compiler. I understand why -- the GNU Fortran tools are in a transition phase where g77 is leaving, to be replaced by the F95-friendly gfortran. But it's still inconvenient.
  • I compiled HiQLab, and it failed to pass my system tests! I don't know what's going wrong, exactly, but I got my first MATLAB crash the first time I tried to run the MATLAB version of HiQLab. The standalone version didn't make it through the test suite, either -- but it made it part through in a really weird way. Compiling the MEX file was an enormous pain, but I knew it would be.
  • It seems strange to me that you'd drop the CD into the trash to eject it. Now, in this version of the OS the trash can turns into an eject button, so I have a reminder, but it's still not the metaphor I'd have chosen.
  • I can't see jsMath on my web pages! It gets filled in by blank space under Safari. Maybe the thing to do is to go back to Firefox, but I'm sad that my super-portable method of incorporating math into web pages is not so portable as I thought.
  • I've discovered during the process of the data transfer that along with all its other woes, the Thinkpad suffered some file system corruption. This has complicated the transfer.

I just wish I'd done this a little sooner so that I could have had more leisure to get switched over. I leave Monday for my next trip, and I really have things that I ought to do before then other than monkeying around with a new computer. Not more entertaining things, mind, just things that pose imminent threats that might eat me if I ignore them.

Wednesday, July 06, 2005

Hamming Talk

Courtesy Revolutionibus, I spent some time reading Hamming's talk on You and Your Research. There's a lot of good stuff in there -- and I think it's still good stuff even if you're not planning a career in research.

Code Tomography

This is part three in a planned five-part sequence on some interesting applications of linear algebra to tomography for computer networks, programs, and a yet-to-be-decided physical example. Today: how to get more utility from conventional code coverage tools by analyzing a matrix that traces out the different execution paths exercised by a test suite.

If you compile programs with GCC and don't know what code coverage is, look at the manual page for gcov. The introductory section is written in a much clearer style than I usually associate with the Unix man pages. Then come back and read on.

If you don't have access to the gcov man pages, or don't want to bother reading them, here's the short version. gcov is a coverage testing tool. You compile your program with certain options, so that when you run it, it writes out a bunch of trace information to a file. With this trace information, you can use gcov to produce an annotated listing of your program which says how many times each line has been executed. Software testers sometimes say that a test suite has complete coverage if every line of code in the program has been executed at least once.

Obtaining complete test coverage is difficult, particularly if the programmer conscientiously checks every status code returned by every call she makes to a library or to the system. How do you test that the program gives an appropriate error message when the system runs out of memory, or when the user decides to pull the disk out of the drive in the middle of a save operation? Furthermore, it is easy to construct obviously flawed programs which pass test suites with complete coverage; the suite exercises every line of code, but not every possible path through the code.

Paths through the code. Does this start to sound familiar?

Suppose that I write down traces of program executions as vectors of numbers: x_i is the number of times line i was executed in a particular run. If I have traces from many runs, then I can assemble them into a matrix G, where G_{ij} is the number of times line i was executed in run j. In linear algebraic terms, total coverage can be represented as the fact that Ge has no nonzero entries, where e is a vector of all ones.

Does that sound like a contrived way of expressing the idea of total coverage? It should! Total coverage is useful, but in some ways its a very shallow metric of how good a job a test suite is doing. There are other, more natural, things that we can do with the matrix G which give us more subtle, and potentially more useful, information.

First, consider a few statements that one might make about a program flow:

  • A is executed if and only if B is executed
  • Either C or D must occur once in the code; they cannot both occur
  • When E occurs, one of F or H must follow

These statements reflect certain invariant properties of the program flow. Unsurprisingly, the trace vectors reflect those invariants; they appear as relations between the rows of G; for all j, we have

  • G_{Aj} = G_{Bj}
  • G_{Cj} + G_{Dj} = 1
  • G_{Ej} = G_{Fj} + G_{Hj}

By writing down the traces in the form of a matrix, and expressing invariant properties as linear dependencies among the rows, we open the door for automatic analysis. In particular, the first and the third invariants above correspond to left null vectors for G: if e_i is the vector which is zero everywhere except in component i, then I know that (e_A - e_B)^T G = 0 and (e_E - e_F - e_H)^T G = 0. The second invariant is a little trickier, but I can detect it if I artificially add a row of ones to G (a useful trick, since the case of mutually exclusive execution is pretty common). Now suppose G represents the traces of program runs in a test suite, where everything went fine, and x represents a program run which consistently crashes. If I have some vector v such that v^T G = 0 but v^T x \neq 0, that suggests that I should check whether there is some unusual interaction involving the statements corresponding to nonzero entries in v.

Dependencies among the rows of G tell me something about invariants among the test case runs (which may or may not be invariants among all possible program runs). What about dependencies among the columns of G? These are just as interesting, because they say something about a deeper form of coverage. Suppose G represents a test suite with complete coverage, and x is a trace vector which is independent of the columns of G. Then even though the new run doesn't exercise any individual lines of code that weren't exercised by the test suite, it very likely exercises a path which wasn't previously exercised.

It's possible to vary the definition of G in order to shine light on different parts of the program structure. In the next installment, I will discuss some additional applications of execution path matrices to problems in software engineering.

Pitfalls in Computation

From the introduction to Pitfalls in Computation, or Why a Math Book Isn't Enough, by George Forsythe (Stanford CS Tech Report 147, 1970):

Why do students take mathematics in college and university? I see two reasons: (i) To learn the structure of mathematics itself, because they find it interesting. (ii) To apply mathematics to the solution of problems they expect to encounter in their own fields, whether it be engineering, physics, economics, or whatever.

I am sure that (ii) motivates far more students than (i). Moreover, most solutions of major mathematical problems involve the use of automatic digital computers. Hence we may justifiably ask what mathematics courses have to say about carrying out mathematical work on a computer. This question motivates my paper.

I am not in a mathematics department, and tend to moralize about them. If the reader prefers not to be lectured to, I invite him to ignore the preaching and just pay attention to the numerical phenomena for their own sake.

And from the conclusion:

It might be noted as a digression that, just as mathematics departments mainly ignore modern numerical analysis, so also the newly created computer science departments often give the subject little attention, since they are so busy with a variety of important nonnumerical fields. Thus numerical analysts remain a small corps of specialists whose greatest appreciation probably comes from the users of mathematical programs.

Students of mathematics are well equipped to read about numerical methods. Why should they repeat the classical blunders of generations past? Why aren't they informed of the existence of good numerical methods, and roughly where to find them?

Remembering that most students take mathematics in order to apply it on computers, I ask why mathematics courses shouldn't reflect a true awareness of how computing is done? Why shouldn't students demand in their mathematics courses a greater awareness of the points of contact of pure mathematics and its practice on a computer?

Of course, a mathematics instructor can shrug his shoulders and say that actual computing problems don't interest him, and suggest that his students contact a numerical analyst sometime. If the instructor actually says this out loud, it at least has the virtue that the students may realize immediately that the mathematics is not applicable directly, instead of having to discover it for themselves. It still sounds irresponsible to me. After all, society has been supporting mathematicians pretty well for the past 25 years -- not because mathematics is a beatuiful art form, which it is -- but because mathematics is useful, which it also is. But this would seem to imply that a mathematician should convey some awareness of the main ways in which his subject is used.

On the other hand, a mathematics course cannot really include very much numerical analysis. Wilkinson's treatise on computing eigenvalues is 700 pages long, and can hardly be summarized in every course on linear algebra! As a practical matter, then, the mathematics instructor's main responsibility is to be aware of the main features of practical computing in the areas of his mathematics courses, and mention occasional points of contact, while giving his students important references to important algorithmic materials in other books.

If one just ignores the relations between mathematics and its important applications, I fear that an instructor is running the risk of being exposed by some technological chaptre of the Students for Democratic Society for not being relevant, and that is a very nasty accusation nowadays. Why risk it?

I might not have put it that strongly, but it seems to me that many of the points Forsythe made in 1970, he might as easily have made today. For that matter, the numerical examples from Forsythe's paper are still worth reading today. He treats elementary problems (among them the solution to a quadratic equation and the solution of a two-by-two linear system), illuminating the difficulties that can occur and pointing out types of errors people still constantly make today. So much for progress.

  • Currently drinking: Gen mai cha

Tuesday, July 05, 2005

Apple of my i

I ordered a 15 inch PowerBook today. Sadly, it turns out that the right grant to pay for this is called my savings account; fortunately, I found out in time to get the new machine before my next trip. It should arrive in the next day or two. I'm excited. I'd better start thinking of a name...

I played on a Commodore 64 when I was a kid, then moved to PCs. One of my best friends in high school was a Mac devotee, but I never did use an Apple myself. This will be the first time in years that I'll (willingly) be using something other than Linux on my main work machine (the qualifier is important, since I've had occasion to use Sun workstations, the old Cray machine, HP machines, Alpha machines, and various other machines running UNIX -- but they're all things I used remotely). I'm curious whether it will live up to it's reputation for ease of use. The underlying system is BSD-based, and the whole thing is POSIX compliant, so I have high hopes that compiling on the Mac won't turn into the nightmare it can sometimes become under Windows. The processor and (more importantly) the memory bus are both rather faster than those on my current machine, of course, so I'm looking forward to spending less time drumming my fingers waiting for compiles to finish and for short simulations to run. I'm also looking forward to having a machine with a monitor that consistently turns on, to not having a hard lock-up every two hours, and to having a power system that doesn't periodically zap me. (Yes, I've been getting electrical shocks from my laptop, though only rarely and only recently.)

It would also seem that I'm getting a coupon for an iPod Mini. I'm not sure I would have ever invested in an iPod otherwise, but I suppose I'll try it out.

Network Tomography

This is part two in a planned five-part sequence on some interesting applications of linear algebra to tomography for computer networks, programs, and a yet-to-be-decided physical example. Today: understanding link properties from end-to-end measurements.

I'm going to start with the same setup that we had yesterday. We have n computers attached to the Internet, and they need to communicate in order to cooperate on something. We can measure the log loss rates on any of the n(n-1) end-to-end network paths; and with a mild assumption about independence of failures, we found that the log loss rate of a path is just the sum of the log loss rates for all the links on the paths. We summarized the relation between the log loss rates for all the paths and for all the paths with the compact equation:

Gx = b
in which

  • G_{ij} is one if path i uses link j, zero otherwise;
  • b_i is the log loss rate for path i; and
  • x_j is the log loss rate for link j.

The matrix G is typically very rank-deficient: both rows and columns in the matrix are linearly dependent upon each other. In practice the rank k of G seems to grow sort of like O(n \log n), though we still only have a fuzzy understanding of the behavior of k as a function of n and of the overall network topology. Yesterday, I described how we could use the low rank of G to infer loss rates on all n(n-1) end-to-end network paths by measurements of only k end-to-end paths. Today, I'll describe a different problem: what can we tell about the loss rates of the individual links? This information can be useful if, for example, we would like to diagnose where there is a network problem, so that we can route around it, fix it, or complain loudly about it to someone who is contractually obligated to fix it for us.

The answer, as it turns out, is that we can't tell a whole lot about the individual links just using the equations I've written so far. The problem can be illustrated by a simple example: if I know c + d = 5, can I find c? No, c could be anything! If you give me d, I can find c, and vice-versa; but without more information, I cannot say anything more about c or d. Similarly, I know a lot about sums of link properties, but the linear system that I wrote down generally provides too little information for me to find the individual link properties. The rank deficiency that was so useful for yesterday's monitoring application is now getting in our way.

How can we get the extra information we need to make progress? Well, there is a natural set of inequality constraints which I have so far ignored. Remember from yesterday that I actually described the link loss rates using negative log probabilities:

x_j = -\log(P_j)
where P_j is the probability of successful transmission across link j. Since P_j is a probability, it lies strictly between zero and one; therefore x_j is strictly non-negative. More generally, I have x \geq 0 and b \geq 0, where the inequalities should be interpreted component-by-component. This tells us a lot! In the example where c+d = 5, if we know that c and d are non-negative, then we can immediately say that 0 \leq c,d \leq 5. We still can't get exact values for c and d without more information, but we can get upper and lower bounds, which is a major improvement.

So while the equations Gx = b don't generally provide sufficient information to determine the link loss x_i, I can get a possibly-useful bounds on x_i because I know that x \geq 0. For example, I know that the loss rate on any network path can be no smaller than the maximum loss rate over all links used by that path; or, put the other way around, I know that the loss rate of any link is bounded above by the loss rates of all paths that use the link. So if path i uses link j (G_{ij} = 1), then I know x_j \leq b_j. But here's a remarkable thing about the Internet: most of the time, the packets you send make it through intact! Most links and paths in the network lose few packets; only a few links, and the paths that go through them, suffer from excessive packet loss. So even with this simple bound, we can show that many links in the network have very low loss rates.

The tightest possible upper bound for x_i based on what I've discussed so far is x_i \leq x_i^{\mathrm{max}}, where

x_i^{\mathrm{max}} := \max\{ y_i : Gy = b, \; y \geq 0 \}
Similarly, I can get a lower bound on x_i by replacing the max by a min. The bound x_i^{\mathrm{max}} is defined by a maximization involving linear equalities and inequalities; that is, it's the solution to a linear programming problem. The theory of linear programming is pretty well-developed, so we might hope that there's a cheap algorithm to compute great bounds for every component of i, which is about all we can hope for so far. Unfortunately, these linear programs can get unwieldy unless we use heuristics or special structures in the system.

I like linear problems with special structure. They give me a sense of purpose.

This is where the joint work with Yan currently stands. The paper which we're finishing now is essentially about how to compute good, practical bounds on link loss rates in order to find where there are problems in the network. I think this approach has a lot of promise, though a lot of fine-tuning remains, both in the mathematical theory and in the implementation. There are alternatives, but those alternatives usually involve either measurements which are not end-to-end (and are hard to make reliably); or statistical models in which errors in the model assumptions can affect prediction accuracy.

  • Next time: What has tomography to do with software testing?
  • Currently drinking: Black tea with vanilla

Monday, July 04, 2005

Network Monitoring

This is part one of a planned five-part sequence. This article and the next are on network tomography; then there will be two articles on coverage testing and code tomography, a simple application of the same ideas to a completely different problem; and the fourth article will be on a physical tomography problem with the same flavor (I have two in mind, but haven't yet decided which to use). Yes, these are things I think about as part of my research; I recently spent some time on a paper on a variation of today's topic, which is why I'm thinking about it; and no, it's not going into the thesis, which already has too much stuff in it. I'm writing this because this stuff is a lot of fun, because it ultimately relies on mathematical ideas that I think are very broadly accessible, and because in this forum I can be a smart-ass without being taken to task for informality. I'm also writing this because the combination of jsMath and a little custom AWK script lets me embed mathematical TeX directly into my HTML, which I think is very cool indeed. Shall we begin?

A couple years ago, Yan Chen knocked on my door to ask a numerical question. He was using the code from Numerical Recipes to compute something (a singular value decomposition), and it wasn't working. I told him that he certainly shouldn't be using Numerical Recipes -- there are better codes out there -- and asked why he needed that particular decomposition, anyhow. He told me, and thus started an interesting collaboration.

Yan's problem was one of network tomography. Suppose you have a bunch of computers attached to the Internet, and they're all cooperating on something. From time to time, there will be links in the network that fail, or get so congested that almost no packets get through them. How should the computers arrange communications between themselves in order to avoid these congested links?

The obvious thing to do is to measure every possible path between hosts in the network. But if there are n hosts in the network, then there will be n(n-1) total paths. That quickly adds up: if you had a hundred cooperating computer, you'd need to monitor 9900 paths. Is there a better way?

A path through the network consists of a sequence of links. To successfully traverse a path, a packet needs to successfully traverse each link in the path in turn. We write the probability that a packet will make it through the path as P_{\mathrm{path}}, and the probability that it will go through network link j as P_j. Then, assuming the link packet losses are independent events (a good assumption in general),

P_{\mathrm{path}} = \prod_{j \in \mathrm{path}} P_{j}
Because sums are even easier than products, let me take logarithms of the above equation. If I define b_{\mathrm{path}} = -\log(P_{\mathrm{path}}) and x_j = -\log(P_{j}), then
b_{\mathrm{path}} = \sum_{j \in \mathrm{path}} x_j
Now let me consider all n(n-1) paths simultaneously. If I define b_i as the log loss rate on the ith path, and define G_{ij} to be one if path i uses link j (zero otherwise), then I can write
b_i = \sum_j G_{ij} x_j
Ah! A linear system of equations -- one of my favorite things. In matrix form, I would write b = Gx.

Here's the cool thing: paths overlap each other, and that ultimately leads to linear dependencies. You can see the effect for yourself in a very simple network, shaped like a Y, with a router R in the center and hosts A, B, and C connected to R. If I want to go from A to B, I use the path AR+RB (I go from A to R, and then from R to B). I can also write AR+RB in terms of the other end-to-end paths in the system:

\mathrm{AR}+\mathrm{RB} = \left\{ \begin{array}{c} +(\mathrm{CR}+\mathrm{RB}) \ -(\mathrm{CR}+\mathrm{RA}) \ +(\mathrm{BR}+\mathrm{RA}) \ -(\mathrm{BR}+\mathrm{RC}) \ +(\mathrm{AR}+\mathrm{RC}) \end{array} \right\}
What this means is that the matrix G which for this network is a six-by-six matrix is rank-deficient (in this case the rank of G is 5).

Let k be the rank of G. If I pick n hosts in a typical (large) computer network, then k is typically much less than n(n-1); in fact, it typically seems to grow much more slowly than O(n^2). I can prove this for unrealistically simple networks, and I can argue about it heuristically for real networks, but I don't really have anything resembling a good proof. I don't even have anything resembling a clearly stated conjecture: I'm not sure how to legitimately employ the usual asymptotic arguments involving the case n \rightarrow \infty, because the network isn't infinite in extent. Nevertheless, I'll abuse notation slightly and say that for intermediate sizes of n -- not too small, but not approaching the size of the network -- typically k seems to grow something like O(n \log n).

Fortunately, I don't have to understand all the details of why k is small in order to take advantage of its smallness. First, though, let me clear away a common piece of mental clutter: the notion that in order to compute a sum, I need to know all the summands.

Cleve Moler wrote an article on The World's Simplest Impossible Problem some time ago (I don't have the article with me, so I may have bungled the title). The article began with a question, something like I'm thinking of two numbers that sum to five; what are they? Of course, I can't answer that question, since the first few words (I'm thinking of) suggest that Cleve had a unique solution in mind out of an infinite number of possible solutions. But if he had then called those numbers a and b and asked me what a + b + 5 was, I could easily answer ten. I don't know the values of a and b, but I do know a relationship between them, and that's the mathematically important thing.

Because G is rank-deficient, I can't, in general, use the equations Gx = b to determine the link losses x_j from the path losses b_i: the system has an infinite number of possible solutions. Depending on your early training, this may be the cause for consternation, excitement, or an outburst of fervent mysticism. But I don't actually need the values of x_j for the problem I described. Remember: I want to know the packet loss rates along every end-to-end path between my n networked computers, but I don't want to be bothered with measuring all those paths directly. Clearly the information I want -- the vector b -- can be determined by direct measurement! I just want to make the measurement less expensive.

The key realization is that I can get all the information I might desire from k measurements. If I can find a basis of k independent paths -- paths that can't be written in terms of each other -- then I can write every end-to-end path as a linear combination of paths from that basis. This is exactly what I did earlier when I mentioned the example of the Y-shaped network. So I don't really need to worry about every path in the network; I only need to worry about some basis of k paths. Let me suppose I can find k such paths, putting aside for the moment the surprisingly delicate question of how I accomplish this task in practice. I'll put these paths together in a matrix G^M with one row in each path, and I'll put the measurements of the log loss rates on those paths into a vector b^M. Then I seek a solution -- any solution -- to the system

G^M y = b^M
Computing a solution y requires relatively few measurements; I only need direct measurements of the k loss rates in b^M. But because every row of G can be written as a linear combination of rows from G^M, the solution y automatically satisfies Gy = b. Which means that I can infer the loss rates on all n(n-1) paths from only k = O(n \log n) direct measurements.

Cool, no?

I've glossed over a number of technical mathematical and system issues here:

  1. How does one measure the network topology to get the matrix G?
  2. Internet routing is dynamic; doesn't G change over time? If it does, how can you efficiently update your solutions?
  3. How do you find k independent paths to measure?
  4. How do you make sure that one host in the network doesn't end up doing a lion's share of the measurement work?
  5. What do you do with computers that want to leave or join the system?
  6. How does measurement noise affect the results?
  7. Once you have information about all the possible end-to-end loss rates, how do use it to systematically build a reliable communication system between the network hosts?

Some (not all) of these issues are treated in papers that Yan and I have written; you can find links to papers and presentations from either of our web pages.

  • Next time: What can we say about individual links?
  • Currently drinking: Coffee

Book Review

In flipping through the June 2004 issue of SIAM Review (vol 46, no 2), I happened to re-read Emma Previato's review of the CRC Concise Encyclopedia of Mathematics. If you have electronic access to the SIAM journals (through a university library, perhaps), go read it for yourself! It's a wonderful example of what a book review can be.

Sunday, July 03, 2005

Testing Euler

If you can see this, then jsMath is working as promised:

e^{\pi i} + 1 = 0

UPDATE: It works! At least, it works for me. If you're reading this and not seeing Euler's formula, please let me know. And expect to see more mathematics from me, now that I can actually typeset it.

Saturday, July 02, 2005

Miscellaneous

Go read Keith Devlin's article on Staying the Course. It has nothing to do with the rest of what I'm going to write, and a great deal to do with my recent day-to-day life.

I'm still enjoying the novelty of glasses that allow me to read the subtitles and authors from the spines of my books without getting out of my chair; I'm getting used to having a cell phone; and I sincerely hope that I'll be able to get the new laptop I've been wishing for before my next conference trip in mid-July. Yes, the plan is still to retire my 5+ year old Thinkpad and move over to a PowerBook. I'll feel nostalgic for the Thinkpad, but the power cable to the monitor is giving up the ghost (a fixable woe), the power distribution system has Issues (a less fixable woe), and some other unspecified but worrisome hardware problem has been steadily decreasing my mean time between crashes. I suspect one of the memory chips is failing -- I don't think the memory on these machines is parity-checked, unfortunately -- but it's possible that there's something else wrong. According to my officemate Jason, there was a mechanical design flaw in this particular model which made it particularly easy to warp the motherboard overtime. This may be why most of my peers have not made their machines last nearly twice the design lifetime.

I have had a classical guitar riff playing intermittently in my head for the past three days, and it's starting to drive me batty. I really like that piece, but I can't remember the name or the composer. I'm pretty sure that either John Williams or Andre Segovia were playing when I heard it, though I'm not sure whether I heard it on the radio or CD.

If you want to read about electromechanical couplings, I highly recommend Maugin's books (but don't recall the Berkeley copy of Nonlinear Electromechanical Couplings until I'm done with it, okay?). It's really helpful to me to see a full set of nonlinear equations written down by someone who is happy to use the language and conventions of continuum mechanics. Whether you care about electromechanical couplings or not, check out It Must be Beautiful: Great Equations of Modern Science (ed. Graham Farmelo). I've only skimmed so far, but what I saw was good. I'm sure I'll have more to say about it after I've read.

It looks like I will be the TA for the local compilers course this fall. Yes, I know it will be a great deal of work; but I expect (or at least hope) that it will also be fun. Compiler implementations lie on some wonderfully clever ideas, both from the perspective of mathematics and of software engineering. I enjoyed the class when I took it as an undergrad; I enjoyed related classes on object-oriented language implementation; and I've enjoyed hacking on compilers and interpreters for domain-specific languages a couple times since then. Fun or not, I need to teach again if I want to graduate (and I do want to graduate). Since it was a graduate course, the parallel computing class for which I was a TA only counted for quarter-time, and consequently didn't do the trick.

HiQLab now has bindings for a variety of dense matrix operations, including basic operations like addition, subtraction, and multiplication by a scalar or by another matrix as well as linear solves and eigendecompositions. Unsurprisingly, these are things I do all the time in MATLAB and wished I could do in the standalone version of HiQLab as well. More surprisingly, the basic infrastructure is sufficiently flexible that it took about one work day to write and test all the features I just mentioned -- and it was a work day I probably would have spent on other things if I felt like I had any stamina worth mentioning for writing (or for developing more subtle code, for that matter). It's useful to sometimes have a day or two when I feel too dumb or bewildered to do anything but sit and code in a straight line for a while.

Before writing this, I was proofreading a paper draft a colleague sent. I've foolishly agreed to send comments by tomorrow morning. Going through one section in particular, I felt like I was getting utterly lost in the details -- which is a bad sign, since I'm pretty sure I drafted the original text on which that section was modeled! I quit reading when I realized that I was gritting my teeth, and my jaw was starting to hurt. I'm sure I'll have more of a sense of humor about it tomorrow.

Time to head home, I think.