Saturday, July 23, 2005

In a Barrel

Searching for LISP-based implementations of different programming languages is a task akin to shooting a mixture of metaphors and pre-hatched chickens in a barrel.

Let's see who Google brings to this post based on that sentence.

I think I've written already that next semester I'll be the GSI (graduate student instructor -- a UC-ism for teaching assistant) for the local compilers class, CS 164. The plan is to implement everything in Common Lisp. That means that all the compiler stages (scanner, parser, optimizer, code generator) will be written in Lisp, as will be the virtual machine on which the compiled programs are to run. The language that we're compiling, though, is going to be a Java variant.

I'm looking forward to using Lisp for this class. The last time I used Lisp for very much was when I took an artificial intelligence course as an undergraduate, but I have more than a passing familiarity with it. Indeed, I think that someone who graduates with a degree in computer science and isn't familiar with Lisp and Forth has been robbed. These two languages are fundamental. The fact that they are so fundamental is actually the source of my current frustration, but let me return to that after I explain briefly what I mean when I say that these languages are fundamental.

Like human languages, programming languages have grammar. The grammmar of a programming language describes the types of constructions which make up acceptable programs. Like human languages, the grammar is far from the whole picture; beyond the syntax, or the structure of an utterance, there is the semantics, or the formal meaning of the utterance. The sentence The blue flu blogs at midnight is syntactically valid, but semantically questionable. Programming language syntax is usually described in BNF (Backus-Naur form -- you may insert your own pun on Backus-Bacchus here). It's easiest to describe the notation by example. Suppose I wanted to write down the syntax of simple arithmetic expressions involving plus, minus, times, and divide. I might write the grammar like this:

  expression := 
    addend |
    expression + addend |
    expression - addend
  addend :=
    multiplicand |
    addend * multiplicand |
    addend / multiplicand
  multiplicand :=
    number |
    variable |
    ( expression ) 

In words, what I've written is that an expression consists of a sums and differences of addends; an addend consists of products and quotients of multiplicands; and a multiplicand is a number, a variable, or a parenthesized expression. So if I wanted to decide whether the expression 1 + 2*(3+4) was valid, I'd have to see if I could describe it according to these rules. The description of the expression in terms of syntactic rules gives us a parse tree, which is sort of like a grade school sentence diagram, but more useful. Since I'm limited by HTML, let me use the following convention for writing the parse tree above: when I use a particular rule, I'll write the name of the thing I produce first, then the list of inputs to the rule; and I'll wrap the whole thing in parentheses to prevent confusion about where I am. In that case, the parse tree for 1 + 2*(3+4) is

  (expression
    (expression
      (addend (multiplicand (number 1))))
    +
    (addend
      (multiplicand (number 2))
      *
      (multiplicand
        oparen
        (expression
          (expression
            (addend (multiplicand (number 3))))
          +
          (addend (multiplicand (number 4))))
        cparen)))

I say that I write things this way because of the limitations of HTML, but even if I were a little more willing to draw sketches for you, I might decide to write the parse tree in the above way. Why? Because it's convenient for computer manipulation, and people have developed powerful tools for dealing with this sort of representation of expressions. It will not surprise you to learn that the above expression is a sort of proto-Lisp.

Of course, if I actually wanted to write the above expression in any dialect of Lisp that someone would use, it would look like

  (+ 1
    (* 2
       (+ 3 4)))

Nevertheless, the point remains: in essence, a Lisp programmer writes a direct representation of the syntax trees for his programs. And trees are things that computer scientists are happy manipulating as data, which means that Lisp programmers can easily write programs that analyze and rewrite programs. And a compiler is essentially a tool for analyzing and rewriting programs, so lots of compiler writers use Lisp, or something very Lisp-like, inside their compiler.

Forth is what you get by turning Lisp on its head, so that instead of writing (+ 1 1) you write 1 1 +. The Forth model will be familiar to anyone who has used an HP calculator or looked at the specs for the Java Virtual Machine.

The problem: since Lisp and Forth are so damn easy to parse, everyone writes parsers for them -- and Google isn't smart enough to distinguish writing about parsers written in Lisp for other languages from writing about parsers written in other languages for Lisp. The latter are ubiquitous, the former a little less so -- hence my problem.

Since my battery is running low, ruminations on the resolution to this problem will have to wait for another day.