Monday, March 07, 2005

Programming Languages

I use several programming languages day-to-day, and a few more languages occasionally. I have a passing familiarity with several more languages that I uses rarely, as are most students with recent computer science degrees. They are our basic tools. They are so basic that I sometimes assume everyone knows about them, until I get brought up short with a question. What's a lambda function, David? Uh...

Well. Let me pick a few of my favorite and least favorite languages and, at least for once, write about them rather than the things I write with them.

  • C

    UNIX is written in C. Windows is written in C. I'm having a hard time thinking of an operating system in current use which was not written in C. Okay, BeOS may have been written in C++, and NeXTStep was written in Objective-C; but those both are children of C, and neither operating system is exactly a major concern any more.

    The C language and the UNIX operating system co-evolved. The first C compiler ran on a DEC PDP-7 machine, and the language has been (perhaps uncharitably) characterized as high-level PDP assembly language. The language did not really depend on the machine, though; this seems like a no-brainer now, but at the time the UNIX operating system was remarkable because it was written at a sufficiently high level that it could be ported from machine to machine. The prevailing wisdom before that was that operating system kernels had to be written in machine-specific assembly language. The C language, though, worked well -- and works well -- as a systems programming language. C lets us program close to the metal, enough so that it's pretty easy to work out the instruction sequences the compiler will generate for a given piece of C code (it's not at all easy to predict what an optimizing compiler will generate, particularly on some modern architectures, but that inconvenient fact isn't really stressed in lower-division CS coursework). At the same time, there is enough stuff built into the language to support a more abstract programming style, one in which the programmer doesn't think that much about what the hardware is doing underneath. Also, C is a small language. It often takes students a long time to understand pointers, or to feel comfortable with some common C idioms, but it's possible to become acquianted with most of the syntax in the course of a couple days. The definitive reference -- The C Programming Language (Kernighan and Ritchie) -- takes scarcely more shelf space than the Elements of Style.

    By way of comparison, The C++ Programming Language (Stroustrop) is over 800 pages of smallish print. C is a small language; most of C's successors are not.

    C is alive and well. Standard C runs on almost anything, and it is still the language of choice for programming operating systems, embedded systems, and other low-level things. Some of us like it for higher-level libraries as well.

    I used C for the core of SUGAR 2.0 and 3.0, for CLAPACK, for ECFS (an Extended Cryptographic File System that I co-authored as a class project), and to write a kernel for my undergraduate OS class. I also use it for a variety of small side projects.

  • C++

    C++ is sometimes lumped together with C in a single breath: C/C++. However convenient it is to write, and however much some people try to think of C++ as a better C, the impression that the phrase C/C++ conveys is inaccurate: C and C++ are very different programming languages. It is true that C++ grew out of C, and that most C compilers now are also C++ compilers. But it's possible to identify a C programmer writing in C++ for the first time, or vice-versa. I really learned C++ before I learned C well, and I was immensely frustrated when I first switched. I know other people who have had the opposite experience, and I know some people who manage to really write C in C++. Reading their code is like reading prose written by a writer familiar with some European language -- Russian, perhaps -- with a good-but-imperfect command of English. Even if the syntax is perfect, the idioms get mixed up.

    C++ is sometimes called an object-oriented language. The idea of object-oriented design is that a program is partitioned into objects which contain both data and routines to act on the data. For example, in one program I might use several different dictionaries: a dictionary object would consist of a data structure which contained both keys and data -- words and associated definitions, perhaps, or employee numbers and employee records -- and associated operations like add an entry, delete an entry, or look up a key. The details of how the data is stored are hidden, since what's really important is what you can do with the data; this is the principle known as encapsulation. For some types of dictionary data structures, all that really matters is that I need some way to compare keys: the exact details of what goes into the keys are secondary. I can use the same code to delete an entry whether I have keys that are strings (words) or numbers: this is a type of polymorphism. Or I can take an existing dictionary type and -- without modifying the original code -- write a dictionary with new functions, like the ability to guess which word is meant when the user is a bad speller. I can still use my dictionary-with-spell-check in any place I use a dictionary; it's just that I've given it an extra new ability. This is an example of inheritance.

    C++ really doesn't force any sort of object-oriented thinking: it's just that it provides a syntax that makes it easier to express (some) object-oriented designs. It's possible to write object-oriented code in C, too; it's just that there is more supporting syntax in C++. Even more important than the supporting syntax is the fact that a C++ compiler will enforce the semantics of an object system: if you tell the compiler that you're going to pass a dictionary to some function, it will let you pass dictionaries and only dictionaries. If you try to pass around penguins in place of dictionaries, the compiler will yell at you; in the same situation, a C compiler will sometimes let you substitute a penguin for a dictionary. Then the penguin will bite you when you actually run the program. Birds do not take well to being force-fed employee records.

    C++ sometimes gets a bad reputation for being slow, clumsy, and complicated. It is complicated, but the charge that it's slow and clumsy is misdirected. C++ can actually be faster than C, if it's used properly; because of strong typing (the property that the compiler won't allow you to confuse a dictionary with a penguin), the optimizer has some opportunities that it wouldn't otherwise have. The trouble is that in order to write really fast code -- in any language -- it's necessary to have a good mental model of how the programming language structures are actually implemented. C++ supports much more complicated constructions than C does, and it takes more time and training to really understand not only the semantics of those added features, but also to understand the implementation well enough to know how much the features will cost. Similarly, object-oriented methods are a supplement to -- not a substitute for -- good design taste. It's possible to make a bad generalization which causes confusion and extra work with little benefit: for example, partial differential equations often behave very differently in 2D than in 3D, and a framework in which 2D and 3D problems are treated too similarly will lead to headaches. But some things about 2D and 3D problems really are the same, and should be treated with the same code. How do you tell whether to treat 2D or 3D the same or differently? It's a matter of good taste, and that's a hard thing to teach.

    I used C++ at my job at Tri-Analytics (and later at Digital Innovation) during my high school and college years. I have used it for various other projects since. The core of the HiQLab simulator is written in C++.

  • FORTRAN

    FORTRAN 77 was standardized in 1977, the same year that I was born. I sincerely hope that FORTRAN 77 dies out before I do. The language does support complex arithmetic -- something that C lacked until very recently -- and often it's easier for an optimizer to generate fast code from FORTRAN 77 than from C. But the language syntax is outmoded and irritating. Why should I have to format my program for a punch card? And why should I have to restrict the length of my function names to six characters? FORTRAN also has very weak type checking; think penguins and dictionaries again, except now I can get away with calling a penguin an integer, which even C will disallow. Also, FORTRAN 77 provides no facilities for dynamic memory allocation: if I make a dictionary, I have to make a maximum size at compile time. It's possible to get around this limitation (and doing so was one of my major contributions to the FORTRAN-language finite element code FEAP), but it takes a degree of cleverness and some mental gymnastics. FORTRAN 77 does not support recursion, does not have pass-by-value arguments, does not have pointers (or any good substitutes for pointers), and generally does not have the modern amenities.

    I like having the modern amenities. I enjoy running water and flush toilets; and I enjoy dynamic memory allocation and long function names. I do not generally think either preference is a weakness. That being said, there is a lot of well-designed numerical code in the world which is, alas, written in FORTRAN 77. And I am willing to use and modify such codes, which means that I am willing to continue to use FORTRAN 77.

    There are more recent dialects of FORTRAN than F77: we have F90, F95, FORTRAN 2000, and now FORTRAN 2003. I actually rather like these languages, though they are more complicated than the older FORTRAN dialects, and they show some of the symptoms of committee design. I'm not sure why these languages haven't caught on in the US, but so far they seem to be more popular in Europe. I see only one reason to continue using FORTRAN 77 rather than a more recent dialect: the free GNU compiler, which a lot of people use and like, currently only supports FORTRAN 77. But that will change with the next major release of the GNU compiler suite, which pleases me immensely.

    The FORTRAN language is dead. Long live the FORTRAN language!

  • MATLAB

    MATLAB is originally short for MATrix LABoratory, and the language is ubiquitous among those of us who do any sort of matrix computations. Not only is MATLAB often used to develop numerical algorithms, but MATLAB's syntax is often used in a purely mathematical context to express indexing operations which are common in numerical linear algebra. In addition to providing a concise syntax for linear algebra operations, MATLAB provides basic libraries for a variety of numerical tasks, including numerical quadrature (computing integrals), numerical integration (computing solutions to ordinary differential equations), and graphics and plotting in many flavors.

    Of course, there aren't that many numerical linear algebra specialists in the world, and most of us don't make that much money; the MathWorks makes most of its money from electronic designers, and particularly from DSP designers. Besides the basic capabilities to which every MATLAB user has access, the MathWorks sells toolboxes of routines for DSP design, filter design, control theory, system modeling, statistics, symbolic math, and PDEs. There are also a variety of third-party toolboxes (both free and commercial).

    I love MATLAB for its matrix syntax. I like the fact that it includes a wide range of numerical routines. But I'm not so attached to the rest of the language. MATLAB only allows one globally-visible function to be provided per file; that quickly becomes annoying for large programs. MATLAB does have explicit language support for OO design, but it's awkward. And if MATLAB strings seem well-designed compared to strings in C and FORTRAN, that's only because string manipulation in C and FORTRAN is so miserable.

    Almost every numerical code I've written, large or small, has started life as a MATLAB program. My MATLAB interface to FEAP (FEAPMEX) and the SUGAR and HiQLab simulators all combine a computational core in some compiled language with a MATLAB user interface. I've also used a mixture of MATLAB and compiled languages in my work on structured eigensolvers and in my work on network tomography. I sometimes use Octave, which is a free numerical system with a MATLAB-like syntax; but Octave only recently added decent sparse matrix support (which I use a lot), and its graphics aren't as nice. So I still mostly use MATLAB.

  • LISP

    LISP is a LISt Processing language; or is it Lots of Irritating Strings of Parentheses? In any event, LISP is one of the oldest high-level lenguages. It is also one of the simplest. LISP programs work on lists, or lists of lists nested in various ways. LISP programs also are lists themselves: the main data structure is not only used to represent data, but also the instructions that operate on the data. Consequently, for a LISP programmer it is very natural to write a function which takes a function as input and produces another function as output. Deep, no? The elegance and simplicity of such a language is rarely appreciated by students who learn Scheme (a LISP dialect) in a first programming course.

    At a superficial level, LISP confuses the uninitiated because it is a prefix language. For instance, in LISP I would write 1+2*3 as (+ 1 (* 2 3)): the operator comes before the operands, not between them. If Shakespeare wrote in LISP, Hamlet might have said (or to-be (not to-be)) (and in a post-fix language like Forth, it would be tobe tobe not or). The difference between prefix and infix notation is not deep, though, and most people figure it out pretty quickly, particularly if they ever used an HP calculator.

    At a deeper level, LISP is a functional language. A well-written LISP program typically will not involve assignments and loops; instead, operations are expressed recursively. If I wanted to sum the numbers from 1 to n in LISP (and didn't remember the formula n(n+1)/2), I might write something like

      (defun sumit(n) 
          (if (eq n 1) 
               1 
               (+ n (sumit (- n 1)))))
    

    That is, if n is 1, then the sum is 1; otherwise, the sum is n plus the sum through n-1. In contrast, if I was writing in an imperative style in a language like MATLAB, I might write

      function sum = sumit(n)
      sum = 0;
      for k = 1:n
        sum = sum + k;
      end
    

    It's possible to write in a functional style in a language like C, and it's possible to write in an imperative style in LISP. As with the distinction between C and C++, the difference is not in what the language can do; it is in the idioms to which the language lends itself. The same can probably be said of the technical jargon of different disciplines, but that's a topic for a different time.

    I used LISP in my Artificial Intelligence course (like every other person to take such a course). I also use Emacs LISP from time to time, and on a few occasions I wrote short functions in a LISP dialect which was used as one of the configuration languages for Collector (the flagship software product produced by the company I worked with in high school and college).

  • Pascal and Delphi

    I took AP Computer Science in high school -- sort of. You see, I'd already learned to program at the time, and it quickly became clear that it would be pointless for me to walk through the usual exercises. So I got to pursue my own projects in a semi-independent study; and I helped the other students in the class, which was also entertaining. The class was still taught in Pascal -- and the AP exam was still given in Pascal -- so I taught myself Pascal. I have few things to say about it, except that the support for character strings was pretty lame (again), and that I always found the C style of using semicolons to end statements to be more natural than the Pascal style of using semicolons to separate statements.

    Pascal was an adequate language when I used it, and I suppose it's fine for instructions. But Delphi was a completely different matter. We used TurboPascal for class, and whatever I thought of the basic Pascal language, the Turbo compiler was a delight -- and the folks who wrote the Turbo compiler extended the language in a good way (with better string support, for one thing). But Delphi! Delphi was -- is -- Borland's answer to Visual BASIC. The programming language is an extended version of Pascal which bears roughly the same relation to standard Pascal as C++ bears to standard C. Like Visual BASIC, a lot of Delphi programming was centered around GUI-driven events: click on this button and that piece of code is executed, for instance. But Pascal is a nicer programming language than BASIC, and the whole business was so much cleaner in Delphi that there was hardly a comparison. The Delphi compiler, like the Turbo compiler, was amazingly fast; the code it generated was pretty fast, too. As far as I could tell, it was in almost all ways technically superior to VB.

    I used Delphi to write user interface code, both for Digital Innovation and for the Department of Communication Services (DCS) at the University of Maryland.

  • Lua

    One of my favorite languages for the past couple years has been a little scripting language called Lua. The name comes from the Portuguese word for moon; the language and the name evolved from SOL, a Simple Object Language (sol is sun in Portuguese). SOL, in turn, was a descendant of DEL, a Data Entry Language. DEL, SOL, and Lua were all inspired by the needs of engineers to set up some sort of petroleum engineering simulations, so one of the criteria in the minds of the designers was can someone who only knows FORTRAN 77 pick this up? At the same time, the language was designed by computer scientists, and it has a number of lovely and useful features: functions as first-class objects, syntactic sugar to support object-oriented programming, and a very natural table-based data description syntax. At the same time, it's not a complex language: the complete syntax description fits on one page, and the language manual -- including the description of the extension API and standard libraries -- is about sixty pages. The interpreter is small, fast, and portable.

    One of the nice things about Lua is how well it serves as an extension language. Python, Perl, Tcl/Tk, and various other high-level scripting languages are easy to extend, but they are not as easy to use as extensions. Size is one issue: Python, Perl, and Tcl are not little languages. Syntax is another issue: when I start programming in Perl after a hiatus, I always forget half the dollar signs; and in Tcl, I always forget which sort of braces I should use. There are a few Scheme interpreters that are billed as extension languages, and from personal experience I know that I like Scheme for extensions. But -- the examples of AutoCAD and CADENCE notwithstanding -- I'd rather not spend my time teaching mechanical engineers how to use LISP. Besides, I like infix notation. Lua wins because it supports a variety of programming styles; because it is easy to call from C and can easily be extended from C; because it has an intuitive syntax; and because it's small and fast. I think the language underappreciated, though some of the video game industry has realized how nice Lua is: Lua is used to script the most recent Monkey Island games, among others.

    I use Lua in the HiQLab simulator both to describe my meshes and to script out analysis tasks. I used Lua for the same purpose when I designed SUGAR 3, though at that time I felt obliged to extend the Lua syntax in order to support some features I'd built into the input language I wrote for SUGAR 2. Let me give an example of why Lua works so nicely as a mesh description language. When you solve a differential equation -- partial or ordinary -- one of the things you always have to worry about is the behavior at the boundary. In HiQLab, I have a very flexible way to define boundary conditions: I write a Lua function to do the job. For instance, the code fragment

      mesh:set_bc(function(x,y)
        if x == 0 then return 'uu', 0, 0; end
      end)
    

    tells my code that all of the points along the line x = 0 in my two-dimensional mesh should not move. The 'uu' means that I'm specifying displacements in both the x and y directions; the two zeros are the values of the displacements. I can write my code this way because Lua treats functions as first class values: I can define a function and pass it as an argument to another function, just as I can in LISP. Also, I've defined an object-oriented interface for my geometry definitions: I have a mesh object (called mesh, strangely enough), and that object has a method called set_bc.

    I generate the interfaces between Lua and C/C++ code using an interface compiler that generates the glue code I need from a cleaned-up header file. I liked the features of the tolua interface compiler well enough that I added some of them to the matwrap interface compiler that I use to generate interfaces between MATLAB and C/C++ code.

    One summer at Digital Innovation, a co-worker and I worked on an interpreter for a language called EEL (Expression Evaluation Language) which we hoped would replace the LISP dialect used to script the Collector database system. I think Lua is what we might have hoped to design, but didn't quite achieve -- though when I mentioned this to Guy, he thought that EEL would still kick ass. I suppose it's a moot point, now.

  • AWK and Perl

    AWK was the first language I learned with really good text string processing. I suppose it's reasonable that I would be impressed by AWK's text processing: after all, the whole point of the language is pattern scanning and text processing. I use AWK for a lot of little things, like massaging text files full of data as I pass them from one program to another. I have a pair of shell scripts that I use a lot, awk-process and awk-unprocess, which let me quickly apply complicated edits to several programs and then undo those edits if I make a mistake.

    Perl (Practical Extraction and Reporting Language or Pathologically Eclectic Rubbish Lister, depending who you ask) includes the capabilities of AWK and much more. However, I rarely write an AWK script that's more than a page, and so I rarely find I want the extra capabilities I could get from Perl. I know enough Perl to be able to read and edit other people's code, but I write almost as little new Perl as I write new Fortran 77. If I want to write new scripts and I need something more than AWK, I'll use Python. Why Python? Because, unlike Perl, it doesn't look like someone accidentally transcribed modem line noise onto the screen. Perl's very powerful, but the syntax is something only a camel could love.

    I suppose Tcl/Tk is also an option for the same sort of high-level scripts that I might write in Perl or Python. Tcl/Tk is an older scripting language; its best point, in my opinion (and in the opinion of some others) is its integrated GUI toolkit (which is what Tk is). Tcl/Tk makes it possible to very quickly write graphical applications for Windows or UNIX, and I used it to make a graphical user interface to the program I wrote for my undergraduate research (TkBTG, a code for statistical prediction of spatial data). But while Tk is nice, I always had trouble debugging Tcl. You see, in Tcl there are specific meanings associated with {}, (), and []; and at critical times, I kept forgetting which was which. That said, I know that OpenSEES -- a locally-devloped open-source simulator for determining the response of structures in earthquakes -- uses Tcl as a scripting language.

  • PostScript

    It seems only fair to end with PostScript. PostScript is primarily a document description language, but it's a full-featured language which is capable of more than drawing documents (though I think there's a certain perversity involved in using PostScript to write a SETI@Home client, for instance). PostScript is the lingua franca for UNIX printer drivers, and many high-end printers will accept PostScript directly.

    PostScript is a postfix language. That is, instead of putting the functions and operators before the data on which they operate (as happens in LISP), one puts the operators after the data. The PostScript interpreter maintains a stack, and whenever it sees an operation -- like + -- it will take an appropriate number of input operands off the top of the stack and push a result onto the top of the stack. The fact that the interpreter hardly needs anything more than a stack and a table of variable definitions means that the interpreter can be pretty simple. Also, because the interpreter can usually do something immediately when it sees a word -- like push something onto the stack, or run an operation -- a PostScript interpreter needs to include only a minimal parser.

    To give an idea how this all works, let me give a simple example. The following program prints the numbers 1-20 in a column running down the side of the screen:

      /Times-Roman findfont 15 scalefont setfont
      1 1 20 {
        dup                % stack: n, n
        -18 mul 720 add    % stack: n, -18*n+720
        72 exch            % stack: n, 72, -18*n+720
        moveto             % stack: n
        3 string cvs show  % stack:
      } for
      showpage
    

    The first line sets the font (15 point Times Roman), and the last line shows the page. The lines in between are the interesting part. In PostScript braces denote a procedure: these are the only things that aren't immediately processed in full when the interpreter sees them. Here, the interpreter pushes the loop bounds (1 through 20 in steps of 1), then the loop body procedure. Then it calls the for command, which says pop three elements from the stack and use them as loop bounds; for each value of the index (from 1-20, here), push the index value on the stack and then execute the procedure inside. The procedure inside computes a position on the page (the point 72,-18*n+720), moves there, and prints a three-character string containing the value of n.

    If you think that's intuitive, you've either programmed in PostScript before or your mind has been warped by too much time with an HP calculator. Or, possibly, your mind works better with postfix notation than mine does. Postfix languages are attractive because they are simple to implement and because it's simple to automatically generate code for them -- not necessarily because it's simple for a human being to write code for them.

    Not only are stack-based interpreters simple to implement, but it is also simple to generate code for them. Most of the time, I only work with PostScript through automated tools (sometimes produced by others, sometimes deviced by me). But the idea of a stack-based machine isn't good just for generating printer commands. The Java Virtual Machine is stack-based; so is the virtual machine that Lua uses internally. The x87 floating point coprocessors and the integrated floating-point units in some later Intel chips are stack-based; this last is a mixed blessing, but that's a story for some other day.