Tuesday, September 06, 2005

Honest exercise

Alas, we didn't get through all the Common Lisp exercises I'd planned for section today. This is too bad, because I would have liked to have people think about the last exercise: given a collection of (programmer program) pairs, how can you find the sets of programmers with structurally identical codes? It took me about five minutes to do this one (though this may be because I already had a solution in mind when I wrote the question). I mention the solution here:


(defun code-to-struct (code)
  "Replace all atoms in a code with 'foo in order to compare structures"
  (when code
    (if (listp code)
        `(,(code-to-struct (car code)) . ,(code-to-struct (cdr code)))
        'foo)))

(defun cluster-structure (codes)
  "Find programmers who have structurally identical codes."
  (let ((ht (make-hash-table :test #'equal))
        (groups nil))
    (dolist (code codes)
      (push (car code) (gethash (code-to-struct (cdr code)) ht)))
    (maphash #'(lambda (key val) (push val groups)) ht)
    groups)))

Let me restate that in case it went by too fast: it's possible to write a Common Lisp code which will find cheaters, and to do it in under 20 lines of code. Furthermore, that code is reasonably efficient: unless the Lisp implementation chooses a really terrible hash function, the cost of the above code is linear in the sums of the lengths of the programs. It doesn't take long to write, it doesn't take long to run, it doesn't take long to improve.

When I took compilers as an undergrad, a bunch of folks in the class got caught plagiarizing each others codes. This is, I think, one of the stupidest plagiarism cases I can think of: here you are in a class, the whole point of which is to automatically analyze programs. You are being taught by someone who is, presumably, very good at writing tools to automatically analyze programs. What, it seems like writing a program to find similar programs would be an impossible task for this guy? In Common Lisp, it takes an overworked graduate student five minutes to write a detector for structurally identical codes; it would take me another fifteen minutes to modify code-to-struct to catch workarounds (e.g. re-ordering of instructions or other superficial changes to program structure). It would take somewhat longer for other languages, but not so long as you might think -- this sort of detector doesn't need to know about every syntactic bell and whistle in a given language (and anyhow, it's possible to find complete grammars for many languages for free on the net).

Don't cheat in a compilers class seems to me to be only slightly more difficult to figure out than don't piss on the electric fence. Here's hoping that this class feels the same...