Planet Lisp

Zach BeaneCommon Lisp bits

· 11 hours ago
Heinrich Apfelmus has updated to the source code from Computer Models of Musical Creativity and put it on github. Looks like it's meant to work with RMCL.

"CEPL is an extension for common lisp that makes working with OpenGL simple and familiar," according to Baggers. There is a blog and a number of videos about CEPL. The readme cautions: PRE-ALPHA.

"BG" gives a take on the history of Macintosh Common Lisp. Rainer Joswig responded to a number of points in the ensuing /r/lisp discussion.

3bgl-shader is "a Common Lisp DSL for generating GLSL shaders," by Bart Botta. Needs people to try it out and provide feedback.

Pseudo is a Lisp-powered roguelike multiplayer browser game, with AGPLv3-licensed source code available. Created by Matthew Carter.

The Infected is a roguelke survival horror game in Common Lisp, by Jan Tatham.

Mariano Montone writes about embedding Python syntax (and functionality) in Common Lisp sources.

Quicklisp newsSeptember 2014 Quicklisp dist update now available

· 3 days ago
New projects:
Updated projects: bknr-datastore, caveman, cl-ana, cl-async, cl-conspack, cl-css, cl-gendoc, cl-gss, cl-inflector, cl-oauth, cl-olefs, cl-quickcheck, cl-redis, cl-sdl2, cl-tld, clip, closer-mop, coleslaw, colleen, crane, crypto-shortcuts, function-cache, gbbopen, hermetic, hu.dwim.walker, let-over-lambda, lisp-unit2, lquery, mel-base, mexpr, mgl-pax, modularize, modularize-hooks, modularize-interfaces, mpc, open-vrp, pgloader, plump, policy-cond, protobuf, qmynd, repl-utilities, restas, scriptl, shelly, smug, software-evolution, south, staple, stumpwm, trivial-mimes, weblocks-tree-widget.

To get this update, use (ql:update-dist "quicklisp").

Just as a friendly reminder, Quickdocs is a great way to find libraries in Quicklisp. I don't run the site and it's not an official part of Quicklisp, it's just a great project that uses Quicklisp's metadata to build a really useful service. So check it out!

Timofei ShatrovWho needs graph theory anyway?

· 3 days ago

In my last post I discussed how to make a Japanese->English transliterator and outlined some problems that limited its usefulness. One problem is that there’s no obvious way to segment a sentence into words. I looked up existing solutions, and a lightweight Javascript implementation caught my eye. I quickly ported it to Common Lisp and to the surprise of absolutely no one, the results were awful

It was clear that I needed an actual database of Japanese words to do segmentation properly. This would also solve the “kanji problem” since this database would also include how to pronounce the words. My first hunch was Wiktionary, but it’s dump format turned out to be pretty inefficient for parsing.

Fortunately I quickly discovered a free JMDict database which was exactly what I needed. It even had open-source code in Python for parsing and loading its XML dumps. Naturally, I wrote my own code to parse it since its database schema looked too complex for my needs. But I’m not going to discuss that in this post, as it is quite boring.

Since now I had a comprehensive Postgres database of every word in Japanese language (not really, as it doesn’t include conjugations) it was only a matter of identifying the words in the sentence. To do this, for every substring of a sentence look up the database for exact matches. There are n(n+1)/2 substrings in a string, so we aren’t doing too badly in terms of performance (and the string wouldn’t be too long anyway since prior to running this procedure I’ll be splitting it by punctuation etc.)

(defstruct segment
  start end word))

(defun find-substring-words (str)
  (loop for start from 0 below (length str)
       (loop for end from (1+ start) upto (length str)
          for substr = (subseq str start end)
            nconcing (mapcar 
                      (lambda (word)
                        (make-segment :start start :end end :word word))
                      (find-word substr)))))

The problem is that there’s a lot of words, and many of them are spelled identically. I decided to assign a score to each word based on its length (longer is better), whether it’s a preferred spelling of the word, how common the word is and whether it’s a particle (which tend to be short and thus need a boost to increase their prominence).

Now we have the following problem: for a sentence, find the set of non-intersecting segments with the maximum total score. Now, you might have better mathematical intuition than I, but my first thought was:

This looks NP-hard, man. This problem has “travelling salesman" written all over it.

My first attempt to crack it was to calculate score per letter for each word and select words with the highest scores. But a counterexample comes to mind rather easily: in a sentence “ABC” with words “AB” (score=5), “BC” (score=5) and “ABC” (score=6), words “AB” and “BC” have a higher score per letter (2.5), but the optimal covering is provided by the word “ABC” with its score per letter a measly 2.

At this point I was working with the most convenient mathematical instrument, which is pen and paper. The breakthrough came when I started to consider a certain relation between two segments: the segment a can be followed by the segment iff (segment-start b) is greater or equal to (segment-end a). Under this relation our segments form transitive directed acyclic graph. The proof is left as an exercise for the reader. Clearly we just need to do a transitive reduction and use something similar to Dijkstra’s algorithm to find the path with the maximal score! This problem is clearly solvable in polynomial time!

Pictured: actual notes drawn by me


In reality the algorithm turns out to be quite simple. Since find-substring-words always returns segments sorted by their start and then by their end, every segment can only be followed by the segments after it. We can then accumulate the largest total score and the path used for it for every segment by using a nested loop:

(defstruct segment
  start end word (score nil) (accum 0) (path nil))

(defun find-best-path (segments)
  ;;assume segments are sorted by (start, end) (as is the result of find-substring-words)
  (let ((best-accum 0)
        (best-path nil))
    (loop for (seg1 . rest) on segments
       when (> (segment-score seg1) (segment-accum seg1))
         do (setf (segment-accum seg1) (segment-score seg1)
                  (segment-path seg1) (list seg1))
            (when (> (segment-accum seg1) best-accum)
              (setf best-accum (segment-accum seg1)
                    best-path (segment-path seg1)))
       when (> (segment-score seg1) 0)
         do (loop for seg2 in rest
               if (>= (segment-start seg2) (segment-end seg1))
               do (let ((accum (+ (segment-accum seg1) (segment-score seg2))))
                    (when (> accum (segment-accum seg2))
                      (setf (segment-accum seg2) accum
                            (segment-path seg2) (cons seg2 (segment-path seg1)))
                      (when (> accum best-accum)
                        (setf best-accum accum
                              best-path (segment-path seg2)))))))
    (values (nreverse best-path) best-accum)))

Of course when I actually tried to run this algorithm, SBCL just crashed. How could that be? It took me a while to figure out, but notice how segment-path contains a list that includes the segment itself. A recursive self-referential structure! When SBCL tried to print that in the REPL, it didn’t result in dragons flying out of my nose but a crash still happened. Interestingly, Common Lisp has a solution to this: if *print-circle* is set to t, it will actually print the structure using referential tokens. Anyway, I just added the following before returning the result to remove self-references:

    (dolist (segment segments)
      (setf (segment-path segment) nil))

So, did it work? Yes, it did, and the result was impressive! Even though my scoring system is pretty barebones, it’s on par or even better than Google Translate’s romanization on a few test sentences I tried. I still need to add conjugations, and it can’t do personal names at all, but considering how little code there is and the fact that it doesn’t even attempt grammatical analysis of the sentence (due to me not knowing the language) I am very happy with the result. Also I plan to add a web interface to it so that it’s possible to hover over words and see the translation. That would be pretty useful. The work in progress code is on my Github.

Paul KhuongDoodle: Hybridising SBCL's GENCGC With Mark and Sweep

· 4 days ago

Meta-note: this is more of a journal entry than the usual post here. I’ll experiment with the format and see if I like publishing such literal and figurative doodles.

Garbage collection is in the air. My friend Maxime is having issues with D’s garbage collector, and Martin Cracauer has a large patch to improve SBCL’s handling of conservative references. I started reviewing that patch today, and, after some discussion with Alastair Bridgewater, I feel like adding a mark-and-sweep component to SBCL’s GC might be easier than what the patch does, while achieving its goal of reducing the impact of conservative references. That lead to the whiteboarding episode below and a plan to replace the garbage collecting half of SBCL’s generational GC. But first, a summary of the current implementation.

The present, and how we got here

CMUCL started out with a Cheney-style two-space collector. Two-space collectors free up space for more allocations by copying objects that might still be useful (that are reachable from “roots,” e.g., variables on the stack) from the old space to the new space. Cheney’s algorithm elegantly simplifies this task by storing bookkeeping information in the data itself. When we copy an object to the new space (because it is reachable), we want to make sure that all other references to that object are also replaced with references to the copy. Cheney’s solution to that desideratum is obvious: overwrite the old object with a broken heart (forwarding pointer), a marker that

  1. the object has already been copied to the new space;
  2. the copy lives at address x.

This adds a constraint that heap-allocated objects can never be smaller than a broken heart, but they’re usually one or two words (two in SBCL’s case) so the constraint is rarely binding.

When the garbage collector traverses the roots (the stack, for example) and finds a pointer, the code only has to dereference that pointer to determine if the objects it points to has been moved. If so, the GC replaces the root pointer with a pointer to the copy in the new space. Otherwise, the GC copies the object to the new space, repoints to that copy, and overwrites the old object with a broken heart.

We also need to traverse objects recursively: when we find that an object is live and copy it to the new space, we must also make sure that anything that objects points to is also preserved, and that any pointer in that object is updated with pointers to copies in the new space.

That’s a graph traversal, and the obvious implementation maintains a workset of objects to visit which, in the worst case, could include all the objects in old space. The good news is we don’t have to worry about objects re-entering that workset: we always overwrite objects (in old space) with a broken heart when we visit them for the first time.

Cheney proposed a clever trick to implement this workset. Whenever an object enters the workset, it has just been copied to the new space; as long as we allocate in the new space by incrementing an allocation pointer, the new space itself can serve as the workset! In addition to the allocation pointer, we now need a “to-scan” pointer. Any object in the new space that’s below the to-scan pointer has already been scanned for pointers and fixed to point in the new space; any object between the to-scan pointer and the allocation pointer must be scanned for pointers to the old space. We pop an element from the workset by looking at the next object (in terms of address) after the to-scan pointer and incrementing that pointer by the object’s size. When the to-scan and the allocation pointers meet, the workset is empty and GC terminates.

Some SBCL platforms still use this two-space collector, but it doesn’t scale very well to large heaps (throughput is usually fine, but we waste a lot of space and GC pauses can be long). The generational conservative garbage collector (GENCGC, GENGC on precise/non-conservative platforms) is a hack on top of that Cheney GC.

The GC is “generational” because most passes only collect garbage from a small fraction of the heap, and “conservative” because we have to deal with values that may or may not be pointers (e.g., we don’t always know if the value in a register is a Lisp reference or just a machine integer) by considering some objects as live (not up for collection) while pinning them at their current address.

The runtime uses mprotect to record writes to the heap, except for the nursery (newly allocated objects) where we expect most writes to land. The heap is partitioned in pages, and the first write to a page after a GC triggers a protection fault; the signal handler marks that page as mutated and changes the protection to allow writes.

When a GC is triggered, we usually want to collect only the nursery, i.e., only objects that were allocated after the previous GC pass. GEN(C)GC adapts Cheney to this use case by building the set of all pages that might have been mutated to point somewhere in the nursery (thanks to the mprotect write barrier) and scanning them for roots, like the stack in Cheney GC. The default GENGC configuration has 7 generations and we extend this scheme by flagging pages with pointers to younger generations (newer objects), without noting what these generations might be.

Pinned objects are also handled by abusing the root set: pages that contain at least one pinned object don’t undergo garbage collection and are directly scanned for pointers, like the stack in Cheney GC.

Instead of having two heaps, an old space and a new space, we now have a lot of pages, and each page belongs to a generation. When we want to collect a given generation, pages in that generation form the old space, and pages allocated during GC the new space. This means that we lose the simplicity of Cheney’s new-space-is-the-workset trick: the new space isn’t contiguous, so a single to-scan pointer doesn’t cut it anymore! GENGC works around that by scanning the page table, but it’s not pretty and I really don’t know if Cheney is a good fit anymore.

Martin Cracauer’s patch

GENCGC’s approach to pinned objects is stupid. If a page has no reference except for one conservative pointer, the whole page is considered live and scanned for references.

Martin’s solution is to allocate additional temporary metadata only for pinned pages and track the pinned status of individual objects. When the GC encounters a pointer to a page with pinned objects, it checks if it’s a pointer to a pinned object. If so, the pointee is left in place. Otherwise, it’s copied normally.

The patch has code to mark objects as live (pinned) and to overwrite objects once they have been copied. Basically, it is half of a mark-and-sweep garbage collector. The main difference is that the set of pinned objects doesn’t grow (being pinned isn’t a contagious property), so we don’t need a worklist for pinned objects. However, I already noted that I’m not convinced the worklist hack in GENGC is a good idea.

A hybrid collector!

Instead of marking pages as containing pinned objects, I feel it may be simpler to collect some pages by copying, and others by marking. Any pinned page would have the “mark” GC policy, while pages that likely contain few live objects (e.g., the nursery and pages with a lot of unused memory) would be collected by copying. This too would avoid the issue with considering whole pages as live when pinned, and I think that having the choice of copying or marking at a page granularity will be simpler than toggling at the level of individual object.

Each “mark” page now has two (bit)sets, one for live objects and another for live objects that have already been scanned. We can maintain a worklist at the page granularity with an embedded linked list: whenever a “mark” page gains a new live object and it’s not already in the worklist, that page is enqueued for scanning.

Instead of emulating Cheney’s trick by looking for newly allocated pages in our page table, we can add pages in new space to the worklist whenever they become full.

Finally, at the end of the pass, we traverse all “mark” pages and clear dead objects.

That’s pretty simple (arguably simpler than the current implementation!), and shouldn’t involve too many changes to the rest of the code. Mostly, I’d have to adapt the weak pointer machinery to stop assuming that it can use forwarding pointers to determine when objects have been marked as live.

However, we might lose the ability to run medium GCs, to collect more than the nursery but less than the whole heap. If we only want to GC the nursery, the mprotect write barrier gives us all the information we need to find references from the rest of the heap to the nursery. If we wish to collect the whole heap, we only have to consider stacks and some statically allocated space as roots.

For medium GCs, e.g., collect only generations 1-4 out of 7, GENGC exploits the way that garbage collection (almost) always copies to easily track pages with pointers to younger generations. It’s coarse, but usually acceptable thanks to the copying. I don’t know that it would work as well if the default is to only copy the nursery. Moreover, if we have a hybrid GC, it probably makes sense to focus copying on pages that are mostly empty, regardless of their age. If we do want medium GCs, we might have to track, for each page, the set of pages that point there. This set can include false positives, so it’s probably easiest to clear it before major GCs, and otherwise only add to that set (removing pages that were emptied by a GC pass sounds reasonable). I also expect that some pages will have many refererrers; I’m thinking we might use a distinguished value to mean “referred by every pages” and not consider them for medium GC.

What’s next

Martin’s patch clearly addresses an important weakness in SBCL’s garbage collector. If I can’t make good progress on the hybrid GC soon, I’ll make sure the patch is cleaned up for master, hopefully by Thanksgiving.

Clozure CL BlogClozure CL 1.10 is available

· 5 days ago

Clozure CL 1.10 is now available.  See for instructions on how to get it.

Christophe Rhodesnaive vs proper code-walking

· 9 days ago

I said in my discussion about backquote representations that some utilities had defects made manifest by SBCL 1.2.2's new internal representation for backquote and related operators, and that those defects could have been avoided by using a code-walker. I'm going to look at let-over-lambda code here, to try to demonstrate what I meant by that, and show how a proper code-walker can quite straightforwardly be used for the code transformations that have been implemented using a naïve walker (typically walking over a tree of conses), removing whole classes of defects in the process.

The let-over-lambda code I'm discussing is from, specifically this version. This isn't intended to be a hatchet job on the utility - clearly, it is of use to its users - but to show up potential problems and offer solutions for how to fix them. I should also state up front that I haven't read the Let over Lambda book, but it's entirely possible that discussing and using a full code-walker would have been out of scope (as it explicitly was for On Lisp).

Firstly, let's deal with how the maintainer of the let-over-lambda code is dealing with the change in backquote representations, since it's still topical:

;; package definition here just in case someone decides to paste
;; things into a Lisp session, and for private namespacing
(defpackage "LOL" (:use "CL"))
(in-package "LOL")
;; actual excerpts from let-over-lambda code from
;; <>
;; begins here:
(if (string-lessp (lisp-implementation-version) "1.2.2")
    (pushnew :safe-sbcl *features*)
    (setq *features* (remove :safe-sbcl *features*)))
(defun flatten (x)
  (labels ((rec (x acc)
             (cond ((null x) acc)
                   #+(and sbcl (not safe-sbcl))
                   ((typep x 'sb-impl::comma) (rec (sb-impl::comma-expr x) acc))
                   ((atom x) (cons x acc))
                   (t (rec (car x) (rec (cdr x) acc))))))
    (rec x nil)))

The issues around the (*features*) handling here have been reported at github; for the purpose of this blog entry, I will just say that I wrote about them in Maintaining Portable Lisp Programs, a long time ago, and that a better version might look a bit like this:

(eval-when (:compile-toplevel :execute)
  (defun comma-implementation ()
    (typecase '`,x
      (symbol 'old)
      ((cons symbol (cons structure-object)) 'new)))
  (if (eql (comma-implementation) 'old)
      (pushnew 'cons-walkable-backquote *features*)
      (setq *features* (remove 'cons-walkable-backquote *features*))))
(defun flatten (x)
  (labels ((rec (x acc)
             (cond ((null x) acc)
                   ((typep x 'sb-impl::comma) (rec (sb-impl::comma-expr x) acc))
                   ((atom x) (cons x acc))
                   (t (rec (car x) (rec (cdr x) acc))))))
    (rec x nil)))

With these changes, the code is (relatively) robustly testing for the particular feature it needs to know about at the time that it needs to know, and recording it in a way that doesn't risk confusion or contention with any other body of code. What is the let-over-lambda library using flatten for?

(defun g!-symbol-p (thing)
  (and (symbolp thing)
       (eql (mismatch (symbol-name thing) "G!") 2)))
(defmacro defmacro/g! (name args &rest body)
  (let ((syms (remove-duplicates
               (remove-if-not #'g!-symbol-p (flatten body)))))
    `(defmacro ,name ,args
       (let ,(mapcar
              (lambda (s)
                `(,s (gensym ,(subseq (symbol-name s) 2))))

The intent behind this macro-defining macro, defmacro/g!, appears to be automatic gensym generation: being able to write

(defmacro/g! with-foo ((foo) &body body)
  `(let ((,g!foo (activate-foo ,foo)))
         (progn ,@body)
       (deactivate-foo ,g!foo))))

without any explicit calls to gensym but retaining the protection that gensyms give against name capture:

(macroexpand-1 '(with-foo (3) 4))
; => (let ((#1=#:FOO1 (activate-foo 3)))
;      (unwind-protect
;          (progn 4)
;        (deactivate-foo #1#)))

That's fine; it's reasonable to want something like this. Are there any issues with this, apart from the one exposed by SBCL's new backquote implementation? In its conventional use, probably not - essentially, all uses of g! symbols are unquoted (i.e. behind commas) - but there are a couple of more theoretical points. One issue is that flatten as it currently stands will look for all symbols beginning with g! in the macroexpander function source, whether or not they are actually variable evaluations:

(defmacro/g! with-bar ((bar) &body body)
  `(block g!block
     (let ((,g!bar ,bar)) ,@body)))
; unused variable G!BLOCK
(macroexpand-1 '(with-bar (3) 4))
; => (block g!block (let ((#:BAR1 3)) 4))

In this example, that's fair enough: it's probably user error to have those g! symbols not be unquoted; this probably only becomes a real problem if there are macro-defining macros, with both the definer and the definition using g! symbols. It's not totally straightforward to demonstrate other problems with this simple approach to Lisp code transformation using just this macro; the transformation is sufficiently minimal, and the symptoms of problems relatively innocuous, that existing programming conventions are strong enough to prevent anything seriously untoward going wrong.

Before getting on to another example where the problems with this approach become more apparent, how could this transformation be done properly? By "properly" here I mean that the defmacro/g! should arrange to bind gensyms only for those g! symbols which are to be evaluated by the macroexpander, and not for those which are used for any other purpose. This is a task for a code-walker: a piece of code which exploits the fact that Lisp code is made up of Lisp data structures, all of which are introspectable, and the semantics of which in terms of effect on environment and execution are known. It is tedious, though possible, to write a mostly-portable code-walker (there needs to be some hook into the implementation's representation of environments); I'm not going to do that here, but instead will use SBCL's built-in code-walker.

The sb-walker:walk-form function takes three arguments: a form to walk, an initial environment to walk it in, and a walker function to perform whatever action is necessary on the walk. That walker function itself takes three arguments, a form, context and environment, and the walker arranges for it to be called on every macroexpanded or evaluated subform in the original form. The walker function should return a replacement form for the subform it is given (or the subform itself if it doesn't want to take any action), and a secondary value of t if no further walking of that form should take place.

To do g! symbol detection and binding is fairly straightforward. If a symbol is in a context for evaluation, we collect it, and here we can take the first benefit from a proper code walk: we only collect g! symbols if the code-walker deems that they will be evaluated and there isn't an already-existing lexical binding for it:

(defmacro defmacro/g!-walked (name args &body body)
  (let* (g!symbols)
    (flet ((g!-walker (subform context env)
             (declare (ignore context))
             (typecase subform
                (when (and (g!-symbol-p subform)
                           (not (sb-walker:var-lexical-p subform env)))
                  (pushnew subform g!symbols))
               (t subform))))
      (sb-walker:walk-form `(progn ,@body) nil #'g!-walker)
      `(defmacro ,name ,args
         (let ,(mapcar (lambda (s) (list s `(gensym ,(subseq (symbol-name s) 2))))

The fact that we only collect symbols which will be evaluated deals with the problem exhibited by with-bar, above:

(defmacro/g!-walked with-bar/walked ((bar) &body body)
  `(block g!block
     (let ((,g!bar ,bar)) ,@body)))
(macroexpand-1 '(with-bar/walked (3) 4))
; => (block g!block (let ((#:BAR1 3)) 4))

Only gathering symbols which don't have lexical bindings (testing sb-walker:var-lexical-p) deals with another minor problem:

(defmacro/g!-walked with-baz ((baz) &body body)
  (let ((g!sym 'sym))
    `(let ((,g!sym ,baz)) ,@body)))
(macroexpand-1 '(with-baz (3) 4))
; => (let ((sym 3)) 4)

(the cons-walker - flatten - would not be able to detect that there is already a binding for g!sym, and would introduce another one, again leading to an unused variable warning.)

OK, time to recap. So far, we've corrected the code that tests for particular backquote implementations, which was used in flatten, which itself was used to perform a code-walk; we've also seen some low-impact or theoretical problems with that simple code-walking technique, and have used a proper code-walker instead of flatten to deal with those problems. If the odd extra unused variable binding were the worst thing that could happen, there wouldn't be much benefit from using a code-walker (other than the assurance that the walker is dealing with forms for execution); however, let us now turn our attention to the other macro in let-over-lambda's code which does significant codewalking:

(defun dollar-symbol-p (thing)
  (and (symbolp thing)
       (char= (char (symbol-name thing) 0) #\$)
       (ignore-errors (parse-integer (subseq (symbol-name thing) 1)))))
(defun prune-if-match-bodies-from-sub-lexical-scope (tree)
  (if (consp tree)
      (if (or (eq (car tree) 'if-match)
              (eq (car tree) 'when-match))
          (cddr tree)
          (cons (prune-if-match-bodies-from-sub-lexical-scope (car tree))
                (prune-if-match-bodies-from-sub-lexical-scope (cdr tree))))
;; WARNING: Not %100 correct. Removes forms like (... if-match ...) from the
;; sub-lexical scope even though this isn't an invocation of the macro.
(defmacro! if-match ((test str) conseq &optional altern)
  (let ((dollars (remove-duplicates
                  (remove-if-not #'dollar-symbol-p
                                 (flatten (prune-if-match-bodies-from-sub-lexical-scope conseq))))))
    (let ((top (or (car (sort (mapcar #'dollar-symbol-p dollars) #'>)) 0)))
      `(let ((,g!str ,str))
         (multiple-value-bind (,g!s ,g!e ,g!ms ,g!me) (,test ,g!str)
           (declare (ignorable ,g!e ,g!me))
           (if ,g!s
               (if (< (length ,g!ms) ,top)
                   (error "ifmatch: too few matches")
                   ;; lightly edited here to remove irrelevant use of #`
                   (let ,(mapcar (lambda (a1) `(,(symb "$" a1)
                                                (subseq ,g!str (aref ,g!ms ,(1- a1))
                                                               (aref ,g!me ,(1- a1)))))
                                 (loop for i from 1 to top collect i))
(defmacro when-match ((test str) conseq &rest more-conseq)
  `(if-match (,test ,str)
     (progn ,conseq ,@more-conseq)))

What's going on here? We have a prune-if-match-bodies-from-sub-lexical-scope function which, again, performs some kind of cons-based tree walk, removing some conses whose car is if-match or when-match. We have a trivial macro when-match which transforms into an if-match; the if-match macro is more involved. Any symbols named as a $ sign followed by an integer (in base 10) are treated specially; the intent is that they will be bound to capture groups of the cl-ppcre match. So it would be used in something like something like

(defun key-value (line)
  (if-match ((lambda (s) (scan "^\\(.*\\): \\(.*\\)$" s)) line)
      (list $1 $2)
      (error "not actually a key-value line: ~S" line)))

and that would macroexpand to, roughly,

(defun key-value (line)
  (multiple-value-bind (s e ms me)
      ((lambda (s) (scan "^\\(.*\\): \\(.*\\)$" s)) line)
    (if s
        (if (< (length ms) 2)
            (error "if-match: not enough matches)
            (let (($1 (subseq line (aref ms 0) (aref me 0)))
                  ($2 (subseq line (aref ms 1) (aref me 1))))
              (list $1 $2)))
        (error "not actually a key-value line: ~S" line))))

(there's additional reader macrology in let-over-lambda to make that lambda form unnecessary, but we can ignore that for our purposes).

Now, if-match has a similar problem that defmacro/g! had: since the tree walker doesn't make a distinction between symbols present for evaluation and symbols for any other purpose, it is possible to confuse the walker. For example:

(if-match (scanner string)
    (if (> (length $1) 6)

This form, if macroexpanded, will attempt to bind one million variables to matched groups; even if the compiler doesn't choke on that, evaluation will go wrong, as the matcher is unlikely to match one million groups (so the "not enough matches" error branch will be taken) - whereas of course the quoted one million dollar symbol is not intended for evaluation.

But the nesting problems are more obvious in this case than for defmacro/g!. Firstly, take the simple case:

(if-match (scanner string)
    (list $1
          (if-match (scanner2 string)

Here, the $2 is in the scope of the inner if-match, and so mustn't be included for the macroexpansion of the outer if-match. This case is handled in let-over-lambda's implementation by the prune-if-match-bodies-from-sub-lexical-scope: the consequent of the inner if-match is pruned from the dollar-symbol accumulator. However, there are several issues with this; the first is that the test is pruned:

(if-match (scanner string)
    (if-match (scanner2 $2)

In this example, the $2 is 'invisible' to the outer if-match, and so won't get a binding. That's straightforwardly fixable, along with the mishandling of when-let's syntax (the entire body of when-let should be pruned, not just the first form), and what I think is an error in the pruning of if-match (it should recurse on the cdddr, not the cddr; github issue).

Not fixable at all while still using naïve code-walking are two other problems, one of which is noted in the comment present in the let-over-lambda code: the pruner doesn't distinguish between if-match forms for evaluation and other conses whose car is if-match. Triggering this problem does involve some contortions - in order for it to matter, we need an if-match not for evaluation followed by a dollar symbol which is to be evaluated; but, for example:

(defmacro list$/q (&rest args)
  `(list ,@(mapcar (lambda (x) (if (dollar-symbol-p x) x `',x)) args)))
(if-match (scanner string)
    (list$/q foo if-match $2)

Here, although the $2 is in a position for evaluation (after macroexpansion), it will have no binding because it will have been pruned when naïvely walking the outer if-match macro. The if-match symbol argument to `list$/q ends up quoted, and should not be treated as a macro call.

Also, the pruner function must have special knowledge not just about the semantics of if-match, but also of any macro which can expand to if-match - see the attempt to handle when-match in the pruner. If a user were to have the temerity to define case-match

(defmacro case-match (string &rest clauses)
  (if (null clauses)
      `(if-match (,(caar clauses) ,string)
           (progn ,@(cdar clauses))
           (case-match string ,@(cdr clauses)))))

any attempt to nest a case-match inside an outer if-match is liable to fail, as the pruner has no knowledge of how to handle the case-match form.

All of these problems are solvable by using a proper code-walker. The code-walker should collect up all dollar symbols to be evaluated in the consequent of an if-match form, so that bindings for them can be generated, except for those with already existing lexical bindings within the if-match (not those from outside, otherwise nesting won't work). For testing purposes, we'll also signal a diagnostic condition within the macroexpander to indicate which dollar symbols we've found.

(define-condition if-match/walked-diagnostic (condition)
  ((symbols :initarg :symbols :reader if-match-symbols)))
(defmacro if-match/walked ((test string) consequent &optional alternative)
  (let* (dollar-symbols)
    (flet ((dollar-walker (subform context env)
             (declare (ignore context))
             (typecase subform
                (when (and (dollar-symbol-p subform)
                           (not (sb-walker:var-lexical-p subform env)))
                  (pushnew subform dollar-symbols))
               (t subform))))
      (handler-bind ((if-match/walked-diagnostic #'continue))
        (sb-walker:walk-form consequent nil #'dollar-walker))
      (let* ((dollar-symbols (sort dollar-symbols #'> :key #'dollar-symbol-p))
             (top (dollar-symbol-p (car dollar-symbols))))
        (with-simple-restart (continue "Ignore diagnostic condition")
          (signal 'if-match/walked-diagnostic :symbols dollar-symbols))
        (sb-int:with-unique-names (start end match-start match-end)
          (sb-int:once-only ((string string))
            `(multiple-value-bind (,start ,end ,match-start ,match-end)
                 (,test ,string)
               (declare (ignore ,end) (ignorable ,match-end))
               (if ,start
                   (if (< (length ,match-start) ,top)
                       (error "~S: too few matches: needed ~D, got ~D." 'if-match
                              ,top (length ,match-start))
                       (let ,(mapcar (lambda (s)
                                       (let ((i (1- (dollar-symbol-p s))))
                                         `(,s (subseq ,string (aref ,match-start ,i) (aref ,match-end ,i)))))
                                     (reverse dollar-symbols))

(I'm using sb-int:once-only and sb-int:with-unique-names to avoid having to include their definitions in this post, which is getting a bit lengthy). Testing this looks like

(defmacro test-if-match (form expected-symbols)
  `(handler-case (macroexpand-1 ',form)
     (if-match/walked-diagnostic (c)
       (assert (equal (if-match-symbols c) ',expected-symbols)))
     (:no-error (&rest values) (declare (ignore values)) (error "no diagnostic"))))
(test-if-match (if-match/walked (test string) (list $1 $2) 'foo) ($2 $1))
(test-if-match (if-match/walked (test string) (if (> (length $1) 6) '$10 '$8) nil) ($1))
(test-if-match (if-match/walked (scanner string)
                   (list $1
                         (if-match/walked (scanner2 string)
(test-if-match (if-match/walked (scanner string) (list$/q foo if-match/walked $3) nil) ($3))
(defmacro case-match/walked (string &rest clauses)
  (if (null clauses)
      `(if-match/walked (,(caar clauses) ,string)
           (progn ,@(cdar clauses))
           (case-match/walked string ,@(cdr clauses)))))
(test-if-match (if-match/walked (scanner string)
                   (case-match/walked $1
                     (foo $2)
                     (bar $3)))

To summarize: I've shown here how to make use of a full code-walker to make a couple of code transforming macros more robust. Full code-walkers can do more than just what I've shown here: the sb-walker:walk-form interface can also inhibit macroexpansion, transform function calls into calls to other functions, while respecting the semantics of the Lisp operators in the code that is being walked and allowing some introspection of the lexical environment. Here, we have called sb-walker:walk-form for side effects from the walker function we've provided; it is also possible to use its value (that's how sb-cltl2:macroexpand-all is implemented, for example). I hope that this can help users affected by the change in internal representation of backquote, as well as others who want to write advanced code-transforming macros. If the thought of using an SBCL-internal code-walker makes you a bit queasy (as well it might), you could instead start by looking at one or two other more explicitly-portable code-walkers out there, for example John Fremlin's macroexpand-dammit, the walker in Alex Plotnick's CLWEB literate programming system (github link), or the code walker in iterate.

Pascal Costanza"Why I like Common Lisp"

· 10 days ago
In a recent email exchange discussion, Charlotte Herzeel gave a summary of Common Lisp that I believe is worth repeating publicly. With her permission, I repeat her statements here.

"An important reason why I like Common Lisp a lot is that the language has a layered design that supports incremental development. The language provides very high-level programming abstractions, such as object-oriented programming, dynamic multiple dispatch, garbage collection, a meta-object protocol, and so on. These abstractions are typically open implementations, built on top of more efficient low-level abstractions the user can also choose to access directly.

Common Lisp is typically implemented as a compiled language, compiling directly to machine code. The runtime components are sparse, the garbage collector being an important one. Common Lisp provides the means to steer the compiler and runtime components to do low-level optimizations. Examples of this include: type declarations to remove type-checking at runtime; inline declarations to avoid dispatch; dynamic extent declarations to perform stack allocation instead of heap allocation; disassembly of code snippets; tuning of the garbage collector to switch between collection strategies; and so on. Optimizations such as these are optional and localized. Hence it is very easy in Common Lisp to rapidly prototype and then incrementally optimize the code by identifying the hotspots through profiling. This way you can often be as efficient as with C code, without being forced to program in a low-level style from the start throughout your whole program.

Hence in contrast to C/C++, Common Lisp allows you to optimize code incrementally and locally for a particular snippet of code. In contrast to Java - or any other language with an implementation that performs optimization at runtime through tracing or JIT compiling or so - Common Lisp implementations employ in a sense a more classic compilation approach. In this sense, Common Lisp makes it easier to 'control' what you are measuring when profiling programs.

The Common Lisp Object System (CLOS) is a library in Common Lisp for object-oriented programming. Common Lisp is a multi-paradigm language, so it depends on your problem whether it is a good idea to use object-oriented programming or not. That said, CLOS is very different from mainstream object-oriented programming. It allows multiple inheritance, multiple dispatch, and is based on generic functions, i.e. classes define types, and methods are defined separately as part of generic functions. The CLOS implementation performs a lot of clever optimizations at runtime, for example for method lookup. What is of course special about CLOS, is that it has a meta-object protocol, which allows you to extend/modify CLOS in an organized way. For example, you have hooks into the method dispatch protocol, the slot (= field) access protocol, etc. If you want to know more about the CLOS implementation and the meta-object protocol, read 'The Art of the Meta-Object Protocol' by Kiczales, des Rivieres, Bobrow.

Common Lisp just has a lot of advanced language features that you just don't find in other languages.

From a practical point of view, I can recommend LispWorks as a Common Lisp implementation. LispWorks is very user-friendly because it comes with an integrated development environment. This means you get Smalltalk-like features such as code browsers and inspector tools. Another user-friendly implementation that is free is Clozure Common Lisp. The most widely used open-source implementation is SBCL, which is very stable and very efficient. There are lots of other Common Lisp implementations out there, but I recommend one of these three.

If you want to learn about Common Lisp, I can recommend "Ansi Common Lisp" by Graham. Maybe also interesting: 'Pascal Costanza's highly opinionated guide to Common Lisp' ;-). If you want a funny introduction to Common Lisp, check out the Lisperati. A good place to snoop for Common Lisp war stories is Planet Lisp. If you want to get an idea about libraries, see quicklisp."

Timofei ShatrovMy little transliterator can't be this CLOS

· 10 days ago

If you are reading this blog, you are probably able to read Latin script. It is pretty widespread in the world, and used by 70% of the world’s population according to Wikipedia. Perhaps, like me, your native language uses a different script. There are many writing systems in the world, some are related, and some are wildly different from each other. Fortunately with the advent of the Internet and tools like Google Translate it is increasingly possible to read text not only in the language you don’t understand, but even the languages where you don’t even understand their writing system.

Well, Google is Google, but is it possible for a mere mortal to create something like that? Not to translate, but just to present some unknown writing system in your preferred alphabet (the process is called transliteration or transcription)? There’s no reason why not.

In this post I’ll talk about the process of romanization of Japanese language, which is transcription from Japanese to Latin script. For example “ありがとうございます" is romanized to "arigatō gozaimasu" under Hepburn romanization method (there are many of those).

First off, the basics of Japanese writing are as follows:

  1. There are several scripts used to write in Japanese language.
  2. Hiragana is a syllabary (a writing system where each character represents a syllable) that is used for words of Japanese origin.
  3. Katakana is another syllabary that is used for loan words. Every possible syllable in Japanese language has a hiragana and katakana form, which usually are completely different. Both scripts have about 50 characters in them.
  4. Chinese characters (kanji) are used for words of Japanese and Chinese origin. There are thousands of such characters. Furthermore, most of them could be read in several different ways, which makes transcribing them difficult. We’re going to ignore those for now.

If we focus on romanization of hiragana and katakana (both systems are called kana for short) then the process seems pretty simple. It’s just a matter of replacing each kana with the syllable it represents, written in roman letters. However there are some characters that do not represent a syllable, but rather modify a syllable before or after that character. This includes sokuon, which doubles the consonant of the next syllable and yoon characters, which are a small version of normal kana and are used to modify a vowel of a preceding syllable.

Ok, so the first thing we must do is to somehow bring order to this madness. Since there is hiragana and katakana version of each character, it doesn’t make sense to work with the characters directly. Instead I’m going to replace each character with a keyword.

(defparameter *sokuon-characters* '(:sokuon "っッ"))

(defparameter *iteration-characters* '(:iter "ゝヽ" :iter-v "ゞヾ"))

(defparameter *modifier-characters* '(:+a "ぁァ" :+i "ぃィ" :+u "ぅゥ" :+e "ぇェ" :+o "ぉォ"
                                      :+ya "ゃャ" :+yu "ゅュ" :+yo "ょョ"
                                      :long-vowel "ー"))

(defparameter *kana-characters*
  '(:a "あア"     :i "いイ"     :u "うウ"     :e "えエ"     :o "おオ"
    :ka "かカ"    :ki "きキ"    :ku "くク"    :ke "けケ"    :ko "こコ"
    :sa "さサ"    :shi "しシ"   :su "すス"    :se "せセ"    :so "そソ"
    :ta "たタ"    :chi "ちチ"   :tsu "つツ"   :te "てテ"    :to "とト"
    :na "なナ"    :ni "にニ"    :nu "ぬヌ"    :ne "ねネ"    :no "のノ"
    :ha "は" :hha "ハ" :hi "ひヒ" :fu "ふフ"  :he "へヘ"    :ho "ほホ"
    :ma "まマ"    :mi "みミ"    :mu "むム"    :me "めメ"    :mo "もモ"
    :ya "やヤ"                  :yu "ゆユ"                 :yo "よヨ"
    :ra "らラ"    :ri "りリ"    :ru "るル"    :re "れレ"    :ro "ろロ"
    :wa "わワ"    :wi "ゐヰ"                 :we "ゑヱ"    :wo "を" :wwo "ヲ"
    :n "んン"

    :ga "がガ"    :gi "ぎギ"    :gu "ぐグ"    :ge "げゲ"    :go "ごゴ"
    :za "ざザ"    :ji "じジ"    :zu "ずズ"    :ze "ぜゼ"    :zo "ぞゾ"
    :da "だダ"    :dji "ぢヂ"   :dzu "づヅ"   :de "でデ"    :do "どド"
    :ba "ばバ"    :bi "びビ"    :bu "ぶブ"    :be "べベ"    :bo "ぼボ"
    :pa "ぱパ"    :pi "ぴピ"    :pu "ぷプ"    :pe "ぺペ"    :po "ぽポ"

(defparameter *all-characters* (append *sokuon-characters*

(defparameter *char-class-hash*
  (let ((hash (make-hash-table)))
    (loop for (class chars) on *all-characters* by #'cddr
         do (loop for char across chars
               do (setf (gethash char hash) class)))

(defun get-character-classes (word)
  (map 'list (lambda (char) (gethash char *char-class-hash* char)) word))

 This creates a hash table that maps every kana to a keyword that describes it and we can now trivially convert a word into a list of “character classes” (or the characters themselves for non-kana characters). Then we need to transform this list into a kind of AST where modifier characters have the role of functions.

(defun process-modifiers (cc-list)
  (loop with result
       for (cc . rest) on cc-list
       if (eql cc :sokuon)
         do (push (cons cc (process-modifiers rest)) result) (loop-finish)
       else if (member cc *modifier-characters*)
         do (push (list cc (pop result)) result)
       else do (push cc result)
       finally (return (nreverse result))))

This is your basic push/nreverse idiom with some extra recursiveness added. Sokuon is applied to everything to the right of it, because I wanted it to have lower precedence, i.e. (:sokuon :ka :+yu) is parsed as (:sokuon (:+yu :ka)) instead of the other way around. Now we can write the outline of our algorithm:

(defun romanize-core (method cc-tree)
  (with-output-to-string (out)
    (dolist (item cc-tree)
      (cond ((null item)) 
            ((characterp item) (princ item out))
            ((atom item) (princ (r-base method item) out))
            ((listp item) (princ (r-apply (car item) method (cdr item)) out))))))

The functions r-base and r-apply are generic functions that will depend on the method of romanization. Another generic function will be r-simplify that will “pretty up” the result. It is easy to write some reasonable fallback methods for them:

(defgeneric r-base (method item)
  (:documentation "Process atomic char class")
  (:method (method item)
    (string-downcase item)))

(defgeneric r-apply (modifier method cc-tree)
  (:documentation "Apply modifier to something")
  (:method ((modifier (eql :sokuon)) method cc-tree)
    (let ((inner (romanize-core method cc-tree)))
      (if (zerop (length inner)) inner
          (format nil "~a~a" (char inner 0) inner))))
  (:method ((modifier (eql :long-vowel)) method cc-tree)
    (romanize-core method cc-tree))
  (:method ((modifier symbol) method cc-tree)
    (format nil "~a~a" (romanize-core method cc-tree) (string-downcase modifier))))
(defgeneric r-simplify (method str)
  (:documentation "Simplify the result of transliteration")
  (:method (method str) str))

Of course relying on symbol names isn’t flexible at all. It’s better to have a mapping from each keyword to a string that represents it. This is where we have to resort to classes to store this mapping in a slot.

(defclass generic-romanization ()
  ((kana-table :reader kana-table
               :initform (make-hash-table))))

(defmethod r-base ((method generic-romanization) item)
  (or (gethash item (kana-table method)) (call-next-method)))

(defmethod r-apply ((modifier symbol) (method generic-romanization) cc-tree)
  (let ((yoon (gethash modifier (kana-table method))))
    (if yoon
        (let ((inner (romanize-core method cc-tree)))
          (format nil "~a~a" (subseq inner 0 (max 0 (1- (length inner)))) yoon))

(defmacro hash-from-list (var list)
  (alexandria:with-gensyms (hash key val)
    `(defparameter ,var
       (let ((,hash (make-hash-table)))
         (loop for (,key ,val) on ,list
              do (setf (gethash ,key ,hash) ,val))

(hash-from-list *hepburn-kana-table*
                '(:a "a"      :i "i"      :u "u"      :e "e"      :o "o"
                  :ka "ka"    :ki "ki"    :ku "ku"    :ke "ke"    :ko "ko"
                  :sa "sa"    :shi "shi"  :su "su"    :se "se"    :so "so"
                  :ta "ta"    :chi "chi"  :tsu "tsu"  :te "te"    :to "to"
                  :na "na"    :ni "ni"    :nu "nu"    :ne "ne"    :no "no"
         :ha "ha" :hha "ha"   :hi "hi"    :fu "fu"    :he "he"    :ho "ho"
                  :ma "ma"    :mi "mi"    :mu "mu"    :me "me"    :mo "mo"
                  :ya "ya"                :yu "yu"                :yo "yo"
                  :ra "ra"    :ri "ri"    :ru "ru"    :re "re"    :ro "ro"
                  :wa "wa"    :wi "wi"                :we "we"    :wo "wo" :wwo "wo"
                  :n "n"
                  :ga "ga"    :gi "gi"    :gu "gu"    :ge "ge"    :go "go"
                  :za "za"    :ji "ji"    :zu "zu"    :ze "ze"    :zo "zo"
                  :da "da"    :dji "ji"   :dzu "zu"   :de "de"    :do "do"
                  :ba "ba"    :bi "bi"    :bu "bu"    :be "be"    :bo "bo"
                  :pa "pa"    :pi "pi"    :pu "pu"    :pe "pe"    :po "po"
                  :+a "a"     :+i "i"     :+u "u"     :+e "e"     :+o "o"
                  :+ya "ya"               :+yu "yu"               :+yo "yo"
(defclass generic-hepburn (generic-romanization) ((kana-table :initform (alexandria:copy-hash-table *hepburn-kana-table*))))

I’m going for a rather versatile class hierarchy here, starting with a completely empty kana-table for generic-romanization method, but defining the methods on it that will work for any table. Then I define a class generic-hepburn that will be the basis for different hepburn variations. The table is taken from Wikipedia article on Hepburn romanization, which is pretty detailed. By carefully reading it, we can identify the exceptions that the above functions can’t handle. For example a :sokuon before :chi is romanized as “tchi” and not as “cchi” as it would by the simple consonant-doubling method. Another exception is that, for example, :chi followed by :+ya is romanized as “cha”, not “chya”. CLOS makes it easy to handle these irregularities before passing the torch to a less specific method.

(defmethod r-apply ((modifier (eql :sokuon)) (method generic-hepburn) cc-tree)
  (if (eql (car cc-tree) :chi)
      (concatenate 'string "t" (romanize-core method cc-tree))

(defmethod r-apply ((modifier (eql :+ya)) (method generic-hepburn) cc-tree)
  (case (car cc-tree)
    (:shi "sha")
    (:chi "cha")
    ((:ji :dji) "ja")
    (t (call-next-method))))
... and the same for :+yu and :+yo

Another thing Hepburn romanizations do is simplifying double vowels like “oo”, “ou” and “uu”. For example, our generic-hepburn will romanize “とうきょう” as “toukyou”, while most people are more familiar with the spelling “Tokyo” or “Tōkyō”.

(defun simplify-ngrams (str map)
  (let* ((alist (loop for (from to) on map by #'cddr collect (cons from to)))
         (scanner (ppcre:create-scanner (cons :alternation (mapcar #'car alist)))))
    (ppcre:regex-replace-all scanner str 
                             (lambda (match &rest rest)
                               (declare (ignore rest))
                               (cdr (assoc match alist :test #'equal)))
                             :simple-calls t)))

(defclass simplified-hepburn (generic-hepburn)
  ((simplifications :initform nil :initarg :simplifications :reader simplifications
                    :documentation "List of simplifications e.g. (\"ou\" \"o\" \"uu\" \"u\")"

(defmethod r-simplify ((method simplified-hepburn) str)
  (simplify-ngrams (call-next-method) (simplifications method)))

(defclass traditional-hepburn (simplified-hepburn)
  ((simplifications :initform '("oo" "ō" "ou" "ō" "uu" "ū"))))

I’m using the “parse tree" feature of CL-PPCRE here to create a complex :alternation regex on the fly and then use regex-replace-all with a custom replacing function. It’s probably not the most efficient method, but sometimes outsourcing string manipulations to a well-tested regex engine is the least painful solution. Anyway, we’re really close now, and all that’s left is to chain up our functions for a useful API.

(defparameter *hepburn-traditional* (make-instance 'traditional-hepburn))

(defvar *default-romanization-method* *hepburn-traditional*)

(defun romanize-list (cc-list &key (method *default-romanization-method*))
  "Romanize a character class list according to method"
  (let ((cc-tree (process-modifiers cc-list)))
    (values (r-simplify method (romanize-core method cc-tree)))))

(defun romanize-word (word &key (method *default-romanization-method*))
  "Romanize a word according to method"
  (romanize-list (get-character-classes word) :method method))

>>> (romanize-word "ありがとうございます")

At my Github you can find an unabridged version of the above code. However there are still some difficult problems with romanization of Japanese that can’t be solved as easily. Even leaving kanji aside, the hiragana character は is pronounced either as “ha” or “wa” depending on whether it is used as a particle. For example a common greeting “こんにちは” is romanized as “konnichiwa” and not “konnichiha” because は plays the role of a particle. Which brings us to another problem: there are no spaces between the words, so it’s not possible to determine whether は is a part of a word or a standalone particle without a dictionary, and even then it can be ambiguous! I’m ending the post on this note, since I’m still not sure how to solve this. さようなら!

Joe MarshallAnother stupid homographic function trick

· 12 days ago
In my last post I showed that if you take a homographic function and apply it to a fraction, you can partly apply the function to the integer part of the fraction and get a new homographic function. The new function can be applied to the non-integer part of the fraction to generate an answer equivalent to the original function applied to the original fraction.

It turns out that you can go in the other direction as well. You can partly evaluate a homographic function. For example, consider this homographic function:
((lambda (t)
   (/ (+ (* 70 t) 29)
      (+ (* 12 t)  5))) n)
Which we intend to apply to some positive number n. Even if all we know is that n is positive, we can deduce that the value of the homographic function is between 29/5 (when t is 0) and 70/12 (as t goes to infinity). The integer part of those values are the same, so we can factor that out:
(+ 5 (/ 1 ((lambda (t)
               (/ (+ (* 12 t) 5)
                  (+ (* 10 t) 4))) n)))
The partial answer has an integer value of 5 and a fractional part that contains a new homographic function applied to our original n. We can do it again:
(+ 5 (/ 1
        (+ 1 (/ 1
                ((lambda (t)
                   (/ (+ (* 10 t) 4)
                      (+ (* 2 t) 1))) n)))))
The fractional part of the answer can itself be factored into another integer and a new homographic function applied to our original n.

A generalized continued fraction is a number of the form:
If all the bi are 1, then it is a simple continued fraction. You can turn generalized continued fractions into a simple continued fraction by doing the algebra.

What happens if you partly apply a homographic function to a continued fraction? The algebra is tedious, but here's what happens:
((lambda (t)
   (/ (+ (* 2 t) 1)
      (+ (* 1 t) 3))) (+ 3 (/ 1 (+ 7 (/ 1 16)))))

;; after one step
((lambda (t)
   (/ (+ (* 7 t) 2)
      (+ (* 6 t) 1))) (+ 7 (/ 1 16)))

;; after two steps
((lambda (t)
   (/ (+ (* 51 t) 7)
      (+ (* 43 t) 6))) 16)
By partly apply a homographic function to a continued fraction, we can process the integer part separately and before the fractional part. By partly evaluating the application of a homographic function, we can often determine the integer part without fully evaluating the argument to the function. For example, after step one above, we could instead partially evaluate the application:
;; after one step
((lambda (t)
   (/ (+ (* 7 t) 2)
      (+ (* 6 t) 1))) (+ 7 (/ 1 16)))

;; Alternatively, partially evaluate first term
(+ 1 (/ 1
       ((lambda (t)
           (/ (+ (* 6 t) 1)
              (+ (* 1 t) 1))) (+ 7 (/ 1 16)))))

Joe MarshallStupid homographic function trick

· 14 days ago
A function of the form
is called a homographic function.  Here is one in Scheme:
(lambda (t)
   (/ (+ (* 3 t) 4)
      (+ (* 5 t) 2)))
And here is what it's graph looks like:
If you multiply all the coefficients (a, b, c, and d) by the same number, it doesn't change the function. For instance, this homographic function:
(lambda (t)
   (/ (+ (* 21 t) 28)
      (+ (* 35 t) 14)))
is the same as the one above. If one of your coefficients isn't an integer, don't worry, you can multiply everything by the denominator and get an equivalent homographic function. On the other hand, you can divide all your coefficients by their greatest common divisor and get an equivalent homographic function with smaller coefficients. We'll keep our homographic functions in smallest integer form.

A rational number can be written as the sum of an integer and a fraction less than one. For example, 23/5 = 4 + 3/5.

Let's apply a homographic function to a rational number:
((lambda (t)
   (/ (+ (* a t) b)
      (+ (* c t) d))) (+ x y/z))

;; substitute
(/ (+ (* a (+ x y/z)) b)
   (+ (* c (+ x y/z)) d))

;; distribute the multiplication
(/ (+ (* a x) (* a y/z) b)
   (+ (* c x) (* c y/z) d))

;; multiply top and bottom by z/y
(/ (* z/y (+ (* a x) (* a y/z) b))
   (* z/y (+ (* c x) (* c y/z) d)))

;; distribute the multiplication
(/ (+ (* a x z/y) (* a y/z z/y) (* b z/y))
   (+ (* c x z/y) (* c y/z z/y) (* d z/y)))

;; simplify
(/ (+ (* a x z/y) a (* b z/y))
   (+ (* c x z/y) c (* d z/y)))

;; rearrange terms
(/ (+ (* a x z/y) (* b z/y) a)
   (+ (* c x z/y) (* d z/y) c))

;; factor out z/y
(/ (+ (* (+ (* a x) b) z/y) a)
   (+ (* (+ (* c x) d) z/y) c))

Now we do something tricky. We abstract out the z/y term:
((lambda (t)
   (/ (+ (* (+ (* a x) b) t) a)
      (+ (* (+ (* c x) d) t) c))) (/ z y))
If introduce a let form, we can see something interesting:
((lambda (t)
   (let ((a1 (+ (* a x) b)) 
         (b1 a)
         (c1 (+ (* c x) d))
         (d1 c))
     (/ (+ (* a1 t) b1)
        (+ (* c1 t) d1)))) (/ z y))
We find a new homographic function being applied to a new rational number. The new homographic function has coefficients related to the original one, and the new rational number is the reciprocal of the fractional part of the original rational number. So if we have a homographic function hf applied to a fraction of the form x + y/z, we can easily find a new homographic function hf' that when applied to z/y will produce the same answer as the original. Applying a homographic function to a fraction has the effect of "eating" the integer part of the fraction, which generates a new homographic function that is applied to the reciprocal of the fractional part.

Vsevolod DyomkinHow to write an English POS tagger with CL-NLP

· 14 days ago


The problem of POS tagging is a sequence labeling task: assign each word in a sentence the correct part of speech. For English, it is considered to be more or less solved, i.e. there are taggers that have around 95% accuracy. It also has a rather high baseline: assigning each word its most probable tag will give you up to 90% accuracy to start with. Also, only around 10% of the words have more than 1 possible tag. However, their usage accounts for around 40% of total text volume.

Let's start our exercise by first verifying these facts in a data-driven manner. This will also allow us to examine the basic building blocks for our solution.

POS taggers are usually built as statistical models trained on some existing pre-labeled data. For English language there is quite a lot of it already available. For other languages, that don't possess such data, an unsupervised or a rule-based approach can be applied.

Data sources

The standard dataset that is used not only for training POS taggers, but, most importantly, for evaluation is the Penn Tree Bank Wall Street Journal dataset. It contains of not only POS tag, but also noun phrase and parse tree annotations.

Here's an example of the combined POS tag and noun phrase annotations from this corpus:

[ Pierre/NNP Vinken/NNP ]
[ 61/CD years/NNS ]
old/JJ ,/, will/MD join/VB
[ the/DT board/NN ]
[ a/DT nonexecutive/JJ director/NN Nov./NNP 29/CD ]

The tagset used in the annotation contains such symbols as NNP for proper nouns, , for commas, and CD for cardinal numbers. The whole set is provided for CL-NLP in the file src/syntax/word-tags.txt. Here's a snippet from it:

X Unknown, uncertain, or unbracketable. X is often used for bracketing typos and in bracketing the...the-constructions.
CC Coordinating conjunction

It's, obviously, possible to extend it with other tags if necessary. All of them are, finally, available as symbols of the tag package in CL-NLP.

Available data and tools to process it

What we're interested in, is obtaining a structured representation of this data. The ncorp package implements interfaces to various raw representations, such as this one.

Different NLP corpora exist in various formats:

Most of these representations are supported by the ncorp adapters at least to some extent. The interface of this module consists of the following entities:

For our task we'll be able to utilize just the tokens slot of the ptb-tagged-text structure, produced with map-corpus. Let's collect the tag distribution for each word from the WSJ section of the PTB:

NLP> (let ((words-dist #h(equal))
(map-corpus :ptb-tagged (corpus-file "ptb/TAGGED/POS/WSJ")
#`(dolist (sent (text-tokens %))
(dolist (tok sent)
(unless (in# (token-word tok) words-dist)
(:= (get# (token-word tok) words-dist) #h()))
(:+ (get# (token-tag tok)
(get# (token-word tok) words-dist)
:ext "POS")
#<HASH-TABLE :TEST EQUAL :COUNT 51457 {10467E6543}>
NLP> (reduce #'+ (mapcan #'ht-vals (ht-vals *)))

So, we have around 50k distinct words and 1,3m tokens.

But, actually, the resources in the field has made some progress in the last decades, and there's a bigger corpus now available that contains not only the whole Penn Tree Bank, but also some more data from other domains. The annotations of the WSJ section in it were also improved. It is called OntoNotes. Let's do the same with its data:

NLP> (let ((words-dist #h(equal)))
(map-corpus :treebank (corpus-file "ontonotes/")
#`(dolist (sent (text-tokens %))
(dolist (tok sent)
(with-accessors ((tag token-tag) (word token-word)) tok
(unless (eql tag 'tag:-NONE-)
(unless (in# word words-dist)
(:= (get# word words-dist) #h()))
(:+ (get# tag (get# word words-dist) 0))))))
:ext "parse")

So, in the OntoNotes 4.0 there are 60925 distinct words. 50925 of them (~84%) are tagged with a single tag. I.e. we have a 16% of multi-tag words which corresponds well with the theoretical data. Also, there are 2,1m tokens in the corpus in total.

Calculating the number of words with distinct tags:

(count-if-not #`(= 1 (ht-count (rt %)))
(ht->pairs words-dist))

And what about the total volume?

NLP> (let ((total1 0)
(total 0))
(map-corpus :treebank "ontonotes"
#`(dolist (sent (text-tokens %))
(dolist (tok sent)
(unless (eql (token-tag tok) 'tag:-NONE-)
(:+ total)
(when (in# (token-word tok) *single-tag-words*)
(:+ total1)))))
:ext "parse")
(float (/ total1 total)))

Only 24% instead of 60%! What's wrong?

OK, here's the trick: let's add words that have more than 1 tag, but >99% of their occurrences are labeled with a single tag. For instance, the word "the" has 9 distinct tags in OntoNotes, but 0.9997 of the times it's a DT.

If we consider such words to have a single tag, we'll get just a slight increase in the number of single-tag words (+386: 51302 instead of 50916), but a dramatic increase in the volume of their occurrence - now it's 63%! Just as the literature tells us.

(NB. Taking such shortcut will only increase the quality of the POS tagger as 99% is above the accuracy it will be able to achieve anyhow, which is at most 97% on the same-domain data and even lower for out-of-domain data.)

Here's how we can determine such set of words:

(remove-if-not #`(let ((tag-dist (ht-vals (rt %))))
(> (/ (reduce #'max tag-dist)
(reduce #'+ tag-dist))
(ht->pairs tag-dist))

NB. The above code samples contain some non-standard utilities and idioms that may look somewhat alien to some Lisp programmers. All of them are from my RUTILS library, and you'll see more below. Mostly, these include some hash-table-specific operators, new iteration constructs, a few radical abbreviations for common operations, and literal syntax for hash-tables (#h()) and anonymous functions (#`()).

Some of them are:

Building the POS tagger

We have explored how to access different corpus data that we'll need to train the POS tagger. To actually do that, we'll re-implement the approach described by Matthew Honnibal in "A good POS tagger in about 200 lines of Python". In fact, due to the expressiveness of Lisp and efficiency of SBCL, we'll need even less than 200 lines, and we'll get the performance comparable to a much more complex Cyton implementation of the parser (6s against 4s on 130k tokens), but that's details... ;)

Here's the source code we'll be discussing below on github.

Our tagger will use a greedy averaged perceptron model with single-tag words dictionary lookup:

(defclass greedy-ap-tagger (avg-perceptron tagger)
((dict :initform #h(equal) :initarg :dict :accessor tgr-dict)
(single-tag-words :initform #h(equalp) :initarg :single-tag-words
:accessor tgr-single-tag-words))
"A greedy averaged perceptron tagger with single-tag words dictionary lookup."))

As you see, it is derived from a generic class tagger and an avg-perceptron learning model. It also has a dict slot that holds all the words known to the tagger.

Every tagger has just one generic function associated with it. You guessed its name - tag :)

(defmethod tag ((tagger greedy-ap-tagger) (sentence sentence))
(let ((prev :-START-)
(prev2 :-START2-)
(ctx (sent-ctx sentence)))
(doindex (i token (sent-tokens sentence))
(:= (token-tag token)
(classify tagger
(extract-fs tagger i (token-word token) ctx prev prev2))
prev2 prev
prev (token-tag token)))

It accepts an already tokenized sentence and (destructively) assigns tags to each of its tokens.

The main job is performed by the call to classify method that is defined for every statistical learning model in CL-NLP. Another model-associated method here is extract-fswhich produces a list of features that describe the current sample.

Now, let's take a look at the implementation of these learning model-related methods.

(defmethod classify ((model greedy-ap-tagger) fs)
(or (get# (first fs) (tgr-single-tag-words model))
(call-next-method model (rest fs))))

For the tagger, we first check the current word against the dictionary of single-tag-words that we've built in the previous part and then call the classify method of the avg-perceptron model itself. That one is a matter of simply returning a class that is an argmax of a dot product between model weights and sample features fsthat in this case can only have values of 1 or 0.

(defmethod classify ((model greedy-ap) fs)
(let ((scores #h()))
(dotable (class weights (m-weights model))
(dolist (f fs)
(:+ (get# class scores 0) (get# f weights 0))))
(keymax scores))) ; keymax is argmax for hash-tables

As you see, averaged perceptron is very straightforward - a simple linear model that has a weights slot which is a mapping of feature weights for every class. In the future this method will probably be assigned to a linear-model class, but it hasn't been added to CL-NLP so far.


Let's take a look at the training part. It consists of 2 steps. extract-fs performs feature extraction. What it, basically, does in our case of a simple perceptron model is returning a list of features preceded by the word we're currently tagging.

(cons word (make-fs model
("i pref1" (char word 0))
("i word" word)
("i-1 tag" prev-tag)

The make-fs macro is responsible for interning the features as symbols in package f by concatenating the provided prefixes and calculated variables. This is a standard Lisp practice to use symbols instead of raw strings for such things. So, in the above example for the word "the" preceded by a word tagged as VBZwill get the following list of features:

'("the" f:|bias| f:|i pref1 t| f:|word the| f:|i-1 tag VBZ| ...)

The second part of learning is training. It is the most involved procedure here, yet still very simple conceptually. Just like with the tag method, we're iterating over all tokens preceded by a dummy :-START2- and :-START- ones, and guessing the current tag using classify. Afterwards we're updating the model's weights in train1. The only difference is that we need to explicitly first consider the case of single-tag-words not to run the model update step for it.

This is how it all looks modulo debugging instrumentation. Note the use of psetf to update prev and prev2 simultaneously.

(defmethod train ((model greedy-ap-tagger) sents &key (epochs 5))
(with-slots (single-tag-words dict) model
;; expand dict
(dolist (sent sents)
(dolist (tok (sent-tokens sent))
(set# (token-word tok) dict nil)))
;; expand single-tag-words
(dotable (word tag (build-single-tag-words-dict (mapcar #'sent-tokens sents)
:ignore-case? t))
(unless (in# word single-tag-words)
(set# word single-tag-words tag)))
;; train
(dotimes (epoch epochs)
(dolist (sent (mapcar #'sent-tokens sents))
(let* ((prev :-START-)
(prev2 :-START2-)
(ctx (sent-ctx sent)))
(doindex (i token sent)
(let ((word (token-word token)))
(psetf prev
(or (get# word single-tag-words)
(let* ((fs (extract-fs model i word ctx prev prev2))
(guess (classify model fs)))
(train1 model (rest fs) (token-tag token) guess)
prev2 prev)))))
(:= sents (shuffle sents))))

Note the additional expansion of the single-tag-words dict of the model (as well as of the normal dict).

An interesting feature of the problem's object-oriented decomposition in this case is that we have a generic perceptron machinery we'd like to capture and reuse for different concrete models, and a model-specific implementation details.

This dichotomy is manifested in the training phase:

Evaluation & persisting the model

We have reached the last part of every machine learning exercise - evaluation. Usually it's about measuring precision/recall/f-measure, but in the tagger case both precision and recall are the same, because the sets of relevant and retrieved items are the same, so we can calculate just the accuracy:

NLP> (accuracy *tagger* *gold-test*)

A "gold" corpus is used for evaluation. This one was performed on the standard evaluation set which is the Wall Street Journal corpus (parts 22-24), OntoNotes 4.0 version. The model was also trained on the standard training set (0-18). Its results are consistent with the performance of the reference model from the blog post. The "gold" features where obtained by calling the extract-gold method of our model on the data from the treebank.

But wait, we can do more.

First, on the evaluation part. It's not being a secret already for a long time in the NLP community that WSJ corpus is far from representative to the real-world use cases. And I'm not even talking of twitter here, but just various genres of writing have different vocabularies and distributions of sentence structures. So, the high baselines shown by many results on the WSJ corpus may not be that robust. To help with such kind of evaluation Google and Yahoo have recently released another treebank called WebText that collect 5 different types of texts seen on the web: from dialogues to blog posts. It's smaller than Penn Treebank: 273k tokens isntead of 1,3m with 23k distinct word types. If we evaluate on it the accuracy drops substantially: only 89.74406!

What we can do is train on more data with better variability. Let's retrain our model on the whole OntoNotes (minus the evaluation set of WSJ 22-24). Here are the results:

So, broader data helps. What else can we do?

Another aspect we haven't touched is normalization. There are some variants of generating arbitrary tokens in English which lend themselves well to normalization to some root form. These include numbers, emails, urls, and hyphenated words. The normalization variant proposed by Honnibal is rather primitive and can be improved.

Here's an original variant:

(defmethod normalize ((model greedy-ap-tagger) (word string))
((and (find #\- word) (not (char= #\- (char word 0))))
((every #'digit-char-p word)
(if (= 4 (length word)) "!YEAR" "!DIGITS"))
(t (string-downcase word))))

And here's a modified one:

(defmethod normalize ((model greedy-ap-tagger) (word string))
((re:scan *number-regex* word) (make-string (length word) :initial-element #\0))
((re:scan *email-regex* word) "!EMAIL")
((re:scan *url-regex* word) "!URL")
((in# word (tgr-dict model)) (string-downcase word))
((position #\- word :start 1 :from-end t)
(let ((suffix (slice word (1+ it))))
(if (in# suffix (tgr-dict model))
(string-downcase suffix)
(t (string-downcase word))))

Such change allows to gain another 0.06% accuracy on the Webtext corpus. So, normalization improvement doesn't help that much. However, I think it should be more useful in real-world scenarios.

Now, as we finally have the best model we need a way to persist and restore it. The corresponding save-model/load-model methods exist for any categorical model. They use the handy ZIP and USERIALlibraries to save models into a single zip file, serializing textual (categories and feature names) and binary data (floating point weights) into separate files. Here's how our serialized POS tagger model looks like:

  Length  File
-------- --------------------
552 classes.txt
4032099 fs.txt
2916012 fs.bin
2916012 weights.bin
35308 single-tag-words.txt
484712 dict.txt
-------- --------------------
10384695 6 files

Finally, I believe, it's an essential practice to make all results we post online reproducible, but, unfortunately, there are restrictions on the use of the Pen Treebank corpus data, so we can't just add an automated test that will reproduce the contents of this post. Still, a small portion of OntoNotes WSJ corpus can be used under the fair use policy, and it is provided with CL-NLP for evaluation purposes.

Let's add such a test to give the users confidence in the performance of our model. For testing CL-NLP I'm using yet another my own library which is called SHOULD-TEST - I'll have another blog devoted to it some time in the future.

Here's a test we need:

(defun extract-sents (text)
(mapcar #`(make 'ncore:sentence :tokens (ncorp:remove-dummy-tokens %))
(ncore:text-tokens text)))

(defvar *tagger* (load-model (make 'greedy-ap-tagger)
(models-file "pos-tagging/")
:classes-package :tag))
(defvar *gold*
(let (test)
(ncorp:map-corpus :treebank (corpus-file "onf-wsj/")
#`(appendf test (extract-sents %)))
(extract-gold *tagger* test)))

(deftest greedy-ap-tagger-quality ()
(should be = 96.31641
(accuracy *tagger* *gold*)))

Summing up

In this article I've tried to describe the whole process of creating a new statistics-based model using CL-NLP. As long as you have the necessary data, it is quite straightforward and commonplace.

If you want to use one of the existing models (namely, greedy averaged perceptron, as of now) you can reuse almost all of the machinery and just add a couple of functions to reflect the specifics of your task. I think, it's a great demonstration of the power of the generic programming capabilities of CLOS.

Obviously, feature engineering is on you, but training/evaluation/saving/restoring the model can be handled transparently by CL-NLP tools. There's also support for common data processing and calculation tasks.

We have looked at some of the popular corpora in this domain (all of which, unfortunately, have some usage restrictions and are not readily available, but can be obtained for research purposes). And we've observed some of factors that impact the performance and robustness of machine learning models. I'd say that our final model is of the production-ready state-of-the-art level, so you can safely use it for your real-world tasks (under the licensing restrictions of the OntoNotes dataset used for training it). Surely, if you have your own data, it should be straightforward to re-train the model with it.

You can also add your own learning algorithms, and I'm going to be continue doing the same likewise.

Stay tuned and have fun!

Christophe Rhodesbackquote and pretty printing

· 18 days ago

There was a bit of a kerfuffle following the 1.2.2 release of SBCL, regarding the incompatible change in the internals of the backquote reader macro.

Formally, implementations can choose how to implement the backquote reader macro (and its comma-based helpers): the semantics of backquote are defined only after evaluation:

An implementation is free to interpret a backquoted form F1 as any form F2 that, when evaluated, will produce a result that is the same under equal as the result implied by the above definition, provided that the side-effect behavior of the substitute form F2 is also consistent with the description given above.

(CLHS 2.4.6; emphasis mine)

There are also two advisory notes about the representation:

Often an implementation will choose a representation that facilitates pretty printing of the expression, so that (pprint '`(a ,b)) will display `(a ,b) and not, for example, (list 'a b). However, this is not a requirement.

(CLHS; added quote in example mine), and:

Implementors who have no particular reason to make one choice or another might wish to refer to IEEE Standard for the Scheme Programming Language, which identifies a popular choice of representation for such expressions that might provide useful to be useful compatibility for some user communities.

(CLHS; the Scheme representation reads `(foo ,bar) as (quasiquote (foo (unquote bar))))

The problem the new implementation of backquote is attempting to address is the first one: pretty printing. To understand what the problem is, an example might help: imagine that we as Common Lisp programmers (i.e. not implementors, and aiming for portability) have written a macro bind which is exactly equivalent to let:

(defmacro bind (bindings &body body)
  `(let ,bindings ,@body))

and we want to implement a pretty printer for it, so that (pprint '(progn (bind ((x 2) (z 3)) (if *print-pretty* (1+ x) (1- y))))) produces

  (bind ((x 2)
         (z 3))
    (if *print-pretty*
        (1+ x)
        (1- y))))

What does that look like? Writing pretty-printers is a little bit of a black art; a first stab is something like:

(defun pprint-bind (stream object)
  (pprint-logical-block (stream object :prefix "(" :suffix ")")
    (write (pprint-pop) :stream stream)
    (write-char #\Space stream)
    (pprint-logical-block (stream (pprint-pop) :prefix "(" :suffix ")")
        (write (pprint-pop) :stream stream)
        (pprint-newline :mandatory stream)))
    (pprint-indent :block 1 stream)
    (pprint-newline :mandatory stream)
      (write (pprint-pop) :stream stream)
      (pprint-newline :mandatory stream))))
(set-pprint-dispatch '(cons (eql bind)) 'pprint-bind)

The loop noise is necessary because we're using :mandatory newlines; a different newline style, such as :linear, might have let us use a standard utility function such as pprint-linear. But otherwise, this is straightforward pretty-printing code, doing roughly the equivalent of SBCL's internal pprint-let implementation, which is:

(formatter "~:<~^~W~^ ~@_~:<~@{~:<~^~W~@{ ~_~W~}~:>~^ ~_~}~:>~1I~:@_~@{~W~^ ~_~}~:>")

A few tests at the repl should show that this works with nasty, malformed inputs ("malformed" in the sense of not respecting the semantics of bind) as well as expected ones:

(pprint '(bind))
(pprint '(bind x))
(pprint '(bind x y))
(pprint '(bind (x y) z))
(pprint '(bind ((x 1) (y 2)) z))
(pprint '(bind ((x 1) (y 2)) z w))
(pprint '(bind . 3))
(pprint '(bind x . 4))
(pprint '(bind (x . y) z))
(pprint '(bind ((x . 0) (y . 1)) z))
(pprint '(bind ((x) (y)) . z))
(pprint '(bind ((x) y) z . w))

Meanwhile, imagine a world where the backquote reader macro simply wraps (quasiquote ...) around its argument, and comma likewise wraps (unquote ...):

(set-macro-character #\` (defun read-backquote (stream char)
                           (list 'quasiquote (read stream t nil t))))
(set-macro-character #\, (defun read-comma (stream char)
                           (list 'unquote (read stream t nil t))))

Writing pretty-printer support for that is easy, right?

(defun pprint-quasiquote (stream object)
  (write-char #\` stream)
  (write (cadr object) :stream stream))
(defun pprint-unquote (stream object)
  (write-char #\, stream)
  (write (cadr object) :stream stream))
(set-pprint-dispatch '(cons (eql quasiquote) (cons t null)) 'pprint-quasiquote)
(set-pprint-dispatch '(cons (eql unquote) (cons t null)) 'pprint-unquote)

(ignoring for the moment what happens if the printed representation of object happens to start with a @ or .)

(pprint '(quasiquote (x (unquote y))))

The problem arises when we try to combine these two things. In particular, what happens when we attempt to print backquoted forms:

(pprint '`(bind ,y z))

What we would hope to see is something like

`(bind ,y

but what we actually get is

`(bind (unquote

because each of the bindings in bind is printed individually, rather than the bindings being printed as a whole. And, lest there be hopes that this can be dealt with by a slightly different way of handling the pretty printing in pprint-bind, note that it's important that (pprint '(bind (function y) z)) print as

(bind (function

and not as

(bind #'y

so the only way to handle this is to know the magical symbols involved in backquote and comma reader macros - but that is not portably possible. So, we've come to the point where the conclusion is inevitable: it is not possible for an implementation to support list-structured quasiquote and unquote reader macros and general pretty printing for user-defined operators. (This isn't the only failure mode for the combination of unquote-as-list-structure and pretty-printing; it's surprisingly easy to write pretty-printing functions that fail to print accurately, not just cosmetically as above but catastrophically, producing output that cannot be read back in, or reads as a structurally unequal object to the original.)

The new implementation, by Douglas Katzman, preserves the implementation of the backquote reader macro as a simple list, but comma (and related reader macros) read as internal, literal structures. Since these internal structures are atoms, not lists, they are handled specially by pprint-logical-block and friends, and so their own particular pretty-printing routines always fire. The internal quasiquote macro ends up extracting and arranging for appropriate evaluation and splicing of unquoted material, and everything ends up working.

Everything? Well, not quite: one or two programmer libraries out there implemented some utility functionality - typically variable renaming, automatic lambda generation, or similar - without performing a full macroexpansion and proper codewalk. That code was in general already broken, but it is true that in the past generating an example to demonstrate the breakage would have to violate the general expectation of what "normal" Lisp code would look like, whereas as a result of the new implementation of backquote in SBCL the symptoms of breakage were much easier to generate. Several of these places were fixed before the new implementation was activated, such as iterate's #l macro; among the things to be dealt with after the new implementation was released was the utility code from let-over-lambda (a workaround has been installed in the version distributed from github), and there is still a little bit of fallout being dealt with (e.g. a regression in the accuracy of source-location tracking). But overall, I think the new implementation of backquote has substantial advantages in maintainability and correctness, and while it's always possible to convince maintainers that they've made a mistake, I hope this post explains some of why the change was made.

Meanwhile, I've released SBCL version 1.2.3 - hopefully a much less "exciting" release...

Timofei ShatrovLiving on the edge

· 19 days ago

Lately my primary Lisp has been a SBCL fork for Windows which is based on SBCL 1.1.4 and is now pretty old. The official release of SBCL for Windows is 1.2.1 so I decided to try it out. The installer managed to delete my old version of SBCL, so there was no way back now. I tried to run it, but it still tried to use .core from the old SBCL. Strange, I’m pretty sure the system environment variables have been updated. Yep, I go to system settings and SBCL_HOME points at the correct directory. I run “cmd” and nope, SBCL_HOME points at the old directory. How could that be? After some mucking about, I save the environment variables again and now it has updated. SBCL now runs from command line. Success?

Ok, so I run SLIME and it tries to use some symbol from SBCL system package which has clearly been removed at some point. My SLIME isn’t even that old, last updated in 2013. I actually installed it via Quicklisp, wonder if this will work? I run SBCL from command line and do (ql:update-all-dists). Lots of libraries get updated, including SLIME 2014-08-01. Oh, this is good stuff.

I start up Emacs, load SLIME and face a certain bug I already faced on another computer. At some point SLIME became, let’s say, not very compatible with Emacs 24.1 and 24.2 series, because Emacs developers did something with ‘cl package and SLIME relies on that change. Guess I’ll have to update Emacs too.

As a result I have been forced to update to a shiny new Lisp stack from 2014. To compare, at work we have to use Python 2.6 (released in 2008) and Django 1.3 (released in 2011 and already deprecated). It’s actually amazing how many libraries still run on Python 2.6. Meanwhile Common Lisp as the language hasn’t changed since like the 80s and yet you must always watch out for compatibility issues! Keep up with the times!

Joe MarshallA use of Newton's method

· 21 days ago
I've seen more than one book claim that computing with real numbers inevitably involves round-off errors because real numbers can have an infinite number of digits after the decimal point and no finite representation can hold them. This is false. Instead of representing a real number as a nearby rational number with an error introduced by rounding, we'll represent a real number as computer program that generates the digits. The number of digits generated is potentially infinite, but the program that generates them is definitely finite.

Here is Gosper's algorithm for computing the square root of a rational number.
(define (gosper-sqrt a b c d)
  ;;   Solve for
  ;; ax + b 
  ;; ------  = x
  ;; cx + d
  (define (newtons-method f f-prime guess)
    (let ((dy (f guess)))
      (if (< (abs dy) 1)
          (let ((dy/dx (f-prime guess)))
            (newtons-method f f-prime (- guess (/ dy dy/dx)))))))

  (define (f x)
    (+ (* c x x)
       (* (- d a) x)
       (- b)))

  (define (f-prime x)  
    (+ (* 2 c x)
       (- d a)))

  (let ((value (floor (newtons-method f f-prime b))))
    (cons-stream value
                 (gosper-sqrt (+ (* c value) d)
                              (+ (* (- a (* value c)) value) 
                                 (- b (* value d)))
                              (- a (* value c))))))

1 ]=> (cf:render (gosper-sqrt 0 17 10 0))

;; base 10, 100 digits
1 ]=> (cf:render (gosper-sqrt 0 17 10 0) 10 100)

Clozure CL BlogClozure CL 1.10 pre-release available for testing

· 21 days ago

Please see this openmcl-devel message if you would be interested in testing a pre-release version of CCL 1.10.

Ben HydeGraphical Programming and yEd

· 21 days ago

Graphical programming languages are like red sports cars.  They have lots of curb appeal, but they are rarely safe and reliable.

I long worked for a company whose product featured a very rich graphic programming. It allowed an extremely effective sales process.  The salesman would visit the customer who would sketch a picture of his problem on the whiteboard, and the salesman would enquire about how bad things would get if the problem didn’t get solved.

Meanwhile in the corner the sales engineer would copy the drawing into his notebook.  That night he would create an app in our product who’s front page looked as much like that drawing as possible.  It didn’t really matter if it did anything, but it usually did a little simulation and some icons would animate and some charts’ would scroll.  The customers would be very excited by these little demos.

I consider those last two paragraphs a delightful bit of sardonic humor.  But such products do sell well.   Customers like how pretty they look.  Sales likes them.  Engineering gets to have mixed feelings.  The maintenance contracts can be lucrative.  Thathelps with buisness model volatility.  So yeah, there is plenty of value in graphical programming.

So one of the lightning talks at ILC 2014 caught my attention.  The speaker, Paul Tarvydas, mentioned in passing that he had a little hack based on a free drawing application called yEd.  That evening I wrote a similar little hack.

Using yEd you can make an illustrations, like this one showing the software release process for most startups.

My few lines of code will extract the topology from the drawing, at which point you can build whatever strikes your fancy: code, ontologies, data structures.  (Have I mentioned how much fun it is to use Optima to digest into a glob of XML?  Why yes I have.)

I was also provoked by Fare Rideaus‘ talk.  Fare is evangelizing the idea that we ought to start using Lisp for scripting.   He has a package, cl-launch, intended to support this.  Here’s an example script.   Let’s dump the edges in that drawing:

bash-3.2$ ./ abc.graphml
Alpha -> Beta
Beta -> Cancel
Beta -> Beta
Beta -> Beta

I’ve noticed, dear Reader, that you are very observant.  It’s one of the things I admire about you.  So you wondering: “Yeah Ben, you found too many edges!”   Well, I warned you that these sports cars are rarely safe.  Didn’t I?

Joe MarshallSolutions in search of problems

· 22 days ago
Suppose you have a function like (define foo (lambda (x) (- (* x x x) 30))) and you want to find x such that (foo x) = 0. There are a few ways to go about this. If you can find two different x such that (foo x) is positive for one and negative for the other, then (foo x) must be zero somewhere in between. A simple binary search will find it.
(define (bisection-method f left right)
  (let* ((midpoint (average left right))
         (fmid     (f midpoint)))
    (if (< (abs fmid) 1e-8)
        (let ((fl (f left))
              (fr (f right)))
          (cond ((same-sign? fl fr) (error "Left and right not on opposite sides."))
                ((same-sign? fmid fr) (bisection-method f left midpoint))
                ((same-sign? fl fmid) (bisection-method f midpoint right))
                (else (error "shouldn't happen")))))))

(define (average l r) (/ (+ l r) 2))

(define (same-sign? l r)
  (or (and (positive? l)
           (positive? r))
      (and (negative? l)
           (negative? r))))

1 ]=> (cos 2)

;Value: -.4161468365471424

1 ]=> (cos 1)

;Value: .5403023058681398

1 ]=> (bisection-method cos 1.0 2.0)
1. 2.
1.5 2.
1.5 1.75
1.5 1.625
1.5625 1.625
1.5625 1.59375
1.5625 1.578125
1.5703125 1.578125
1.5703125 1.57421875
1.5703125 1.572265625
1.5703125 1.5712890625
1.5703125 1.57080078125
1.570556640625 1.57080078125
1.5706787109375 1.57080078125
1.57073974609375 1.57080078125
1.570770263671875 1.57080078125
1.5707855224609375 1.57080078125
1.5707931518554687 1.57080078125
1.5707931518554687 1.5707969665527344
1.5707950592041016 1.5707969665527344
1.570796012878418 1.5707969665527344
1.570796012878418 1.5707964897155762
1.570796251296997 1.5707964897155762
1.570796251296997 1.5707963705062866
1.5707963109016418 1.5707963705062866
1.5707963109016418 1.5707963407039642
;Value: 1.570796325802803
Rather than selecting the midpoint between the two prior guesses, you can pretend that your function is linear between the guesses and interpolate where the zero should be. This can converge quicker.
(define (secant-method f x1 x2)
  (display x1) (display " ") (display x2) (newline)
  (let ((f1 (f x1))
        (f2 (f x2)))
    (if (< (abs f1) 1e-8)
        (let ((x0 (/ (- (* x2 f1) (* x1 f2))
                     (- f1 f2))))
          (secant-method f x0 x1)))))

1 ]=> (secant-method cos 0.0 4.0)
0. 4.
2.418900874126076 0.
1.38220688493168 2.418900874126076
1.5895160570280047 1.38220688493168
1.5706960159120333 1.5895160570280047
1.5707963326223677 1.5706960159120333
;Value: 1.5707963326223677
If you know the derivative of f, then you can use Newton's method to find the solution.
(define (newtons-method f f-prime guess)
  (display guess) (display " ") (newline)
  (let ((dy (f guess)))
    (if (< (abs dy) 1e-8)
        (let ((dy/dx (f-prime guess)))
          (newtons-method f f-prime (- guess (/ dy dy/dx)))))))

1 ]=> (newtons-method cos (lambda (x) (- (sin x))) 2.0)
;Value: 1.5707963267948966
Here's another example. We'll find the cube root of 30 by solving (lambda (x) (- (* x x x) 30)).
(define (cube-minus-thirty x) (- (* x x x) 30))

1 ]=> (bisection-method cube-minus-thirty 0.0 4.0)
0. 4.
2. 4.
3. 4.
3. 3.5
3. 3.25
3. 3.125
3.0625 3.125
3.09375 3.125
3.09375 3.109375
3.1015625 3.109375
3.10546875 3.109375
3.10546875 3.107421875
3.1064453125 3.107421875
3.10693359375 3.107421875
3.107177734375 3.107421875
3.107177734375 3.1072998046875
3.107177734375 3.10723876953125
3.107208251953125 3.10723876953125
3.1072235107421875 3.10723876953125
3.1072311401367187 3.10723876953125
3.1072311401367187 3.1072349548339844
3.1072311401367187 3.1072330474853516
3.107232093811035 3.1072330474853516
3.107232093811035 3.1072325706481934
3.1072323322296143 3.1072325706481934
3.107232451438904 3.1072325706481934
3.107232451438904 3.1072325110435486
3.107232481241226 3.1072325110435486
3.1072324961423874 3.1072325110435486
3.107232503592968 3.1072325110435486
3.107232503592968 3.1072325073182583
3.107232505455613 3.1072325073182583
3.107232505455613 3.1072325063869357
;Value: 3.1072325059212744

1 ]=> (secant-method cube-minus-thirty 0.0 4.0)
0. 4.
1.875 0.
8.533333333333333 1.875
2.1285182547245376 8.533333333333333
2.341649751209406 2.1285182547245376
3.4857887202177547 2.341649751209406
3.0068542655016235 3.4857887202177547
3.0957153766467633 3.0068542655016235
3.1076136741672546 3.0957153766467633
3.1072310897513415 3.1076136741672546
3.1072325057801455 3.1072310897513415
;Value: 3.1072325057801455

1 ]=> (define (cube-minus-thirty-prime x) (* 3 x x))

1 ]=> (newtons-method cube-minus-thirty cube-minus-thirty-prime 4.0)
;Value: 3.107232505953859

Quicklisp newsAugust 2014 dist update now available

· 22 days ago
New projects:
Updated projects: 3bmd, access, alexandria, architecture.hooks, architecture.service-provider, arnesi+, asdf-finalizers, bit-smasher, black-tie, caveman, chanl, cl+ssl, cl-ana, cl-autowrap, cl-charms, cl-cli, cl-csv, cl-custom-hash-table, cl-dbi, cl-dot, cl-gss, cl-indeterminism, cl-larval, cl-launch, cl-libevent2, cl-opengl, cl-project, cl-qrencode, cl-rethinkdb, cl-sdl2, cl-slp, cl-spark, cl-syntax, cl-test-more, cl-voxelize, cl-yaclyaml, clack, clfswm, clhs, clobber, closer-mop, closure-html, clsql-orm, clss, coleslaw, collectors, colleen, common-lisp-stat, conduit-packages, crane, css-selectors, daemon, datafly, defmacro-enhance, djula, docbrowser, drakma-async, equals, esrap, f2cl, fare-quasiquote, function-cache, gbbopen, graph, hash-set, hermetic, http-parse, hu.dwim.perec, inferior-shell, iolib, let-over-lambda, listoflist, lla, lol-re, marching-cubes, memoize, more-conditions, multiple-value-variants, ningle, nuclblog, optima, packet, parse-number, petit.package-utils, plump, postmodern, purl, qmynd, racer, rutils, scriptl, sdl2kit, shelly, slime, software-evolution, spatial-trees, stumpwm, sxql, trivial-gray-streams, uiop, weblocks-stores, weblocks-utils.

Not included in the summary above: I was able to get gtk3 into my test build environment, so gtk-cffi should now be fully available in Quicklisp. If you have the gtk3 C libraries installed, give it a try and let me know how it goes.

To get this update, use (ql:update-dist "quicklisp").


Timofei ShatrovWeb scraping with Common Lisp: cookies and stuff

· 22 days ago

It’s been a long time since my last post, but let’s pick up where I left off.

Read part 1 here!

Often you want to grab some data from a website but you can’t just send a get request to a page, you need to log in first. How does the site even know you’re logged in? Well, attached to each request is a bunch of cookies, which are essentialy name-value pairs. Moreover, the server’s response may update or add new cookies, which are faithfully stored by your browser for some period of time. So when you are logging in to a site the following happens:

  1. You send a POST request with your name and password as parameters.
  2. The server responds by setting up your cookies in a way that allows it to recognize your future requests. Usually it sets a cookie that contains your “session id”, which uniquely identifies your current browser session.
  3. When you make any requests after that, the cookie that contains session id is sent along with them, and the server assumes you are logged in.

As you can see, the whole algorithm hinges on the fact that your browser must store and resend the cookie that the server has set up. And when you are making requests through a library or a command-line tool such as curl or wget, the responsibility to store and pass the cookies lies upon you.

Ok, so with Common Lisp we’re using the DRAKMA library. By default it will not send any cookies, or do anything with received cookies. However if you pass a special cookie jar object as a keyword parameter to http-request, it will send cookies from it, and update them based on the server’s response. If you use the same cookie jar object to POST a login request and later to retrieve some data, usually this will be enough to trick the server into serving you the content behind the authentication wall.

    (let ((cookie-jar (make-instance 'drakma:cookie-jar)))
        (drakma:http-request login-url :method :post :parameters login-parameters :cookie-jar cookie-jar)
        (drakma:http-request data-url :cookie-jar cookie-jar))

I think it’s annoying to always write “:cookie-jar cookie-jar” for every request, so in my library webgunk a special variable *webgunk-cookie-jar* is passed as the requests’ cookie jar (it’s nil by default). So you can instead:

    (let ((*webgunk-cookie-jar* (make-instance 'drakma:cookie-jar)))
        (http-request login-url :method :post :parameters login-parameters)
        (http-request data-url))

Special variables are sure handy. In webgunk/modules I created an object-oriented API that uses this feature and webgunk/reddit is a simple reddit browser based on it. Here’s the code for authorization:

(defmethod authorize ((obj reddit-browser) &key login password)
  (let ((login-url ""))
    (with-cookie-jar obj
      (http-request login-url :method :post
                    :parameters `(("api_type" . "json")
                                  ("user" . ,login)
                                  ("passwd" . ,password))))))

where with-cookie-jar is just

(defmacro with-cookie-jar (obj &body body)
  `(let ((*webgunk-cookie-jar* (get-cookie-jar ,obj)))

Note that logging in isn’t always as easy. Sometimes the server’s procedure for setting the cookies is rather tricky (e.g. involving Javascript and redirects). However you almost always can trick the server that you’re logged in by logging in with your browser and then copying the cookie values from your browser (this is known as session hijacking, except you’re only hijacking your own session so it’s ok).

For example, I used to play an online game called SaltyBet, in which you place imaginary money bets on which character will win in a fight. The outcome could be predicted by analyzing the past fights of each character. After losing a million of SaltyBucks due to suboptimal betting, I have built a system that would collect the results of past fights from SaltyBet’s own website, calculate and display the stats for each character and also show their most recent fights, and the biggest upsets that they have been involved in. This was incredibly effective and I was able to recoup my lost money twice over!

But anyway, the data was available only to paid members so I needed to log in to scrape it. And the method described above did not work. In the end what worked was a simple:

(defparameter *cookies* 
  '(("PHPSESSID" . "4s5u76vufh0gt9hs6mrmjpioi0")
    ("__cfduid" . "dc683e3c2eb82b6c050c1446d5aa203dd1376731139271")))

(defmethod authorize ((obj saltybet-browser) &key)
  (clear-cookies obj)
  (loop for (name . value) in *cookies*
       do (add-cookie obj (make-instance 'drakma:cookie :name name :value value
                                         :domain ""))))

How did I get these values? I just copy-pasted them from my Firefox! They were good for a few days, so it wasn’t much hassle at all. Sometimes a stupid solution is the most effective one.

PixelThe Terrible, Horrible, No-Good, Very-Bad Things I Do To PHP (Or: Greenspunning in PHP)

· 23 days ago

Controller Methods

What we call a “controller method”, at least, is simply a function which is called on the web side via AJAX and returns some value back (generally either JSON or a snippet of HTML).

When I started at my current employer, controller methods looked like this:

class MyController extends Controller {
  function f() {
    if (!$this->required('foo', 'bar', 'baz')) return;
    if (!is_numeric($this->requestvars['foo'])) return;
    // lots of code
    if (isset($this->requestvars['quux'])) {
      // do something with quux
    // more code
    return $someString;

Aside from being hideous, this has a number of glaring problems:

  1. The error handling is terrible.
  2. Validation is hard, and thus incredibly easy to screw up or forget entirely.
  3. Even something that should be easy―merely figuring out how to call the method―requires reading and understanding the entire method.

This just won’t do.

Surely it would be much nicer if controller methods could simply be defined like a regular function. Fortunately, PHP offers some manner of reflexive capabilities, meaning we can ask it what arguments a function takes. We can then match up GET/POST parameters by name, and send the function the proper arguments.

In other words, we can define the function more like:

class MyController extends Controller {
  function f($foo, $bar, $baz, $quux = null) {
    // lots of code
    if (isset($quux)) {
      // do something with quux
    // more code
    return new Response\HTML($someString);

And have it actually work. That’s much nicer. Now, we can call the method from PHP as easily as we call it from JavaScript, and we don’t have to read the entire function to figure out what arguments it takes.

(The astute reader will also notice I’ve moved to returning an object, so the response has a type. This is super-handy, because now it’s easy to ensure we send the apropriate content-type, enabling the JS side to do more intelligent things with it.)

Of course, this only tells us which arguments it takes, and whether they’re optional or required. We still need easier data validation. PHP provides type hints, but they only work for classes. Or do they?

Type Hints

In a brazen display of potentially ill-advised hackery (our code is a little more involved, but that should give you the general idea), I added an error handler that enables us to define non-class types to validate things.

So now we can do this:

class MyController extends Controller {
  function f(
    int $foo,
    string $bar,
    string $baz,
    int $quux = null
  ) {
    // lots of code
    if (isset($quux)) {
      // do something with quux
    // more code
    return new Response\HTML($someString);

And all the machinery ensures that by the time f() is executing, $foo looks like an integer, as does $quux if it was provided.

Now the caller of the code can readily know what the value of the variables should look like, and the programmer of the function doesn’t really have an excuse for not picking a type because it’s so easy.

Of course, this isn’t sufficient yet either. For instance, if I’d like to be able to pass a date into the controller, it has to be a string. Then the writer of the controller has to convert it to an appropriate class. Surely it’d be much nicer if the author of the controller method could say “I want a DateTime object”, which would be automagically converted from a specially-formatted string sent by the client.

Type Conversion via Typehints

Because PHP provides references via the backtrace mechanism, we can modify the parameters a function was called with.

class MyController extends Controller {
  function f(
    int $foo,
    string $bar,
    DateTime $baz,
    int $quux = null
  ) {
    // lots of code
    if (isset($quux)) {
      // do something with quux
    // more code
    return new Response\HTML($someString);

So while $baz might be POSTed as baz=2014-08-16, what f() gets is a PHP DateTime object representing that date. Due to the implementation mechanism, even something as simple as:

$mycontroller->f(1, “bar”, “2014-08-16”);

Will result in $baz being a DateTime object inside f().


There is an unfortunate caveat, and I have yet to figure out if it’s a quirk of the way I implemented things, or a quirk in the way PHP is implemented, but optional arguments do not change. That is, SomeClass $var = null will result in $var still being a string. func_get_args() will contain the altered value, however.

Multiple Inheritance and Method Combinations

PHP is a single inheritance language. Traits add some ability to build mixins, which is super-handy, but has some annoying restrictions. Particularly around calling methods―in particular, you can’t define a method in a trait, override it in a class which uses a trait, and then call the trait method from the class method. At least, not easily and generally.

Plus there’s no concept of method combinations. It’d be really handy to be able to say “hey, add this stuff to the return value” (e.g., by appending to an array) and have it just happen, rather than having to know how to combine your stuff with the parent method’s stuff.

While I’m sad to say I don’t have this working generally across any class, I have managed to get it working for a particular base class where it’s most useful to our codebase. Subclasses and traits can define certain methods, and when called, the class heirarchy will be automatically walked and the results of calling each method in the heirarchy will be combined.

trait BobsJams {
  static function BobsJams_getAdditionalJams() {
    return [ new CranberryJam(), new StrawberryJam() ];

trait JimsJams {
  static function JimsJams_getAdditionalJams() {
    return [ new BlackberryJam() ];

class Jams {
  function getJams() {
    return (new MethodCombinator([], 'array_merge'))
      ->execute(new ReflectionClass(get_called_class()), "getAdditionalJams");

class FewJams extends Jams {
  static function getAdditionalJams() {
    return [ new PineappleJam() ];

class LotsOJams extends FewJams {
  use BobsJams;
  use JimsJams;

  static function getAdditionalJams() {
    return [ new OrangeJam() ];

(new LotsOJams())->getJams();
// => [ OrangeJam, CranberryJam, StrawberryJam, BlackberryJam, PineappleJam ]

(The somewhat annoying prefix on the traits’ method names is to avoid forcing users of a trait to deal with name collisions.)

Naturally, all the magic of the getJams() method is hidden away in the MethodCombinator class, but it just walks the class hierarchy―traits included―using the C3 Linearization algorithm, calls those methods, and then combines them all using the combinator function (in this case, array_merge).

This, as you might imagine, greatly simplifies some code.

Oh, but you’re not impressed by shoehorning some level of multiple inheritance into a singly-inherited language? Fine, how about…

Context-Sensitive Object Behavior

Web code tends to be live, while mobile code is harshly asynchronous (as in: still needs to function when you have no signal, and then do something reasonable with data changes when you do have signal again), so what we care about changes between our Mobile API and our Web code, and yet we’d still like to share the basic structure of any given piece of data so we don’t have to write things twice or keep twice as much in our heads.

Heavily inspired by Pascal Costanza’s Context-Oriented-Programming, we define our data structures something like this:

class MyThing extends Struct {
  public $partA;
  public $userID;
  // ...
  function getAdditionalDefaultContextualComponents() {
    return [ new MyThingWebUI(), new MyThingMobileAPI() ];

class MyThingWebUI extends Contextual {
  public $isReadOnly;
  // ...
  function getApplicableLayer() { return "WebUI"; }

class MyThingMobileAPI extends Contextual {
  public $partB;
  // ...
  function getApplicableLayer() { return "MobileAPI"; }

The two Contextual subclasses define things that are only available within particular contexts (layers). Thus, within the context of WebUI, MyThing appears from the outside to look like:

  "partA": "foo",
  "userID": 12,
  "isReadOnly": false,

But within the Mobile API, that same $myThing object looks like:

  "partA": "foo",
  "userID": 12,
  "partB": "bar",

In addition to adding new properties, each layer can also exclude keys from JSON serialization, add aliases for keys (thus allowing mobiles to send/fetch data using old_key, when we rename something to new_key), and probably a few other things I’m forgetting.


PHP is remarkably malleable. error_handlers can be used as a poor-man’s handler-bind (unlike exceptions, they run before the stack is unwound, but you’re stuck dispatching on regular expressions if you want more than one); scalar type hints can be provided as a library; and traits can be abused to provide a level of multiple inheritance well beyond what was intended. While this malleability is certainly handy, I miss writing code in a language that doesn’t require jumping through hoops to provide what feel like basic facilities. But I’m also incredibly glad I can draw from the well of ideas in Common Lisp and bring some of that into the lives of developers with less exposure to the fantastic facilities Lisp provides.


My employer is desperate for user feedback, and as such is offering a free eight week trial. So if you want to poke at stuff and mock me when things don’t work very well (my core areas are nutritional analysis for recipes and food-related search results), that’s a thing you can do.

If you're outside the US, I should warn you that we have a number of known bugs and shortcomings you're much more likely to hit (we use a US-based product database; searching for things outside ASCII doesn't work due to MySQL having columns marked as the wrong charset; and there's a lot of weirdness around time because most user times end up stored as unix timestamps). The two bugs will be fixed eventually, but since they're complicated and as the US is our target market they're not exactly at the top of the list.

Don't you just love it when people move to a new blog? I'd 301 redirect you if I could, but since I can't you'll have to click through to read comment count unavailable comments or leave your own.

Dimitri FontaineTurn your PostgreSQL queries into Charts

· 23 days ago

Earlier this year we did compare compare Aggregating NBA data, PostgreSQL vs MongoDB then talked about PostgreSQL, Aggregates and histograms where we even produced a nice Histogram chart directly within the awesome psql console. Today, let's get that same idea to the next level, with pgcharts:

The new pgcharts application

The application's specifications are quite simple: edit an SQL query, set your categories and your data series, add in some legends, and get a nice chart. Currently supported are bar, column, pie and donut charts, and we should be able to add anything that has support for.

Currently, you need to compile the application yourself, and for that you need to install the SBCL compiler. Soon enough you will have a debian package to play with! The README at the pgcharts github place has the details to get you started. Enjoy!

Zach BeaneWhere to get help with Common Lisp

· 24 days ago
There are several options for getting help with Common Lisp.

First, some general sources of help.
Second, some implementation-specific help.

Every implementation has a mailing list. Some have multiple lists, depending on the topic of discussion. Here are some of the lists I know about:
Finally, there's a Quicklisp mailing list for Quicklisp issues. All questions about Quicklisp are welcome.

When appealing to any of these sources for help, asking good questions is critical to getting good results. In many cases, that requires three key bits of information:
Sometimes this info isn't applicable, like for "What is a good book for learning Common Lisp?" (Practical Common Lisp, Paradigms of AI Programming). But in many cases, having those three things readily available will save a lot of time.

There are dozens of people ready and willing to help you with your Common Lisp questions. This can work very well as you're getting oriented and learning the basics. After the basics, though, you may find questions going unanswered: "I'm compiling commonqt for ecls on raspberry pi and I'm getting a segfault in my custom gcc fork..." nets no helpful answers anywhere you ask. Don't worry! This is your chance to be a pioneer! When you do find a solution to your problem, share it in public so those in the future with the same, weird problem need not feel so lonely. And answer the question yourself the next time it comes up on a mailing list.

You may also get conflicting help as well. Some people believe that nobody should ever use LOOP, and others feel that for the right situation, nothing's better. Some people will tell you that you must use Emacs to write Common Lisp, and others feel there are other reasonable options. Some love the Common Lisp Object System, and others are wrong. It's up to you to sift through the information and decide for yourself what makes the most sense. And how boring would it be if everyone agreed about everything all the time?

I love to help people with Common Lisp. It's most rewarding when the questions show an interest in Common Lisp and a desire for progress and understanding. It's least rewarding when the questions show disdain for Common Lisp ("Why is this so convoluted and dumb?") and no interest in anything but a quick answer ("I'm taking this dumb course and after it's done I will never use LISP again. So can you just explain how to remove quote marks from a list?").

Have fun, and I hope the next time you need help with Common Lisp, you get what you need. Good luck!

Hans HübnerBerlin Lispers Meetup: Tuesday August 26th, 2014, 8.00pm

· 24 days ago
You are kindly invited to the next "Berlin Lispers Meetup", an informal gathering for anyone interested in Lisp, beer or coffee:

Berlin Lispers Meetup
Tuesday, August 26th, 2014
8 pm onwards

St Oberholz, Rosenthaler Straße 72, 10119 Berlin
U-Bahn Rosenthaler Platz

We will try to occupy a large table on the first floor, but in case you don't see us,
please contact Christian: 0157 87 05 16 14.

Please join for another evening of parentheses!

Zach BeaneStarting a Lisp project in 2014

· 26 days ago
If you're starting a project today and you want to make it easy to interact with the outside world, there are many, many, many services available. If you're feeling overwhelmed with all the options, here's what I'd recommend to get started:After a while you may find one or more of these options lacking. That's a good time to start shopping around and exploring other options. But when you're just getting started and you want to get up and running quickly, give them a shot.

Joe MarshallSmall puzzle solution

· 26 days ago
Before I give my solution, I'd like to describe the leftmost digit algorithm in a bit more detail.
(define (leftmost-digit base n)
  (if (< n base)
      (let ((leftmost-pair (leftmost-digit (* base base) n)))
        (if (< leftmost-pair base)
            (quotient leftmost-pair base)))))
The idea is this: if we have a one digit number, we just return it, otherwise we recursively call leftmost-digit with the square of the base. Squaring the base will mean gobbling up pairs of digits in the recursive call, so we'll get back either a one or two digit answer from the recursion. If it is one digit, we return it, otherwise it's two digits and we divide by the base to get the left one.

For example, if our number is 12345678 and the base is 10, we'll make a recursive call with base 100. The recursive call will deal with number as if it were written 12 34 56 78 in base 100 and return the answer 12. Then we'll divide that by 10 to get the 1.

Since we're squaring the base, we're doubling the number of digits we're dealing with on each recursive call. This leads to the solution in O(log log n) time. If we instrument quotient, you can see:
(leftmost-digit 10 816305093398751331727331379663195459013258742431006753294691576)
816305093398751331727331379663195459013258742431006753294691576 / 100000000000000000000000000000000
8163050933987513317273313796631 / 10000000000000000
816305093398751 / 100000000
8163050 / 10000
816 / 100
A sixty-three digit number trimmed down to one digit with only five divisions.

So a simple solution to the puzzle is:
(define (leftmost-digit+ base n)
  (if (< n base)
      (values n 0)
      (call-with-values (lambda () (leftmost-digit+ (* base base) n))
        (lambda (leftmost-pair count)
          (if (< leftmost-pair base)
              (values leftmost-pair (* count 2))
              (values (quotient leftmost-pair base) (+ (* count 2) 1)))))))
The second value is the count of how many digits we discard. If the number is less than the base, we return it and we discarded nothing. Otherwise, we make the recursive call with the base squared and get back two values, the leftmost pair and the number of pairs it discarded. If the leftmost pair is a single digit, we return it, otherwise we divide by the base. The number of digits discarded is simply twice the number discarded by the recursive call, plus one more if we had to divide.

But I don't see an easy way to separate finding the digit from finding the position. At first it seemed straightforward to just count the digits being discarded, but you can't decide whether to increment the count at each stage without determining if the leftmost part of the recursive call contains one or two digits.

Nicolas HafnerAdding Class-Bound Methods to CLOS - Confession 25

· 26 days ago

header This confession is out of sequence because I wrote the initial sketch for it earlier and can only now publish it because it is rather heavy in code for a change and that required the new blog version. Anyway.

One of the big changes when coming from more mainstream OOP languages to CL is that CLOS uses generic dispatch rather than methods that are bound to classes. I've now come to love and rejoice generic dispatch because it's both a simpler and more powerful system.

As I was browsing about my mind for ideas, I suddenly had the thought of implementing the other OOP idiom in CL as well. After all, it had to be possible, right? And surely enough it is. It didn't take me long to figure out how either, but instead of presenting a direct solution for you I'll walk through a basic prototyping approach to implement something like this in CL.

First we need to figure out what we want. For this, the answer to that is rather simple: We want methods that are bound to a specific class. Different classes can have methods with the same name but differing argument lists. Methods should be inheritable and overridable.

Next is the actual implementation difficulty. A first idea might be to create a ‘main superclass’ that has a class-allocated methods slot with a hash-table that contains the class method functions. This is problematic for two reasons. First, it leaks class information into the instances. Second, it doesn't give us a good way to compute the inheritance.

Enter MOP. The Meta Object Protocol gives us a wealth of extensions to CLOS that make the whole system a true wonder machine. Indeed, MOP allows us to do what we want and a lot more without too much of a hassle and it all integrates with the rest of CLOS. First we'll want to define our own metaclass that contains the methods slot.

(defclass method-class (standard-class)
  ((methods :initform (make-hash-table) :accessor class-methods)))

That's right, classes themselves are instances of classes. Before I found out about this I never even thought it possible, it never crossed my mind. But yet here it is and it is immensely powerful. In order to use our metaclass, we need to tell CLOS a bit more about it so that it can handle inheritance properly. From here on out we'll be using CLOSER-MOP, which is QL-able.

(defmethod c2mop:validate-superclass ((class method-class) (superclass t)) 
(defmethod c2mop:validate-superclass ((class standard-class) (superclass method-class)) 
(defmethod c2mop:validate-superclass ((class method-class) (superclass standard-class)) 
(defmethod c2mop:validate-superclass ((class method-class) (superclass method-class)) 

This basically tells CLOS what kind of superclass or inheritance order is allowed, since some things wouldn't quite make sense for our case. Next we'll add a few helper functions to add and remove methods for our classes.

(defun class-method (class-name method-name)
  (gethash method-name (class-methods (find-class class-name))))

(defun (setf class-method) (function class-name method-name)
  (setf (gethash method-name (class-methods (find-class class-name)))

(defun remove-class-method (class-name method-name)
  (remhash method-name (class-methods (find-class class-name))))

(defmacro define-class-method (class name args &body body)
  `(setf (class-method ',class ',name)
         #'(lambda ,args ,@body)))

Using these we can, although in a more lispy way, already define class-bound methods. Let's do some quick tests to see if it's all working:

(defclass test-class () ()
  (:metaclass method-class))

(define-class-method test-class greet (&optional name)
  (format T "Hello~@[, ~a~]!" name))

(funcall (class-method 'test-class 'greet))

(funcall (class-method 'test-class 'greet) "Lisper")

Now, calling funcall directly is a bit unwieldy and we'd also like to be able to call class methods on instances, so let's add a generic CALL function to do all that for us:

(defun call (instance method &rest args)
  (let ((class (etypecase instance
                 (standard-object (class-of instance))
                 (symbol (find-class instance))
                 (method-class instance))))
    (assert (typep class 'method-class))
    (let ((method (or (gethash method (class-methods class))
                      (error "No such class-method!"))))
      (apply method args))))

(call 'test-class 'greet "Reader")
(call (make-instance 'test-class) 'greet "You")

Neat. Now, one aspect that's missing is that, within a class-method you cannot refer to your own instance. That's not exactly great and something that is certainly direly needed in order to properly utilise classes and methods. We'll circumvent this by adding a special *THIS* that will be bound to the instance in CALL.

(defvar *this*)

(defun call (instance method &rest args)
  (let ((class (etypecase instance
                 (standard-object (class-of instance))
                 (symbol (find-class instance))
                 (method-class instance))))
    (assert (typep class 'method-class))
    (let ((method (or (gethash method (class-methods class))
                      (error "No such class-method!"))))
      (let ((*this* (typecase instance
                      (standard-object instance)
                      (T (c2mop:class-prototype class)))))
        (apply method args)))))

Thanks to CLASS-PROTOTYPE and the typecases we can also use our class-methods on classes themselves and still have access to class-allocated slots. Sort of like static fields and methods work in Java & co, except every method can be used statically and not.

Now, to make this all a bit less awkward to use, we'll add some more macros.

(defmacro -> (instance method &rest args)
  `(call ,instance ',method ,@args))

(defmacro <- (method &rest args)
  `(call *this* ',method ,@args))

(defmacro <s (slot)
  `(slot-value *this* ',slot))

Now I realise that this is all a lot of cosmetics and probably not the best style from a lisp viewpoint, but I'll excuse this with the fact that we're trying to emulate other languages anyway, so we might as well add some sugar to the mix.

“Ok, cool” I hear you think “but what about inheritance? Surely that isn't just already done for us, right?” And it certainly isn't, but we'll get to it now. If you subclass your test-class now, the new subclass won't have any of its parent's methods. In order to get this all rolling we'll first define a new slot on our class.

(defclass method-class (standard-class)
  ((direct-methods :initform (make-hash-table) :accessor class-direct-methods)
   (methods :initform (make-hash-table) :accessor class-methods)))

The DIRECT-METHODS slot will hold methods that belong directly to this class and the METHODS slot will hold all of the effective methods available to it, both from its own and the inherited. This means we have to change our simple accessor functions from earlier to direct to DIRECT-METHODS.

(defun class-method (class-name method-name)
  (gethash method-name (class-direct-methods (find-class class-name))))

(defun (setf class-method) (function class-name method-name)
  (setf (gethash method-name (class-direct-methods (find-class class-name)))

(defun remove-class-method (class-name method-name)
  (remhash method-name (class-direct-methods (find-class class-name))))

Next we need to have a function that can actually compute the effective methods.

(defun compute-effective-methods (class)
  (setf (class-methods class) (make-hash-table))
  (flet ((set-method (name method)
           (setf (gethash name (class-methods class)) method)))
    ;; Compute superclass combination
    (dolist (superclass (c2mop:class-direct-superclasses class))
      (when (typep superclass 'method-class)
        (maphash #'set-method (class-methods superclass))))
    ;; Compute override
    (maphash #'set-method (class-direct-methods class))))

As you can see this first maps all the direct superclasses' methods to the table and then the direct-methods on top. Since each superclass has their own effective methods slot we don't need to go up further than one level. Now we need to tie this into the actual inheritance computation of CLOS.

(defun cascade-method-changes (class)
  (compute-effective-methods class)
  (loop for sub-class in (c2mop:class-direct-subclasses class)
        when (and (typep sub-class 'method-class)
                  (c2mop:class-finalized-p sub-class))
          do (cascade-method-changes sub-class)))

(defmethod c2mop:finalize-inheritance :after ((class method-class))
  (dolist (super (c2mop:class-direct-superclasses class))
    (unless (c2mop:class-finalized-p super)
      (c2mop:finalize-inheritance super)))
  (cascade-method-changes class))

Let's test this:

(defclass sub-class (test-class) ()
  (:metaclass method-class))

(-> (make-instance 'sub-class) greet)

And it indeed works! If you are someone who likes to experiment first before following instructions you might have come across the following problem after defining the sub-class:

(-> 'sub-class greet)
; Evaluation aborted on #<SIMPLE-ERROR "No such class-method!" {1005558A33}>.

So it seems directly using the class after defining it doesn't work, but using an instance does. And after creating the instance, the above command will work as well. What happens here is that the inheritance of a class is not necessarily finalised until immediately before a class instance is made. This means that when we access the class before then, the inheritance might not have been computed and thus our method computation never actually happened! We can fix this by adding a check in our CALL function.

(defun call (instance method &rest args)
  (let ((class (etypecase instance
                 (standard-object (class-of instance))
                 (symbol (find-class instance))
                 (method-class instance))))
    (assert (typep class 'method-class))
    (unless (c2mop:class-finalized-p class)
      (c2mop:finalize-inheritance class))
    (let ((method (or (gethash method (class-methods class))
                      (error "No such class-method!"))))
      (let ((*this* (typecase instance
                      (standard-object instance)
                      (T (c2mop:class-prototype class)))))
        (apply method args)))))

Great, so with a bit of fiddling about we have added class-method capabilities to CLOS, full with inheritance and everything, in about 65 lines of simple code. Amazing. Now as a last treat I want to add a small extension to the class definition itself so that we can put methods in there directly, as you might know it from other languages as well. In order to do this we're going to define our own methods on RE/INITIALIZE-INSTANCE.

(defun initialize-method-class (class next-method &rest args &key &allow-other-keys)
  (let ((methods (getf args :methods)))
    (setf (class-direct-methods class) (make-hash-table))
    (dolist (definition methods)
      (destructuring-bind (name lambda-list &rest body) definition
        (let ((function (compile NIL `(lambda (,@lambda-list)
          (setf (gethash name (class-direct-methods class)) function)))))
  (remf args :methods)
  (apply next-method class args))

(defmethod initialize-instance :around ((class method-class) &rest args)
  (apply #'initialize-method-class class #'call-next-method args))

(defmethod reinitialize-instance :around ((class method-class) &rest args)
  (apply #'initialize-method-class class #'call-next-method args))

Sadly I am not aware of any other way of doing this that doesn't require invoking EVAL or COMPILE. However, this should be fine for most cases since most of the time your class definitions will be top-level forms, so the lexical environment should not be of consequence. As you can see though, by simply extending these two generic functions and grabbing the :METHODS argument from it we can change the way DEFCLASS is interpreted.

(defclass sub-class (test-class) ()
  (:metaclass method-class)
  (:methods (scream () (format NIL "AAAAAAA!!"))))

(-> 'sub-class scream)

CLOS and MOP are amazing creatures and there's still so much I have yet to explore of it. I hope this brief venture into the depths of Common Lisp have been enjoyable and informative to you. As always with these posts, if there are corrections, additions or questions, please do let me know and I'll see how I can help.

You can view the full code here.

edit: Thanks to Jean-Philippe Paradis for pointing out that using COMPILE instead of EVAL to create the lambda form is probably a better idea.

Zach BeaneCommon Lisp bits

· 26 days ago
The International Lisp Conference 2014 is history. I didn't make it, but Julian Squires did and provided a nice write-up. So did Nick Levine.

Reddit user FightAnArmedBearP started a discussion titled "Should my startup attempt to use ABCL in a production environment?" The ensuing feedback from Mikel Evins, Cyrus Harmon, Mark Evenson, and others is of extremely high quality, with excellent information about using ABCL and other solutions for working with Lisp on the JVM. Highly recommended reading even if you think you'll never be in the same situation.

cl-autowrap by Ryan Pavlik looks to me like a genius way to do FFI wrappers without needing to use stub C programs and a compiler at build time. It uses c2ffi, a standalone C++ program, to generate specification data files you can distribute with your project, after which neither c2ffi nor a C compiler is involved in using the FFI wrapper. It's a work in progress, one of the Lisp In Summer Projects projects, but I'm excited about the potential.

Speaking of Lisp In Summer Projects, Janne Nykopp's Notewhacker sounds fun: it's a game for learning sheet music notation, with MIDI instrument support.

elPrep is a LispWorks-olnly, BSD-style-licensed "high-performance tool for preparing .sam/.bam files for variant calling in sequencing pipelines." I don't know what that means, but it's a full Common Lisp application for the purpose, and it looks neat. From Charlotte Herzeel and Pascal Costanza.

Ezra Rush has written networked, OpenGL battleship and pong games. They use Patrick Stein's userial game protocol example.

Slugmage is a work-in-progress game from Alyssa Carter. "I hope to turn Slugmage into the game it was meant to be!" Uses SDL.

Testament of the White Cypress is "a fantasy role-playing adventure game for your Windows, Mac, or Linux PC. Using the mouse and keyboard, a single player controls a lone monk named Geoffrey as he travels through a post-apocalyptic wilderness." From David O'Toole, using his Xelf game engine. It has split licensing: the code is GPLv3 but the music and art assets are not freely redistributable.

Nicolas Hafner takes a shot at explaining CLOS to the uninitiated.

site-generator is a "command-line static site generator that supports arbitrary content and multiple languages." By Alex Charlton. (Not to be confused with Philip "Don't Call Me Phil" Greenspun's SITE CONTROLLER.)

qlot by Eitaro Fukamachi is a project-local library installer that uses Quicklisp. "This aims to be like Bundler of Ruby or Carton of Perl." Seems interesting, though it comes with a big "ALPHA quality" warning.

Joe MarshallJust a small puzzle

· 27 days ago
You can get the most significant digit (the leftmost) of a number pretty quickly this way
(define (leftmost-digit base n)
  (if (< n base)
      (let ((leftmost-pair (leftmost-digit (* base base) n)))
        (if (< leftmost-pair base)
            (quotient leftmost-pair base)))))
The puzzle is to adapt this code to return the position of the leftmost digit.

(leftmost-digit+ 10 46729885)  would return two values, 4 and 7

Common Lisp TipsDeclining to handle a condition

· 27 days ago

Occasionally it is convenient for a handler to decide whether to deal with a condition after it already has control.

Sometimes I see that idiom written with a `handler-case` that resignals the condition if it decides it can't take care of it. The downside of that approach is that `handler-case` only resignals the condition after unwinding the stack and removing any restarts that were in place.

`handler-bind` avoids those downsides. When a `handler-bind` handler does not transfer control it is said to "decline to handle" the condition. The condition will then percolate up to surrounding handlers or the debugger with the stack and restarts still in tact, as if the handler had never been entered.

(defun baz ()
  (handler-bind ((error (lambda (c)
                          (when (i-got-this-p c)
                            (return-from baz (quux))))))

LispjobsCommon Lisp or Clojure Developer, Adelaide or remote

· 27 days ago

Common Lisp or Clojure Developer
A fantastic opportunity for a Common Lisp developer or Clojure developer that is fast, adaptable and can work independently or fit in well into a team

Experience with Common Lisp or Clojure is a must (could be non-commercial) as well as general knowledge of relational databases and web technologies.

This role could be 100% remote for the right person, to join a top class team and on a great product which could became the Common Lisp application with largest customer base in the world.   The successful applicant will join a small, focused team in maintaining and furthering the development of a leading multivariate testing platform.

Familiarity in the following areas would be considered a plus: backend web server technology, Javascript, PostgreSQL, SQL Server, statistics, distributed computing, Lispworks, any distributed version control system. A high degree of autonomy and self-motivation will be expected.

This is a great career building opportunity and salary package on offer! Although will consider contractors!

If this is of interest I’d be keen to discuss with you, please email me or call 0061 (0)415 344 427

For older items, see the Planet Lisp Archives.

Last updated: 2014-09-17 20:56