Planet Lisp

July 1, 2009

Andy HefnerMiscellaneous Lisp Hackery

Here's a few things I've been working on lately which may be of broader interest:

Bordeaux-FFT is a small library for computing the FFT/IFFT on complex data, originally written by Robert Strandh (and/or Sylvain Marchand and Martin Raspaud), with contributions by Paul Khuong and myself. Last summer I did some audio processing experiments, originally using the FFT code from Sapaclisp. Robert helpfully volunteered his FFT implementation, which I cleaned up slightly and have been cheerfully employing in various audio hacks ever since. Several versions have changed hands through email, lisppaste, and my web server, so I've finally come around to collecting the changes, writing a brief manual, and tarring up a release. It's surprisingly fast, particularly with Paul Khuong's recent work on SBCL, although I don't know how it stacks up against some of the highly tuned assembly language implementations out there. I use it along with my fledgling task queuing code to grind out batches of FFTs across four CPU cores.

Shuffletron is a simple music player in CL, with a few interesting features. I've been running this program full time for several weeks now as my preferred music player. It snuck on to the lisp subreddit before I really announced it, and I'm sure folks on IRC are already sick of hearing about it, so I won't say much. I began with plans for a much more ambitious player, with a fancy graphical interface, and wrote the first version of this one Saturday realizing I needed something simple to put the audio code through its paces while I wrote the full player, intending to include this one as an example program with Mixalot. It worked better than expected, so I ended up fleshing out the feature set instead and decided that it was really all I needed. The code is lean and mean.

I've had a number of problems trying to produce redistributable binaries of this which I believe I've finally solved (although new binaries are forthcoming). I hope to write about these later.

Mixalot is the audio back-end of Shuffletron, factored into its own system(s) because it might be useful for other purposes. It includes a mixer which pulls audio from any number of streamer objects and outputs them to ALSA. It also includes FFI definitions for libmpg123 with some helpers for decoding MP3 files and reading ID3 tags, and a streamer class for decoding and playing MP3 files in real time. The libmpg123 portions are usable independently of the audio mixing/output code.

 

Nick LevineCall for Participation: Commercial Users of Functional Programming Workshop

The organisers of CUFP 2009 (4 September, Edinburgh) are really keen to get lispers involved. Although they didn't get any lisp presentations this year, they'd "sure love to have some lisp community participants". You can read the Call for Participation here.

I'm on the current Program Committee but haven't been very effective. If anyone reading this can recommend potential PC members for future years please get in touch with me directly. They should be well known in the lisp community and also have a scope that encompasses at least some of the other proposals which the workshop might get.  

June 30, 2009

Dan WeinrebCome to the European Common Lisp Meeting!

The European Common Lisp Meeting will be in Hamburg, on the weekend of September 12 and 13, 2009.  I greatly enjoyed last year’s ECLM, in Amsterdam. It’s relaxed and gives you a lot of opportunity to meet great Lisp experts from all over the world. Arthur Lemmens and Edi Weitz did a superb job arranging for entertainment and space, and making sure everyone was happy.

I’m also looking forward to seeing Hamburg; I’ve never been there, and it sounds great.

I’m giving a talk entitled A Highly-Available Large-Scale Transaction Processing System in Common Lisp. It’s about the airline reservation system that we’re building at ITA Software, specifically about the issues involved in using Common Lisp, which is not widely thought of as being a language for writing large-scale transaction processing.

The lightning talks at the International Lisp Conference last March went so well that Edi and Arthur are trying out this format at the ECLM.  After the ILC, someone told me that at another sofware-related conference he had been to, the lightning talks fell flat: few people signed up to give talks, and they weren’t very good.  At the ILC, I thought they were nearly all great.  We learned about new tools, stories, and so on.  There was a great one about using Lisp in a Lisp-unfriendly world.  In a nutshell: if they force you to program in PERL, then run a PERL-coded Scheme interpreter and write in Scheme!  I anticipate more fun lightning talks in Hamburg!

So, I encourage you to join the fun!

 

Nick LevineEuropean Common Lisp Meeting, Hamburg, September 12/13

Arthur Lemmens and Edi Weitz have posted the list of speakers for the European Common Lisp Meeting 2009. The meeting will consist of a day of talks on Sunday September 13, with optional tour of Hamburg on the Saturday afternoon and dinners on Saturday and Sunday evenings.

For a list of speakers and more information including lightning talks and registration info visit http://weitz.de/eclm2009/.  

June 29, 2009

Gábor MelisGlobal Compiler Policy

A quick note to library implementors: the effects of DECLAIM are permitted to persist after the containing file is compiled and it is unkind to mutate your user's settings. Personally, I find DECLAIM too blunt and prefer to add declarations within functions, even going as far as introducing LOCALLY subforms just to have a place on which to hang declarations. But if you are really set on using DECLAIM, please wrap it like this:

 (eval-when (:compile-toplevel)
  (declaim (optimize speed)))

to ensure that the body is evaluated in the dynamic execution context of the compiler which makes a practical difference on Allegro. It goes without saying that you don't want (PROCLAIM '(OPTIMIZE ...)) in a library, either.
 

Nikodemus SiivolaMAKE-INSTANCE optimizations

SBCL has always had a fairly efficient MAKE-INSTANCE -- as long as the class argument is a quoted symbol and all the keywords in initargs are constant as well.

This is due to what are called ctors. Very briefly, a call such as

(make-instance 'point :x x :y y)

is transformed into something along the lines of

(let ((#:ctor (load-time-value (ensure-ctor 'foo :x :y))))
  (funcall #:ctor x y))

where CTOR is a funcallable instance. The first time it is called, an optimized constructor is generated, saved as the funcallable instance function, and finally called with the provided arguments. Later calls get the optimized constructor ready made.

There are multiple efficiency gains here: initargs can be checked for validity at compile time, and (assuming certain things hold true about the metaclass of the class and its methods) essentially the whole MAKE-INSTANCE -> ALLOCATE-INSTANCE | INITIALIZE-INSTANCE -> SHARED-INITIALIZE call tree can be open coded, and slot accesses can be performed using STANDARD-INSTANCE-ACCESS with constant slot locations. This all works with class redefinitions as ctors are updated automatically.

Unfortunately, this made the performance of non-constant class arguments to MAKE-INSTANCE extremely unattractive in comparison: 10-100 times slower is fairly typical. Ouch!

At ELS 2009 Didier Verna asked if we could do something similar for the non-constant case as well. Turns out we can:

(make-instance class :x x :y y)

can become something like

(let* ((#:cache (load-time-value (cons 'ctor-cache nil)))
       (#:store (cdr #:cache)))
  (multiple-value-bind (#:ctor #:new-store)
      (ensure-cached-ctor class :x :y #:store)
    (setf (cdr #:cache) #:new-store)
    (funcall #:ctor x y)))

where the ctor corresponding to the class argument is first looked for in the cache: if it is not found a new one is generated and cached.

This was implemented in SBCL 1.0.29.41; now the non-constant class arguments performs much better, being roughly only 1.15-1.25 times slower than the constant ones. Hopefully this will make a real difference in some applications out there!

A bit later another ctor optimization improvement was implemented, improving cases where the full open coding could not be done due to defined methods, but where we are still able to check initargs at compile time and replace the MAKE-INSTANCE generic function call with a slightly faster normal function call to a simplified constructor function: this makes such calls 4-10 times slower than the fast path, but still considerably faster than the worst case scenario.

The message to take home is actually not about SBCL optimizations -- as much as I like to talk about them: I think ctors are an excellent example of compiler macros doing inline caching and specialization. A nice technique that could probably be used more often than it is.

Note: SBCL inherited its ctor optimizations from CMUCL: they are the work of the estimable Gerd Moellmann. Things described above do not replace his work, just extend it around the edges. Those interested in the gory details can look at src/pcl/ctor.lisp in the SBCL source tree.

 

Paul KhuongComplex float improvements for sbcl 1.0.30/x86-64

Complex single- and double- floats used to be represented as pairs of SSE registers on the x86-64 port. Doing so simplified the porting work, since more code could be shared with other backends. Representing complexes packed in SSE registers as would be obviously preferable was left for the future. Luckily, most of my work on SSE intrinsics is directly applicable to that task: all the SSE{1,2} instructions were finally defined, and modifying a pre-existing primitive object type is much simpler than adding a new one.

The initial work was mostly straightforward, since, by default, the only operations on complexes are moves (to/from registers, the stack/arguments, vectors, or the heap [as boxed objects]), creation from scalar components, and extraction of components. The modifications mostly entailed modifying the way register allocation manipulates complexes (they only need a single SSE register, like scalar values), and making the boxed representation more SSE-friendly (alignment, packing single floats for MOVQ). There's only one problem I can remember in that part of the code: the stack isn't guaranteed to be aligned on 16 byte boundaries! That means that I'm forced to use MOVUPD when moving complex double floats between the stack and an SSE register.

Already, using half the registers and half the memory accesses is interesting. The real pay-off with packed complexes, however, is the ability to use vector instructions in arithmetic operations. The obvious cases are addition and subtraction (both complex-complex and complex-real), which exhibit natural horizontal parallelism: each pair of components (real-real or imaginary-imaginary) can be processed separately. For complex-real multiplication or division, things are only slightly more complex: the real operand has to be duplicated before operating in parallel. I encountered two important bugs in that code.

First, scalar values were still passed around with narrow register-register moves (movsd or movss). When we only ever manipulated scalars, that wasn't an issue. Complex-real addition and subtraction, however, really want to exploit the fact that the high part of a scalar value is filled with 0. That's not the case when a narrow move is used to move a single float value into a register that used to hold a double... Using full register moves instead (for both complex and scalar values) not only made the code correct, but also improved performance, due to their being simpler for the SSE pipeline to handle.

Second, packed single float division executes four divisions, not two. Thus, it's not enough to duplicate the real operand in a complex-real division before executing a packed division: depending on the exception mask, the 0.0/0.0 in the higher half of the result may trigger an arithmetic exception. The simplest solution I found was to perform the same operation in both halves of the operands. Obviously, an exception will be signalled iff operating only on the lower half would have signalled too. The cost is one additional shuffle step for the complex value: the real value always has to be duplicated, and the high half of the result must be reset to 0 anyway.

That leaves complex-complex multiplication, and complex-complex/real-complex division. Complex multiplication is a fairly common operation; common enough for Intel to pretty much dedicate an instruction to that operation (ADDSUBP{S,D}) in SSE3. Even when restricting ourselves to SSE2, the code is fairly simple. The most unnatural part is computing the conjugate, given the absence of ADDSUB (more on that later). Complex division, however, is a much more complicated beast. I decided to stick to an implementation in CL, but rewritten to use block (complex-at-a-time) instead of component-wise arithmetic. In the end, it still wanted an additional operation: swapping the real and imaginary components of a complex.

Finally, that leaves non-arithmetic operations, like equality tests (both bitwise and as numbers), conjugate and negation. Once there's a guarantee that the unused components in a SSE register are 0-filled (a guarantee that is supported by most load-from-memory instructions), testing for equality is very straightforward. Computing the conjugate of a complex or negating one, on the other hand, isn't so nice. One look at the way floats are laid out in register shows that they may be implemented as simple bit-wise operations. That leaves the problem of loading the relevant constants in. That's where a later patch comes in.

Negating and taking the absolute value of a real requires constants for bit-wise operations, like conjugating or negating a complex. Until now, SBCL generated code that loaded the constant in a GPR, and then moved it from the GPR into an SSE register. Generating the constant usually required some shifting (often slow), and moving data between the GPR and SSE register files takes a relatively long time on current microarchitectures. Loading such constants from memory directly into SSE registers would be preferable. However, by the time assembly code is generated, it's too late to add constants, and some of these constants would have to be boxed (be it as fixnums or bignums). In other words, both not feasible and not always a good idea.

A second patch adds support for unboxed constants stored inline, after the executable code. The system must already support code relocation, so the moving GC isn't an issue more than it already is (RIP-relative addressing on x86-64, and post-GC fixups on x86). I haven't bothered to look at the situation for other platforms yet, but, with some luck, things will be similarly simple. Storing constants in the code segments was already possible, with a small bit of hacking. What the patch mostly offers is the ability to pack the constants together. The advantage is that they can be stored far enough away from code (one cache line) to not confuse caches, and that a more global outlook allows the code generator to merge identical constants together and to reorder them to guarantee sufficient alignment without wasting too much space.

With that infrastructure in place, it was easy to generate much simpler and faster code for the previous bitwise operations on floats. It also became interesting to store some unboxed constants inline. On x86-64, I implemented that for constant floats (both real and complex) and fixnums that fit in registers, but can't be assembled as immediates (so +/- 2^29 and wider). On x86, fixnums can always be assembled as immediates, so only floats are stored as unboxed inline constants. Float constants are interesting to store inline because they're frequently found in speed-oriented code. Wider-than-fixnum constants, on the other hand, not so much. There's also a certain trade-off to consider: unboxed constants cause additional consing (and code bloat) each time they must be boxed.

Unboxed constants open the door for another set of improvements for x86oids: many instructions can load one operand directly from memory. Using that feature may save a register, and always makes the code smaller. The latter can often have a heavier impact on performance than the former. Using fewer registers mostly matters when the compiler has to spill values; that does not happen so often in inner loops on the x86-64. Smaller code, on the other hand, helps with memory (fits better in cache, needs less bandwidth) and reduces the pressure on the instruction decoder. Finally, when a constant shows up as an operand, it pretty much always pays off to store is inline, so even wider-than-fixnum integer constants are stored inline in that case.

The union of these changes transparently speeds up most complex float operations by up to ~100% on my Core 2. In the current version of Bordeaux-FFT, that translates into a 110% speed-up for 1024-point FFTs. Scalar operations also show significant improvements. Nikodemus Siivola's implementation of the (broken) mandelbrot routine at http://random-state.net/log/3452921796.html shows a 55% speed-up for hand-rolled complex arithmetic (within 20% of the execution time of the fastest g++ version). The complex-ful version shows a lesser speed-up, 38%, probably because a good portion of the runtime is spent operating element-wise to compute the 2-norm of complexes. If you have performance-sensitive code that performs a lot of floating point computations, you might be pleasantly surprised by this month's release.

P.S. Considering the size of these patches, I would probably be even more pleasantly surprised if they were bug-free. Bug reports (to sbcl-bugs, or, more realistically, directly to sbcl-devel) are always welcome, especially during the code freeze period starting today!

 

June 28, 2009

LispjobsGame developer, Gamerizon


http://www.gamerizon.com/gamerizon-careers.html

Looking for Lisp/Scheme for writing games…

 

June 27, 2009

LispjobsAuphelia Technologies, ERP


Hi everyone,

Auphelia Technologies is looking for highly skilled resources for work
in Gambit / JazzScheme, Web development and Database architecture.

Auphelia Technologies offers integrated management software, entirely
customized for small and medium-sized businesses, using innovative
technology and revolutionary features. Auphelia is located in
Montreal, Canada. See http://www.auphelia.com (under construction).

Gambit is a complete, portable, efficient and reliable implementation
of the Scheme programming language. See
http://dynamo.iro.umontreal.ca/~gambit.

JazzScheme is an open source programming language and cross-platform
framework built on top of Gambit . It includes a sophisticated IDE and
has been used for more than 10 years to develop high-quality
commercial software. See http://www.jazzscheme.org.

DESCRIPTION OF THE PROJECT

We are starting development on a new Enterprise Resource Planning
(ERP) software that we plan on making top of its class. Work will be a
complete rewrite and a major evolution of an existing product already
in use by a large client base.

Note that this project is a highly complex undertaking as we will not
be developing just another ERP software but a complete dynamic ERP
creation platform. Project is planned to last around 1 1/2 year for
the first release. Current team is composed of 10 people and is
planned to grow to around 25.

TECHNOLOGIES

After an exhaustive evaluation period, decision has been made to
develop the backend entirely in Gambit / JazzScheme. We still haven’t
decided between Qt and JazzScheme for our desktop client solution but
if the JazzScheme platform can be evolved to suit our needs, we have a
definite preference for using it accross the whole project.

EMPLOYMENT OPPORTUNITIES

We need resources for the following 4 positions :

LANGUAGE / FRAMEWORK DEVELOPPER

1 or 2 resources to work closely with Gambit and JazzScheme’s authors
to evolve the JazzScheme langage and framework during a period of
about 6 months. Planned work includes :
- Language
- Module / build system
- Performance / memory
- IDE / Remote debugger
- X11 and MacOS X frontends

Note that Gambit’s author Marc Feeley, has confirmed his participation
part time in the project!

Ideally, the resources we are looking for would have expert knowledge
in Lisp, Scheme, Gambit and JazzScheme! But realisticaly, if you don’t
know those but are brilliant, motivated and want to be part of an
exciting project that could put Scheme to the forefront, please send
us your resume!

PROJECT DEVELOPPER

1 or 2 resources to complete our existing team and work on the Desktop
client and / or Server layer. Knowledge of the following is a definite
plus :
- Lisp, Scheme, Gambit, JazzScheme
- GUI development
- Highly scalable server development
- Databases, SQL

WEB DEVELOPPER

1 resource to work on the Web client. We need a world class expert in
advanced web interfaces as the Web client will have to implement all
the functionality of the Desktop client. Knowledge of the following is
critical :
- Dynamic web development with AJAX
- Advanced web interfaces
- Javascript
- PHP

DATABASE ARCHITECT

1 resource to be lead in every database related work. Knowledge of the
following is critical :
- World class expertise in database modeling
- Databases : PostgreSQL (will probably be the first one supported),
Oracle, SQLite (for the offline mode)

Ideally we are looking for full time resources located near Montreal
but are definitely open to remote work and other options. Auphelia
Technologies offers competitive & flexible employment conditions.

PS: Please feel free to pass this announce to your friends!

Cheers,

Guillaume Cartier
gc@auphelia.com

 

June 24, 2009

Luís OliveiraIllustrating SBCL's build process

A while back I read Christophe Rhodes's paper "SBCL: A Sanely-Bootstrappable Common Lisp" which describes SBCL's bootstrap procedures.

The paper includes a bunch of diagrams for each build stage. These were pretty helpful in improving my understanding of the build process. So, I tried to take them a step further and create a single diagram that provides a global overview of the build process:

I'm interested in hearing any comments you might have. If you already know how the build process works, does it make you cringe? If you are vaguely familiar with (parts of) the process, does it provide you with some sort of new insight? Given that I haven't included a legend, does it make any sense at all?

 

Zach BeaneBay Area Lisp meeting on July 19, 2009

This sounds pretty cool:

So finally, here is the skinny on the Sunday, July 19th in the big
conference room at Franz Inc:

6:00pm ~ Meet in front of the Franz office (2201 Broadway, Suite 715,
Oakland, CA 94612). The kind folks at Franz Inc will herd us cats up
to their big conference room before 6:15.

6:15pm ~ Introduction to Clojure

Amit Rathore will give an introduction to Clojure aimed at lispers.

6:45pm ~ Milk and cookies (http://www.meetup.com/balisp/polls/189504/)

7:00pm ~ Coders at Work: The Lisp Perspective

Peter Seibel talks about the interviews from his new book Coders at
Work (http://www.codersatwork.com/),
particularly what his subjects had to say about Lisp and Lisp related
topics. Will include bits that didn't
make it into the book.

7:30pm ~ Coders at Work QA

7:45pm ~ &optional Belgian beer outside and around the corner at Lukas

Thanks for staying tuned, hope to see you there. For more info see:

http://www.meetup.com/balisp/calendar/10589286/
 

Simon AlexanderA quick note about aliasing

Two things quickly:

First: someone (pkhoung) pointed out to me that I had misremembered the atomicity of incf. We can get atomicity using sb-ext:atomic-incf, but the usual one won't lock. I hadn't worried about this much anyway when threading across rows because a) the worst thing a race condition could do was recompute a row and b) this was highly unlikely. Modifying the code to either use a mutex there or to use the atomic version is easy enough, but not really needed. But I should have double checked before adding the "atomic, so no mutex" comment....


Secondly. As I'm sure you're all happy to hear, I'm done ranting about why I think the gcls benchmark actually does a poor job of it's stated goals --- but I haven't finished talking about why approaches like that do a poor job of the computation itself. I won't have anything else to say about the particulars of the benchmark, but while last post talked a bit about addressing the very poor estimates that 50 iterations gives (and there is more to say about estimating the set-membership), there is another big problem hanging over our heads, that I've only obliquely referenced so far.

The fundamental issue is that the set boundary is the interesting part, but the Whittaker (also known as Shannon and/or Nyquist) sampling theorem tells us that we won't be able to reproduce the boundary accurately. More on that later, but here's a couple of pictures.

Consider these 3 panels:

This is a detail near the complex point (-0.74453892,0.12172418) at roughly comparable resolution to that discussed previously (e.g. the 16000x16000 case). I've double the size of all pixels though, to make it easier to see. Hopefully at the posted resolutions this is fairly visible, but you may need to click through to the image itself to see much.

The leftmost panel is what you get from a million iterations deciding set membership. The middle panel is showing, in grayscale rather than black and white, the averaged membership of 100 samples inside the pixel, and the leftmost panel is just thresholding the middle one back to a binary image so you can compare it with the 1st panel.

So this shows how aliasing is affecting the boundary, at least. But last day I mentioned how the Mandelbrot set is actually simply connected .... so we know that the above images are in some sense impossible, because this black region is isolated in the white (out of the set) background. So what's going on here?

I'll discuss this all next time, but for now, here is a larger detail of the same area. It is inverted relative to the above (whit is in set, black out of it). Range of gray values relates to how long it took points within the pixel to escape the disc of radius 2. I've messed with the image a bit though, to make things easier to see, so the densities don't correspond exactly to anything. Solid red shows the (same) central region that is considered in the set, but red circles also show points were at least one random sample in the pixel is also considered in the set.

In this case sampled a few hundred times from each one, and as you can see this wasn't terribly effective at finding the set inside of pixels. However, you can see how the brighter (but not white) pixels contain some information about where it might be.... something I'll take up later.  

June 23, 2009

Ingvar Mattsson23 Jun 2009

So, upgrades happened at home and, as occasionally happens, X stopped working. X stopping working varies from "trivial" to "annoyingly painful" to troubleshoot, especially as there was NOTHING in the xorg.N.log file to indicate what the issue was and it wedged the console to the point where "shutdown, reload" was the only way to get it back (thankfully, I could log on from another machine to do that).

In the end, it was a surprisingly simple fix, after "startx 2>&1 > trace-file" gave me the crucial bits of info. An expected symbol was not around in a dynamic linking stage and chasing that down gave a simple(ish) fix. All I had to do, in the end, was to uninstall the fglrx driver (something I installed in the first place to get working accelerated 3D primitives and direct rendering).

But, it did made me wonder, if the Xorg server can write to stderr, why can't it log the lack of a symbol to the og file? Maybe, I don't know, because that writing happens in a non-X library? I should probably have a poke at that, at some point.  

June 22, 2009

Simon AlexanderOn the limitations of dumb but fast...

Suppose you ask your friend what time it is, and they say "Right *now*, it's 12:43 and 32 seconds, 323 microseconds .... er, plus or minus about twenty minutes. You'd think they were a bit wonky, wouldn't you?

This is roughly the same reaction I have to someone describing the boundary of the Mandelbrot set on a lattice of 16000x16000 positions, with an iteration count of 50. The spatial resolution is very high, but the accuracy is very low. In fact it's only one of two (what I would consider fatal) deep flaws with this approach, but the other one will have to wait until next time.

Now one (overly) simple way to improve the estimate is to increase the iteration count. So, for example, what if we use 50 million iterations instead of 50? Back of the envelope computations tell you that (knowing that the set is about 3/8 of the total area) the fastest benchmarked code takes 6s to do 50 iterations on this grid, so it will take about 26 days to do 50 million. I think I originally used 10 million as my proxy for "lots and lots", so that's still a bit over 5 days....

Of course there is no real analysis there (clearly more than 50 is better, but is 50 million actually appreciably better than 10 million? Better than 10,000? It's actually deeper than that, but I'll discuss this more later). But that wasn't the point. The point was that if you have scale things up to get more accuracy (which you will), you'll quickly run into long runtimes. This is entirely consistent with many "real world" computational situations. Far more so that the ones that are actually cheap. For that matter, 5 days is often fine. It's when your first back-of-the-envelope computation tells you need 5 centuries that you really have to go back to the drawing board.

Bear in mind, too, that I originally had only a single core to use, so things were much, much worse. Even just looking at a small piece of the lattice was going to take hours and hours, time I didn't want to wait for.

So what to do? Giving up on the tactic of just computing as fast as we can (i.e. "dumb, but fast"), we have to step back and thing about what's happening here. You have a lattice where roughly 1/2 the area is cheap to compute, roughly 1/2 very expensive (at these high iteration counts). This sort of thing cries out for a new approach. Examples: some way to avoid blindly iterating (e.g. period detection, convergence tests), some way to do successive refinement and hence avoid spending time where it isn't needed. Some way to avoid doing some computations at all.

All of these techniques could work well, but as I was in a hurry I looked to the latter one. I remembered another theorem about the Mandelbrot set. Recall that the original iteration-based computation is all based on a theorem that if an orbit ever leaves the radius 2 disk about the origin, it will diverge. Well how about this one: Doudy and Hubbard showed in the original modern work on the set (mid 1980s) that the set is actually (simply) connected, something not at all obvious. So one practical implication is that if you draw an unbroken loop anywhere on the complex plane, if all of the points on that contour are in the set, then its interior is well, and the same for points outside the set, nearly. There are no isolated points in the set. You have to be a little bit careful here, because while the set itself is simply connected, the points outside the set includes all of the complex plane outside the disc of radius 2 at the origin. You don't want to draw a line around the entire set and conclude it is empty! (i.e the exterior is connected, the interior simply connected, the boundary, on the other hand, is weird) [update I lost part of this about the distinction originally, thanks to anon for catching it]

This immediately leads to an algorithm for spatial subdivision. Take the region you are interested and look at its boundary. If all points are in (out) of the set, record the interior points as in (out) of the set. If not, subdivide the region into a partition (e.g. four subregions) and apply the same approach to each subregion. Continue recursively until all points are covered.

Now of course in a discrete setting you stop at the point of one pixel, or in practise larger than that. Beyond that though, there is a as source of error. Because we aren't drawing continuous lines, the sampling may miss a small piece of the set, and thus erroneously conclude that the interior is solidly inside (outside) the set. There is a very practical way to mitigate this problem, but not remove entirely as I'll demonstrate.

Spatial subdivision FTW

It's pretty clear why this algorithm will win on computations if it encounters a subregion that is only withing the set. Rather than having to compute all of the interior points, it computes only the perimeter. The bigger this region is, the more you win, as the perimeter grows linearly while the area grows quadratically. The obvious way to keep a bunch of threads busy on this is to have them test a region, and if it fails, put the subregions on a stack somewhere. Then each thread just goes to the "job stack" for a new region when it needs new work to do.

However, there is still the overhead of specifying regions, and farming out these jobs to threads. The former can involve a lot of memory too. Think about our 16000x16000 case, we can subdivided it into 4 about 12 times, and each time you take 4 numbers specifying the regions, and replace it with 16 specifying 4 subregions. Take 64bit fixnums and unless I've screwed up that's about 64 Mb of potential state to be sloshed around just specifying where you are in the lattice, or about what twice the state being used to store the set values themselves (obviously this could be improved a bit, but that's a back-of-the-envelope bound on it)

Of course this isn't a terrible lot of state, but look what happened last time when I only added a function call and made it 4x slower! So while I knew that this approach would be much faster at large iterations, I assumed the overhead of managing regions would mean that with a tiny number of iterations like 50, "dumb but fast" would probably still win.

While writing up the previous post though, I had a thought. All I really had to do was partition the set enough to spread cheap and expensive subregions around enough to keep all the threads busy. Once in a region, the thread could just beaver away there by itself. As long as you don't interact spatially with other threads, you don't need any locking either. So this way I could keep all the cores busy but pay very little cost in organization. It's not as good at spreading the work around, but it will be very cheap.

So I tried this approach, and it's fast. Quite a bit faster than the current champion, actually, even at the 50 iteration case. Once you get into bigger iterations though, it really mops the floor with the "dumb but fast" approach. Here are the gcc (modified only to vary the number of iterations) case and two partitioning implementations:

This plot shows the expense (NB: in user cpu time) as you increase the number of iterations spent at each site. These are 16000x16000 lattices, loglog plot of seconds against the maximum number of iterations. So you can see that while the gcc code starts off at roughly 1/2 the cost of the lisp code, by the time you hit 1000 iterations the gap has widened to an order of magnitude, and stays there. I only ran this out to a million iterations, but that gives the final user time for the gcc code about 20x the user time of the gcls partition lisp code, and 40x the faster qtree2 (more on that to come) and wall clocks of about 12 hours, 40 minutes, 20 minutes, respectively. Wall clocks being a little skewed by other work on the machine over these longish periods. Of course my lisp code implementing a similar algorithm to the c effort would suffer the same fate, it would just be even a bit (roughly 1.75x) slower. This isn't about implementations, after all. Reporting in user time done to avoid problems with other jobs on these machines.

How fast is it on the benchmark case? Let's compare again with the c implementation, same machine and setup as last time, same declaim statement as previously. Again these are typical runs, and again we throw away the output to avoid NFS timing issues.

-bash-3.2$ /usr/bin/time -v ./mandelbrot.gcc-6.gcc_run 16000 > /dev/null
User time (seconds): 19.26
System time (seconds): 0.08
Percent of CPU this job got: 795%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:02.43
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 8266
Voluntary context switches: 22
Involuntary context switches: 1001
-bash-3.2$ /usr/bin/time -v sbcl --core gcls_partitioned.core --noinform --eval "(gcls/opt/qtree2 16000 16000)" --eval '(quit)' > /dev/null
User time (seconds): 8.27
System time (seconds): 1.08
Percent of CPU this job got: 650%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.44
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 271688
Voluntary context switches: 206
Involuntary context switches: 488

So you can see that the lisp code does still have a bit of a core utilization problem, only getting about 88% of the 8 cores vs. the c codes 99%. And we have orders of magnitude more page faults. But it isn't enough to catch up to the better algorithm, the c runtime is about 1.7x the lisp runtime. Actually we can do better than that. The system time is heavy because of the big cache of uint32s that we don't actually need if we're only counting to fifty. Make those chars (so max iterations is now 255) and you get:

-bash-3.2$ /usr/bin/time -v sbcl --core gcls_partitioned.core --noinform --eval "(gcls/opt/qtree2 16000 16000)" --eval '(quit)' > /dev/null
User time (seconds): 8.09
System time (seconds): 0.42
Percent of CPU this job got: 657%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.29
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 101670
Voluntary context switches: 105
Involuntary context switches: 449

So were getting pretty close to a 2x multiple of the c code. We could improve things a bit more, especially since the system time is still 1/3 of wall time. I also bet there are tricks there I've missed. But I think the points been made; I don't really care on the low iteration count, though it is a bit amusing that it actually wins there as well. But as we move forward a bit trying to generate a good approximation to the set, this stuff will be useful.

Before anyone complains, I did mention the possible source of error, right? Now while the output for the 200x200 at 50 iterations case is identical (which means, unless I've misunderstood, this code would be a valid entry, not that it's important), with 16000x16000 at 50 iterations and this partitioning scheme there are 3 pixels out of 256000000, or about one in one hundred million. To put this in perspective, changing the number of iterations by one, either to 49 or 51, will change on the order of 100,000 pixel values. Note also that this doesn't mean that either algorithm is right about those 6 pixels, only that they differ on a poor estimation of them. The divergence is bigger, and more interesting, with better estimates, but even at 1000000 iterations they only differ by about 100 pixels at this resolution (note results can depend a bit on partitioning, obviously). This error is very small compared to other errors being made here (and still an order of magnitude smaller than the difference between 100000 iterations and 1000000).

So does it do exactly the same thing in less time? No, it doesn't. But it does a comparable job in much less time, which in effect means it can do a much better job in the same time. Which, after all is far more often what you're actually interested in this sort of work.

Of course, the down side is that even at 10,000,000 iterations the output of either method is still kind of crap. Next time, I'll explain why. It may be a while though, since I'm traveling soon.

Anyway, here is some code. "gcls partitioning" is a fairly clean implementation of the algorithm discussed above. "qtree2" is tweaked for a gain of about 2x. Since they are very similar, I'll only show the latter with all its warts. Bear in mind a lot has been sacrificed for speed in this case, and it hurts readability and generalizability. I'm not going to clean it up, sorry.

The first thing is the function count-iterations. I won't include it here because it is identical to iterate/df except that it returns the iteration count, [b]not[/b] zero or one. Make sure it is inlined.

(defun gcls/opt/qtree2 (width height &key; (max-iters 51) (nthreads 8) (partitions 10) (min-size 4)
(real-min -1.5d0) (real-max 0.5d0) (imag-min -1d0) (imag-max 1d0))
(declare (type fixnum width height max-iters min-size) (double-float real-min real-max imag-min imag-max))
(let* ((array (make-array (list height (ceiling width 8)) :element-type '(unsigned-byte 8)))
(done (make-array (list height width) :element-type 'bit :initial-element 0))
(cache (make-array (list height width) :element-type '(unsigned-byte 32) :initial-element 0))
(seq (make-array (array-total-size array) :element-type '(unsigned-byte 8) :displaced-to array))
(regions)
(region-mutex (sb-thread:make-mutex))
(real-step (/ (- real-max real-min) width)) ;;with fencepost error in spec
(imag-step (/ (- imag-max imag-min) height)))
(declare (type double-float real-step imag-step))
(labels ((set-bit (i j v)
(multiple-value-bind (byte bit) (truncate j 8)
(declare (fixnum byte bit))
(if (zerop v)
(setf (aref array i byte) (logandc2 (aref array i byte) (ash 128 (- bit))))
(setf (aref array i byte) (logior (aref array i byte) (ash 128 (- bit)))))
(setf (aref done i j) 1)))
(result-at (i j)
(when (zerop (aref done i j))
(iterate i j))
(multiple-value-bind (byte bit) (truncate j 8)
(if (logand (aref array i byte) (ash 128 (- bit))) 1 0)))
(iterate (i j)
(if (zerop (aref done i j))
(let ((res (count-iterations (+ real-min (* j real-step)) (- imag-max (* i imag-step)) max-iters)))
(declare (type (unsigned-byte 8) res))
(setf (aref cache i j) res)
(set-bit i j (if (= res max-iters) 1 0))
res)
(aref cache i j)))
(compute-region (i-min i-max j-min j-max)
(loop for i from i-min upto i-max
do (loop for j from j-min upto j-max
do (result-at i j))))
(fill-region (i-min i-max j-min j-max val)
(loop for i from i-min upto i-max
do (loop for j from j-min upto j-max
do (set-bit i j val))))
(quadtree (i-min i-max j-min j-max)
(declare (type fixnum i-min i-max j-min j-max))
(if (or (<= (- i-max i-min) min-size) (<= (- j-max j-min) min-size))
(compute-region i-min i-max j-min j-max)
(let ((proto (iterate i-min j-min)))
(if (and (loop for i from i-min upto i-max always (= proto (iterate i j-min)))
(loop for i from i-min upto i-max always (= proto (iterate i j-max)))
(loop for j from j-min upto j-max always (= proto (iterate i-min j)))
(loop for j from j-min upto j-max always (= proto (iterate i-max j))))
(fill-region i-min i-max j-min j-max (if (= proto max-iters) 1 0))
(let ((i-mid (+ i-min (truncate (- i-max i-min) 2)))
(j-mid (+ j-min (truncate (- j-max j-min) 2))))
(quadtree i-min i-mid j-min j-mid)
(quadtree i-min i-mid j-mid j-max)
(quadtree i-mid i-max j-min j-mid)
(quadtree i-mid i-max j-mid j-max))))))
(worker ()
(loop for p = (sb-thread:with-mutex (region-mutex) (pop regions)) until (null p)
do (apply #'quadtree p))))
(setf regions (part2d 0 height 0 width partitions))
(mapc #'sb-thread:join-thread (loop repeat nthreads collecting (sb-thread:make-thread #'worker)))
(with-open-stream (stream (sb-sys:make-fd-stream (sb-sys:fd-stream-fd sb-sys:*stdout*)
:element-type '(unsigned-byte 8) :buffering :full :output t :input nil))
(write-sequence (map 'list #'char-code (format nil "P4~%~D ~D~%" width height)) stream)
(write-sequence seq stream)))))

So you can see in addition to the spatial partitioning, I'm remembering if I've already computed a value and not recomputing if I can avoid it. The whole thing is pretty hacky, and I'm not going to go over all the decisions that went into tweaking this. I iteratively improved three times, getting about a factor of 2 each time, but just looking at the code I realize it may not be obvious why things were done this way. Not also the highest possible iteration count here is 2^32 -1, but that's probably enough!

The other thing missing is part2d, a function that takes a region (a b c d) and breaks it into subregions. Here's an implementation:


(defun part2d (i-min i-max j-min j-max steps)
(declare (type fixnum i-min i-max j-min j-max steps))
(labels ((partition (a b steps)
(declare (fixnum a b st eps))
(loop for n of-type fixnum downfrom (1+ steps) above 0
for i of-type fixnum = a then (+ i (truncate (- b i) n))
collecting i)))
(loop repeat steps
for a on (partition i-min i-max steps) nconc
(loop repeat steps
for b on (partition j-min j-max steps)
collect (list (car a) (cadr a) (car b) (cadr b))))))


Obviously in this case the speed of the function isn't really important as it only is used in setup.

Now it is important that I am comparing iteration counts, not in/out of the set judgements here. This is a more conservative approach, and should reduce the chance of chopping off what seems to be an isolated section of the set (obviously it can't really be isolated). More on that in another post.

In a cleaner version of this sort of approach, though, this would also replace the quadtree decomposition, and you really are probably better off using bounded regions, i.e. [a,b)x[c,d) instead of what I've used here [a,b]x[c,d] ... but this way generated faster code.


So there you go, a fast-and-hacky version of something that can even overcome the formidible problems outlined in the previous post. My core utilization here is lower than the benchmark's c code, but I'm still running about twice as fast there, and once we go up to more reasonable iteration counts (i.e. > 1000) it's really no contest. In fact, this can do more than twice the iterations in the same time -- resulting in a vastly improved approximation of the set for the same computation time. And we're still easily in the short and simple program range. Above is about 150 lines or so, which could be shorter, and also a bit simpler without sacrificing much speed.

But is set approximation of 16000x16000 even at 1000000 iterations a good one? No, it isn't. But we're getting better. Next time I'll talk about what's (still) being done wrong, and why. And a little bit about approaches to do better.  

June 21, 2009

Gábor MelisActive Learning for cl-libsvm

Along the lines of active learning with python & libsvm, I added support for calculating distance of a point from the separating hyperplane to cl-libsvm. In binary classification there is only one SVM involved and one hyperplane. However, with N class problems there is a binary SVM for each of the N*(N-1)/2 pairs of classes and there are as many separating hyperplanes, something the linked python code fails to take into account. As per the libsvm FAQ, the absolute value of the decision value (see PREDICT-VALUES, wrapper of svm_predict_values) divided by the norm of the normal vector of the separating hyperplane is the distance. PREDICT-VALUES and MODEL-W2S are sufficient to calculate it. Note that among the distributed binaries only the linux-x86 version has been recompiled with the necessary changes, but patched sources are also included for your recompiling pleasure.
 

Zach BeaneRIP, Erik Naggum

Erik Naggum was the first person I killfiled in GNUS. His style was sometimes shockingly blunt and aggressive. After a while, though, I realized I was missing out, and I came to treasure the information and insight in his messages.

I learned yesterday that Erik died. I'm sorry to hear it; I occasionally contacted him to clarify or expand on some technical matter he wrote about in the past, and he was always helpful. I thought I would just be able to do that whenever I wanted, but now it's too late.

His death has, not surprisingly, led some people to go through the same initial experience I had, seeing some blunt and shocking language and wondering why anyone would care about its author. Here are some links that I hope show a small part of Erik's contributions to knowledge.

I think a newcomer would benefit from reading two in particular:

Here are the rest, taken from my bookmarks:

Lisp

Unix solutions vs. Lisp solutions for the same problems

A lengthy explanation of types as they relate to CL

How much use of CLOS?

Programming in Lisp, delivering in some other language

A cute read macro dispatch scheme

Lazy-loading with SLOT-UNBOUND

Using CHANGE-CLASS for object "deletion"

"Unix quality" vs "Lisp quality", with sockets as an example

"if you can't outperform C in CL, you're too good at C. " (see the whole thread for details)

Alist vs. plist

What Lisp could take from C

Destructors, finalizers, weak pointers

Kitchen hygiene compared to Lisp hygiene

Design patterns for Lisp

Misc

An introduction, written just two months ago

The Long, Painful History of Time

The oil industry in Norway is really big

The "Norwegian Dream" (vs the American Dream) is to win the lottery

"most everything worth doing is associated with effort and some pain"

"Western culture is favorable to mediocre people and hostile to smart ones"

Core ideas behind SGML and XML

Feedback loops of lisp, reward, punishment, psychotic environments

"the market does not in fact lead anything or anywhere"

"Just let other people have their desires and needs. Do not let them affect yours."

The purpose of a newsgroup

"Which is the best car? How do you choose?"

"One general concept of the free exchange of information on the Net is that people are equals in principle and that their differences are the nothing more than accidents of time."

Learning new things

"How come people with the most misguided political ideas believe revolution is the answer and people with reasonable political ideas manage to succeed in slowly transforming their society to their liking? Please think about it."

"If you have to subordinate your defense of truth or what you believe in to who else believes it, I seriously suggest you rethink your value system."

A tribute to Yuri Rabinsky

Adapting emacs for rapid prose editing

"It is a really bad idea to believe that one can learn to get it right from doing it wrong many times."

The purpose of higher education

Erik's computer-oriented biography

Books in Erik's library

Tokenizing/parsing

A bibop-style memory management scheme

"people seem unable to get over the fact that they no longer want to use a language and just move on to something better"  

June 20, 2009

Tobias RittweilerErik Naggum, RIP.

It's sad but seems to be true: Erik Naggum was found dead in his home.

Last September, I wrote him a mail to thank him for the wealth of postings
he had contributed to comp.lang.lisp and for the impact they had on me.
I now feel incredibly glad I did that as I had always postponed doing so
for years.

Following is his response.

RIP, Erik.



* Tobias C. Rittweiler (2008-09-19 00:37)
> for a long time I've wanted to send an e-mail to express the
> gratitude I feel for the countless usenet postings that you have
> written in your comp.lang.lisp history. Now I finally came around
> doing so:

It's been quite a while since I posted my last article to c.l.l, but
very warm and welcome messages like yours still keep coming in at a
rate of about one a week. It amazes me.

> Thank you for the time and energy you spent in writing so many
> sophisticated and thought-provocating articles. I never understood
> where you drew this massive amount of energy and the will to
> continue from, considering the feedback you received back at that
> time.

Even when I was posting at a high rate, I received more mail from
those who appreciated what I had posted. I needed that encouragement
and knowing that despite the increasing amount of insanity on the
newsgroup, lots of people were reading what I wrote with a positive
attitude. However, there are some people whose evil ways are truly
destructive, and those are the moralists and the punishers who are
utterly unable to produce anything good at all, but all the more
hell-bent on making others pay for such things as not living up to
their expectations, doing what they think is right, etc, and I drew a
considerable amount of energy from my life-long desire to rid the
planet of moralists and punishers. It may sound contradictory, but I
hold that the use of force and violence and the lesser evils of
moralizing and punishment should only be used to prevent any of them,
never to seek any other goal: Telling a moralist that moralizing is
wrong, punishing those who punish others, using force and violence to
stop those who initiate the use of force and violence — you get the
picture — is the right thing to do. Unfortunately, this gets the
moralizers and punishers all worked up, and they prove that they
really are the psychopaths they only appeared to be when someone had
the gall to do something against /their/ desires. So even though my
general outlook on life went under-appreciated, namely that if you
know what other people should have done in the past, you never have
any clue what you or anyone else ought to do in the future, and if you
are concerned with what you yourself ought to do in the future, you
generally leave other people alone to figure out what /they/ ought to do
in the future, too, lots of people picked up on the positive message:
Exposing moralists and punishers for what they really are —
psychopaths — does a great service to any community who suffers from
their pernicious effects. Of course, if you are a horribly bad person
like that, you can only see others as reflections of yourself, and you
never understand why anyone would want to moralize against or punish
/you/, because part and parcel of being rabidly insane is never being
able to see yourself from the outside, and you think insane thoughts
like "How /dare/ anyone moralize against or punish /me/, when I'm the sole
moral authority in the whole Universe and everyone, everywhere have a
moral duty to behave the way /I/ tell them!". Or, in other words, people
who partition the world population into "the good" and "the bad"
always make the mistake of believing that they fall into the "the
good" partition, when they are actually among the very few that are
truly evil: No monumental evil act in the history of mankind has been
committed by anyone who thought of themselves as "evil" — on the
contrary, the worse the (objective) evil, the more the perpetrator was
completely convinced of the goodness of himself and of his
"purification". So when the newsgroup became plagued by the evil kind
of moron that has nothing to contributed and no rewards for anyone
doing the right thing, but only harm and punishment for those who do
the wrong thing in their eyes, it was time to quit. Had I had even
more energy, I could have stayed, but I had ran out of steam fighting
false accusations, which is another one of those hallmarks of truly
evil people who think nothing of harming the innocent in their crusade
against the "evildoers". It turns out, as any study of history will
show, that those accused by moralists and punishers are always
innocent. That's why we need courts, so those who accuse are not the
ones to decide on the guilt and the punishment. Stupid people tend not
to grasp this fact, and only see courts as means of letting people
they "know" are guilty avoid punishment.

> And I particularly mean your non-technical contributions. I read all
> those now already several years ago when I was sixteen if I'm
> remembering correctly. I recall how I was shocked at the tone you
> used, and the aggression you seemed to feel towards people, as I
> grew up in feel-good communities myself. [The postings] made me value
> technical expertise and competence over civil masquerade, and they
> revealed how the latter can sometimes even impair the communication
> for the former.

Civility and politeness are extremely useful tools in communication
with people who are more wrong than right, but of very little use with
people who are vastly more right than wrong. This counter-intuitive
observation comes directly from the fact that we simply do not need
civil and polite ways of telling people that they are right about
something. So the people who have most to gain from civility and
politeness are people who know they are and intend to /stay/ wrong while
they force everybody else hold their tongues. That may have been a
very good way of building societies before /anyone/ was usually right
about anything. It is only in the twentieth century that a sizable
fraction of the population had any means to know whether they were in
the right or in the wrong to begin with. Before we invented the
concept of the real world, everybody lived their entire lives in their
own emotional world. After science and technology invented the concept
of the real world and of truth as correspondence between thoughts and
reality, the internal, private world turned out to be /untrue/ almost
all the time. These days, I keep telling people that you only /really/
grow up and become a human being (as opposed to a mere animal) when
you realize that most of what you think and almost all you feel is
/wrong/ and every person, however smart or highly esteemed by their
peers, is utterly and completely incapable of determining where they
are right among all this wrongness on their own. We tend to believe we
are mostly right, however, and only notice when the consequences of
our actions contradict our best expectations. Now, there are many
areas of life where there /is/ no way to sort right from wrong, and it
would certainly be impolite to point out an error that was only
relative to our own personal values, which is where the impoliteness
of moralizing comes in, but modern man enjoys a growing number of
areas where we can unequivocally sort right from wrong, and then it is
impolite /not/ to point out an error, for that means we let someone
believe something that will cause them harm, or at least undesired
consequences, later on. This means that in the areas of life where
people are mostly wrong, it is indeed a good thing to be civil and
polite all the time, as one wouldn't want others constantly to point
out one's own mistakes, either, but in the many areas where it is
possible to be right, and positively harmful to be wrong, allowing
people to hold on to false beliefs in order to protect their feelings
is really, really bad for everyone. The key, therefore, is knowing
which areas can and which still cannot tell right from wrong. It is my
firm position that no areas of science, technology, engineering,
mathematics, and allied disciplines such as medicine, are proper
arenas for politeness and civility. If people are wrong and are
spreading dis- or misinformation in these fields, everybody hurts
because it becomes that much harder to know right from wrong. However,
in areas where no one can really tell, such as ethics, politics,
fashion, etc, even though we may have pretty good ideas and
communities who /choose/ a particular set of beliefs, it behooves people
to be humble and civil and polite because fighting with people who are
wrong, but believe they are right, yet there are no means to prove
that, would be extremely tiresome, as it became on c.l.l when people
who stopped thinking about issues where we /can/ decide right from
wrong, started to bother everyone with their /personal/ problems when
others disagreed with them.

> In retrospect, I'm pretty sure that you helped me become the
> individuum I am now.

I'm very pleased to hear that, and particularly that you took the time
to write and tell me. I wish you the best of luck for the future!

Best regards, Erik Naggum
--
Member of AAAI AAAS ACM AMS APS ASA ASL IEEE IMS MAA NYAS PSA SIAM USENIX
The United States of America still symbolizes individualism, rationality,
and intellectual achievement to me — even though most Americans disagree.
 

June 18, 2009

Simon AlexanderBlather about FPU optimization & threading with SBCL

Still riffing on this Mandelbrot set stuff (bear with me), I find myself in the strange position of having to demonstrate a decently performing version of an implementation for the benchmark I don't think is very good, to avoid claims of unfair comparisons. Ok, fair enough. I'll do this and use it as an excuse to describe a bit at least in an SBCL-specific way, what might be done. I'll have a few comments about what it reveals about the benchmark, too.

Last time I gave a perfectly straightforward implementation of the main loop:

(defun iterate-point (c max-iterations)
(loop for n below max-iterations
for z = (complex 0 0) then (+ (* z z) c)
when (> (abs z) 2) do (loop-finish)
finally (return n)))

This is the bit of code doing most of the work.

In order to get a reasonably fast version, we need to do two things. First is to give the compiler enough information to do what optimizing it can (in SBCL's case, this is quite a decent job). The second is to make an observation that your compiler almost certainly isn't clever/devious enough to see --- the fact that the complex product (* z z) and the computation of (abs z) share common sub computations. I'm also, just for the sake of the benchmark, going to stop returning the number of iterations and return 0 for not in the set, 1 if it's in the set.
Here it is:

(defun iterate/df (re0 im0 bound)
(declare (type double-float re0 im0) (type fixnum bound))
(do ((n 1 (1+ n)) ;the first iterate always takes z->c, so start there
(re re0) (im im0)
(re^2 (* re0 re0) (* re re))
(im^2 (* im0 im0) (* im im)))
((>= n bound) 1)
(declare (double-float re im re^2 im^2) (fixnum n))
(when (> (+ re^2 im^2) 4d0) (return 0))
(psetf re (+ re0 (- re^2 im^2))
im (+ im0 (let ((a (* re im))) (+ a a))))));faster than (* 2 (* re im))

Loop syntax is nice for some things but can make it hard to see exactly what order things are happening in, etc., so I often prefer do for things like this. Note that the loop is starting from n=1 because properly the first iteration is always 0. So I've declared types and unrolled the computation so the commonality of (* re re) and (* im im) is exposed and not recomputed. This shows what the computation boils down to. You need a counter, a bound for it, and six double floats (re,im, re^2, im^2, the constant 4.0 and the temporary a). This last one I've added just because I knew on my platform SBCL would probably generate better code for a product and sum, rather than (* 2 re im). It might not, on your implementation. SBCL could be smarter about things like this though, so it's often worth checking. At the beginning of the loop we have two integer-float products and two float-float adds to set up re0 and im0, which costs next to nothing.

This computation can live on the chip on most systems you might try this with, which is getting towards the holy grail of FPU speed. Anything we can get in-chip like this will be damn fast. To do the absolute best, we have to start looking at things with pipeline utilization and stalls, loop parallelization or unrolling (if we've got some registers free), tricky, tedious, and often machine specific stuff. But we've already come a long way, and what we have here is basically the simplest case you can possible hope for. And in "real world" numerical code, you rarely get it.

Someone pointed out that I hadn't originally talked about threading, and I commented that this stuff is trivially threadable too. This is because we can keep good memory localization etc. by putting a thread on computing one row, but keep the individual thread busy without any possibility of interacting, because they live on separate rows. It literally could not be any simpler to thread than this, we don't even need a mutex.

So lets do both; you've got the central loop, to thread it we just have to make a minor change. I'm also going to amortize the cost of packing the bits over the whole computation, and output the .pbm file to std out. This is a bit unnatural for lisp, but allows a side by side comparison as per the benchmark so there is no question we are measuring the same things. I found that SBCL specific std-output stream approach from the benchmark itself, as there was an SBCL entry (thanks, guys!)

(defun gcls/opt (width height &key; (bound 51) (nthreads 8) (real-min -1.5d0) (real-max 0.5d0) (imag-min -1d0) (imag-max 1d0))
(declare (type fixnum width height bound) (double-float real-min real-max imag-min imag-max))
(let* ((array (make-array (list height (ceiling width 8)) :element-type '(unsigned-byte 8)))
(seq (make-array (array-total-size array) :element-type '(unsigned-byte 8) :displaced-to array))
(rows -1)
(real-step (/ (- real-max real-min) width)) ;; with shifted sampling as spec image
(imag-step (/ (- imag-max imag-min) height)))
(declare (type fixnum rows) (type double-float real-step imag-step))
(labels ((worker ()
(loop for i = (incf rows) while (< i height) ;incf atomic, no mutex needed
for im0 of-type double-float = (- imag-max (* i imag-step)) do
(loop with byte of-type fixnum = 0
with code of-type (unsigned-byte 8) = 0
with b of-type bit = 0
for bit-number of-type fixnum upfrom 0
for j below width
for re0 of-type double-float = (+ real-min (* j real-step)) do
(setf b (iterate/df re0 im0 bound))
(if (= bit-number 8)
(setf (aref array i byte) code code b byte (1+ byte) bit-number 0)
(setf code (logior (ash code 1) b)))
finally (unless (= bit-number 1) (setf (aref array i byte) code))))))
(mapc #'sb-thread:join-thread (loop repeat nthreads collecting (sb-thread:make-thread #'worker))))
(with-open-stream (stream (sb-sys:make-fd-stream (sb-sys:fd-stream-fd sb-sys:*stdout*)
:element-type '(unsigned-byte 8) :buffering :full :output t :input nil))
(write-sequence (map 'list #'char-code (format nil "P4~%~D ~D~%" width height)) stream)
(write-sequence seq stream))))

Again with appropriate type declarations (this is starting to look a bit like C code, how surprising). If you throw those two functions in a file with a "(declaim (inline iterate/df) (optimize speed (safety 0) (compilation-speed 0) (space 0) (debug 0)))" at the top, you'll have the exact code I'm timing. It's actually a little bit faster (and a shorter, obv.) if you expand the iteration inside the worker function and get rid of iterate/df entirely, but only slightly and that will obscure a point I want to make.

Now lets see where we are. I don't have any machines identical to the quad-core the benchmark uses so this won't be exactly the same, but I do have a bunch of 8 core Xeons available. I'm going to ignore the file writing here, both because it's not the key area but more because the Xeons are all on NFS storage only. Writing the file locally on my slower machine takes less than 1/10th of a second, which is in the noise over repeated runs so this won't shift anything much. [Note: in all cases here timings shown are typical over many runs, but I'm not averaging them and looking at the variance, it's not important to the discussion here as all of the timings were very stable. Of course to compare over a range of inputs or whatever we'd average a number of runs...]

-bash-3.2$ SBCL --eval '(load (compile-file "gclsopt.lisp"))' --eval '(save-lisp-and-die "inline.core" :purify t)'
-bash-3.2$ /usr/bin/time -v SBCL --core inline.core --noinform --eval "(gcls/opt 16000 16000)" --eval "(quit)" > /dev/null
User time (seconds): 35.38
System time (seconds): 0.11
Percent of CPU this job got: 782%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:04.53
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 22260
Voluntary context switches: 57
Involuntary context switches: 2817

So we are getting a respectable 98% or so utilization, and a cumulative time of a little under 5s on a basically idle machine. Looks pretty decent.

It really is easy to underestimate how cheap 50 iterations of a small loop like the above is on modern hardware. If you can stay out of memory, and keep out of trouble in the pipelines and context switches, you're golden. Which isn't to say that my code couldn't be faster (it obviously could!) but if there is another factor of 2 or 4 to be had it is necessarily through keeping the pipes all busy and/or being clever about looping over multiple iterations at the same time, that sort of thing. So lets try a little experiment; I've made only one small change. The call to iterate/df is now not inlined within the main loop, so every pixel has a function call before the 50 iterations:

-bash-3.2$ /usr/bin/time -v SBCL --core notinline.core --noinform --eval "(gcls/opt 16000 16000)" --eval "(quit)" > /dev/null
User time (seconds): 65.12
System time (seconds): 8.92
Percent of CPU this job got: 412%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:17.97
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 164098
Voluntary context switches: 522128
Involuntary context switches: 12716


Ouch! User time nearly doubled, our respectable 98% core utilization plummeted to below 50%, wall time is 4x slower. Page faults and context switches tell the story, I think. SBCL doesn't have particularly expensive function calls or anything but this all really adds up. Fundamentally, the FPU's are starving here. This has nothing to do with SBCL, you'll see similar affects if you reproduce it in other languages. The fact that it's a function call isn't so important, the fact that it's got to come out of the chip is.

Now accepting for the moment that a) the output set is a reasonable goal as stated and b) that the reason we went from 200x200 by 50 iterations up to 16000x16000 at 50 iterations was simply to have enough work to do, consider this: Why not go to 1600x1600 at 5000 iterations? Same nominal change in back-of-the-envelope terms, right? In fact, since the area of the set is something like 1.5 (out of the total of 4) we have to adjust that a little bit. So lets go 2000x2000 and perform the same comparison:

-bash-3.2$ /usr/bin/time -v sbcl --core inline.core --noinform --eval "(gcls/opt 2000 2000 :bound 5001)" --eval "(quit)" > /dev/null
User time (seconds): 41.81
System time (seconds): 0.11
Percent of CPU this job got: 787%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:05.32
Minor (reclaiming a frame) page faults: 14060
Voluntary context switches: 58
Involuntary context switches: 3305
-bash-3.2$ /usr/bin/time -v sbcl --core notinline.core --noinform --eval "(gcls/opt 2000 2000 :bound 5001)" --eval "(quit)" > /dev/null
User time (seconds): 42.04
System time (seconds): 0.13
Percent of CPU this job got: 784%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:05.37
Minor (reclaiming a frame) page faults: 19336
Voluntary context switches: 2178
Involuntary context switches: 3346

Of course results will vary with different hardware a bit but fundamentally the issue is the same. The function calls here act as a proxy for almost anything else you might want to do that will touch memory, etc. Once you can get the computation on the chip like this for such a small number of iterations, lots of other things may not look worth trying at all. Not that there isn't any hope of doing better (I'll discuss later, or next time) but it sure can look that way.

So even if you have no interest in my discussion of this particular benchmark, I hope this illustrates some issues to consider when threading numerical code. Of course, once we've reached the simplicity of the localized loop, there really isn't much new to say. If you know anything at all about fast numerical code of this type, you would have predicted something like this, and you already have a good idea of at least some of the languages that can do a decent job at it. On the other hand if you are inexperienced in the area, I don't think a good way to learn about it is inside a computation that is only nominally doing something useful --- but doing a lousy job of it. Particularly since this sort of extremely simple and localized loop is exactly the sort you usually can't get in practice, or have to fight quite a bit to get. This Mandelbrot computation gets there trivially, but it does it under false pretenses.

So this is in my opinion another real weakness of the benchmark. It could have had a much more interesting setup though, and another post will explore how you might go about making a decent approximation to the Mandelbrot set boundary at high resolution, and what the real challenges are, and why all of this is a lot more interesting than "if I survive fifty iterations of this simple loop, I'm in".

So how good in my above lisp code? Let's compare with the current champion of that benchmark. Again throwing away the files to avoid the network file system skewing things:

-bash-3.2$ /usr/bin/gcc -pipe -Wall -O3 -fomit-frame-pointer -march=native -D_ISOC9X_SOURCE -mfpmath=sse -msse2 -lm -lpthread mandelbrot.gcc-6.c -o mandelbrot.gcc-6.gcc_run
-bash-3.2$ /usr/bin/time -v ./mandelbrot.gcc-6.gcc_run 16000 > /dev/null
User time (seconds): 19.38
System time (seconds): 0.08
Percent of CPU this job got: 795%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:02.44
Minor (reclaiming a frame) page faults: 8173
Voluntary context switches: 16
Involuntary context switches: 1605

I'm just using the flags from the benchmark run, perhaps it could do with some tweaking. So this puts my lisp code at roughly 1.85x the fast c code, which is about where I expected to be (within a factor of 2). My utilization is lower than it could be, and I expect it could make up a bit of that with tweaking the code, maybe even reach 1.5 or so if you cared to put the time in. Targets here were 64 bit, gcc-4.3.0 and SBCL 1.0.29, running on 8 by 3Ghz E5450's.

So there you go, pretty straight forward approach to get reasonably fast lisp code to do a bad job of the Mandelbrot set. More importantly, I hope I've made clear why the specification makes the overwhelming priority of this benchmark getting your compiler to generate decent floating point code on that tight inner loop, which I think is fairly uninformative on a cross language comparison, even if it were doing something more useful. Perhaps you differ on that.

A comment about optimizing numerical code in lisp: This is fairly typical of the way I approach it. First a simple, clean and clear implementation I can understand and test for correctness. If it's too slow, after profiling some tweaked functions (like the iterate/df) to take care of the worst of it. Once I've settled on an approach, I'll sometimes use the approach like above of pulling the details into labels functions inside a main function. Especially if I've got problem specific tweaks that aren't very nice, but this also make the job easy for the compiler without too many declarations. The down side is if you are debugging at this point, because it may be hard (due to inlining) to see where the problem actually happened. So about 30 minutes of coding time and way more than that to write this up. Sigh.

Hopefully the optimization tweaks and/or SBCL specific threading stuff is useful to someone. I'll probably discuss a bit the issues of threading when it isn't this trivial, at some point. On the tweaking numerical code front, one perhaps not obvious thing you can do to this is to interleave two iterations. This often helps keep the pipelines full. In this case, we get about a 5% improvement using something like this:

(defun iterate/df/double (a-re0 a-im0 b-re0 b-im0 max-iter)
(declare (type double-float a-re0 a-im0 b-re0 b-im0) (fixnum max-iter))
(do ((n 1 (1+ n ))
(a-re a-re0) (a-im a-im0) (a-re^2 0d0) (a-im^2 0d0)
(b-re b-re0) (b-im b-im0) (b-re^2 0d0) (b-im^2 0d0)
(mask 0) (test 3 (logxor mask 3)))
((or (>= n max-iter) (zerop test)) test)
(declare (double-float a-re a-im a-re^2 a-im^2 b-re b-im b-re^2 b-im^2)
(fixnum n) (type (integer 0 3) mask))
(unless (zerop (logxor mask 2))
(setf a-re^2 (* a-re a-re) a-im^2 (* a-im a-im))
(when (> (+ a-re^2 a-im^2) 4d0) (setf mask (logior mask 2))))
(unless (zerop (logxor mask 1))
(setf b-re^2 (* b-re b-re) b-im^2 (* b-im b-im))
(when (> (+ b-re^2 b-im^2) 4d0) (setf mask (logior mask 1))))
(psetf a-re (+ a-re0 (- a-re^2 a-im^2)) a-im (+ a-im0 (let ((a (* a-re a-im))) (+ a a)))
b-re (+ b-re0 (- b-re^2 b-im^2)) b-im (+ b-im0 (let ((a (* b-re b-im))) (+ a a))))))

And the appropriate modifications to the "worker" function to operate on two consecutive points in a row, will give you something like:

-bash-3.2$ /usr/bin/time -v sbcl --core double.core --noinform --eval "(gcls/opt/double 16000 16000)" --eval "(quit)" > /dev/null
User time (seconds): 33.81
System time (seconds): 0.11
Percent of CPU this job got: 782%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:04.33
Minor (reclaiming a frame) page faults: 22190
Voluntary context switches: 59
Involuntary context switches: 3005

So at this point we're sitting at 1.75 of the current champion c code. This is nearly a 30% improvement on the existing SBCL entry which gives times of about 6.05s on my machine:

-bash-3.2$ /usr/bin/time -v SBCL --core SBCL.core --noinform --eval "(main)" --eval "(quit)" 16000 8 > /dev/null
User time (seconds): 40.55
System time (seconds): 0.12
Percent of CPU this job got: 671%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:06.05
Minor (reclaiming a frame) page faults: 25138
Voluntary context switches: 66
Involuntary context switches: 2682

But it may do a bit better on fewer cores, as you can see utilization is a bit of a problem for it here at 84% or so. This just illustrates how much difference you may see in superficially similar code, if you are trying to keep a bunch of cores happy! I'll also note that I wrote my original code (this is a little tweaked relative to that) without looking at any of the implementations, but the narrow logic of the spec. led us to very similar implementations even line by line, which I suppose is also evidence for one of my complaints --- artificial narrowness isn't a good thing in these.

That's nearly the last I'll have to say directly about the "Great Computer Language Shootout" (and there was much rejoicing, I'm sure). Next time though, I'll show you how a different algorithm can walk all over this one, and how that relates to more interesting computations.  

Ingvar Mattsson18 Jun 2009

As Pierre Mai so eloquently writes, the evaluation order corner case is explicitly covered as "it depends" by the Standard, so any code that depended on it is, well, relying on implementation-specific details.</a>  

June 17, 2009

Pierre MaiEvaluation Order in Function Forms

In this blog post Ingvar Mattsson was wondering about the printed line resulting from evaluating:

(defun frob (x) (format t 'Frob: ~a~%' x))
(frob (defun frob (x) (format t 'New frob: ~a~%' x))

This provides an example where I think the ANSI Common Lisp standard is actually very helpful, since it often tries to go out of its way to point out things that are explicitly left undefined, instead of simply leaving them left undefined by not defining them, as many other standards (out of fear of being ambiguous) do. Quoting from the HyperSpec, Section 3.1.2.1.2.3 Function Forms:

Although the order of evaluation of the argument subforms themselves is strictly left-to-right, it is not specified whether the definition of the operator in a function form is looked up before the evaluation of the argument subforms, after the evaluation of the argument subforms, or between the evaluation of any two argument subforms if there is more than one such argument subform. For example, the following might return 23 or 24.

(defun foo (x) (+ x 3))
(defun bar () (setf (symbol-function 'foo) #'(lambda (x) (+ x 4))))
(foo (progn (bar) 20))

Thus we can see that the effect of evaluating the orginal two forms is indeed undefined (as already suspected by Ingvar Mattsson) as to which function definition is called in the second form, without going to the trouble of trying to take into account possible differences between evaluation and compilation, compile-time effects of defun, etc.

Which is my long-winded way of saying a big thank you to the people involved in creating the ANSI Common Lisp standards document (with special thanks to all involved in creating, releasing and maintaining the HyperSpec online text equivalent thereof, especially of course Kent Pitman), which is one of the nicest language standards I have had the pleasure of working with and against.

 

LispjobsAnvita-Health, San Diego


Anvita-Health of San Diego California is seeking programmers to develop and maintain its health care decision support tools.
Prerequisites for the position include:
An advanced degree in Computer Science from a US or UK university
Strong C++ skills; Lisp programming a plus
At least 2 yrs of programming experience
Web skills a plus (PERL, PHP) and familiarity with web UI tools
A willingness to relocate to San Diego, CA
Salary and Benefits commensurate with experience.
Interested persons should send CV to:
Sheldon Ball MD, PhD
Physician Informaticist
Anvita Health
sball@anvitahealth.com
www.anvitahealth.com

Anvita-Health of San Diego California is seeking programmers to develop and maintain its health care decision support tools.

Prerequisites for the position include:

An advanced degree in Computer Science from a US or UK university

Strong C++ skills; Lisp programming a plus

At least 2 yrs of programming experience

Web skills a plus (PERL, PHP) and familiarity with web UI tools

A willingness to relocate to San Diego, CA

Salary and Benefits commensurate with experience.

Interested persons should send CV to:

Sheldon Ball MD, PhD

Physician Informaticist

Anvita Health

sball@anvitahealth.com

www.anvitahealth.com

 

Ingvar Mattsson17 Jun 2009

Intriguing. I have found an interesting corner case, where I believe the Common Lisp standard doesn't have an opinion. I don't think it's really any critical corner case, as I (right now) can't see any legitimate use of the difference, but...

Basically, in the case of the following:

(defun frob (x) (format t "Frob: ~a~%" x))
(frob (defun frob (x) (format t "New frob: ~a~%" x))
does the printed line say "Frob" or "New frob"?

It is, I believe, fully specified what will happen when you do either of (funcall #'frob ...) or (funcall 'frob ...), but out of the two implementations I have tried (SBCL and CLisp), I have two different behaviours. SBCL prints "New frob" and CLisp prints "Frob".

I shall have to ponder this, for a bit, I think.  

François-René RideauBoston Lisp Meeting: Monday 2009-06-29 - Eli Barzilay

A Boston Lisp Meeting will take place on Monday, June 29th 2009 at 1800 at MIT 34-401B, where Eli Barzilay will speak about Implementing Domain Specific Languages with PLT Scheme.

Additionally, we are still accepting proposals for up to two volunteers to each give of a 5-minute Lightning Talk (followed by 2-minute Q&A).

Also, there will be a buffet offered by ITA Software. Registration is not necessary but appreciated. See details below.

*

Eli Barzilay will speak about Implementing Domain Specific Languages with PLT Scheme.

Many problems call for domain-specific languages (DSLs) to express them and their solutions; such languages enable a dialogue between domain experts and software developers. The Lisp and Scheme community has a decades-old tradition of creating and embedding special-purpose languages via macros. Over the last twenty years, we PLT Schemers have continued to develop this technology to the point where making up new languages is so quick and easy that PROGRAMMERS CREATE A LANGUAGE FOR WRITING A SINGLE PROGRAM.

Embedded DSLs are appropriate for a whole range of domains and applications -- in both academia and industry. Notable examples include research languages, teaching languages, and application-specific languages like our text-friendly documentation language.

In this talk Eli will demonstrate how to implement embedded practical DSLs in PLT Scheme.

Eli Barzilay is a Researcher in the PLT group at Northeastern University. He has been a core PLT Scheme developer since 2003, and has used PLT's ability to implement new languages to an extreme. For his Programming Languages undergraduate course, he creates nearly one language per week. In addition to writing new languages, he is involved in helping PLT develop into a multi-lingual environment.

His website is at http://barzilay.org/

* *

Having observed the success of the formula at ILC'2009, we have instituted Lightning Talks at the Boston Lisp Meeting. At every meeting, before the main talk, there are two slots for strictly timed 5-minute talks followed by 2-minute for questions and answers.

The slots for next meeting are still open. Step up and come talk about your pet project!

* * *

The Lisp Meeting will take place on Monday June 29th 2009 at 1800 (6pm) at MIT, Room 34-401B.

As the numbers indicate, the room is in Building 34, on the 4th floor. This is the usual location, on 50 Vassar Street, Cambridge.

MIT map: http://whereis.mit.edu/bin/map?selection=34

Google map: http://maps.google.com/maps?q=50+Vassar+St,+Cambridge,+MA+02139,+USA

Many thanks go to Alexey Radul for arranging for the room, and to MIT for welcoming us.

* * * *

Dinner: ITA Software, a fine employer of Lisp hackers (disclaimer: I work there), is kindly purchasing a buffet to accompany our monthly Boston Lisp meeting. Anyone who attends is welcome to partake.

We appreciate it if you let us know you're coming, and what food taboos you have, so that we can order the correct amount and kind of food. Tell us by sending email to boston-lisp-meeting-register at common-lisp.net. We won't send any acknowledgement unless requested; importantly, we'll keep your identity and address confidential and won't communicate any such information to anyone, not even to our sponsors.

* * * * *

The previous Boston Lisp Meeting on May 26th had 40 participants. Norman Ramsey gave a talk about Using Higher-Order Functions and Continuation-Passing Style to Make Dataflow Optimization Simple.

In the near future, we expect to have Bruce Lewis on 2009-07-27 about BRL and ourdoings.com, Emmanuel Schanzer on 2009-08-31 about bootstrapworld.org, Christine Flood on some undetermined date about Fortress.

We're always looking for more speakers. The call for speakers and all the other details are at http://fare.livejournal.com/120393.html Volunteers to give Lightning Talks are also sought.

For more information, see our web site boston-lisp.org. For posts related to the Boston Lisp meetings in general, follow this link: http://fare.livejournal.com/tag/boston-lisp-meeting or subscribe to our RSS feed: http://fare.livejournal.com/data/rss?tag=boston-lisp-meeting

Please forward this information to people you think would be interested. Please accept my apologies for your receiving this message multiple times. My apologies if this announce gets posted to a list where it shouldn't, or fails to get posted to a list where it should. Feedback welcome by private email reply to fare@tunes.org.

 

June 15, 2009

LispjobsLisp developers: oDesk, UK


Lisp developers positions at oDesk in the UK.

 

Simon AlexanderOnce more, from the top (and a bit about sampling)

The benchmark goal is described as:
Each program should plot the Mandelbrot set [-1.5-i,0.5+i] on an N-by-N bitmap. Write output byte-by-byte in portable bitmap format.
cmp program output N = 200 with this 5KB output file to check your program is correct before contributing.

I'm going to assume everyone understands the basic properties of the Mandelbrot set. If not, MathWorld and/or Wikipedia will fill in the gaps. I'll stick to the straightforward (if a bit sloppy) notation used previously, a quadratic recurrence generates by repeated application.

It can be shown that the set lives in the closed disc around the origin of radius 2 and also that a point c is in the set if and only if for all iterates n. Now we can't test all n greater than 0, but we can certainly test the first few, say . This doesn't prove that a point is in the set, but it does prove if a point is not (also useful!) and as you increase the iterations you at least intuitively hope there is some connection between the number of iterations a point survives, and the probability it is in the set. So now we have something we can translate directly to code (and hope blogger doesn't mess up the indentation too much this time):

(defun iterate-point (c max-iterations)
(loop for n below max-iterations
for z = (complex 0 0) then (+ (* z z) c)
when (> (abs z) 2) do (loop-finish)
finally (return n)))

This code returns an integer in [1, max-iterations]. If the result is equal to max-iterations, the point is at least still a candidate for being in the set, whereas if it is lower than that, we know that the point has escaped and c is not in the Mandelbrot set. Let's agree to ignore the issue of floating point representation of real numbers, although it's relevant to computing this set membership.

Ok, so in order to render an image of the set, now all we need to do is form a lattice of points in the complex plane, and compute the result of the iteration. We'll call a point "in the set" if it survives max-iterations:

(defun mb-set (width height &key (max-iterations 51)
(real-min -1.5d0) (real-max 0.5d0) (imag-min -1d0) (imag-max 1d0))
(let ((array (make-array (list height width))))
(with-square-grid ((im i :steps height :first imag-max :last imag-min)
(re j :steps width :first real-min :last real-max))
(dotimes (i height array)
(dotimes (j width)
(setf (aref array i j)
(if (= (iterate-point (complex re im) max-iterations)
max-iterations)
1
0)))))))

Here I make use of a macro to clarify how the sampling (more on this below) is to be done. I know some people dislike this sort of capturing, but I think it helps in situations like the above. A simple version might look something like this:

(defmacro with-square-grid (((var index &key steps first last) &rest others) &body body)
(let ((var-step (gensym)))
`(let ((,var-step (/ (- ,last ,first) (1- ,steps))))
(symbol-macrolet ((,var (+ ,first (* ,index ,var-step))))
,@(if others
`((with-square-grid ,others ,@body))
body)))))

The result of mb-set is an an array of values, 0 for out of the set, 1 for in the set. For the benchmark, we output a raw PBM file; so we need to put these results into a packed representation of the rows and add a header. I'll do this in two steps. First pack the array into the right data format:

(defun pack-array/byte-aligned (array)
(let ((seq (make-array (ceiling (array-total-size array) 8) :element-type '(unsigned-byte 8))))
(dotimes (n (length seq) seq)
(loop for b from (* 8 n) below (+ (* 8 n) 8)
for code of-type (unsigned-byte 8) = (row-major-aref array b)
then (logior (ash code 1) (row-major-aref array b))
finally (setf (aref seq n) code)))))

and then write it out to a file, with the PBM header, using something like this:

(defun write-pbm (filename width height packed-array)
(with-open-file (s filename :direction :output :element-type '(unsigned-byte 8))
(write-sequence (map 'list #'char-code (format nil "P4~%~D ~D~%" width height)) s)
(write-sequence packed-array s)))

You'll note I've cheated a little bit and assumed that the width of the array is evenly divisible by 8, which matches the test case but isn't general (but suffices for now).

Here is the output of the above code for N=200.

And the validation image given on the benchmark page:


So they look roughly similar but look at the little "spike" leading off to the left. This is the real line (i.e. the imaginary component is equal to 0). But hang on a moment, the specification called for the range [-i,+i] on the imaginary axis, and N=200... but no even regular sampling of a symmetric range like this can contain the point zero. Looks to me like the test file was generated with a fencepost error, and the other implementations have presumably just followed suit.

If the above isn't obvious to you, consider a very coarse sampling: Suppose we want to sample [-1,1] in four places, equally spaced. There are two natural ways to do this, with different properties. We can do it including the points [-1,1]:

CL-USER> (setf x-min -1 x-max 1 dx (- x-max x-min) samples 4)
CL-USER> (loop for n below samples collecting (+ x-min (* n (/ dx (1- samples)))))
(-1 -1/3 1/3 1)

Note zero lies between the central two samples, both endpoints are present, and the steps are each 2/3. Alternatively, we can pick the points as the center points of an equipartition of [-1,1] into sub-intervals:

CL-USER> (loop for n below samples collecting (+ x-min (* (+ n 1/2) (/ dx samples))))
(-3/4 -1/4 1/4 3/4)

Here the step size is 1/2, not 2/3, and of course we still miss the point 0. In order to include the point 0, we'd have to use an odd number of samples.

The code generating the benchmark image does something like this, though:

CL-USER> (loop for n below samples collecting (+ x-min (* n (/ dx samples))))
(-1 -1/2 0 1/2)

So now you will (for any even number of samples) get the point 0 in your sampling, but your points are asymmetrically placed in the interval you've specified, and in fact always cut off the end.

If you change the above code to generate the points on the complex plane with this asymmetric approach, you'll get the identical result to the test image, so I suspect that this is what was done originally. You can interpret it as a simple fencepost error when dividing by samples, or as an attempt to do the second, more natural, approach and failing to shift the samples by 1/2 of the delta in each direction. Or, I suppose, as an unclear specification of the expected sampling.

This sort of thing is very common to run into, and can lead to weird errors/results because the amount it is shifted by depends on the number of samples, not generally a good property to have. Ok, hopefully that's not too tedious about the sampling, but it's important to get right (and often isn't) and it will come up again in other forms the next post or two.

In the above code, there has been not attempt at optimization, just to set up the problem an separate it into three parts. I realize the separation of the packing problem is somewhat artificial because while it is unavoidable, it is better to amortize this over the computations, particularly if threading.

Now after asking the compiler to optimize as best it can, consider the following timings:

CL-USER> (time (mb-set 1000 1000))
Evaluation took:
15.593140 seconds of total run time (14.772507 user, 0.820633 system)
[ Run times consist of 3.651 seconds GC time, and 11.943 seconds non-GC time. ]
39,699,456,062 processor cycles
8,436,938,592 bytes consed
#<(SIMPLE-ARRAY T (1000 1000)) {121AF447}>
CL-USER> (time (pack-array/byte-aligned *))
Evaluation took:
0.025524 seconds of total run time (0.025439 user, 0.000085 system)
64,063,813 processor cycles
125,008 bytes consed
#<(SIMPLE-ARRAY (UNSIGNED-BYTE 8) (125000)) {12479007}>
CL-USER> (time (write-pbm "test1000.pbm" 1000 1000 *))
Evaluation took:
0.002482 seconds of total run time (0.000926 user, 0.001556 system)
6,306,962 processor cycles
7,552 bytes consed

This is the first thing I did for the original post. So what have we learned? First off, and no surprise, this implementation has pretty poor performance compared to what can be done (but it hasn't been tuned at all). Packing the results into an array for the PBM file is orders of magnitude cheaper than computing the values. Writing out the file another order cheaper; these things practically free in this context.

However, we know the iteration code is suboptimal. Can we ignore the cost of packing the array? Keep in mind this is work we have to do somewhere, even if it isn't as a separate step. The above code doesn't tell us much, really, as the array access indirection will kill it. Taking a similar version (not shown) of pack-array that has been notated for types (very much faster) and looking at an array the same size as the benchmark.

CL-USER> (defparameter *array* (make-array (list 16000 16000) :element-type 'fixnum))
CL-USER> (dotimes (n (array-total-size *array*)) (setf (row-major-aref *array* n) (random 2)))
CL-USER> (time (pack-array/byte-aligned/fast *array*))
Evaluation took:
1.518081 seconds of total run time (1.486409 user, 0.031672 system)
[ Run times consist of 0.013 seconds GC time, and 1.506 seconds non-GC time. ]
3,818,926,037 processor cycles
32,000,576 bytes consed
#<(SIMPLE-ARRAY (UNSIGNED-BYTE 8) (32000000)) {4FD44007}>

Ok, we are into territory where we could quibble about details, and I certainly could have optimized this a bit more. I'm assuming a factor of two is there for the asking. Also waving our hands wildly about the rough equivalence of machines if we want to compare this (run on a core2 duo 2.5ghz) with the posted benchmark results .... however, just from an order of magnitude point of view it should be clear that, as was my original contention, this second order of cost might come into play. It's obviously going to be an issue if your language implementation can't pack bits like this effectively. However, if we're talking about threaded run times on the order of 10 seconds, it may well be a factor there as well, and we have to keep an eye on it. So while this is hardly a complete analysis it at least raises the specter of a benchmark that confounds two quite different things, breaking my third characteristic.

So there we go, I've laid out a rather naive solution of the benchmark, and already we've run into trouble with respect to my criteria. The specification of the problem seems to be buggy, and the final goal isn't as concrete as it could be if the expected final output was actually specified.

Now I realize I've opened myself up a bit to a complaint of "apples and oranges" comparisons, so next post I'll try and put things on a firmer footing.

We've yet to talk further about why I believe this benchmark is problematic even if you accept the output as meaningful, and quite separably about why I think the output is fairly pointless, and how that undermines it as a benchmark. So this will be two posts, at least.

Somewhere in all that I'll also talk a bit about what is interesting about the set, and computations like this, and some ideas about how to do a better job (of computing the set, not benchmarking). I'm pretty busy at the moment but I'll try and get them out without too much delay. I'm sure my vast set of reader will be holding your breath.  

Simon AlexanderI may well regret this....

The other day I quickly tossed off a few comments about a benchmark: Lies, Damn Lies, and Benchmarks

I've had a little pushback in comments and email which makes me wish I had taken the time to be clearer. I realized that maybe I could take a post or two and expand on it, hopefully include some stuff more interesting than mere chasing of benchmark numbers, so I'm going to have another go. I'll try not to be too boring about it all, and tie in some commentary on lisp, optimization, and the Mandelbrot set (which is a much more interesting object than many people realize, even many people who have played with and/or written generators).

To start off with, a quick note; Greg Buchholz (who I believe came up with it in the first place) comments on the previous post:
When we created the Mandelbrot benchmark in December of 2004, the machines that were used were much slower ... the image limits were something like 200x200, and so an iteration limit of 50 seemed reasonable... Another interesting thing to note is that I seem to recall that with this benchmark we smoked out compiler bugs in O'Caml and MLton. FWIW.

This sounds plausible, and matches my earlier surmise that the image size had just grown to match faster processors and threading, without much further thought. Greg's latter point is a good one. I think things at least similar to this (the size is a bit overkill) make pretty good regression tests, and even "smoke tests", and it doesn't surprise me at all that you could find compiler issues with this. However, what is interesting to compiler hackers doesn't necessarily make much of a generally useful benchmark... and I'll try and make my reservations about this one in particular more clear.

I'll also note that from a numerical analysis point of view, 50 iterations of the maps on a 200x200 grid looks like something that might be a reasonable coarse approximation; 50 iterations on a 16000x16000 grid just looks perverse. Somewhere in my previous post I noted that you could make an argument that this doesn't really matter. Quite contrary to the misreading that this and admission that it doesn't matter, I think that while the argument could be made it's a bad argument....and I'll try to explain why later.

Before I do that though, I'll note that I'm not really trying to pick on "The Great Computer Language Shootout". They have a pretty reasonable statement about the inherent flaws in any such undertaking, and of course it can be fun to try, anyway. I'm not even really picking on this one benchmark so much as using it as an example. There are plenty of lousy benchmarks out there, after all.

I think what got a (bit of a knee jerk) reaction out of me on this particular one when it fell in my lap was the way that it trivializes something really very interesting, while focusing elsewhere (somewhere I'd argue is not very informative, which isn't to say completely uninteresting.)

Now full disclosure: I don't think benchmarks are all that informative as a way of comparing languages. I'm also not particularly a language booster. This is a lisp-ish blog because I occasionally want to throw bits and pieces up here about a language that is not so well known, and one that I find nearly optimal (or at least, the best of a bad lot if you'd rather) for certain types of research code. At least, that's the theory behind this blog --- obviously I mostly just don't post. So while I'm going to use lisp to illustrate things, I'm not trying to argue it's superiority or anything, and I'm not interested in these posts in why language X might have done something better/faster/more elegantly.

So, given that, if there is any value in a benchmark of this sort, what is it? If you can't learn anything new from it, there's not much point. To this end, I can think of several useful characteristics.

  1. It should be well specified, with a concrete goal.
  2. The task should be broad enough to allow for different approaches, yet narrow enough that it is solvable and implementations manageable (and understandable!)
  3. If it confounds two or more issues, it is less useful. Avoiding this as much as possible is a good idea.
  4. If it only tells you what is already known, it's not very useful (but very common).
  5. If it boils down to a particular algorithm, that should be the stated goal.
  6. It shouldn't reduce to toy code --- realistic and reasonably common coding tasks are far more interesting for this than that which is pointless, obscure, or only esoterically useful.
I'm sure there are other good criteria, but I'd argue (and will, at least partially) that the particular benchmark in question doesn't do so well on any of these points.

In response to Isaac in comments, I'll note that I never meant suggest that nobody could learn anything from even a flawed benchmark, somebody can probably learn something from almost any bit of code. This is especially true if you are naive or inexperienced in the particular area. However, it's equally true that you could learn the same things from a better designed task, with less risk of picking up bad ideas/approaches along the way. And just because you didn't know something doesn't mean it isn't well known.

Ok, on to the benchmark, next post, including a (very) little coding content.  

Tamas K Pappupdates to cl-cairo2 and cl-2d

Following the discussion and the resulting poll, I updated cl-cairo2 and consequently cl-2d to use the special variable *context* as the default context.

This does not affect the interface of cl-2d (except for user-defined functions that draw directly on the cairo context, of course), since in cl2d, the context is always saved in the frame and there is no "default" frame. Nevertheless, updating the cl-2d code was beneficial because I was able to see if the new style makes things easier. It appears that it simplifies the code quite a bit: I would like to thank everyone who argued for the change.

Enjoy. Bug reports and contributed code are appreciated.  

June 12, 2009

Cyrus HarmonNew hunchentoot-auth, hunchentoot-vhost, hunchenoot-cgi and nuclblog releases

To mark the milestone of finally bringing hunchentoot-auth, hunchentoot-vhost, hunchenoot-cgi and nuclblog into the present such that they work properly with the hunchentoot-1.0 release, I've rolled up the following packages:

Of course these are probably best gotten from the git repo's, but for those of you who like released versions, I figured I'd roll new ones since it had been quite some time (almost two years in some cases!).

 

Zach BeaneFirst TC Lispers meeting

It sounds like the first TC Lispers meeting was a success. Patrick Stein really liked it.  

ECL NewsECL 9.6.0 has been released

Important changes related to the usability of the debugger and inspector, debugging of compiled C code, handling of floating point exceptions and creation and manipulation of NaNs and infinities, among other things. Please reade the Changelog.

Support for Solaris/Intel has been added. The windows ports are right now broken. Please stay with 9.4.0 if support for those platforms is needed, until we fix the problems. Remaining ports, such as OSX, Linux, *BSD are working fine. (0 comments)  


For older items, see the Planet Lisp Archives.


Last updated: July 1, 2009 07:09 AM