Planet Lisp

Victor AnyakinACM Open access to LFP

· 18 hours ago

In the end of March Slashdot reproted that “Association for Computing Machinery (ACM) has made the ACM Digital Library open access to help support the computing community during the coronavirus pandemic.”

In a ltetter on ACM.org the ACM President Cherri Pancake wrote: “We believe that ACM can help support research, discovery and learning during this time of crisis by opening the ACM Digital Library to all.”

And support they do.

After searching for a few things I’ve been surprised to discover a wealth of LISP-related papers in the proceedings of “LFP: Conference on LISP and Functional Programming“.

This conference produced 287 publications in the span of 1980-1994. Out of them all, 276 publications are available for download.

It featured quite a number of prominent contributors, discussed plenty of topics that are still relevant today, and overall provides a very interesting window into the age of glory, when LISP machines were a thing and the future for AI looked promising. Unfortunately, it also captured artifacts of IT archeology: “Tachyon Common Lisp: an efficient and portable implementation of CLtL2” that was claimed to be:

an efficient and portable implementation of Common Lisp 2nd Edition. The design objective of Tachyon is to apply both advanced optimization technology developed for RISC processors and Lisp optimization techniques. The compiler generates very fast codes comparable to, and sometimes faster than the code generated by UNIX C compiler

nowdays is nowhere to be found. But “The Python compiler for CMU Common Lisp
fotunately has survived the test of time and in some form or shape can be found in SBCL.

Some articles have abstracts as if written last year, not 1984:

As the need for high-speed computers increases, the need for multi-processors will be become more apparent. One of the major stumbling blocks to the development of useful multi-processors has been the lack of a good multi-processing language—one which is both powerful and understandable to programmers.

Overall, “LFP: Conference on LISP and Functional Programming” is a very interesting thing to read. Don’t miss this oportunity as access will be open only through June 30, 2020.

https://dl.acm.org/conference/lfp

Wimpie NortjePicking the database for a web application.

· 2 days ago

Picking a database for your web application can be surprisingly time consuming. There are a variety of databases of various types available and Quicklisp has an interface library for many of them.

Unless you need a very specific capability from the database and you are acutely aware of this fact, chances are that a standard relational database will be more than sufficient to provide for your needs.

Once you have decided to use a SQL database the three main contenders are MySQL, PostgreSQL and SQLite. They are all open-source, mature, actively developed and well-supported by Common Lisp.

For most projects SQLite will work very well, even in production. It can handle database sizes and query volumes far exceeding what most web apps will ever need, it stores the whole database in a single file so the backup process is simply a normal file copy and there is no database server application to configure and maintain.

The down side of SQLite is that the database file must be stored on the same machine as where the application is running because it is not a client/server system. If you need to run the application and the database on separate machines then SQLite will not work. One such reason would be if you need to build a highly available service, i.e. your application must run on multiple machines to ensure it remains available when one server stops working.

If you decide that you need a client/server database it comes down to MySQL or PostgreSQL. This is mostly a choice of personal preference. Both have very good support in Common Lisp. There are areas in which one is better than the other but few applications will ever run into those cases.

I use PostgreSQL in my projects, both for development and production environments. The idea is sometimes put forward that one should use SQLite in development and PostgreSQL in production. This is a bad idea. Even though both implement the SQL standard there are still distinct differences in how they function and in their capabilities. By using different database systems you must cater for two database variants instead of one. In addition, you will always go to production with untested code that can only be tested in production.

Vsevolod DyomkinProgramming Algorithms: Synchronization

· 3 days ago

This is the final chapter of the book, in which we will discuss optimization of parallel computations: whether concurrently on a single machine in and a shared-memory setting or in a distributed shared-nothing environment. This is a huge topic that spans synchronization itself, parallelization, concurrency, distributed computations, and the functional approach. And every senior software developer should be well-versed in it.

Usually, synchronization is studied in the context of system or distributed programming, but it has a significant algorithmic footprint and is also one of the hottest topics for new algorithm research. In fact, there are whole books that concentrate on it, but, usually, they attack the problem from other angles, not focusing on the algorithmic part. This chapter will be more algorithm-centered, although it will also present an overview of the problem space. SO that, in the end, you'll have a good foundation to explore the topic further if a desire or need for that will appear.

Let's start from the basics. In the previous chapters of the book we, mainly, viewed algorithms as single computations running without external interruptions. This approach, obviously, removes the unnecessary complexity, but it also isn't totally faithful to reality. Most of the programs we deal with, now, run in multiprocessing environments (sometimes, even distributed ones), and even when they don't utilize these capabilities, those are still available, they sometimes have their impact, and, besides, might have improved the performance of the programs if they would have been utilized. The majority of the backend stuff, which, currently, is comprised of services running in the datacenters, are multithreaded. There's a notorious "Zawinski's Law" that states that every program attempts to expand until it can read mail. Those programs which cannot so expand are replaced by ones which can. Being a good joke it also reflects an important truth about the tendency of all programs over time to become network-aware, and thus distributed to at least some extent.

There are two principally different types of environments in which the programs that need synchronization run: shared-memory and shared-nothing ones.

In a shared-memory setting, there exists some shared storage (not necessarily, RAM) that can be directly accessed by all the threads[1] of the application. Concurrent access to data in this shared memory is the principal source of the synchronization challenges, although not the only one. The example of a shared-memory program is a normal application that uses multithreading provided either directly by the OS or, more frequently, by the language runtime[2].

The opposite of shared-memory is a shared-nothing environment, in which all threads[3] don't have any common data storage and can coordinate only by sending messages directly to other processes. The contents of the messages have to be copied from the memory of the sender to the receiver. In this setting, some of the synchronization problems disappear, but others still remain. At the fundamental level, some synchronization or coordination still needs to happen. From a performance standpoint, however, the shared-nothing mode is, usually, inferior due to the need for additional data copying. So, both paradigms have their place and the choice, which one to utilize, depends on the context of a particular task.

The main goal of synchronization is ensuring program correctness when multiple computations are running in parallel. Another side of the coin is achieving optimal performance, which is also addressed by parallelization that we have somewhat discussed in a couple of prior chapters. Prioritizing performance before correctness, although tempting, is one of the primary sources of bugs in the concurrent systems. The trivial example would be building a shared-memory program without explicit use of any synchronization mechanisms. It is, definitely, the most performant approach, but non-coordinated access to the shared data will inevitably result in failures like data corruption.

Synchronization Troubles

So, let's talk in more detail about the most common synchronization problems that the methods we will discuss next are trying to handle. Such situations are called race conditions for there's a situation when multiple threads compete for the same resource — be it data storage or processor — and, in the absence of special coordination, the order of execution will be unspecified, which may result in unpredictable and unintended outcomes. There are two main results of this unpredictability (often, both occur simultaneously):

  • data corruption or loss
  • incorrect order of execution up to total failure of the program

Here is the simplest code segment that is amenable to data corruption in multithreaded programs:

(incf i)

It seems like there's just one operation involved — how can there be a race condition? From the point of view of a programmer in a high-level language, indeed, we deal with a single operation, but if we go deeper to the level of machine code we'll see that it is not the case. The relevant assembly snippet will, probably, contain three instructions:

mov i, register
inc register
mov register, i

You've just seen one more convincing evidence why every programmer should understand how the lower levels of his platform operate. :)

The issue is that modern processors can't directly modify data in the RAM (our variable i). First, the data needs to be moved into a register, only then some operation on it may be performed by the CPU, and, finally, it needs to be put back where the high-level program can find it. If an interrupt occurs (we're talking about multithreaded execution in a single address space, in this context) after mov i, reg the current thread will remember the old value of i (let it be 42) and be put into a waitqueue. If another thread that wants to change i is given processor time next, it may set it to whatever value it wants and continue execution (suppose it will be 0). However, when the turn is returned to the first thread, it will increment the value it remembered (42), so i will change the value in the following sequence: 42 - 0 - 43. Hardly, it's an expected behavior.

Such data corruption will only impact the mentioned variable and may not cause catastrophic failures in the program. Its behavior will be incorrect, but in some situations that can be tolerated (for instance, if we gather some statistics, and occasional off-by-one errors will not be noticed). Yet, if i was some counter that impacts the core behavior of the program, it might easily lead to a catastrophe.

Utimately, incorrect execution order should be considered the root cause of all synchronization problems. And here it is also manifest: we expected increment to be a single (atomic) operation and thus finish execution before anything else will happen to i.

What are some other common cases of execution order errors? The most well-known and dreaded race condition is a deadlock. It is a situation of mutual blocking among two or more threads. Here is the simplest illustration of how it can occur:

thread 1 ---> acquire resource1 --> try to acquire resource2
thread 2 --> acquire resource2 ------> try to acquire resource1

In other words, two threads need exclusive access to two resources, but the order of access is opposite and the timing of the operations is such that the first thread manages to acquire access to the first resource while the second — to the second. After that, the deadlock is inevitable and both threads will be blocked as soon as they will try to access the other resource. The period between each thread acquiring the first resource for exclusive access and the release of this resource is called a critical section of the program. Only in the critical sections, a synchronization issue may manifest.

The only way to untangle "from within" such deadlock situation is for one of the threads to release the resource it already holds. Another approach, which requires external intervention, is often employed in database management systems — deadlock monitoring. A separate thread is periodically examining blocked threads to check for some conditions that signify a deadlock situation, and it resets the threads that were spotted in such a condition. Yet, instead of trying to fix the deadlock situations, it may be better to prevent them from occurring altogether. The prevention techniques may utilize time-limited exclusive leases on resources or mandating the threads to acquire resources in a specific order. However, such approaches are limited and don't cover all the use cases. It would be nice to find some way to totally exclude deadlocks, but we should remember that the original reason why they may occur, at all, is the need to prevent data corruption in the case of uncontrolled access to the data. Exclusive access to the resource ensures that this problem will not occur, but results in the possibility of a deadlock, which is a comparatively lesser evil.

A livelock is a dynamic counterpart to deadlock which occurs much rarely. It is a situation when threads don't constantly hold the resources exclusively (for instance, they might release them after a timeout), but the timing of the operations is such that at the time when the resource is needed by one thread it happens to be occupied by the other, and, ultimately, mutual blocking still occurs.

One more obnoxious race condition is priority inversion — a phenomenon one can frequently observe in real life: when a secondary lane of cars merges into the main road, but, for some extraneous reason (traffic light malfunctioning, an accident that is blocking part of the road, etc.), the cars from it have more time to merge than the ones of the main road to progress. Priority inversion may be the reason for a more severe problem, which is starvation — a situation when the execution of a thread is stalled as it can't access the resource it needs. Deadlocks result in starvation of all the involved threads, but the issue may occur in other conditions, as well. I would say that starvation or, more generally, underutilization is the most common performance issue of multithreaded applications.

Low-Level Synchronization

I hope, in the previous section, the importance of ensuring proper execution order in the critical sections of the program was demonstrated well enough. How to approach this task? There are many angles of attack. Partially, the problem may be solved by the introduction of atomic operations. Atomic increment/decrement are a common example of those, which may be found in the ecosystem of the majority of the programming languages. For instance, SBCL provides an sb-ext:atomic-incf macro that operates on the fixnum slots of structures, array cells, contents of cons pairs or global variables. Some other languages, like Java, provide AtomicInteger and similar structures that guarantee atomic operations on its main slot.

What enables atomic operations are special hardware instructions:

  • TSL — test and set lock
  • CAS — compare and swap
  • LL/CS — load link/store conditional

The most widespread of them is CAS that has the same effect as if the following code would work as a single atomic operation:

(defmacro cas (place old new)
`(when (eql ,place ,old)
(:= ,place ,new))

Based on this spec, we could define atomic-incf using cas:

(defmacro atomic-incf (place &optional i)
(let ((cur (gensym "CUR"))
(rez (gensym "REZ")))
`(loop :for ,rez := (let ((,cur ,place))
(cas ,place ,cur (+ ,cur ,i)))
:when ,rez :do (return ,rez))))

Here, we read the current value of place, and then try to set it with cas. These two operations happen non-atomically, so there's a chance that cas will return nil. In that case, we redo the whole sequence again. It is clear that execution time of such operation is non-deterministic, but, in a reasonably configured multithreaded system, there should be, generally, just a single chance for cas to fail: when the thread is preempted between the assignment and cas. It shouldn't repeat the second time this thread gets its time slice for it should have enough time to complete both operations considering that it will start from them.

Another important low-level instruction is a memory barrier. It causes the CPU to enforce an ordering constraint on memory operations issued before and after the barrier instruction. I.e. the operations issued prior to the barrier are guaranteed to be performed before operations issued after the barrier. Memory barriers are necessary because most modern CPUs employ performance optimizations that can result in out-of-order execution. The reordering of memory loads and stores goes unnoticed within a single thread of execution but can cause unpredictable behavior in concurrent programs. One more leak from the low level adding to the list of synchronization worries...

On top of CAS and atomic operations, some higher-level synchronization primitives are provided by the OS and the execution runtimes of most of the programming languages. The most popular of them is the semaphore. It is a counter that is initially set to the number of threads that can proceed past querying its value. If the counter is above zero, the thread may continue execution, but it also atomically decrements the counter. This operation is usually called wait, acquire or down on the semaphore. However, if the counter is already down to zero, the thread goes to sleep and is put into an OS waitqueue until the wakeup notification arrives. The notification is initiated by some thread calling release/up on the same semaphore. This operation atomically increments the counter value and also allows some of the waiting threads to continue execution. The most used type of semaphores is called the mutex and it allows only a single thread to enter and also mandates the implementation to check that the thread that releases the mutex is the one that has previously acquired it. There are also other types of semaphores or more complex locks built on top of them, such as the read-write lock or a monitor.

Semaphores are an alternative to a lower-level spin-lock primitive that uses busy waiting, i.e. constant checking of the counter variable until it increases above zero. Another, more general name, for this method is polling that refers to constantly querying the state of some resource (a lock, a network socket, a file descriptor) to know when its state changes. Polling has both drawbacks and advantages: it occupies the thread instead of yielding CPU to other workers, which is a serious downside, but it also avoids expensive utilization of the OS context-switching required by semaphores.

So, both semaphores and spin-locks find their place. In the low-level OS code, spin-locks prevail, while semaphores are a default synchronization primitive in the user-space.

Mutual Exclusion Algorithms

Relying on hardware features for synchronization is a common approach taken by most software systems. However, since the beginning of work on this problem, computer scientists, including such famous algorithmists as Dijkstra and Lamport, proposed mutual exclusion algorithms that allowed guarding the critical sections without any special support from the platform. One of the simplest of them is the Peterson's algorithm. It guarantees mutual exclusion of two threads with the use of two variables: a two-element array interest and a boolean turn. A true value of the interest item corresponding to a thread indicates that it wants to enter the critical section. Entrance is granted if a second thread does not want the same or it has yielded priority to the first thread.

(defparameter *interest* (vec nil nil))
(defparameter *turn* nil)

(defun peterson-call (i fn)
(let ((other (abs (1- i))))
(:= (? *interest* i) t
*turn* other)
;; busy waiting
(loop :while (and (? *interest* other) (= *turn* other)))
;; critical section
(call fn)
(:= (? *interest* i) nil))

The algorithm satisfies the three essential criteria to solve the critical section problem: mutual exclusion, progress, and bounded waiting. Mutual exclusion means that several competing threads can never be in the critical section at the same time. For the Peterson's algorithms, if thread 0 is in its critical section, then (? *interest* 0) is true. In addition, either (? *interest* 1) is nil (meaning thread 1 has left its critical section and isn't interested in coming back into it) or turn is 0 (meaning that thread 1 is just now trying to enter the critical section but waiting), or thread 1 is trying to enter its critical section, after setting (? *interest* 1) to true but before setting *turn* to 0). So if both processes are in the critical section then we conclude that the state must satisfy (and (? *interest* 0) (? *interest* 1) (= *turn* 0) (= *turn* 1)), which is, obviously, impossible. I.e. only on of the threads could have entered the section. The condition of progress, basically, says that only those threads that wish to enter the critical section can participate in making the decision as to which one will do it next, and that this selection cannot be postponed indefinitely. In our case, a thread cannot immediately reenter the critical section if the other thread has set its interest flag. Thus the thread that has just left the critical section will not impact the progress of the waiting thread. Bounded waiting means that the number of times a thread is bypassed by another thread after it has indicated its desire to enter the critical section is bounded by a function of the number of threads in the system. In the Peterson's algorithm, a thread will never wait longer than one turn for entrance to the critical section.

The drawback of the Peterson's algorithms is busy waiting[4]. So, it may be compared to a spin-lock. There are a number of other similar algorithms, including the Dekker's and Lamport's ones, which also share this property. A newer Szymański's algorithm is designed to avoid busy waiting, but it requires access to the OS scheduling facilities to make the thread sleep, waiting for the wakeup call, making the algorithm similar to semaphores.

High-Level Synchronization

All the mentioned synchronization primitives don't solve the challenges of synchronization completely. Rather they provide tools that enable reasonable solutions but still require advanced understanding and careful application. The complexity of multithreaded programs is a level up compared to their single-threaded counterparts, and thus a lot of effort continues being spent on trying to come up with high-level ways to contain it. I.e. remove it from the sight of a regular programmer by providing the primitives that handle synchronization behind the scenes. A simple example of that is Java synchronized classes that employ an internal monitor to ensure atomic access to the slots of a synchronized object. The major problem with regular locks (like semaphores) is that working with them brings us into the realm of global state manipulation. Such locking can't be isolated within the boundaries of a single function — it leaks through the whole caller chain, and this makes the program much harder to reason about. In this regard, it is somewhat similar to the use of goto, albeit on a larger scale, and so a push for higher-level synchronization facilities resembles the Dijkstra's famous appeal to introduce structured programming ("goto considered harmful"). Ironically, Dijkstra is one of the creators of the classic synchronization mechanisms that are now frowned upon. However, synchronization has intrinsic complexity that can't be fully contained, so no silver bullet exists (and hardly will ever be created) and every high-level solution will be effective only in a subset of cases. I have seen that very well on my own when teaching a course on system programming and witnessing how students solve the so-called classic synchronization problems. The task was to apply both classic synchronization techniques (semaphores et al.) and the new high-level ones (using Erlang, Haskell, Clojure or Go, which all provide some of those). The outcome, in terms of complexity, was not always in favor of the new approaches.

There is a number of these classic synchronization problems, and I was even collecting them to be able to provide more variants of the tasks to diminish cheating. :) But, in essence, they all boil down to just a few archetypical cases: producer-consumer, readers-writers, sleeping barber, and the dining philosophers. Each problem demonstrates a certain basic synchronization scenario and allows the researchers to see how their approach will handle it. I won't include them in the book but strongly encourage anyone interested in this topic to study them in more detail and also try to solve using different synchronization mechanisms.

Now, let's talk about some of the prominent high-level approaches. Remember, that they try to change the paradigm and avoid the need for explicit locking of critical sections altogether.

Lock-free Data Structures

My favorite among them is lock-free data structures. This is a simple and effective idea that can help deal with many common use cases and, indeed, avoid the necessity for explicit synchronization. Still, their use is limited and, obviously, can't cover all the possible scenarios.

The most important among them is arguably a lock-free queue. It can be implemented in different ways, and there's a simple and efficient implementation using cas provided by SBCL in the SB-CONCURRENCY contrib package. Here is the implementation of the main operations (taken from the SBCL source code and slightly simplified):

(defstruct lf-queue
(head (error "No HEAD.") :type cons)
(tail (error "No TAIL.") :type cons))

(defun lf-enqueue (value queue)
(let ((new (cons value nil)))
(loop (when (eq nil (sb-ext:compare-and-swap (cdr (lf-queue-tail queue))
nil new))
(:= (lf-queue-tail queue) new)
(return value)))))

(defun lf-dequeue (queue)
(loop (with ((head (lf-queue-head queue))
(next (cdr head)))
(typecase next
(null (return (values nil
nil)))
(cons (when (eq head (sb-ext:compare-and-swap (lf-queue-head queue)
head next))
(let ((value (car next)))
(setf (cdr head) +dead-end+
(car next) +dummy+)
(return (values value
t)))))))))

The value of this structure lies in that it enables the implementation of the master-worker pattern that is a backbone of many backend applications, as well as, in general, different forms of lock-free and wait-free coordination between the running threads. Basically, it's a lock-free solution to the producer-consumer problem. The items are put in the queue by some producer threads (masters) and consumed by the worker threads. Such an architecture allows the programmer to separate concerns between different layers of the application: for instance, one type of threads may be responsible for handling incoming connections and, in order to ensure system high availability, these threads shouldn't spend much time processing them. So, after some basic processing, the connection sockets are put into the queue, from which the heavy-lifting worker threads can consume them and process in a more elaborate fashion. I.e. it's a job queue for a thread pool. Surely, a lock-based queue may also be utilized as an alternative, in these scenarios, but the necessity to lock from the caller's side makes the code for all the involved threads more complicated: what if a thread that has just acquired the lock is abruptly terminated for some reason?

Data-Paralelism and Message Passing

Beyong thread pools, there's a whole concept of data parallelism, which, in essence, lies in submitting different computations to the pool and implementing synchronization as an orchestration of those tasks. In addition, node.js and Go use lock-free IO in conjunction with such thread pools (and a special syntax for its seamless integration) for an efficient implementation of user-space green threads to support this paradigm.

Even further alone this direction is Erlang that is a whole language built around lock-free IO, efficient user-space threading, and a shared-nothing memory model. It is The language of message-passing concurrency that aims to solve all synchronization problems within this single approach. As discussed in the beginning, such stance has its advantages and drawbacks, and so Erlang fits some problems (like coordination between a large number of simple agents) exceptionally well, while for others it imposes unaffordable costs in terms of both performance and complexity.

I won't go deeper into this topic as they are not directly related to the matter of this book.

STM

Another take on concurrency is the technology that is used, for quite a long time, in the database systems, and was reimplemented in several languages, being popularized by the author of Clojure — Software Transactional Memory (STM). The idea is to treat all data accesses in memory as part of transactions — computations that possess the ACID properties: atomicity, consistency, and isolation (minus durability, which is only relevant to the database systems persisting data on disk). These transactions should still be initiated by the programmer, so the control over synchronization remains, to a large extent, in their hands with some portion of the associated complexity. The transactions may be implemented in different ways, but they will still use locking behind the scenes, and there are two main approaches to applying locking:

  • pessimistic — when the locks are acquired for the whole duration of the transaction, basically, making it analogous to a very conservative programming style that avoids the deadlocks but seriously hinders program performance: acquiring all locks at once and then entering the critical section; in the context of STM, each separate variable will have its own lock
  • optimistic — when the initial state of the transaction variables is remembered in the thread-local storage, and locking occurs only at the last (commit) phase, when all the changes are applied — but only when there were no external changes to the transaction variables; if at least one of them were changed, the whole transaction would need to be rolled back and retried

In both cases, the main issue is the same: contention. If the number of threads competing for the locks is small, an optimistic approach should perform better, while, in the opposite case, there will be too many rollbacks and even a possibility of a livelock.

The optimistic transactions are, usually, implemented using the Multiversion Concurrency Control mechanism. MVCC ensures a transaction never has to wait to read an object by maintaining several versions of thid object. Each version has both a Read Timestamp and a Write Timestamp which lets a particular transaction read the most recent version of the object which precedes the own Read Timestamp of the transaction.

STM is an interesting technology, which hasn't proven its case yet beyond the distinct area of data management systems, such as RDBMs and their analogs.

Distributed Computations

So far, we have discussed synchronization, mainly, in the context of software running in a single address space on a single machine. Yet, the same issues, although magnified, are also relevant to distributed systems. Actually, the same models of computation are relevant: shared-memory and shared-nothing message passing. Although, for distributed computing, message passing becomes much more natural, while the significance of shared-memory is seriously diminished and the "memory" itself becomes some kind of a network storage system like a database or a network file system.

However, more challenges are imposed by the introduction of the unreliable network as a communication environment between the parts of a system. These challenges are reflected in the so-called "fallacies of distributed computing":

  • the network is reliable
  • latency is zero
  • bandwidth is infinite
  • the network is secure
  • topology doesn't change
  • there is a single administrator
  • transport cost is zero
  • the network is homogeneous
  • clocks on all nodes are synchronized

Another way to summarize those challenges, which is the currently prevailing look at it, is the famous Brewer's CAP Theorem, which states that any distributed system may have only two of the three desired properties at once: consistency, availability, and partition tolerance. And since partitional tolerance is a required property of any network system as it's the ability to function in the unreliable network environment (that is the norm), the only possible distributed systems are CP and AP, i.e. they either guarantee consistency but might be unavailable at times or are constantly available but might be sometimes inconsistent.

Distributed Algorithms

Distributed computation requires distributed data structures and distributed algorithms. The domains that are in active development are distributed consensus, efficient distribution of computation, efficient change propagation. Google pioneered the area of efficient network computation with the MapReduce framework that originated from the ideas of functional programming and Lisp, in particular. The next-generation systems such as Apache Spark develop these ideas even further.

Yet, the primary challenge for distributed systems is efficient consensus. The addition of the unreliable network makes the problem nontrivial compared to a single-machine variant where the consensus may be achieved easily in a shared-memory setting. The world has seen an evolution of distributed consensus algorithms implemented in different data management systems, from the 2-Phase Commit (2PC) to the currently popular RAFT protocol.

2PC is an algorithm for coordination of all the processes that participate in a distributed atomic transaction on whether to commit or rollback the transaction. The protocol achieves its goal even in many cases of temporary system failure. However, it is not resilient to all possible failure configurations, and in rare cases, manual intervention is needed. To accommodate recovery from failure, the participants of the transaction use logging of states, which may be implemented in different ways. Though usually intended to be used infrequently, recovery procedures compose a substantial portion of the protocol, due to many possible failure scenarios to be considered.

In a "normal execution" of any single distributed transaction, the 2PC consists of two phases:

  1. The "commit-request" or voting phase, in which a coordinator process attempts to prepare all the participating processes to take the necessary steps for either committing or aborting the transaction and to vote, either "Yes" (commit) or "No" (abort).
  2. The "commit" phase, in which, based on the voting of the participants, the coordinator decides whether to commit (only if all have voted "Yes") or rollback the transaction and notifies the result to all the participants. The participants then follow with the needed actions (commit or rollback) with their local transactional resources.

It is clear, from the description, that 2PC is a centralized algorithm that depends on the authority and high availability of the coordinator process. Centralized and peer-to-peer are the two opposite modes of the network algorithms, and each algorithm is distinguished by its level of centralization.

The 3PC is a refinement of the 2PC which is supposed to be more resilient to failures by introducing an intermediate stage called "prepared to commit". However, it doesn't solve the fundamental challenges of the approach that are due to its centralized nature, only making the procedure more complex to implement and thus having more failure modes.

The modern peer-to-peer coordination algorithm alternatives are Paxos and RAFT. RAFT is considered to be a simpler (and, thus, more reliable) approach. It is also, not surprisingly, based on voting. It adds a preliminary phase to each transaction, which is leader election. The election, as well as other activities within a transaction, don't require unanimous agreement, but a simple majority. Besides, execution of all the stages on each machine is timeout-based, so if a network failure or a node failure occurs, the operations are aborted and retried with an updated view of the other peers. The details of the algorithm can be best understood from the RAFT website, which provides a link to the main paper, good visualizations and other references.

Distributed Data Structures

We have already mentioned various distributed hash-tables and content-addressable storage as one of the examples of these types of structures. Another exciting and rapidly developing direction is eventually-consistent data structures or CRDTs (Conflict-free Replicated Data Types). They are the small-scale representatives of the AP (or eventually-consistent) systems that favor high availability over constant consistency, as they become more and more the preferred mode of operation of distributed system.

The issue that CRDTs address is conflict resolution when different versions of the structure appear due to network partitions and their eventual repair. For a general data structure, if there are two conflicting versions, the solution is either to choose one (according to some general rules, like take the random one or the latest one, or application-specific logic) or to keep both versions and defer conflict resolution to the client code. CRDTs are conflict-free, i.e. the structures are devised so that any conflict is resolved automatically in a way that doesn't bring any data loss or corruption.

There are two ways to implement CRDTs: convergent structures rely on the replication of the whole state, while commutative use operation-based replication. Yet, both strategies result in the CRDTs with equivalent properties.

The simplest CRDT is a G-Counter (where "G" stands for grow only). Its operation is based on the trivial fact that addition is commutative, i.e. the order of applying the addition operation doesn't matter: we'll get the same result as long as the number of operations is the same. Every convergent CRDT has a merge operation that combines the states of each node. On each node, the G-Counter stores an array that holds the per-node numbers of the local increments. And its merge operation takes the maximums of the elements of this array across all nodes while obtaining the value of the counter requires summing all of the cells:

(defstruct (g-counter (:conc-name nil))
ccs)

(defun make-gcc (n)
(make-g-counter :ccs (make-array n)))

(defun gcc-val (gcc)
(reduce '+ (ccs gcc))

(defun gcc-merge (gcc1 gcc2)
(map* 'max gcc1 gcc2))

The structure is eventually consistent as, at any point in time, asking any live node, we can get the current value of the counter from it (so there's constant availability). However, if not all changes have already been replicated to this node, the value may be smaller than the actual one (so consistency is only eventual once all the replications are over).

The next step is a PN-Counter (positive-negative). It uses a common strategy in CRDT creation: combining several simpler CRDTs. In this case, it is a combination of two G-Counters: one — for the number of increments, and another — decrements.

A set is, in some sense, a more sophisticated analog of a counter (a counter may be considered a set of 1s). So, a G-Set functions similar to a G-Counter: it allows each node to add items to the set that are stored in the relevant cell of the main array. The merging and value retrieval operations use union. Similarly, there's 2P-Set (2-phase) that is similar in construction to the PN-counter. The difference of a 2P-Set from a normal set is that once an element is put into the removal G-Set (called the "tombstone" set) it cannot be readded to the set. I.e. addition may be undone, but deletion is permanent. This misfeature is amended y LWW-Set (last-write-wins) that adds timestamps to all the records. Thus, an item with a more recent timestamp prevails, i.e. if an object is present in both underlying G-Sets it is considered present in the set if its timestamp in the addition set is greater than the one in the removal set, and removed — in the opposite case.

There are also more complex CRDTs used to model sequences, including Treedoc, RGA, Woot, Logoot, and LSEQ. Their implementations differ, but the general idea is that each character (or chunk of characters) is assigned a key that can be ordered. When new text is added, it's given a key that is derived from the key of some adjacent text. As a result, the merge is the best-possible approximation of the intent of the edits.

The use cases for CRDTs are, as mentioned above, collaborative editing, maintaining such structures as shopping carts (e.g. with an LWW-Set), counters of page visits to a site or reactions in a social network, and so on and so forth.

Distributed Algorithms in Action: Collaborative Editing

In fact, CRDTs are a data-structure-centric answer to another technology that is used, for quite some time, to support collaborative editing: Operational Transformation (OT). OT was employed in such products as Google Docs and its predecessors to implement lock-free simultaneous rich-text editing of the same document by many actors.

OT is an umbrella term that covers a whole family of algorithms sharing the same basic principles. Such systems use replicated document storage, i.e. each node in the system operates on its own copy in a non-blocking manner as if it was a single-user scenario. The changes from every node are constantly propagated to the rest of the nodes. When a node receives a batch of changes it transforms the changes before executing them to account for the local changes that were already made since the previous changeset. Thus the name "operational transformation".

The basic idea of OT can be illustrated with the following example. Let's say we have a text document with a string "bar" replicated by two nodes and two concurrent operations:

(insert 0 "f") # o1 on node1
(delete 2 "r") # o2 on node2

Suppose, on node 1, the operations are executed in the order o1, o2. After executing o1, the document changes to "fbar". Now, before executing o2 we must transform it against o1 according to the transformation rules. As a result, it will change to (delete 3 "r"). So, the basic idea of OT is to adjust (transform) the parameters of incoming editing operations to account for the effects of the previously executed concurrent operations locally (whether they were invoked locally or received from some other node) so that the transformed operation can achieve the correct effect and maintain document consistency. The word "concurrent" here means operations that happened since some state that was recorded on the node that has sent the new batch of changes. The transformation rules are operation-specific.

In theory, OT seems quite simple, but it has its share of implementation nuances and issues:

  • While the classic OT approach of defining operations through their offsets in the text seems to be simple and natural, real-world distributed systems raise serious issues: namely, that operations propagate with finite speed (remember one of the network fallacies), states of participants are often different, thus the resulting combinations of states and operations are extremely hard to foresee and understand.
  • For OT to work, every single change to the data needs to be captured: "Obtaining a snapshot of the state is usually trivial, but capturing edits is a different matter altogether. The richness of modern user interfaces can make this problematic, especially within a browser-based environment.
  • The notion of a "point in time" relevant to which the operations should be transformed is nontrivial to implement correctly (another network fallacy in play). Relying on global time synchronization is one of the approaches, but it requires tight control over the whole environment (which Google has demonstrated to be possible for its datacenter). So, in most cases, a distributed solution instead of simple timestamps is needed

THe most popular of these solutions is a vector clock. The VC of a distributed system of n nodes is a vector of n logical clocks, one clock per process; a local "smallest possible values" copy of the global clock-array is kept in each process, with the following rules for clock updates:

  • Initially all clocks are zero.
  • Each time a process experiences an internal event, it increments its own logical clock in the vector by one.
  • Each time a process sends a message, it increments its own logical clock in the vector by one (as in the bullet above, but not twice for the same event) and then sends a copy of its own vector.
  • Each time a process receives a message, it increments its own logical clock in the vector by one and updates each element in its vector by taking the maximum of the value in its own vector clock and the value in the vector in the received message (for every element).

You might notice that the operation of vector clocks is similar to the CRDT G-counter.

VCs allow the partial causal ordering of events. A vector clock value for the event x is less than the value for y if and only if for all indices the items of the x's clock are less or equal, and, at least for one element, they are stricktly smaller.

Besides vector clocks, the other mechanisms to implement distributed partial ordering include Lamport Timestamps, Plausible Clocks, Interval Tree Clocks, Bloom Clocks, and others.

Persistent Data Structures

To conclude this chapter, I wanted to say a few words about the role of the functional paradigm in synchronization and distributed computing. That's no coincidence that it was mentioned several times in the description of different synchronization strategies: essentially, functional programming is about achieving good separation of concerns by splitting computations into independent referentially-transparent units that are easier to reason about. Such an approach supports concurrency more natively than the standard imperative paradigm, although it might not be optimal computationally (at least, in the small). Yet, the gains obtained from parallelism and utilizing the scale of distributed computing may greatly outweigh this low-level inefficiency. So, with the advent of concurrent and distributed paradigms, functional programming gains more traction and adoption. Such ideas as MapReduce, STM, and message-passing-based coordination originated in the functional programming world.

Another technology coming from the functional paradigm that is relevant to synchronization is Purely Functional Data Structures. Their principal property is that any modification doesn't cause a destructive effect on the previous version of the structure, i.e., with each change, a new version is created, while the old one may be preserved or discarded depending on the particular program requirements. This feature makes them very well suited for concurrent usage as the possibility of corruption due to incorrect operation order is removed, and such structures are also compatible with any kind of transactional behavior. The perceived inefficiency of constant copying, in many cases, may be mostly avoided by using structure sharing. So the actual cost of maintaining these data structures is not proportional to their size, but rather constant or, at worst, logarithmic in size. Another name for these structures is persistent data structures — contrast to "ephemeral" ones which operate by destructive modification.

The persistent functional structure is, as we already mentioned in one of the preceding chapters, is a Lisp list[5] used as a stack. We have also seen the queue implemented with two stacks called a Real-Time Queue. It is a purely functional data structure, as well. The other examples are mostly either list- or tree-based, i.e. they also use the linked backbone structured in a certain way.

To illustrate once again how most persistent data structures operate, we can look at a Zipper that may be considered a generalization of a Real-Time Queue. It is is a technique of representing a data structure so that it is convenient for writing programs that traverse the structure arbitrarily and update its contents in a purely functional manner, i.e. without destructive operations. A list-zipper represents the entire list from the perspective of a specific location within it. It is a pair consisting of a recording of the reverse path from the list start to the current location and the tail of the list starting at the current location. In particular, the list-zipper of a list (1 2 3 4 5), when created will look like this: (() . (1 2 3 4 5)). As we traverse the list, it will change in the following manner:

  • ((1) . (2 3 4 5))
  • ((2 1) . (3 4 5))
  • etc.

If we want to replace 3 with 0, the list-zipper will become ((2 1) . (0 4 5)) while the previous version will still persist. The new zipper will reuse the list (2 1) and create a new list by consing 0 to the front of the sublist (4 5). Consequently, there memory after performing 2 movements and 1 update will look like this:

It is apparent that each operation on the zipper (movement or modification) adds at most a single additional element. So, it's complexity is the same as for normal lists (although, with larger constants).

Zippers can operate on any linked structures. A very similar structure for trees is called a Finger Tree. To create it from a normal tree, we need to put "fingers" to the right and left ends of the tree and transform it like a zipper. A finger is simply a point at which you can access part of a data structure.

Let's consider the case of a 2-3 tree for which the finger approach was first developed. First, we restructure the entire tree and make the parents of the first and last children the two roots of our tree. This finger is composed of several layers that sit along the spine of the tree. Each layer of the finger tree has a prefix (on the left) and a suffix (on the right), as well as a link further down the spine. The prefix and suffix contain values in the finger tree - on the first level, they contain values (2-3 trees of depth 0); on the second level, they contain 2-3 trees of depth 1; on the third level, they contain 2-3 trees of depth 2, and so on. This somewhat unusual property comes from the fact that the original 2-3 tree was of uniform depth. The edges of the original 2-3 tree are now at the top of the spine. The root of the 2-3 tree is now the very bottom element of the spine. As we go down the spine, we are traversing from the leaves to the root of the original 2-3 tree; as we go closer to the root, the prefix and suffixes contain deeper and deeper subtrees of the original 2-3 tree.

Now, the principle of operation (traversal or modification) on the finger tree is the same as with the zipper: with each change, some elements are tossed from one side of the spine to the other, and the number of such elements remains withing the O(log n) limits.

Finally, another data structure that is crucial for the efficient implementation of systems that rely solely on persistent data structures (like the Clojure language environment) is a Hash-Array Mapped Trie (HAMT). It may be used both in ephemeral and persistent mode to represent maps and sets with O(log n) access complexity[6]. HAMT is a special trie that uses the following two tricks:

  • as an array-mapped trie, instead of storing pointers to the children nodes in a key-value indexed with their subkeys, it stores a list an array of pointers and a bitmap that is used to determine if the pointer is present and at what position in the array it resides. This feature requires limiting the number of possible subkeys (for example, individual characters, which are the dominant use case for tries) to the length of a bitmap. The default length is 32, which is enough to represent English alphabet :)
  • however, the hash feature gives us a number of benefits including limiting the limitations on the subkeys. Actually, in a HAMT, all values are stored at the leaves that have the same depth, while the subkeys are obtained by, first, hashing the key and then splitting the obtained hash into n-bit ranges (where n is usually also 5)[7]. Each subkey is used as an index into the bitmap: if the element at it is 1 the key is present. To calculate the index of a pointer in the pointer array, we need to perform popcount on the preceding bits.

With such a structure, all major operations will have O(log 5) i.e. O(1) complexity. However, hash collisions are possible, so the hash-table-related collision considerations also apply to a HAMT. In other words, HAMTs are pretty similar to hash-tables, with the keys being split into parts and put into a trie. However, due to their tree-based nature, the memory footprints and the runtime performance of iteration and equality checking of the HAMTs lag behind array-based counterparts:

  • Increased memory overhead as each internal node adds an overhead over a direct array-based encoding, so finding a small representation for internal nodes is crucial.
  • On the other hand, HAMTs do not need expensive table resizing and do not waste (much) space on null references.
  • Iteration is slower due to non-locality, while a hash-table uses a simple linear scan through a continuous array.
  • Delete can cause the HAMT to deviate from the most compact representation (leave nodes with no children, in the tree)
  • Equality checking can be expensive due to non-locality and the possibility of a degenerate structure due to deletes.

So, what's the value of this structure if it's just a slightly less efficient hash-table? The difference is that a HAMT can not only be implemented both with destructive operations, but also, being a tree, it can be easily adapted to persistent mode with a usual path-copying trick that we have already seen.

Complexity estimations for persistent data structres use amortized analysis to prove acceptable performance (O(log n)). Another trick at play here is called scheduling, and it lies in properly planning heavy structure-rebuilding operations and splitting them into chunks to avoid having to execute some at a time when optimal complexity can't be achieved. To learn more about these topics read the seminal book by Chris Okasaki "Purely Functional Data Structures"[8] that describes these methods in more detail and provides complexity analysis for various structures.

Besides, the immutability of persistent data structures enables additional optimizations that may be important in some scenarios:

  • native copy-on-write (COW) semantics that is required in some domains and algorithms
  • objects can be easily memoized
  • properties, such as hashes, sizes, etc., can be precomputed

The utility of persistent data structures is only gradually being realized and apprehended. Recently, some languages, including Clojure, were built around them as core structures. Moreover, some people even go as far as to claim that git is a purely functional data structure due to its principal reliance on structure-sharing persistent trees to store the data.

Take-aways

We have covered a lot of ground in this chapter at a pretty high level. Obviously, you can go much deeper: the whole books are written on the topics of concurrency and distributed computing.

Overall, concurrency can be approached from, at least, three different directions:

  1. There's a low-level view: the means that should be provided by the underlying platforms to support concurrent operation. It includes the threading/process APIs, the atomic operation, synchronization, and networking primitives.
  2. Then, there's an architecture viewpoint: what constraints our systems should satisfy and how to ensure that. At this level, the main distinctions are drawn: shared-memory vs shared-nothing, centralized vs peer-to-peer.
  3. And, last but not least, comes the algorithmic perspective. What data structures (as usual, they are, in fact, more important than the algorithms) can be used to satisfy the constraints in the most efficient way possible, or to simplify the architecture? We have seen several examples of special-purpose ones that cater to the needs of a particular problem: lock-free data structures, eventually-consistent, purely functional persistent ones. And then, there are some areas where special-purpose algorithms also play a major role. Their main purpose, there, is not so much computational efficiency (like we're used to), but, mostly, correctness coupled with good enough efficiency[9]. Mutual exclusion and distributed consensus algorithms are examples of such targeted algorithm families.

There's a lot of room for further research in the realms of synchronization and, especially, distributed computation. It is unclear, whether the new breakthroughs will come from our current computing paradigms or we'll have to wait for the new tide and new approaches. Anyway, there's still a chance to make a serious and lasting contribution to the field by developing new algorithm-related stuff. And not only that. Unlike other chapters, we haven't talked much here about the tools that can help a developer of concurrent programs. The reason for that is, actually, an apparent lack of such tools, at least, of widely adopted ones. Surely, the toolbox we have already studied in the previous chapters is applicable here, but an environment with multiple concurrent threads and, possibly, multiple address spaces adds new classes of issues and seriously complicates debugging. There are network service tools to collect metrics and execution traces, but none of them is tightly integrated into the development toolboxes, not to speak of their limited utility. So, substantial pieces are still missing from the picture and are waiting to be filled.


[1] We will further use the term "thread" to denote a separate computation running as part of our application, as it is less ambiguous than "process" and also much more widespread than all the other terms.

[2] This internal "threading", usually, also relies on the OS threading API behind the scenes.

[3] In this context, they tend to be called "processes", but we'll still stick to the term "thread".

[4] The other apparent limitation of supporting only two threads can be lifted by a modification to the algorithm, which requires some hardware support.

[5] If we forbid the destructive rplaca/rplacd operations and their derivatives.

[6] and a quite high algorithm base — usually 32 — that means very shallow trees resulting in just a handful of hops even for quite large structures

[7] Except for the length of the leftmost range that depends on the number of bits in a hash. For instance, for a 32-bit hash, it may be 7, and the depth of the whole HAMT would be 5.

[8] His thesis with the same title is freely available, but the book covers more and is more accessible.

[9] The reason for that might be relative immaturity of this space, as well as its complexity, so that our knowledge of it hasn't been developed enough to reach the stage when optimization becomes the main focus.

Didier VernaELS 2020 COVID update

· 5 days ago

Dear all,

as promised two weeks ago, here is an update on the current situation of ELS 2020. Thank you all for your patience.

  1. Given the evolution of the world-wide situation with regard to the COVID pandemic, it will come as no surprise that we have to cancel the Zurich event.
  2. I have consequently closed the registrations, and I will proceed with full refunding this afternoon. Note owever that it will take some time for the money to get back to your accounts.
  3. In the meantime, the reviewing process has continued as usual, and I'm happy to announce that thanks to the authors, the PC, and under the supervision of Ioanna, we now have a preliminary programme online! At least, this will already give you a taste of what we have this year.
  4. A fallback solution is still under consideration, so stay tuned for updates. What I can tell you right now is this: a fully online event is very unlikely. More likely would be a semi-interactive online event, with the broadcasting of pre-recorded talks and a real-time channel for interaction. We have not completely ruled out the possibility of just postponing the physical event, but that solution is not very probable. Finally, if we go for an online event, I will make sure that it will be free and open for anyone to "attend".

So, this is what I can tell you right now. Finally, I also want to extend my warmest thanks to all the persons who showed their support (notably financial, which is critical), by staying optimistic and registering anyway, hoping for the best. This means a lot.

Stay safe!

Quicklisp newsMarch 2020 Quicklisp dist update now available

· 8 days ago
New projects:
  • cl-capstone — Raw Common Lisp FFI interface to the Capstone disassembler — MIT
  • cl-catmull-rom-spline — Catmull-Rom Spline — Public Domain
  • cl-fixtures — A simple library to create and use parameterized fixtures — MIT
  • cl-gserver — Erlang inspired GenServer library with Agent for easy access to state. — MIT
  • cl-sparql — SPARQL query builder for Common Lisp — MIT
  • cl-torrents — Search for torrents on popular trackers. Lisp library, CLI interface, terminal application, Tk GUI. — MIT
  • cl-transmission — A Common Lisp library to interface with transmission using its rpc — MIT
  • cl-utils — GrammaTech Common Lisp Utilities — MIT
  • enhanced-boolean — Provides a canonical way of converting generalized booleans to booleans. — Public Domain
  • fakenil — Provides a canonical stand-in for NIL for contexts where NIL means "no value". — Public Domain
  • fast-generic-functions — Seal your generic functions for an extra boost in performance. — MIT
  • functional-trees — Tree data structure supporting functional manipulation — MIT
  • glacier — lightweight mastodon bot framework — BSD 3-Clause
  • gtirb — Common Lisp library for GTIRB — MIT
  • jpeg-turbo — libjpeg-turbo wrapper for Common Lisp — 2-clause BSD
  • keystone — Raw Common Lisp FFI interface to the Keystone assembler — MIT
  • list-named-class — CLOS extension - name classes after lists of symbols — MIT
  • mutility — modula's utilities. — MIT
  • primecount — prime counting of sublinear complexity — MIT
  • select-file — Select File dialog for McCLIM — MIT
  • sxql-composer — Build and compose SXQL queries dynamically — MIT
  • textery — tracery lisp implementation — BSD 3-Clause
  • trucler — Library for managing lexical environments. — FreeBSD, see file LICENSE.text
  • x.let-star — value binder — BSD compatible
Updated projects3bzagutilaprilarc-compatbikebpbytecurry.mockscardiogramcari3scavemanchanlcl+sslcl-abnfcl-argparsecl-collidercl-cxxcl-dbicl-dotcl-environmentscl-fadcl-formscl-hamtcl-krakencl-messagepackcl-mpg123cl-mustachecl-protobufscl-satcl-scsucl-skkservcl-smt-libcl-steamworkscl-strcl-sxmlcl-webkitcl4storeclassimpclmlcloser-mopclsql-local-timeclxcommand-line-argumentscroatoancxmlcxml-stpdartsclhashtreedartsclmessagepackdartscltoolsdartscluuiddataflydatamusedatum-commentsdeployeasy-routeseclectorepigrapheruditeesrapfile-selectflareflexi-streamsflexichainfont-discoveryforfsetfxmlgendlgeneric-clgolden-utilsgraphhu.dwim.asdfhu.dwim.computed-classhu.dwim.reiteratehu.dwim.utilhu.dwim.web-serverintegralkenzolacklinear-programmingliterate-lisplocal-timemagic-edmagiclmarkupmaxpcmcclimmitommapmywaynumcloriginpango-markupparachutepetalisppgloaderphoe-toolboxpiggyback-parametersplumppolisherportable-threadspostmodernprotestpseudonymspzmqquery-fsquilcqvmrate-monotonicread-numberreaderreplicroanroverpcqsafe-queuesc-extensionsscalplsealable-metaobjectsselserapeumsha1simple-inferiorsslyspecialized-functionspinneretst-jsonstaplestmxstumpwmsucletfmtootertrace-dbtriviatrivial-cltl2trivial-extensible-sequencestrivial-garbagetrivial-json-codectrivial-monitored-threadtrivial-timertrivialib.type-unifyuax-15utilities.print-treevernacularwild-package-inferred-systemwooxml.location.

Removed projects: cl-batis, cl-dbi-connection-pool.

The removed projects no longer work due to changes in the cl-dbi API, and the author has not responded to bug reports or pull requests. If they build in the future, they will be added back to Quicklisp.

To get this update, use: (ql:update-dist "quicklisp")

Enjoy!

Lispers.deBerlin Lispers Meetup is indefinitely suspended.

· 11 days ago

No meetings in person for now.

Didier VernaTFM 1.1 "Carolingan Miniscules" is released

· 18 days ago

I'm happy to announce the release of TFM version 1.1 "Carolingan Miniscules". TFM is a TeX Font Metrics parser library for Common Lisp.

This release provides support for font scaling and freezing. Scaling is done, as in TeX, by overriding the font's design size, all other dimensions remaining relative to it. Freezing, on the other hand, converts all relative dimensions into absolute ones, potentially saving a lot of run-time arithmetic computation.

Get it at the usual place.

Marco AntoniottiWhy You Cannot (Yet) (Portably) Write an "Interval Arithmetic" Library in Common Lisp

· 23 days ago
Hi

it has been a while since I posted anything here.  Common Lisp is a nice "comfort zone" for me and I essentially play around with it, in ways that eventually make me wander in the wide ocean of knowledge out there.

Here is the takeaway: I spent some (too much?) time thinking about parts of Common Lisp related to numerical computations and the result is this arXiv paper with the same title of this blog post.

Everything started when I was looking for a cute exercise/project to give students. I had settled on "write an Interval Arithmetic library".  Interval Arithmetic (IA) is an interesting topic per se, with many not-so-trivial mathematical underpinning.  Much work have been done to provide implementation and standards on current processor architectures and, especially, on current IEEE-754 floating standard libraries.

A very simple interval representation in Common Lisp is the following.

(defstruct (int (:constructor int (l h))
  (l 0.0 :type real)
  (h 0.0 :type real))

Given this representation, a very simple interval sum operation could be written as

(defun int-sum (i1 i2)
  (int (+ (int-l i1) (int-l i2))
       (+ (int-h i1) (int-h i2))))

The other operations pretty much follow the same pattern until you get to division, which opens up a can of worms. But let's forget about division and see what other issues pop up.

IEEE-754 infinities etc...


First of all you have the issue of what happens in "extreme" case. What should the following sums return?

(defvar i1 (int 0.0 1.0))

(defvar i2 (int 42.0 most-positive-double-float)

(int-sum i1 i2)

(int-sum i2 i2)

Common Lisp does not quite specify how to handle these cases, therefore implementations may differ about what they return or how they handle overflows. Note also that having the feature :ieee-floating-point set does not mean that your implementation provides all the bells and whistles of IEEE-754. In particular you do not have access to a portable representation of IEEE-754 infinities (and NaNs).

Control on Rounding Direction


Another issue that the code above has, is that the two + yielding the lower and upper bound of the result interval should not behave in the same way; i.e., as we are talking floating point numbers, there are issues of rounding to be considered. Most IA specifications assume IEEE-754, or, more recently, the Language Independent Arithmetic ISO/IEC-10967 (LIA) standard, which define rounding modes and operations. Common Lisp implementations usually round to nearest, while the IA specifications require a more fine grained control on rounding directions; the int-sum function above should actually be written as follows,

(defun int-sum (i1 i2)
  (int (+↓ (int-l i1) (int-l i2))
       (+↑ (int-h i1) (int-h i2))))

where +↓ and +↑ compute results rounding downward and upward.

Again, Common Lisp implementations vary quite a lot in providing access to IEEE rounding modes, if at all, leaving he programmer in a bind.

So, What's Next?


This has been a learning experience for me. Once I found out about LIA, I took some time to try to grok it. The result is the arXiv paper I mentioned, where I advocate a community effort to come up with a specification for a LIA-compliant arithmetic library/API for Common Lisp. Note that I dare to claim that LIA is a superset of any facility provided in Common Lisp and this is the reason why I would like to suggest a different approach to "language evolution" in this case: specify first, implement later.

Maybe some of the Common Lisp regulars out there will like to join the fun. Will you?


(cheers)

Lispers.deHamburg Lispers Meetup, Monday, 2nd March 2020

· 32 days ago

Greetings, Lisplings!

Our monthly gathering in Hamburg takes place as usual on the first Monday of the month, on 2nd March 2020. We meet around 19:00 (_7 p. m._) at Ristorante Opera, Dammtorstr. 7.

This is an informal event for all Lisp experience levels, and about any Lisp dialect.

Come as you are, bring lispy topics.

Nicolas HafnerMarch Kandria Status Update - Gamedev

· 32 days ago

header
Another month has flown by, and this one has turned out to be much more exciting than I ever anticipated! Primarily this is due to me making a surprise decision to try and apply for a grant to help get this project funded. This prompted a lot of changes, which I'll talk about today.

If you're new to Kandria (formerly named Leaf), you can find a brief description of the game in the trailer embedded below, and a longer description in the document linked after.

Now, before we get to talking about the development, I'll talk a bit about the grant and how it came to me applying for it. It starts in the late summer of 2019, when I finally decided to push myself a bit outside my usual social boundaries and visited the Zürich Game Developers meetup. From there I learned about Zürindies a weekly get together, which I started to attend every Saturday. This has been absolutely fantastic for my morale and my interest in working seriously on the game. Fast forward a couple of months, and some people banded together to open a shared space for game developers in Zürich called the Swiss Game Hub. After hearing that most of my fellow Zürindies people decided to take up the offer and work there a couple of days every week, I made the jump as well. I've been there twice a week since January now, and it has been pretty good!

Sometimes there's also small events being held at the game hub, one of which was about the Pro Helvetia institution, a Swiss culture fund with a division for interactive media, or simply: games. They offer grants for both studios and indies alike, with different categories for pre-production, production, and post-production. With a little bit of a push from the other Zürindies attendees, I made the decision to try for the pre-production grant as well. The past two weeks have then been a rush to get things ready and done for the grant application, which is due today.

In order to apply for the grant there's a bunch of restrictions, but also a bunch of things you need to hand in:

  • A game design document pitching the game
  • A budget showing your expected expenses and income
  • Market analysis to show what you're targeting and what sales you project
  • A short trailer to describe your game
  • A playable demo of the game
  • A CV for every person on the core development team
  • A website (Optional)

That's a lot of stuff to get going in two weeks, especially when you don't work full time on the project and there's a lot about the game that's still very brittle. I also had no experience with budgeting at all, so that part caused me a lot of confusion.

Nevertheless, the application was completed! Now it's all up to the jury to see whether they like my pitch well enough over the surely many others to fund it. According to what others have said, it'll take a few months for the jury to make the decision. I don't intend on relying on the grant for the development of the game though, so I won't wait with bated breath and just proceed as I would have anyway.

One of the most important changes I made along the application preparation was to finally pick a proper name for the project. Some of you may still remember the game with its prototype name "Leaf", which was only picked because I had to use anything at all to name the project. After lots of deliberation and indecision I finally settled on "Kandria". The name itself has no meaning, it was picked to sound good and avoid collision with other games and projects. I think it does a pretty good job at that.

Now that the application is complete, I decided it would be best to publish all of that material I had to make for it too, and use it to do my first serious attempt at marketing Kandria a bit. So, without further ado, here's the pitch trailer for the game:

I'm quite happy with what I managed to put together in less than a week of time, though I'm especially happy that I decided to hire Alex Gray to act out the couple of lines in the video. It works so much better than my initial attempt of just having subtitles.

I'm much less happy with the gameplay scenes in the video, since those were literally only put together yesterday, and the short level I made for it is lacking in a lot of ways. I would have especially loved to include some combat or dialogue interaction, but unfortunately neither of those systems were ready in time. I'll focus on those very soon, though, so hopefully I'll be able to get a much better demo together before long.

The game design document with the pitch, budgeting, and market analysis stuff is also available publicly if you're interested in what I put together for that:

https://github.com/Shinmera/kandria/blob/master/doc/pitch/pitch.pdf

I'm sure that someone experienced in finance would tear their hair out over what I threw in for the budget, but hopefully it'll be good enough for the Pro Helvetia jury. Another important thing the document includes though is a rough schedule: I hope to finish pre-production and enter the full production phase in Summer 2020 (that's this year). This is not a lot of time, especially with me still having to attend lectures for my Master's, but I think it should be doable. There's not that much that still needs to be added for the engine, and most of it is bugfixes and reintegrating older systems that have grown stale.

Once production starts I gave the very optimistic estimate of a year until release, meaning Summer 2021, which I don't think is realistic. Depending on how things go, how much help I'll be able to afford, and so forth I think it might be possible, but a more reasonable target would be 2022.

Making such a large game is probably not the best of ideas for a first title of an indie — in fact I know it's a terrible idea — but it is just what I want to do, and damn it, I will try and fail to do what I want first instead of cutting my standards preemptively.

Another thing that came out of the grant application is a website! I don't have a dedicated domain for that yet, so it's currently just living on Github:

https://shinmera.github.io/kandria

I think that'll be fine enough until I enter production at least. On the site you can also get a copy of the prototype demo, and sign up for a mailing list. I'll definitely put these monthly updates onto the list, but I'll probably also share some exclusive content there.

Finally there's now also a Discord for Kandria in case you want to hang out and chat about the game and its development with other interested people, and myself, of course.

There wasn't a lot of time left besides the grant application to work on the game itself. What I did do was focused on improving the editor, fixing bugs, allowing steam integration, making the gamepad library lisp-native, testing deployment on Windows and Linux, and getting started on the animation editor I talked about last month. There's not a lot of visual stuff to show off for all of that either, I'm afraid.

Once I get some more collision and UI bugs fixed I'll continue with the animation editor, so I can finally start working on the combat system. If things go well I should have a very interesting update for you next month!

If you want to stay in touch with news about Kandria, you can use the mailing list, Discord, twitter, gamedev.net, or the Atom feed.

As always, thanks a lot to everyone who shows interest in this project and the other things I do, it really means the world to me to see that people care!

Charles ZhangBlock compilation - "Fresh" in SBCL 2.0.2

· 33 days ago
I've just managed to land a feature I've been missing in SBCL for quite some time now called block compilation, enabling whole-program-like optimizations for a dynamic language. There was at least one person back in 2003 who requested such a feature on the sbcl-help mailing list, so it's definitely been a long time coming. My suspicion is that although many of you old-timers have at least heard of this feature, this is one of those old state-of-the-art features in dynamic language implementations lost and forgotten for younger generations to the passage of time...

what is "block compilation" anyway?

Those of you using Lisp, or any dynamic language know one thing: Function calls to global, top-level functions are expensive. Much more expensive than in a statically compiled language. They're slow because of the late-bound nature of top-level defined functions, allowing arbitrary redefinition while the program is running and forcing runtime checks on whether the function is being called with the right number or types of arguments. This type of call is known as a "full call" in Python (the compiler used in CMUCL and SBCL, not to be confused with the programming language), and their calling convention permits the most runtime flexibility.

But there is another type of call available to us: the local call. A local call is the type of call you would see between local functions inside a top-level function, say, a call to a function introduced via anonymous LAMBDAs, LABELs or FLETs in Lisp, or internal defines in Scheme and Python. These calls are more 'static' in the sense that they are treated more like function calls in static languages, being compiled "together" and at the same time as the local functions they reference, allowing them to be optimized at compile-time. For example, argument checking can be done at compile time because the number of arguments of the callee is known at compile time, unlike in the full call case where the function, and hence the number of arguments it takes, can change dynamically at runtime at any point. Additionally, the local call calling convention can allow for passing unboxed values like floats around, as they are put into unboxed registers never used in the full call convention, which must use boxed argument  and return value registers.

Block compilation is simply the compilation mode that turns what would normally be full calls to top-level defined functions into local calls to said functions, by compiling all functions in a unit of code (e.g. a file) together in "block" or "batch" fashion, just as local functions are compiled together in a single top-level form. You can think of the effect of block compilation as transforming all the DEFUNs in a file into one large LABELs form. You can think of it as being a tunable knob that increases or decreases how dynamic or static the compiler should act with the respect to function definitions, by controlling whether function name resolution is early-bound or late-bound in a given block compiled unit of code.

We can achieve block compilation with a file-level granularity in CMUCL and SBCL specifically by specifying the :block-compile keyword to compile-file. Here's an example:

In foo.lisp:
(defun foo (x y)
  (print (bar x y))
  (bar x y))

(defun bar (x y)
  (+ x y))

(defun fact (n)
  (if (zerop n)
      1
      (* n (fact (1- n)))))

> (compile-file "foo.lisp" :block-compile t :entry-points nil)
> (load "foo.fasl")

> (sb-disassem:disassemble-code-component #'foo)

; Size: 210 bytes. Origin: #x52E63F90 (segment 1 of 4)        ; (XEP BAR)
; 3F90:       .ENTRY BAR(X Y)                           
     ; (SB-INT:SFUNCTION
                                                              ;  (T T) NUMBER)
; 3FA0:       8F4508           POP QWORD PTR [RBP+8]
; 3FA3:       4883F904         CMP RCX, 4
; 3FA7:       0F85B1000000     JNE L2
; 3FAD:       488D65D0         LEA RSP, [RBP-48]
; 3FB1:       4C8BC2           MOV R8, RDX
; 3FB4:       488BF7           MOV RSI, RDI
; 3FB7:       EB03             JMP L1
; 3FB9: L0:   8F4508           POP QWORD PTR [RBP+8]
; Origin #x52E63FBC (segment 2 of 4)                          ; BAR
; 3FBC: L1:   498B4510         MOV RAX, [R13+16]              ; thread.binding-stack-pointer
; 3FC0:       488945F8         MOV [RBP-8], RAX
; 3FC4:       4C8945D8         MOV [RBP-40], R8
; 3FC8:       488975D0         MOV [RBP-48], RSI
; 3FCC:       498BD0           MOV RDX, R8
3FCF:       488BFE           MOV RDI, RSI
; 3FD2:       E8B9CB29FF       CALL #x52100B90                ; GENERIC-+
; 3FD7:       488B75D0         MOV RSI, [RBP-48]
; 3FDB:       4C8B45D8         MOV R8, [RBP-40]
; 3FDF:       488BE5           MOV RSP, RBP
; 3FE2:       F8               CLC
; 3FE3:       5D               POP RBP
; 3FE4:       C3               RET
; Origin #x52E63FE5 (segment 3 of 4)                          ; (XEP FOO)
; 3FE5:       .SKIP 11
; 3FF0:       .ENTRY FOO(X Y)                           
     ; (SB-INT:SFUNCTION
                                                              ;  (T T) NUMBER)
; 4000:       8F4508           POP QWORD PTR [RBP+8]
; 4003:       4883F904         CMP RCX, 4
; 4007:       7557             JNE L3
; 4009:       488D65D0         LEA RSP, [RBP-48]
; 400D:       488955E8         MOV [RBP-24], RDX
; 4011:       48897DE0         MOV [RBP-32], RDI
; Origin #x52E64015 (segment 4 of 4)                          ; FOO
; 4015:       498B4510         MOV RAX, [R13+16]              ; thread.binding-stack-pointer
; 4019:       488945F0         MOV [RBP-16], RAX
; 401D:       4C8BCD           MOV R9, RBP
; 4020:       488D4424F0       LEA RAX, [RSP-16]
; 4025:       4883EC40         SUB RSP, 64
; 4029:       4C8B45E8         MOV R8, [RBP-24]
; 402D:       488B75E0         MOV RSI, [RBP-32]
; 4031:       4C8908           MOV [RAX], R9
; 4034:       488BE8           MOV RBP, RAX
; 4037:       E87DFFFFFF       CALL L0
; 403C:       4883EC10         SUB RSP, 16
; 4040:       B902000000       MOV ECX, 2
; 4045:       48892C24         MOV [RSP], RBP
; 4049:       488BEC           MOV RBP, RSP
; 404C:       E8F1E163FD       CALL #x504A2242                ; #<FDEFN PRINT>
; 4051:       4C8B45E8         MOV R8, [RBP-24]
; 4055:       488B75E0         MOV RSI, [RBP-32]
; 4059:       E95EFFFFFF       JMP L1
; 405E: L2:   CC10             INT3 16                        ; Invalid argument count trap
; 4060: L3:   CC10             INT3 16                        ; Invalid argument count trap

You can see that FOO and BAR are now compiled into the same component (with local calls), and both have valid external entry points. This improves locality of code quite a bit and still allows calling both FOO and BAR externally from the file (e.g. in the REPL). The only thing that has changed is that within the file foo.lisp, all calls to functions within that file shortcut going through the global fdefinition's external entry points which do all the slow argument checking and boxing. Even FACT is faster because the compiler can recognize the tail recursive local call and directly turn it into a loop. Without block-compilation, the user is licensed to, say, redefine FACT while it is running, which forces the compiler to make the self call into a normal full call to allow redefinition and full argument and return value processing.
But there is one more goody block compilation adds...

the :entry-points keyword

Notice we specified :entry-points nil above. That's telling the compiler to still create external entry points to every function in the file, since we'd like to be able to call them normally from outside the code component (i.e. the compiled compilation unit, here the entire file). Now, those of you who know C know there is a useful way to get the compiler to optimize file-local functions, for example automatically inlining them if they are once-use, and also enforce that the function is not visible externally from the file. This is the static keyword in C. The straightforward analogue when block compiling is the :entry-points keyword. Essentially, it makes the DEFUNs which are not entry-points not have any external entry points, i.e. they are not visible to any functions outside the block compiled unit and so become subject to an assortment of optimizations, For example, if a function with no external entry point is never called in the block-compiled unit, it will just be deleted as dead code. Better yet, if a function is once-use, it will be removed and directly turned into a LET at the call site, essentially acting as inlining with no code size tradeoff and is always an optimization.
So, for example, we have
> (compile-file "test.lisp" :block-compile t :entry-points '(bar fact))
which removes FOO for being unused in the block compiled unit (the file). This is all documented very well in the CMUCL manual section Advanced Compiler Use and Efficiency Hints under "Block compilation". Unfortunately this section (among others) never made it over to SBCL's manual, though it is still 99% accurate for SBCL.

a brief history of block compilation

Now to explain why the word "Fresh" in the title of this post is in quotes. You may be surprised to hear that CMUCL, the progenitor of SBCL, has had this interface to block compilation described above since 1991. Indeed, the Python compiler, which was first started in 1985 by Rob MacLachlan, was designed with the explicit goal of being able to block compile arbitrary amounts of code at once in bulk in the manner described above as a way to close the gap between dynamic and static language implementations. Indeed, the top-level intermediate representation data structure in Python is the COMPONENT, which represents a connected component of the flow graph created by compiling multiple functions together. So, what happened? Why did SBCL not have this feature despite its compiler being designed around it?

the fork

When SBCL was first release and forked off from CMUCL in late 1999, the goal of the system was to make it sanely bootstrappable and more maintainable. Many casualties of CMUCL features occured during this fork for the purpose of getting something working, such as the loss of the bytecode compiler, many extensions, hemlock, numerous backends, and block compilation. Many of these features such as the numerous CMUCL backends were eventually restored, but block compilation was one of those features that was never brought back into the fold, with bitrotted remnants of the interface lying around in the compiler for decades. The processing of DECLAIM and PROCLAIM forms, which were a crucial part of the more fine grained form of block compilation, were revamped entirely to make things more ANSI compatible. In fact, the proclamation level of block compilation of CMUCL has not made it back into SBCL even now for this reason, and it is still unclear whether it would be worth adding this form of elegant block compilation back into SBCL and whether it can be done in a clean, ANSI manner. Perhaps once this feature becomes more well known, people will find the finer granularity form of block compilation useful enough to request (or implement) it.

the revival

Reanimating this bitrotted zombie of a feature was surprisingly easy and difficult at the same time. Because the idea of block compilation was so deeply embedded into the structure of the compiler, most things internally worked right off the bat, sans a few assertions that had crept in after assuming an invariant along the lines of one top-level function definition per code component, which is contrary definitionally to the idea of block compilation. The front-end interface to block compilation was in contrast completely blown away and destroyed, with fundamental macros like DEFUN having been rewritten and the intermediate representation namespace behaving more locally. It took some time to redesign the interface to block compilation to fit with this new framework, and my first attempt last month to land the change ended in a Quicklisp library dropping the compiler into an infinite loop. The cause was a leakage in the intermediate representation which I patched up this month. Now things seem robust enough that it doesn't cause regressions for normal compilation mode.


what's left?

Not all is perfect though, and there are still a few bugs lurking around. For example, block compiling and inlining currently does not interact very well, while the same is not true for CMUCL. There are also probably a few bugs lurking around with respect to ensuring the right policies are in effect and have consistent semantics with block compilation. In addition, as mentioned above, the form-by-form level granularity given by the CMUCL-style (declaim (start-block ...)) ... (declaim (end-block ..)) proclamations are still missing. In fact, the CMUCL compiler sprinkled a few of these block compilation declarations around, and it would be nice if SBCL could block compile some of its own code to ensure maximum efficiency. However, the basic apparatus works, and I hope that as more people rediscover this feature and try it on their own performance-oriented code bases, bug reports and feature requests around block compilation and whole program optimization will develop and things will start maturing very quickly!

Lispers.deBerlin Lispers Meetup, Monday, 24th February 2020

· 40 days ago

We meet again on Monday 8pm, 24th February.

Berlin Lispers is about all flavors of Lisp including Clojure, Common Lisp, and Scheme.

We have no talk announced for this time. We meet therefore for dinner at 3 Schwestern restaurant in Bethanien in Berlin-Kreuzberg.

But there are plenty of Lisp talks this week in Berlin!

Racketfest - https://racketfest.com

:clojureD - https://clojured.de

Bob2020 - https://bobkonf.de/2020/en/

And also Clojure data science [1] and the Clojure music gathering [2]. Most of the events will have official or unofficial pre and after dinners.

[1] https://ti.to/scicloj/international-clojure-data-science-meetup-in-berlin

[2] https://ti.to/audible-side-effects/audible-side-effects-2020

8pm

3 Schwestern, Mariannenplatz 2, Berlin-Kreuzberg, https://www.3schwestern.com/

It is located in 10min walking distance from U Görlitzer Bahnhof or U Kottbusser Tor or Ostbahnhof. In case of questions call Christian +49 1578 70 51 61 4.

Quicklisp newsFebruary 2020 Quicklisp dist update now available

· 41 days ago
New projects:
  • cl-isolated — A isolated environment for Common Lisp code evaluation — AGPLv3+
  • cl-maxminddb — CL MaxMind DB — GNU Lesser General Public License, v3
  • cl-semver — Semantic Version implementation — MIT
  • cl-tui — High-level library for making Text User Interfaces — MIT
  • cl-wavelets — Wavelet transform library — 2-clause BSD
  • cl-zyre — Zyre is a ZeroMQ-based network protocol for clusters and service discovery. — MIT
  • lisp-preprocessor — Common Lisp embedded template engine — MIT
  • lispcord — A client library for the discordapp bot api — MIT
  • magic-ed — Edit your code from REPL. — MIT
  • mbe — Scheme Macros for Common Lisp — LGPL 2.1
  • minilem — Minimal version of lem, emacs-like editor. A minimal self-modifying Common Lisp editor — MIT
  • ops5 — The Ops5 programming language for production systems — Public Domain
  • ratmath — Math utilities for working with rational numbers and intervals. — MIT
  • rs-colors — A color data type for Common Lisp. — Modified BSD License
  • s-graphviz — a s-expression presentation of GraphViz DOT language — MIT
  • srfi-1 — List Library — MIT
  • srfi-23 — SRFI 23: Error reporting mechanism — Unlicense
  • srfi-6 — SRFI-6: Basic String Ports — Unlicense
  • srfi-98 — SRFI 98: get-environment-variable — Unlicense
  • trivial-coverage — A simple Common Lisp library to print out the code coverage collected. Supports SBCL and CCL. — MIT
  • truetype-clx — Ripped out rendering from clx-truetype — MIT
  • uax-15 — Common lisp implementation of Unicode normalization functions :nfc, :nfd, :nfkc and :nfkd (Uax-15) — MIT
Updated projects3b-bmfont3d-vectorsadoptalexandriaalso-alsaaprilasdf-vizassoc-utilsatomicsbabelbdefbeastbikebinary-iobobbinbpcardiogramcerberuscffichancerychanlcl-anacl-ansi-textcl-argparsecl-asynccl-autowrapcl-charmscl-collidercl-colors2cl-conllucl-containerscl-db3cl-dbicl-digraphcl-ecma-48cl-elasticcl-emojicl-enumerationcl-fadcl-formscl-gamepadcl-gobject-introspectioncl-hamtcl-krakencl-lascl-ledgercl-libusbcl-maxsatcl-mount-infocl-netpbmcl-patternscl-pcgcl-piglowcl-pslibcl-pslib-barcodecl-random-forestcl-rdkafkacl-satcl-sdl2cl-sdl2-ttfcl-shlexcl-simple-fsmcl-skkservcl-storecl-strcl-unificationcladclazyclemclmlcloser-mopclxcodata-recommended-valuescommon-lisp-jupytercommonqtconiumcontextlcroatoancurry-compose-reader-macrosdefclass-stddefenumdefinitionsdeflatedeploydexadoreasy-audioeasy-routeseclectoresrapfare-scriptsflowfsetgendlgeneric-clgolden-utilsgraphhelambdaphu.dwim.stefilhu.dwim.walkerironcladjsonrpcjsownkenzolinear-programminglisp-binarylisp-criticlisp-zmqliterate-lisplquerymaidenmcclimmetabang-bindmeteringmitonamed-readtablesnew-opnodguinumclookoriginpapyrusparachuteparse-floatpetalispphoe-toolboxpjlinkplokamipngloadpolicy-condportable-threadspostmodernprotestprotobufproveqlotqtools-uiquilcqvmrereaderregular-type-expressionsanity-clausescalplsealable-metaobjectsselserapeumsha1shadowsimple-actorssimple-configslyspinneretstatic-dispatchstumpwmsucleswank-clientswank-crewteepeedee2tmpdirtootertriviatrivial-featurestrivial-package-local-nicknamestrivial-utilitiesuiopumbrautils-ktutmvernacularxhtmlambdaxml-emitter.

Removed projects: bodge-nanovg, cl-fixtures, cl-gambol, cl-grace, cl-torrents, cl-transmission, clon, clx-cursor, clx-truetype, dbd-oracle, m2cl, x.fdatatypes, x.let-star.

This month has an unusually high number of removed projects. There are a couple causes. First, a number of projects have simply disappeared - they are gone from their source locations and there's no sign of a new home. Second, some projects stopped building (sometimes because of missing libraries in the first group, but not always) and weren't fixed in time for this release.

If you depend on one of these removed projects, I can try to help you get in touch with the author to get them back into Quicklisp. Otherwise, you may wish to stick with an older Quicklisp dist where they are still present.

To get this update, use (ql:update-dist "quicklisp").

Joe MarshallStupid pattern matching tricks

· 43 days ago
There are a few pattern matching constructs in Common Lisp. For instance, destructuring-bind matches list structure against a tree of variable names and binds the variables accordingly. Macros can destructure their argument list. Even functions have simple keyword matching. These constructs don't give access to their pattern matchers as first-class objects, but perhaps you want that. You can construct a simple pattern matcher by wrapping one of these constructs in the appropriate macro.

We'll want the result of our pattern match to be an alist mapping symbols to the objects they matched with. First, we'll need a function that takes a pattern and returns a list of the variables in the pattern. flatten will work nicely for destructuring-bind:
(defun flatten (pattern)
  (cond ((null pattern) '())
 ((symbolp pattern) (list pattern))
 ((consp pattern) (append (flatten (car pattern))
     (flatten (cdr pattern))))
 (t (error "Not a pattern"))))

CL-USER> (flatten '((a b . c) d e . f))
(A B C D E F)
Then we want to generate code that will make an alist:
CL-USER> `(list ,@(map 'list (lambda (var)
           `(cons ',var ,var))
               (flatten '((a b . c) d e . f))))
(LIST (CONS 'A A) (CONS 'B B) (CONS 'C C) (CONS 'D D) (CONS 'E E) (CONS 'F F))
Finally, we wrap a call to destructuring-bind with a macro:
CL-USER> (defmacro destructuring-pattern-matcher (pattern)
           `(lambda (form)
              (destructuring-bind ,pattern form
                (list ,@(map 'list (lambda (var)
                              `(cons ',var ,var))
                     (flatten pattern))))))
DESTRUCTURING-PATTERN-MATCHER

CL-USER> (destructuring-pattern-matcher ((a b . c) d e . f))
#<FUNCTION (LAMBDA (FORM)) {10027B143B}>
destructuring-pattern-matcher returns a pattern matcher as a first-class procedure we can call on a pattern to get an alist of bindings:
CL-USER> (defvar *matcher* (destructuring-pattern-matcher ((a b . c) d e . f)))
*MATCHER*

CL-USER> (funcall *matcher* '((1 2 3 4) 5 6 7 8))
((A . 1) (B . 2) (C 3 4) (D . 5) (E . 6) (F 7 8))

We can use this trick to get at the destructuring pattern match done by defmacro. First, we need a function that takes a macro lambda list and returns a list of the variables it binds. I won't reproduce the function here, it is too large, but here's a sample call:
CL-USER> (macro-lambda-list-variables 
            '((foo bar &optional; (baz 'default baz-supplied-p) . more) quux
              &rest; rest
              &key; ((:key key-variable) 'key-default key-supplied-p) key2
              &aux; (auxvar 'auxvalue)))
(FOO BAR BAZ BAZ-SUPPLIED-P MORE QUUX REST KEY-VARIABLE KEY-SUPPLED-P KEY2 AUXVAR)
If we were matching the list '(1 e) against the pattern (a b &optional; c), we'd want to generate code something like this:
(MACROLET ((MACRO (A B &OPTIONAL; C)
             (LIST 'LIST
     (LIST 'CONS ''A (LIST 'QUOTE A))
     (LIST 'CONS ''B (LIST 'QUOTE B))
                   (LIST 'CONS ''C (LIST 'QUOTE C)))))
  (MACRO 1 E))
We'll do this in stages:
(defun make-macro-pattern-matcher-body (pattern)
  `(list 
    'list
    ,@(map 'list (lambda (var)
     `(list 'cons '',var `',,var))
    (macro-lambda-list-variables pattern))))

(defun make-macro-pattern-matcher (pattern)
  (let ((body (make-macro-pattern-matcher-body pattern)))
    (lambda (form)
      `(macrolet ((macro ,pattern
      ,body))
  (macro ,@form)))))

(defmacro macro-pattern-matcher (pattern)
  (let ((matcher  (make-macro-pattern-matcher pattern)))
    `(lambda (form)
       (eval (funcall ',matcher form)))))
Now we can make a pattern matcher that works like the macro destructuring facility:
CL-USER> (setq *matcher* 
            (macro-pattern-matcher 
       ((foo bar &optional; (baz 'default baz-supplied-p) . more) quux
               &rest; rest
               &key; ((:key key-variable) 'key-default key-supplied-p) key2
               &aux; (auxvar 'auxvalue))))
#<FUNCTION (LAMBDA (FORM)) {10027B1D3B}>

CL-USER> (funcall *matcher* '((1 2 3 4) 5 :key 6 :key2 7))
((FOO . 1)
 (BAR . 2)
 (BAZ . 3)
 (BAZ-SUPPLIED-P . T)
 (MORE 4)
 (QUUX . 5)
 (REST :KEY 6 :KEY2 7)
 (KEY-VARIABLE . 6)
 (KEY-SUPPLIED-P . T)
 (KEY2 . 7)
 (AUXVAR . AUXVALUE))
You can do a similar trick with regular lambda lists, but while they have keywords, they don't destructure.

You have to be careful when writing the expansion for the binding alist. Too much quoting and you end up with the names rather than their values in the output:
((foo . foo)
 (bar . bar)
 …etc…)
not enough, you end up with the values of the values in the output:
CL-USER> (defvar e 22)
E

CL-USER> (funcall *matcher* '((1 2 e) 5))
((FOO . 1)
 (BAR . 2)
 (BAZ . 22) ; Wrong! Should be 'Eetc…)

Vsevolod DyomkinProgramming Algorithms: Compression

· 43 days ago

Compression is one of the tools that every programmer should understand and wield confidently. Such situations when the size of the dataset is larger than the program can handle directly and it becomes a bottleneck are quite frequent and can be encountered in any domain. There are many forms of compression, yet the most general subdivision is between lossless one which preserves the original information intact and lossy compression which discards some information (assumed to be the most useless part or just noise). Lossless compression is applied to numeric or text data, whole files or directories — the data that will become partially or utterly useless if even a slight modification is made. Lossy compression, as a rule, is applied to data that originates in the "analog world": sound or video recordings, images, etc. We have touched the subject of lossy compression slightly in the previous chapter when talking about such formats as JPEG. In this chapter, we will discuss the lossless variants in more detail. Besides, we'll talk a bit about other, non-compressing, forms of encoding.

Encoding

Let's start with encoding. Lossless compression is, in fact, a form of encoding, but there are other, simpler forms. And it makes sense to understand them before moving to compression. Besides, encoding itself is a fairly common task. It is the mechanism that transforms the data from an internal representation of a particular program into some specific format that can be recognized and processed (decoded) by other programs. What we gain is that the encoded data may be serialized and transferred to other computers and decoded by other programs, possibly, independent of the program that performed the encoding.

Encoding may be applied to different semantic levels of the data. Character encoding operates on the level of individual characters or even bytes, while various serialization formats deal with structured data. There are two principal approaches to serialization: text-based and binary. The pros and cons are the opposite: text-based formats are easier to handle by humans but are usually more expensive to process, while binary variants are not transparent (and so, much harder to deal with) but much faster to process. From the point of view of algorithms, binary formats are, obviously, better. But my programming experience is that they are a severe form of premature optimization. The rule of thumb should be to always start with text-based serialization and move to binary formats only as a last resort when it was proven that the impact on the program performance will be significant and important.

Base64

Encoding may have both a reduction and a magnification effect on the size of the data. For instance, there's a popular encoding scheme — Base64. It is a byte-level (lowest level) encoding that doesn't discriminate between different input data representations and formats. No, the encoder just takes a stream of bytes and produces another stream of bytes. Or, more precisely, bytes in the specific range of English ASCII letters, numbers, and three more characters (usually, +, /, and =). This encoding is often used for transferring data in the Web, in conjunction with SMTP (MIME), HTTP, and other popular protocols. The idea behind it is simple: split the data stream into sextets (6-bit parts — there's 64 different variants of those), and map each sextet to an ASCII character according to a fixed dictionary. As the last byte of the original data may not align with the last sextet, an additional padding character (=) is used to indicate 2 (=) or 4 (==) misaligned bits. As we see, Base64 encoding increases the size of the input data by a factor of 1.25.

Here is one of the ways to implement a Base64 serialization routine:

(defparameter *b64-dict*
(coerce (append (loop :for ch :from (char-code #\A) :to (char-code #\Z)
:collect (code-char ch))
(loop :for ch :from (char-code #\a) :to (char-code #\z)
:collect (code-char ch))
(loop :for ch :from (char-code #\0) :to (char-code #\9)
:collect (code-char ch))
'(#\+ #\/ #\=))
'simple-vector))

(defun b64-encode (in out)
(let ((key 0)
(limit 6))
(flet ((fill-key (byte off beg limit)
(:= (ldb (byte limit off) key)
(ldb (byte limit beg) byte))
(:= off (- 6 beg)))
(emit1 (k)
(write-byte (char-code (svref *b64-dict* k)) out)))
(loop :for byte := (read-byte in nil) :while byte :do
(let ((beg (- 8 limit)))
(fill-key byte 0 beg limit)
(emit1 key)
(fill-key byte (:= limit (- 6 beg)) 0 beg)
(when (= 6 beg)
(emit1 key)
(:= limit 6))))
(when (< limit 6)
(:= (ldb (byte limit 0) key)
(ldb (byte limit 0) 0))
(emit1 key)
(loop :repeat (ceiling limit 2) :do
(emit1 64))))))

This is one of the most low-level pieces of Lisp code in this book. It could be written in a much more high-level manner: utilizing the generic sequence access operations, say, on bit-vectors, instead of the bit manipulating ones on numbers. However, it would be also orders of magnitude slower due to the need to constantly "repackage" the bits, converting the data from integers to vectors and back. I also wanted to show a bit of bit fiddling, in Lisp. The standard, in fact, defines a comprehensive vocabulary of bit manipulation functions and there's nothing stopping the programmer from writing performant code operating at a single bit level.

One important choice made for Base64 encoding is the usage of streams as the input and output. This is a common approach to such problems based on the following considerations:

  • It is quite easy to wrap the code so that we could feed/extract strings as inputs and outputs. Doing the opposite, and wrapping a string-based code for stream operation is also possible, but it defeats the whole purpose of streams, which is...
  • Streams allow to efficiently handle data of any size and not waste memory, as well as CPU, for storing intermediary copies of the strings we're processing. Encoding a huge file is a good illustration of why this matters: with streams, we do it in an obvious manner: (with-open-file (in ...) (with-out-file (out) (base64-encode in out)). With strings, however, it will mean, first, reading the file contents into memory — and we may not even have enough memory for that. And, after that, filling another big chunk of memory with the encoded data. Which we'll still, probably, need to either dump to a file or send over the network.

So, what happens in the code above? First, the bytes are read from the binary input stream in, then each one is slashed into 2 parts. The higher bits are set into the current base64 key, which is translated, using b64-dict, into an appropriate byte and emitted to the binary output stream out. The lower bits are deposited in the higher bits of the next key in order to use this leftover during the processing of the next byte. However, if the leftover from the previous byte was 4 bits, at the current iteration, we will have 2 base64 bytes available as the first will use 2 bits from the incoming byte, and the second will consume the remaining 6 bits. This is addressed in the code block (when (= 6 beg) ...). The function relies on the standard Lisp ldb operation which provides access to the individual bits of an integer. It uses the byte-spec (byte limit offset) to control the bits it wants to obtain.

Implementing a decoder procedure is left as an exercise to the reader...

Taking the example from the Wikipedia article, we can see our encoding routine in action (here, we also rely on the FLEXI-STREAMS library to work with binary in-memory streams):

CL-USER> (with-input-from-string (str "Man i")
(let ((in (flex:make-in-memory-input-stream
(map 'vector 'char-code
(loop :for ch := (read-char str nil) :while ch :collect ch))))
(out (flex:make-in-memory-output-stream)))
(b64-encode in out)
(map 'string 'code-char (? out 'vector))))
"TWFuIGk="

This function, although it's not big, is quite hard to debug due to the need for careful tracking and updating of the offsets into both the current base64 chunk (key) and the byte being processed. What really helps me tackle such situations is a piece of paper that serves for recording several iterations with all the relevant state changes. Something along these lines:

        M (77)    |    a (97)     |    n (110)
0 1 0 0 1 1 0 1|0 1 1 0 0 0 0 1|0 1 1 0 1 1 1 0
0: 0 1 0 0 1 1 | | 19 = T
0 1| |
1: 0 1|0 1 1 0 | 22 = W
| 0 0 0 1|
2: | 0 0 0 1|0 1 5 = F

Iteration 0:

beg: 2
off: 0
limit: 6

beg: 0
off: (- 6 2) = 4
limit: 2


Iteration 1:

beg: 4
off: 0
limit: 4

beg: 0
off: (- 6 4) = 2
limit: 4

Another thing that is indispensable, when coding such procedures, is the availability of the reference examples of the expected result, like the ones in Wikipedia. Lisp REPL makes iterating on a solution and constantly rechecking the results, using such available data, very easy. However, sometimes, in makes sense to reject the transient nature of code in the REPL and record some of the test cases as unit tests. As the motto of my test library SHOULD-TEST declares: you should test even Lisp code sometimes :) The tests also help the programmer to remember and systematically address the various corner cases. In this example, one of the special cases is the padding at the end, which is handled in the code block (when (< limit 6) ...). Due to the availability of a clear spec and reference examples, this algorithm lends itself very well to automated testing. As a general rule, all code paths should be covered by the tests. If I were to write those tests, I'd start with the following simple version. They address all 3 variants of padding and also the corner case of an empty string.

(deftest b64-encode ()
;; B64STR would be the function wrapped over the REPL code presented above
(should be blankp (b64str ""))
(should be string= "TWFu" (b64str "Man"))
(should be string= "TWFuIA==" (b64str "Man "))
(should be string= "TWFuIGk=" (b64str "Man i")))

Surely, many more tests should be added to a production-level implementation: to validate operation on non-ASCII characters, handling of huge data, etc.

Lossless Compression

The idea behind lossless compression is straightforward: find an encoding that is tailored to our particular dataset and allows the encoding procedure to produce a shorter version than using a standard encoding. Not being general-purpose, the vocabulary for this encoding may use a more compact representation for those things that occur often, and a longer one for those that appear rarely, skipping altogether those that don't appear at all. Such an encoding scheme will be, probably, structure-agnostic and just convert sequences of bytes into other sequences of a smaller size, although custom structure-aware compression is also possible.

This approach can be explained with a simple example. The phrase "this is a test" uses 8-bit ASCII characters to represent each letter. There are 256 different ASCII characters in total. However, for this particular message, only 7 characters are used: t, h, i, s, , a, and e. 7 characters, in theory, need only 2.81 bits to be distinguished. Encoding them in just 3 bits instead of 8 will reduce the size of the message almost thrice. In other words, we could create the following vocabulary (where #*000 is a Lisp literal representation of a zero bit-vector of 3 bits):

#h(#\t #*000
#\h #*001
#\i #*010
#\s #*011
#\a #*100
#\e #*101
#\Space #*110)

Using this vocabulary, our message could be encoded as the following bit-vector: #*0000010100111100100111101001100001010111000. The downside, compared to using some standard encoding, is that we now need to package the vocabulary alongside the message, which will make its total size larger than the original that used an 8-bit standard encoding with a known vocabulary. It's clear, though, that, as the message becomes longer, the fixed overhead of the vocabulary will quickly be exceeded by the gain from message size reduction. Although, we have to account for the fact that the vocabulary may also continue to grow and require more and more bits to represent each entry (for instance, if we use all Latin letters and numbers it will soon reach 6 or 7 bits, and our gains will diminish as well). Still, the difference may be pre-calculated and the decision made for each message or a batch of messages. For instance, in this case, the vocabulary size may be, say, 30 bytes, and the message size reduction is 62.5%, so a message of 50 or more characters will be already more compact if encoded with this vocabulary even when the vocabulary itself will be sent with it. The case of only 7 characters is pretty artificial, but consider that DNA strings have only 4 characters.

However, this simplistic approach is just the beginning. Once again, if we use an example of the Latin alphabet, some letters, like q or x may end up used much less frequently, than, say, p or a. Our encoding scheme uses equal length vectors to represent them all. Yet, if we were to use shorter representations for more frequently used chars at the expense of longer ones for the characters occurring less often, additional compression could be gained. That's exactly the idea behind Huffman coding.

Huffman Coding

Huffman coding tailors an optimal "alphabet" for each message, sorting all letters based on their frequency and putting them in a binary tree, in which the most frequent ones are closer to the top and the less frequent ones — to the bottom. This tree allows calculating a unique encoding for each letter based on a sequence of left or right branches that need to be taken to reach it, from the top. The key trick of the algorithm is the usage of a heap to maintain the characters (both individual and groups of already processed ones) in sorted order. It builds the tree bottom-up by first extracting two least frequent letters and combining them: the least frequent on the left, the more frequent — on the right. Let's consider our test message. In it, the letters are sorted by frequency in the following order:

((#\a 1) (#\e 1) (#\h 1) (#\i 2) (#\s 3) (#\t 3) (#\Space 3)) 

Extracting the first two letters results in the following treelet:

 ((#\a #\e) 2)
/ \
(#\a 1) (#\e 1)

Uniting the two letters creates a tree node with a total frequency of 2. To use this information further, we add it back to the queue in place of the original letters, and it continues to represent them, during the next steps of the algorithm:

((#\h 1) ((#\a #\e) 2) (#\i 2) (#\s 3) (#\t 3) (#\Space 3)) 

By continuing this process, we'll come to the following end result:

  ((#\s #\t #\Space #\i #\h #\a #\e) 14)
/ \
((#\s #\t) 6) ((#\Space #\i #\h #\a #\e) 8)
/ \ / \
(#\s 3) (#\t 3) (#\Space 3) ((#\i #\h #\a #\e) 5)
/ \
(#\i 2) ((#\h #\a #\e) 3)
/ \
(#\h 1) ((#\a #\e) 2)
/ \
(#\a 1) (#\e 1)

From this tree, we can construct the optimal encoding:

#h(#\s #*00
#\t #*01
#\Space #*10
#\i #*110
#\h #*1110
#\a #*11110
#\e #*11111)

Compared to the simple approach that used constantly 3 bits per character, it takes 1 bit less for the 3 most frequent letters and 2 bits more for two least frequent ones. The encoded message becomes: #*01111011000101100010111101001111110001, and it has a length of 38 compared to 43 for our previous attempt.

To be clear, here are the encoding and decoding methods that use the pre-built vocabulary (for simplicity's sake, they operate on vectors and strings instead of streams):

(defun huffman-encode (envocab str)
(let ((rez (make-array 0 :element-type 'bit :adjustable t :fill-pointer t)))
(dovec (char str)
(dovec (bit (? envocab char))
(vector-push-extend bit rez)))
rez))

(defun huffman-decode (devocab vec)
(let (rez)
(dotimes (i (length vec))
(dotimes (j (- (length vec) i))
(when-it (? devocab (slice vec i (+ i j 1)))
(push it rez)
(:+ i j)
(return))))
(coerce (reverse rez) 'string)))

It is worth recalling that vector-push-extend is implemented in a way, which will not adjust the array by only 1 bit each time it is called. The efficient implementation "does the right thing", for whatever the right thing means in this particular case (maybe, adjusting by 1 machine word). You can examine the situation in more detail by trying to extend the array by hand (using adjust-array or providing a third optional argument to vector-push-extend) and comparing the time taken by the different variants, to verify my words.

Finally, here is the most involved part of the Huffman algorithm, which builds the encoding and decoding vocabularies (with the help of a heap implementation we developed in the chapter on Trees):

(defun huffman-vocabs (str)
;; here we assume more than a single unique character in STR
(let ((counts #h())
(q (make-heap :op '< :key 'rt))
(envocab #h())
(devocab #h(equal))) ; bit-vectors as keys require 'equal comparison
;; count character frequencies
(dovec (char str)
(:+ (get# char counts 0))) ; here, we use the default third argument of gethash set to 0
;; heapsort the characters based on their frequency
(dotable (char count counts)
(heap-push (pair char count) q))
;; build the tree
(dotimes (i (1- (heap-size q)))
(with (((lt cl) (heap-pop q))
((rt cr) (heap-pop q)))
(heap-push (pair (list lt rt) (+ cl cr))
q)))
;; traverse the tree in DFS manner
;; encoding the path to each leaf node as a bit-vector
(labels ((dfs (node &optional (level 0) path)
(if (listp node)
(progn
(dfs (lt node) (1+ level) (cons 0 path))
(dfs (rt node) (1+ level) (cons 1 path)))
(let ((vec (make-array level :element-type 'bit
:initial-contents (reverse list))))
(:= (? envocab node) vec
(? devocab vec) node)))))
(dfs (lt (heap-pop q))))
(list envocab devocab)))

Huffman Coding in Action

Compression is one of the areas for which it is especially interesting to directly compare the measured gain in space usage to the one expected theoretically. Yet, as we discussed in one of the previous chapters, such measurements are not so straightforward as execution speed measurements. Yes, if we compress a single sequence of bytes into another one, there's nothing more trivial than to compare their lengths, but, in many tasks, we want to see a cumulative effect of applying compression on a complex data structure. This is what we're going to do next.

Consider the problem that I had in my work on the tool for text language identification WIKI-LANG-DETECT. This software relies on a number of big dictionaries that map strings (character trigrams and individual words) to floats. The obvious approach to storing these maps is with a hash-table. However, due to the huge number of keys, such table will, generally, have a sizeable overhead, which we would like to avoid. Besides, the keys are strings, so they have a good potential for reduction in occupied size when compressed. The data is also serialized into per-language files in the tab-separated format. This is the sample of a word-level file for the Danish language:

afrika    -8.866735
i -2.9428265
the -6.3879676
ngo -11.449115
of -6.971129
kanye -12.925021
e -8.365895
natal -12.171249

Our task is to load the data in memory so that access to the keys had constant runtime and minimal occupied space.

Let's begin with a simple hash-table-based approach. The following function will load two files from the default directory (*default-pathname-defaults*) and return a list of two hash-tables: for the word and trigram probabilities.

(defun load-data-into-hts (lang)
(declare (optimize sb-c::instrument-consing))
(mapcar (lambda (kind)
(let ((rez (make-hash-table :test 'equal)))
(dolines (line (fmt "~A-~A.csv" lang kind))
(let ((space (position #\Space line)))
(set# (slice line 0 space) rez
(read-from-string (slice line (1+ space))))))
rez))
'("words" "3gs")))

To measure the space it will take, we'll use a new SBCL extension called allocation profiling from the sb-aprof package[1]. To enable the measurement, we have put a special declaration immediately after the defun header: (optimize sb-c::instrument-consing).

Now, prior to running the code, let's look at the output of room:

CL-USER> (room)
Dynamic space usage is: 60,365,216 bytes.
...

This is a freshly loaded image, so space usage is minimal. Usually, before proceeding with the experiment, I invoke garbage collection to ensure that we don't have some leftover data from the previous runs that may overlap with the current one. In SBCL, you run it with (sb-ext:gc :full t).

Now, let's load the files for the German language (the biggest ones) under aprof. The data can be obtained from the github repository of the project. The total size of 2 German-language files on disk (words and trigrams dictionaries) is around 4 MB.

CL-USER> (sb-aprof:aprof-run
(lambda () (defparameter *de* (load-data-into-hts "DE"))))
227 (of 50000 max) profile entries consumed

% Bytes Count Function
------- ----------- --------- --------
24.2 34773600 434670 SB-KERNEL:%MAKE-ARRAY - #:|unknown|
19.4 27818880 217335 SB-IMPL::%MAKE-STRING-INPUT-STREAM - SB-IMPL::STRING-INPUT-STREAM
19.4 27818880 434670 SLICE - LIST

17.3 24775088 SB-IMPL::HASH-TABLE-NEW-VECTORS
54.0 13369744 52 SIMPLE-VECTOR
46.0 11405344 156 (SIMPLE-ARRAY (UNSIGNED-BYTE 32) (*))

14.9 21406176 SB-IMPL::ANSI-STREAM-READ-LINE-FROM-FRC-BUFFER
99.4 21280192 225209 (SIMPLE-ARRAY CHARACTER (*))
0.6 125984 7874 LIST

4.8 6957184 217412 SB-KERNEL::INTEGER-/-INTEGER - RATIO

00.0 14160 SB-IMPL::%MAKE-PATHNAME
91.8 12992 812 LIST
8.2 1168 1 SIMPLE-VECTOR

00.0 4160 2 SB-IMPL::SET-FD-STREAM-ROUTINES - (SIMPLE-ARRAY CHARACTER (*))

00.0 3712 SB-IMPL::%MAKE-DEFAULT-STRING-OSTREAM
62.1 2304 8 (SIMPLE-ARRAY CHARACTER (*))
37.9 1408 8 SB-IMPL::CHARACTER-STRING-OSTREAM

00.0 1024 MAKE-HASH-TABLE
53.1 544 2 SIMPLE-VECTOR
46.9 480 6 (SIMPLE-ARRAY (UNSIGNED-BYTE 32) (*))

00.0 832 SB-IMPL::%MAKE-FD-STREAM
73.1 608 2 SB-SYS:FD-STREAM
19.2 160 2 SB-VM::ARRAY-HEADER
7.7 64 2 (SIMPLE-ARRAY CHARACTER (*))

00.0 576 GET-OUTPUT-STREAM-STRING
55.6 320 8 SIMPLE-BASE-STRING
44.4 256 8 SB-KERNEL:CLOSURE

00.0 400 SB-KERNEL:VECTOR-SUBSEQ*
60.0 240 6 (SIMPLE-ARRAY CHARACTER (*))
40.0 160 5 SIMPLE-BASE-STRING

00.0 400 5 SB-IMPL::%%MAKE-PATHNAME - PATHNAME
00.0 384 2 SB-IMPL::%MAKE-HASH-TABLE - HASH-TABLE
00.0 288 4 SB-KERNEL:%CONCATENATE-TO-STRING - (SIMPLE-ARRAY CHARACTER (*))
00.0 192 12 SB-IMPL::UNPARSE-NATIVE-PHYSICAL-FILE - LIST
00.0 176 2 SB-IMPL::READ-FROM-C-STRING/UTF-8 - (SIMPLE-ARRAY CHARACTER (*))
00.0 128 4 SB-ALIEN-INTERNALS:%SAP-ALIEN - SB-ALIEN-INTERNALS:ALIEN-VALUE

00.0 96 SB-IMPL::QUERY-FILE-SYSTEM
66.7 64 2 SB-KERNEL:CLOSURE
33.3 32 2 SB-VM::VALUE-CELL

======= ===========
100.0 143576336

The profiling report is pretty cryptic, at first sight, and requires some knowledge of SBCL internals to understand. It contains all the allocations performed during the test run, so we should mind that some of the used memory is garbage and will be collected at the next gc. We can confirm that by looking at the room output:

CL-USER> (room)
Dynamic space usage is: 209,222,464 bytes.
CL-USER> (sb-ext:gc :full t)
NIL
CL-USER> (room)
Dynamic space usage is: 107,199,296 bytes.

Let's study the report in detail. Around 47 MB were, in fact, used for the newly created data structures — more than 10 times more than what was needed to store the data on disc. Well, efficient access requires sacrificing a lot of space. From the report, we can make an educated guess where these 47 MB originate: 24.7 MB was used for the hash-table structures themselves (SB-IMPL::HASH-TABLE-NEW-VECTORS) and 21.4 MB for the keys (SB-IMPL::ANSI-STREAM-READ-LINE-FROM-FRC-BUFFER), plus some small amount of bookkeeping information. We can also infer that the floating-point values required around 7 MB (SB-KERNEL::INTEGER-/-INTEGER - RATIO), but, it seems like they were put inside the hash-table arrays without any indirection. To verify that this assumption is correct we can calculate the total number of keys in the hash-tables, which amounts to 216993, and multiply it by 32 (the number of bits in a short-float used here). Also, the first 3 lines, which, in total, accrued around 90 MB or almost 2/3 of the memory used, are all related to reading the data and its processing; and this space was freed during gc.

So, this report, although it is not straightforward to understand, gives a lot of insight into how space is used during the run of the algorithm. And the ability to specify what to track on a per-code block basis makes it even more useful.

From the obtained breakdown, we can see the optimization potential of the current solution:

  • the use of a more space-efficient data structure instead of a hash-table might save us up to 17 MB of space (7 MB of float values will remain intact)
  • and another 20 MB may be saved if we compress the keys

Let's try the second option as it is exactly the focus of this chapter. We'll use the created hash-tables to make new ones with Huffman-encoded keys. Here are the contents of the word probabilities table:

CL-USER> (print-ht (first *de*))
#{EQUAL
"afrika" -9.825206
"i" -7.89809
"the" -7.0929685
"ngo" -12.696277
"noma" -14.284437
"of" -6.82038
"kanye" -14.233144
"e" -7.7334323
"natal" -11.476304
"c" -8.715089
...
}

And here is the function that will transform the tables:

(defun huffman-tables (hts envocab)
(declare (optimize sb-c::instrument-consing))
(mapcar (lambda (ht)
(let ((rez #h(equal)))
(dotable (str logprob ht)
(:= (? rez (huffman-encode envocab str)) logprob))
rez))
hts))

;; the Huffman encoding vocabulary *DE-VOCAB* should be built
;; from all the keys of *DE* tables separately
CL-USER> (sb-aprof:aprof-run
(lambda () (defparameter *de2* (huffman-tables *de* *de-vocab*))))
1294 (of 50000 max) profile entries consumed
% Bytes Count Function
------- ----------- --------- --------
42.5 44047104 1376461 SB-VM::ALLOCATE-VECTOR-WITH-WIDETAG - ARRAY

23.9 24775088 SB-IMPL::HASH-TABLE-NEW-VECTORS
54.0 13369744 52 SIMPLE-VECTOR
46.0 11405344 156 (SIMPLE-ARRAY (UNSIGNED-BYTE 32) (*))

20.1 20864160 HUFFMAN-ENCODE
83.3 17386800 217335 SB-VM::ARRAY-HEADER
16.7 3477360 217335 SIMPLE-BIT-VECTOR

6.7 6955072 217335 SB-KERNEL:VECTOR-SUBSEQ* - SIMPLE-BIT-VECTOR
3.4 3477360 217335 (SB-PCL::FAST-METHOD RUTILS.GENERIC::GENERIC-SETF :AROUND (T T)) - LIST
3.4 3477360 217335 (SB-PCL::FAST-METHOD RUTILS.GENERIC::GENERIC-SETF (HASH-TABLE T)) - LIST
00.0 2464 77 SB-KERNEL::INTEGER-/-INTEGER - RATIO

00.0 1024 MAKE-HASH-TABLE
53.1 544 2 SIMPLE-VECTOR
46.9 480 6 (SIMPLE-ARRAY (UNSIGNED-BYTE 32) (*))

00.0 384 2 SB-IMPL::%MAKE-HASH-TABLE - HASH-TABLE

00.0 96 SB-C::%PROCLAIM
66.7 64 2 LIST
33.3 32 1 SB-KERNEL:CLOSURE

00.0 96 2 SB-INT:SET-INFO-VALUE - SIMPLE-VECTOR
00.0 64 2 SB-THREAD:MAKE-MUTEX - SB-THREAD:MUTEX
00.0 32 1 SB-IMPL::%COMPILER-DEFVAR - LIST
00.0 32 2 HUFFMAN-TABLES - LIST
00.0 16 1 SB-KERNEL:ASSERT-SYMBOL-HOME-PACKAGE-UNLOCKED - LIST
======= ===========
100.0 103600352
CL-USER> (sb-ext:gc :full t)
NIL
CL-USER> (room)
Dynamic space usage is: 139,922,208 bytes.

So, we have claimed 32 MB of additional space (instead of 47) and some of it seems to be used by other unrelated data (some functions I have redefined in the REPL during the experiment etc), as the compressed keys amount for only 3.5 MB:

3477360     217335        SIMPLE-BIT-VECTOR 

That is more than 5 times reduction or almost 40% compression of the whole data structure!

And what about performance? Huffman compression will be needed at every data access, so let's measure the time it will take for vanilla string keys and the bit-vector ones. We will use another file from the wiki-lang-detect repository for the smoke test — a snippet from Faust:

CL-USER> (defparameter *de-words*
(let ((words (list))
(dict (first *de*)))
(dolines (line "~/prj/lisp/wiki-lang-detect/data/smoke/de.txt")
(dolist (word (split #\Space line))
(push word words)))
words))
CL-USER> (length *de-words*)
562

CL-USER> (let ((vocab (first *de*)))
(time (loop :repeat 1000 :do
(dolist (word *de-words*)
(get# word vocab)))))
Evaluation took:
0.045 seconds of real timeEvaluation took:

CL-USER> (let ((vocab (first *de2*)))
(time (loop :repeat 1000 :do
(dolist (word *de-words*)
(get# (huffman-encode *de-vocab* word) vocab)))))
Evaluation took:
0.341 seconds of real time

Hmm, with Huffman coding, it's almost 10x slower :( Is there a way to speed it up somewhat? To answer it, we can utilize another profiler — this time a more conventional one, which measures the time spent in each operation. SBCL provides access to two versions of such profilers: a precise and a statistical one. The statistical doesn't seriously interfere with the flow of the program as it uses sampling to capture the profiling data, and it's the preferred one among the developers. To use it, we need to perform (require 'sb-sprof) and then run the computation with profiling enabled (the lengthy output is redacted to show only the most important parts):

CL-USER> (let ((dict (first *de2*)))
(sb-sprof:with-profiling (:report :graph)
(loop :repeat 100 :do
(dolist (word *de-words*)
(get# (huffman-encode *de-vocab* word) dict)))))
Number of samples: 34
Sample interval: 0.01 seconds
Total sampling time: 0.34 seconds
Number of cycles: 0
Sampled threads:
#<SB-THREAD:THREAD "repl-thread" RUNNING {100FB19BC3}>

Callers
Total. Function
Count % Count % Callees
------------------------------------------------------------------------
24 70.6 "Unknown component: #x52CD6390" [41]
5 14.7 24 70.6 HUFFMAN-ENCODE [1]
1 2.9 SB-IMPL::GETHASH/EQL [17]
1 2.9 SB-IMPL::GETHASH3 [6]
1 2.9 LENGTH [14]
1 2.9 SB-KERNEL:HAIRY-DATA-VECTOR-REF/CHECK-BOUNDS [13]
2 5.9 (SB-VM::OPTIMIZED-DATA-VECTOR-REF BIT) [5]
13 38.2 VECTOR-PUSH-EXTEND [11]
------------------------------------------------------------------------
4 11.8 SB-VM::EXTEND-VECTOR [4]
4 11.8 4 11.8 SB-VM::ALLOCATE-VECTOR-WITH-WIDETAG [2]
------------------------------------------------------------------------
6 17.6 "Unknown component: #x52CD6390" [41]
3 8.8 6 17.6 SB-IMPL::GETHASH/EQUAL [3]
1 2.9 SXHASH [42]
2 5.9 SB-INT:BIT-VECTOR-= [10]
------------------------------------------------------------------------
8 23.5 VECTOR-PUSH-EXTEND [11]
2 5.9 8 23.5 SB-VM::EXTEND-VECTOR [4]
2 5.9 SB-VM::COPY-VECTOR-DATA [9]
4 11.8 SB-VM::ALLOCATE-VECTOR-WITH-WIDETAG [2]
------------------------------------------------------------------------
2 5.9 HUFFMAN-ENCODE [1]
2 5.9 2 5.9 (SB-VM::OPTIMIZED-DATA-VECTOR-REF BIT) [5]
------------------------------------------------------------------------
...

Self Total Cumul
Nr Count % Count % Count % Calls Function
------------------------------------------------------------------------
1 5 14.7 24 70.6 5 14.7 - HUFFMAN-ENCODE
2 4 11.8 4 11.8 9 26.5 - SB-VM::ALLOCATE-VECTOR-WITH-WIDETAG
3 3 8.8 6 17.6 12 35.3 - SB-IMPL::GETHASH/EQUAL
4 2 5.9 8 23.5 14 41.2 - SB-VM::EXTEND-VECTOR
5 2 5.9 2 5.9 16 47.1 - (SB-VM::OPTIMIZED-DATA-VECTOR-REF BIT)
6 2 5.9 2 5.9 18 52.9 - SB-IMPL::GETHASH3
7 2 5.9 2 5.9 20 58.8 - GETHASH
8 2 5.9 2 5.9 22 64.7 - (SB-VM::OPTIMIZED-DATA-VECTOR-SET BIT)
9 2 5.9 2 5.9 24 70.6 - SB-VM::COPY-VECTOR-DATA
10 2 5.9 2 5.9 26 76.5 - SB-INT:BIT-VECTOR-=
11 1 2.9 13 38.2 27 79.4 - VECTOR-PUSH-EXTEND
12 1 2.9 1 2.9 28 82.4 - SB-VM::SLOW-HAIRY-DATA-VECTOR-SET
13 1 2.9 1 2.9 29 85.3 - SB-KERNEL:HAIRY-DATA-VECTOR-REF/CHECK-BOUNDS
14 1 2.9 1 2.9 30 88.2 - LENGTH
15 1 2.9 1 2.9 31 91.2 - SB-KERNEL:HAIRY-DATA-VECTOR-SET
16 1 2.9 1 2.9 32 94.1 - SB-KERNEL:VECTOR-SUBSEQ*
17 1 2.9 1 2.9 33 97.1 - SB-IMPL::GETHASH/EQL
...

Unsurprisingly, most of the time is spent in huffman-encode, and of it the biggest chunks are vector-push-extend and hash-table access (to get the Huffman code of a letter). Surely, instead of extending the vector at each iteration, it would be much nicer to just perform a bulk copy of the bits for each character directly into the vector. Let's try that and see the difference:

(defun huffman-encode2 (envocab str)
(let ((vecs (map 'vector (lambda (ch) (get# ch envocab))
str))
(total-size 0))
(dovec (vec vecs)
(:+ total-size (length vec)))
(let ((rez (make-array total-size :element-type 'bit))
(i 0))
(dovec (vec vecs)
(let ((size (length vec)))
(:= (subseq rez i) vec)
(:+ i size)))
rez)))

CL-USER> (let ((vocab (first *de2*)))
(time (loop :repeat 1000 :do
(dolist (word *de-words*)
(get# (huffman-encode2 *de-vocab* word) vocab)))))
Evaluation took:
0.327 seconds of real time

Almost no difference. Well, it's a usual case with these micro-optimizations: you have a brilliant idea, try it under the profiler — and, bah, no difference... This doesn't have to stop us, though. Another idea could be to use a jump-table instead of a hash-table to store character-vector mappings. There are only around 500 characters that have a mapping in my data, although they span the whole Unicode range:

CL-USER> (reduce 'max (mapcar 'char-code (keys de-vocab)))
65533
CL-USER> (defparameter jvocab (make-array (1+ 65533)
:element-type 'bit-vector
:initial-element #))
CL-USER> (dotable (k v *de-vocab
)
(:= (? jvocab (char-code k)) v))

(defun huffman-encode3 (envocab str)
(let ((rez (make-array 0 :element-type 'bit :adjustable t :fill-pointer t)))
(dovec (char str)
;; here, we have changed the hash-table to a jump-table
(dovec (bit (svref envocab (char-code char)))
(vector-push-extend bit rez)))
rez))

CL-USER> (let ((vocab (first de2)))
(time (loop :repeat 1000 :do
(dolist (word de-words)
(get# (huffman-encode3 jvocab word) vocab)))))
Evaluation took:
0.308 seconds of real time

OK, we get an improvement of around 10%[2]. That's a start. But many more ideas and experiments are needed if we want to significantly optimize this implementation. Yet, for the sake of space conservation on the pages of this book, we won't continue with it.

Another tool we could use to analyze the performance and think about further improvement is flamegraphs — a way to visualize profiler output. [CL-FLAMGRAPH](https://github.com/40ants/cl-flamegraph) is a wrapper around `sb-sprof` that generates the output in the common format which can be further processed by the Perl tool, in order to generate the image itself. Here is the basic output I got. It's rather rough and, probably, requires some fiddling with the Perl tool to obtain a prettier image:

To conclude, key compression alone gives a sizeable reduction in used space at the cost of deteriorated performance.

Another possible angle of attack is to move from a hash-table to a more space-efficient structure. We have explored this direction somewhat in the chapter on hash-tables already.

Arithmetic Coding

Why Huffman coding works? The answer lies in Shannon's Source Coding Theorem and has to do with a notion of entropy. Entropy is one of the ways to represent expectation and surprise, in a message. The most random message has the maximal surprise, i.e. it's very hard to predict what symbol will appear at a certain position in it, while the least random (for instance, containing only repetitions of a single char) is the least surprising. Obviously, any kind of useful data is not uniformly distributed, or, otherwise, it's indistinguishable from white noise. Most of the data representations use an "alphabet" (encoding) that is redundant, for a particular message. Why? Because it is general-purpose and should allow expressing arbitrary messages. Yet, in practice, some passages appear much more often than the others, some words are more frequent, some letters are, and even some patterns in the images may be.

The idea of character-level compression algorithms is to tailor a custom vocabulary that uses fewer bits for low-entropy (frequent) characters and more bits for high-entropy ones. In general, the probability distribution of characters may be thought of as a [0,1) interval, in which each char occupies a slice proportionate to its frequency. If we rely on standard encoding, the interval for our test example will look like this:

|---+---+---+------+---------+---------+---------|
0 e a h i s t Space 1

Here, each subinterval for a character is its probability times the number of bits per character (8 for each). Huffman coding tries to equalize this distribution by assigning fewer bits to characters that occupy larger space. For the Huffman vocabulary we have constructed, the distribution will look like this:

|-----+-----+----+------+------+-------+-------|
0 e a h i s t Space 1

As you can see, it has become more even, but still not totally. This is due to the discrete nature of the encoding that results in rounding the number of bits to the closest integer value. There's another approach to solving the same problem that aims at reducing the rounding error even further — Arithmetic coding. It acts directly on our interval and encodes the whole message in a single number that represents the point in this interval. How this point is found and used? Let's consider a message with a single character i. In our example, the subinterval for it is [0.214285714, 0.357142857). So, if we use any number from this interval and know that the message contains a single character we can unambiguously decode it back. Ideally, we'd use the number from the interval that has the least amount of digits. Here is a simple example of how such a number can be found:

(defun find-shortest-bitvec (lo hi)
(let ((rez (make-array 0 :element-type 'bit :adjustable t :fill-pointer t)))
(loop
(with ((lod lof (floor (* lo 2)))
(hid hif (floor (* hi 2))))
(when (or (zerop lof)
(zerop hif)
(/= lod hid))
(vector-push-extend hid rez)
(return))
(vector-push-extend lod rez)
(:= lo lof
hi hif)))
rez))

CL-USER> (find-shortest-bitvec 0.214285714 0.357142857)
#*01

The result is a bit-vector that represents the fractional part of some floating point number lying within the interval, which may be also used as an encoding of our one-character message. Obviously, we could use just a single bit to encode it with a custom vocabulary of one entry, but, here, for the purpose of illustration, I wanted to use an existing pre-calculated vocabulary that includes other characters as well. Also, if we compare this version with the Huffman coding, the message length is decreased by 1 bit.

Now, how can we process the longer messages? In the same manner: by recursively dividing the currently selected part using the same original distribution. For the message is:

  • on step 1 (for character i), the interval [0.214285714, 0.357142857) will be selected
  • on step 2 (for character s), we'll narrow it down to [0.26530612, 0.29591838) (using the subinterval [0.357142857, 0.5714286) for s)

For this interval, the shortest encoding will be 01001. In this case, it has the same size as the Huffman one.

So, the naive arithmetic encoding implementation is quite simple:

(defun arithm-encode (envocab message)
(let ((lo 0.0)
(hi 1.0))
(dovec (char message)
(let ((coef (- hi lo)))
(dotable (ch prob envocab)
(let ((off (* prob coef)))
(when (eql char ch)
(:= hi (+ lo off))
(return))
(:+ lo off)))))
(find-shortest-bitvec lo hi)))

CL-USER> (arithm-encode #h(#\e 1/14
#\a 1/14
#\h 1/14
#\i 2/14
#\s 3/14
#\t 3/14
#\Space 3/14)
"this is a test")
#*100110110100001110000001

However, this function has a hidden bug. The problem lies in the dreaded floating-point overflow that happens quite soon in the process of narrowing the interval, which results in using more and more digits of the floating-point number until all the bits are utilized and we can't distinguish the intervals any further. If we try to faithfully decode even the short message encoded above, we'll already see this effect by getting the output this ist sssst.

The implementation of this approach, which works around the bug, relies on the same idea but uses a clever bit arithmetics trick. Due to that, it becomes less clean and obvious, because it has to work not with the whole number, but with a bounded window in that number (in this case: a 32-bit one) and, also, still take care of potential overflow that may happen when the range collapses around 0.5. Here it is shown, for illustration purposes, without a detailed explanation[3]. This function is another showcase of the Lisp standard support for handling bit-level values. Besides, read-eval (#.) is used here to provide literal values of bitmasks.

(defun arithm-encode-correct (envocab message)
(let ((lo 0)
(hi (1- (expt 2 32)))
(pending-bits 0)
(rez (make-array 0 :element-type 'bit :adjustable t :fill-pointer t)))
(flet ((emit-bit (bit)
(vector-push-extend bit rez)
(let ((pbit (if (zerop bit) 1 0)))
(loop :repeat pending-bits :do (vector-push-extend pbit rez))
(:= pending-bits 0))))
(dovec (char message)
(with ((range (- hi lo -1))
((plo phi) (? envocab char)))
(:= lo (round (+ lo (* plo range)))
hi (round (+ lo (* phi range) -1)))
(loop
(cond ((< hi #.(expt 2 31))
(emit-bit 0))
((>= lo #.(expt 2 31))
(emit-bit 1)
(:- lo #.(expt 2 31))
(:- hi #.(expt 2 31)))
((and (>= lo #.(expt 2 30))
(< hi (+ #.(expt 2 30) #.(expt 2 31))))
(:- lo #.(expt 2 30))
(:- hi #.(expt 2 30))
(:+ pending-bits))
(t (return)))
(:= lo (mask32 (ash lo 1))
hi (mask32 (1+ (ash hi 1)))))))
(:+ pending-bits)
(emit-bit (if (< lo #.(expt 2 30)) 0 1)))
rez))

(defun mask32 (num)
;; this utility is used to confine the number in 32 bits
(logand num #.(1- (expt 2 32))))

CL-USER> (arithm-encode-correct #h(#\e '(0 1/14)
#\a '(1/14 1/7)
#\h '(1/7 3/14)
#\i '(3/14 5/14)
#\s '(5/14 4/7)
#\t '(4/7 11/14)
#\Space '(11/14 1))
"this is a test")
#*10011011010000111000001101010110010101

Note that the length of the compressed message is 38 bits. The same as the Huffman version!

And here, for the sake of completeness and verification, is the decoding routine. It works in a similar fashion but backwards: we determine the interval into which our current number falls, emit the corresponding character, and narrow the search interval to the currently found one. We'll need to have access to the same vocabulary and know the length of the message.

(defun bitvec->int (bits)
(reduce (lambda (bit1 bit2) (+ (ash bit1 1) bit2)
bits))

(defun arithm-decode (dedict vec size)
(with ((len (length vec))
(lo 0)
(hi (1- (expt 2 32)))
(val (bitvec->int (subseq vec 0 (min 32 len))))
(off 32)
(rez (make-string size)))
(dotimes (i size)
(with ((range (- hi lo -1))
(prob (/ (- val lo) range)))
(dotable (char r dedict)
(with (((plo phi) r))
(when (>= phi prob)
(:= (? rez i) char
lo (round (+ lo (* plo range)))
hi (round (+ lo (* phi range) -1)))
(return))))
(print (list val lo hi))
(loop
(cond ((< hi #.(expt 2 31))
;; do nothing
)
((>= lo #.(expt 2 31))
(:- lo #.(expt 2 31))
(:- hi #.(expt 2 31))
(:- val #.(expt 2 31)))
((and (>= lo #.(expt 2 30))
(< hi #.(* 3 (expt 2 30))))
(:- lo #.(expt 2 30))
(:- hi #.(expt 2 30))
(:- val #.(expt 2 30)))
(t
(return)))
(:= lo (mask32 (ash lo 1))
hi (mask32 (1+ (ash hi 1)))
val (mask32 (+ (ash val 1)
(if (< off len)
(? vec off)
0)))
off (1+ off)))))
rez)))

CL-USER> (let ((vocab #h(#\e '(0 1/14)
#\a '(1/14 1/7)
#\h '(1/7 3/14)
#\i '(3/14 5/14)
#\s '(5/14 4/7)
#\t '(4/7 11/14)
#\Space '(11/14 1))))
(arithm-decode vocab
(arithm-encode-correct vocab "this is a test")
14))
"this is a test"

DEFLATE

Entropy-based compression — or, as I would call it, сharacter-level one — can do only so much: it can't account for repetitions of the larger-scale message parts. For instance, a message with a single word repeated twice, when compressed with Huffman or Arithmetic encodings, will have twice the length of the message with a single occurrence of that word. The reason being that the probability distribution will not change, and thus the encodings of each character. Yet, there's an obvious possibility to reduce the compressed size here. This and other similar cases are much better treated by dictionary-based or block-level encoding approaches. The most well-known and wide-spread of them is the DEFLATE algorithm that is a variant of LZ77. Surely, there are other approaches like LZW, LZ78 or the Burrows-Wheeler algorithm (used in bzip2), but they are based on the same principle approach, so studying DEFLATE will allow you to grasp other algorithms if necessary.

But, before considering DEFLATE, let's first look at the simplest block-level scheme — Run-Length Encoding (RLE). This is not even a block-level algorithm, in full, as it operates on single characters, once again. The idea is to encode sequences of repeating characters as a single character followed by the number of repetitions. Of course, such an approach will hardly help with natural language texts that have almost no long character repetitions, instead, it was used in images with limited palettes (like those encoded in the GIF format). It is common for such images to have large areas filled with the same color, so the GIF format, for instance, used RLE for each line of pixels. That was one of the reasons that an image with a horizontal pattern like this:

xxxxx

xxxxx

xxxxx

lent itself to stellar compression, while the same one rotated 90 degrees didn't :)

x  x  x
x x x
x x x
x x x
x x x

LZ77 is a generalization of the RLE approach that considers runs not just of single characters but of variable-length character sequences. Under such conditions, it becomes much better suited for text compression, especially, when the text has some redundancies. For example, program code files tend to have some identifiers constantly repeated (like, if, loop or nil, in Lisp), each code file may have a lengthy identical copyright notice at the top, and so on and so forth. The algorithm operates by replacing repeated occurrences of data with references to a single copy of that data seen earlier in the uncompressed stream. The encoding is by a pair of numbers: the length of the sequence and the offset back into the stream where the same sequence was originally encountered.

The most popular LZ77-based compression method is DEFLATE. In the algorithm, literals, lengths, and a symbol to indicate the end of the current block of data are all placed together into one alphabet. Distances are placed into a separate alphabet as they occur just after lengths, so they cannot be mistaken for another kind of symbol or vice versa. A DEFLATE stream consists of a series of blocks. Each block is preceded by a 3-bit header indicating the position of the block (last or intermediate) and the type of character-level compression used: no compression, Huffman with a predefined tree, and Huffman with a custom tree. Most compressible data will end up being encoded using the dynamic Huffman encoding. The static Huffman option is used for short messages, where the fixed saving gained by omitting the tree outweighs the loss in compression due to using a non-optimal code.

The algorithm performs the following steps:

  1. Matching and replacement of duplicate strings with pointers: within a single block, if a duplicate series of bytes is spotted (a repeated string), then a back-reference is inserted, linking to the previous location of that identical string instead. An encoded match to an earlier string consists of an 8-bit length (the repeated block size is between 3 and 258 bytes) and a 15-bit distance (which specifies an offset of 1-32768 bytes inside the so-called "sliding window") to the beginning of the duplicate. If the distance is less than the length, the duplicate overlaps itself, indicating repetition. For example, a run of any number identical bytes can be encoded as a single byte followed by a length of (1- n).

  2. Huffman coding of the obtained block. Instructions to generate the necessary Huffman trees immediately follow the block header. There are, actually, 2 trees: the 288-symbol length/literal tree and the 32-symbol distance tree, which themselves encoded as canonical Huffman codes by giving the bit length of the code for each symbol. The bit lengths are then run-length encoded to produce as compact a representation as possible.

An interesting fact is that DEFLATE compression is so efficient in terms of speed that it is faster to read a compressed file from an ATA hard drive and decompress it in-memory than to read an original longer version: disk access is much longer than CPU processing, for this rather simple algorithm! Even more, it applies to network traffic. That's why compression is used (and enabled by default) in many popular network protocols, for instance, HTTP.

Take-aways

This chapter, unlike the previous one, instead of exploring many different approaches, dealt with, basically, just a single one in order to dig deeper and to demonstrate the use of all the tools that can be applied in algorithmic programming: from a piece of paper to sophisticated profilers. Moreover, the case we have analyzed provides a great showcase not just of the tools but of the whole development process with all its setbacks, trial and error, and discoveries.

Bit fiddling was another topic that naturally emerged in this chapter. It may look cryptic to those who have never ventured into this territory, but mastering the technique is necessary to gain access to a number of important areas of the algorithms landscape.


[1] To make full use of this feature and be able to profile SBCL internal functions, you'll need to compile SBCL with --with-cons-profiling flag. Many thanks to Douglas Katzman for developing this feature and guiding me through its usage.

[2] It was verified by taking the average of multiple test runs.

[3] You can study the details in the [relevant article](https://www.drdobbs.com/cpp/data-compression-with-arithmetic-encodin/240169251).

Joe MarshallA polygot program puzzle

· 49 days ago
Can you come up with an expression that evaluates to the symbol 'scheme in a Scheme system, but evaluates to the symbol 'common-lisp in a Common Lisp system?

Joe MarshallAnaphoric if

· 50 days ago
An anaphoric if expression binds the identifier “it” to the value of the conditional in the scope of the consequent
(aif (find-broken-computer)
     (fix it))
I have two objections to anaphoric macros. The first is that the binding of “it” isn't obvious, the second is the inflexibility of the variable name. You are kind of stuck with “it”. What if you wanted to use “it” to mean something else? Maybe you have an IT department to fix your computers
(let ((it (find-department "IT")))
  (aif (find-broken-computer)
       (tell it (fix it))))   ; name conflict
or maybe you want to nest two anaphoric conditionals
(aif (find-broken-computer)
     (aif (find-broken-component it)
          (repair it)
          (replace it)))
In this case, I want to replace the computer if I cannot repair the broken component, but the inner binding of “it” shadows the outer binding and makes it inaccessible.

The solution is pretty obvious if you think about it (though sometimes it takes me a while to see the obvious). Just replace the conditional test form with a binding form
(aif (it (find-broken-computer))
     (fix it))
This makes it obvious we are binding the variable “it” to the value returned by (find-broken-computer), and it lets us choose the name we bind. If we want to nest these, it would look like this
(aif (computer (find-broken-computer))
     (aif (component (find-broken-component computer))
          (repair component)
          (replace computer)))
But I'm not sure if this is so much more concise and clear than simply using a let expression that it is worth adding this syntax to the language. It's one more thing the reader of the code has to be prepared to encounter.

A slightly different approach would move the binding form closer to where it is used. Note that there is no point in binding a name in the alternative clause to the conditional because it will always have the value nil.
(aif (find-broken-computer)
     (it (fix it)))
and instead of using a let binding, I could use a lambda binding
(aif (find-broken-computer)
     (λ (it) (fix it)))
aif no longer needs to be a macro but can be an ordinary function, which might be handy if your language doesn't have macros
(defun aif (it consequent &optional; alternative)
  (if it
      (funcall consequent it)
      (if alternative
          (funcall alternative)
          nil)))

(aif (find-broken-computer)
     (λ (computer)
        (aif (find-broken-component computer)
             (λ (component) (fix component))
             (λ () (replace computer)))))
The explicit lambdas make it obvious what is being bound and what the scope of the binding is, but they do add a bit of visual noise.

Instead of using anaphoric if, I just write the slightly more verbose
(let ((computer (find-broken-computer)))
  (if computer
      (let ((component (find-broken-component)))
        (if component
            (repair component)
            (replace computer)))))
The binding is obvious, and I get to choose the variable being bound; both problems solved. I don't see a compelling reason to use the anaphoric version.

Addendum

Hexstream suggests that “No discussion of anaphoric macros can be complete without at least mentioning anaphoric-variants: https://www.hexstreamsoft.com/libraries/anaphoric-variants/” I wouldn't want to be incomplete, so consider it mentioned.

Joe MarshallMacro pitfalls

· 51 days ago
Macros are a unique source of power in Common Lisp, but there are some pitfalls to watch out for.

A compiler macro is special macro that is expanded only by the compiler. The interpreter doesn't expand the macro and simply evaluates the form like a normal function call. If you aren't careful when writing a compiler macro, the interpreted and compiled forms may not evaluate the same and that's probably not what you want. Here we abuse this effect
(defun foo (x) 'interpreted)

(define-compiler-macro foo (x) ''compiled)

CL-USER> (foo)
INTERPRETED

CL-USER> ((lambda () (foo)))
COMPILED
That might be unexpected. It appears that in this implementation (SBCL) the compiler is called on lambda expressions when they are evaluated.

Like all macros, a compiler macro is given the unevaluated source code of the arguments, not the value. We can see that in this example
(defun foo (l r)
  (format t "~%Foo")
  (list r l))

(define-compile-macro foo (l r) 
  `(progn 
     (format t "~%Foo")
     (list ,r ,l)))

CL-USER> (foo (progn (format t "~%First") 'l) (progn (format t "~%Second") 'r))

First
Second
Foo
(r l)

CL-USER> ((lambda () (foo (progn (format t "~%First") 'l) (progn (format t "~%Second") 'r))))

Foo
Second
First
(r l)
When interpreted, the arguments are evaluated left to right before the function is entered. When compiled, the arguments end up being evaluated right to left and after the function is entered.

Unless you really want this — and shame on you if you do — you have to be careful when writing your macro to preserve the left-to-right, call-by-value semantics that are probably expected. The easiest way to do this is to write the expansion so that it just substitutes the body of the function. Something like this
(define-compiler-macro foo (l r)
  `((lambda (l r)
      (format t "~%Foo")
      (list r l))
    ,l
    ,r))

CL-USER> (foo (progn (format t "~%First") 'l) (progn (format t "~%Second") 'r))

First
Second
Foo
(r l)
Or you could use a let expression with the same effect
(define-compiler-macro foo (l r)
  `(let ((l ,l)
         (r ,r))
     (format t "~%Foo")
     (list r l)))
The version with the lambda expression doesn't even require putting a block of let bindings at the front. You just plop down the original argument list and body after the lambda, but both forms are equivalent.

The problem with doing this is that you have probably disabled the ability of the compiler to optimize the expression. You are forcing the compiler to ensure that the arguments are evaluated in left-to-right order before the body. A Sufficiently Smart compiler might be able to provide some optimizations anyway. If your compiler is not Sufficiently Smart, you can take matters in to your own hands and substitute the arguments at the point they are used. Just be aware that you might be surprising people by changing the semantics at the call site.

Funny semantics isn't just a problem with compiler macros. Regular macros have to be written with care as well or you may surprise users when they write code they think are normal function calls. Compiler macros just have the unique property that they can change the semantics between interpreted and compiled code.

You can see a related effect when using symbol macros. A symbol macro substitutes a piece of code that computes a value. If we write
CL-USER> (let ((l (progn (format t "~%First") 'l))
               (r (progn (format t "~%Second") 'r)))
           (format t "~%Let body")
           (list r l))

First
Second
Let body
(r l)
we get the standard left-to-right, call-by-value evaluation. But we can mimic normal-order reduction by substituting the code for l and r before evaluating the body of the let by use of symbol-macrolet*
CL-USER> (symbol-macrolet ((l (progn (format t "~%First") 'l))
                           (r (progn (format t "~%Second") 'r)))
           (format t "~%Symbol-macrolet body")
           (list r l))

Symbol-macrolet body
Second
First
(r l)
If one of the arguments to a macro is a block of code, for instance the &body argument, then you probably want to avoid accidental variable capture.
(defmacro capturing-macro (&body; body)
  `(let ((temp 'captured))
     (format t "~%Macro body binds temp to ~S" temp)
     ,@body))

(let ((temp 'lexical))
  (capturing-macro
     (format t "~%Temp is ~s" temp)))

Macro body binds temp to CAPTURED
Temp is CAPTURED
NIL
The lexical binding of temp is shadowed by the binding introduced by capturing-macro. This is probably unintended (except in the case of anamorphic macros, where capture is intended). Instead, you can ensure lexical scoping is maintained by closing over the body before introducing any new bindings
(defmacro non-capturing-macro (&body; body)
  `(let ((temp 'captured)
         (body (lambda () ,@body)))
     (format t "~%Macro body binds temp to ~S" temp)
     (funcall body)))

(let ((temp 'lexical))
  (non-capturing-macro
    (format t "~%Temp is ~s" temp)))

Macro body binds temp to CAPTURED
Temp is LEXICAL
NIL
In this case, even a fairly naive compiler ought to be able to inline the call to body because it is simply a lexically apparent code block.

Inadvertent capture can happen in other direction as well if the macro caller shadows a binding used by the macro.
(flet ((funcall (x) (format t "~%Unexpected")))
  (let ((temp 'lexical))
    (non-capturing-macro
      (list temp))))

Macro body binds temp to CAPTURED
Unexpected
NIL
Here the caller shadowed funcall and the code the macro introduced ends up inadvertently calling it. This doesn't happen often in practice because people rarely shadow the top-level functions a macro depends upon, and that is good because there isn't an easy way to solve this reverse capture problem (other than don't do that).

The “hygienic” macro system in Scheme solves both kinds of accidental capture by appropriately renaming variables. There is a price, however. You either have to forego direct code manipulation and use a special pattern matching language, or write code that explicitly keeps track of the environment where the variables are bound. For simple macros, the pattern matching language is adequate, but for more complex macros, neither option is appealing.

*Macrolet rhymes with Chevrolet, naturally.

Joe MarshallFour ways to use macros

· 53 days ago
The way I see it, there are about four five basic ways to use macros in Common Lisp.

First are macros that circumvent the regular call-by-value semantics. These might evaluate a subform at macro expansion time, treat a subform as a place (an l-value) rather than a value, or otherwise treat a subform as something other than a runtime function call. For example, if incf fully evaluated its argument, it could perform the increment on the value, but it couldn't put the value back where it got it. Another example is the check-type macro. You use it like this:
(defun foo (x)
  (check-type foo (integer 1 *) "a positive integer")
  (bar (- x 1)))
The check-type macro has to be a macro because it treats foo as a place (it will allow you to proceed by modifying foo), and it treats its second argument as a type specifier.

Second are macros that introduce new syntax to the language. Examples are cond, case, do, dotimes, defun, defvar, etc. These treat their arguments specially or have special clauses that don't act like ordinary function calls.
CL-USER> (macroexpand-all '(do ((i 0 (+ i 1))
                                (j 1 (* j 2)))
                               ((> j 65536) nil)
                             (format t "~%~2d ~5d" i j)))

(BLOCK NIL
  (COMMON-LISP:LET ((I 0) (J 1))
    (TAGBODY
      (GO #:G748)
     #:G747
      (TAGBODY (FORMAT T "~%~2d ~5d" I J))
      (COMMON-LISP:LET* ((#:NEW1 (+ I 1)) (#:NEW1 (* J 2)))
        (SETQ I #:NEW1)
        (SETQ J #:NEW1)
        NIL)
     #:G748
      (IF (> J 65536)
          NIL
          (GO #:G747))
      (RETURN-FROM NIL (PROGN NIL))))

Third are macros that implement tiny languages within Common Lisp. The loop macro is a good example. It looks like this
(loop for i from 1 to (compute-top-value)
      while (not (unacceptable i))
      collect (square i)
      do (format t "Working on ~D now" i)
      when (evenp i)
        do (format t "~D is a non-odd number" i) 
      finally (format t "About to exit!"))
It works like a little compiler. It parses the loop clauses and generates a Lisp form that carries them out
(BLOCK NIL
  (LET ((I 1) (#:LOOP-LIMIT-744 (COMPUTE-TOP-VALUE)))
    (DECLARE (TYPE (AND NUMBER REAL) #:LOOP-LIMIT-744)
             (TYPE (AND REAL NUMBER) I))
    (SB-LOOP::WITH-LOOP-LIST-COLLECTION-HEAD (#:LOOP-LIST-HEAD-745
                                              #:LOOP-LIST-TAIL-746)
      (TAGBODY
       SB-LOOP::NEXT-LOOP
        (WHEN (> I #:LOOP-LIMIT-744) (GO SB-LOOP::END-LOOP))
        (UNLESS (NOT (UNACCEPTABLE I)) (GO SB-LOOP::END-LOOP))
        (SB-LOOP::LOOP-COLLECT-RPLACD
         (#:LOOP-LIST-HEAD-745 #:LOOP-LIST-TAIL-746) (LIST (SQUARE I)))
        (FORMAT T "Working on ~D now" I)
        (IF (EVENP I)
            (FORMAT T "~D is a non-odd number" I))
        (SB-LOOP::LOOP-DESETQ I (1+ I))
        (GO SB-LOOP::NEXT-LOOP)
       SB-LOOP::END-LOOP
        (FORMAT T "About to exit!")
        (RETURN-FROM NIL
          (SB-LOOP::LOOP-COLLECT-ANSWER #:LOOP-LIST-HEAD-745)))))

Fourth are macros that run code in a special context. The with-… macros and when and unless fall in this category. These macros take ordinary Lisp code and wrap it with other code that establishes a context or tests a conditional.
CL-USER> (macroexpand '(when (condition) (foo) (bar)))

(IF (CONDITION)
    (PROGN (FOO) (BAR)))

CL-USER> (macroexpand '(with-open-file (foo "~/.bashrc" :if-does-not-exist :create)
                         (print (read-line foo))
                         (bar)))

(LET ((FOO (OPEN "~/.bashrc" :IF-DOES-NOT-EXIST :CREATE)) (#:G751 T))
  (UNWIND-PROTECT
      (MULTIPLE-VALUE-PROG1 (PROGN (PRINT (READ-LINE FOO)) (BAR)) (SETQ #:G751 NIL))
    (WHEN FOO (CLOSE FOO :ABORT #:G751))))

These aren't hard and fast categories, many macros can be thought of as in more than one category. All macros work by syntactic transformation and most treat at least one of their subforms as something other than a call-by-value form, for instance. There are also the occasional macros that have the look and feel of a standard function calls. The series package appears to allow you to manipulate series through standard function calls, but works by clever macroexpansion into iterative code.

I find it useful to think of macros in these four different ways, but I'm sure that others have their own categorizations that they find useful.

Addendum

An anonymous reader asked, “What about code walking/analysis/optimization?”. I really overlooked that. I think Richard Waters's series package would be a good example. It takes ordinary functional programs that can operate on series of data (coinductively defined data or codata), and turns it into the equivalent iterative construct that operates on one element at a time. It does this by clever macros that walk the code, analyse it, and rewrite it to a more optimal form
CL-USER> (sb-cltl2:macroexpand-all '(let ((x (scan-range :from 0 :below 10)))
                               (collect (choose-if #'evenp x))))

(COMMON-LISP:LET* ((X (COERCE (- 0 1) 'NUMBER))
                   (#:LASTCONS-751 (LIST NIL))
                   (#:LST-752 #:LASTCONS-751))
  (DECLARE (TYPE NUMBER X)
           (TYPE CONS #:LASTCONS-751)
           (TYPE LIST #:LST-752))
  (TAGBODY
   #:LL-756
    (SETQ X (+ X (COERCE 1 'NUMBER)))
    (IF (NOT (< X 10))
        (GO SERIES::END))
    (IF (NOT (EVENP X))
        (GO #:LL-756))
    (SETQ #:LASTCONS-751 (SB-KERNEL:%RPLACD #:LASTCONS-751 (CONS X NIL)))
    (GO #:LL-756)
   SERIES::END)
  (CDR #:LST-752)))
As you can see, the series code does a major rewrite of the original lisp code. An astute reader will notice that the let form has to have been redefined to do dataflow analysis of it's bindings and body. Thanks to anonymous for the suggestion.

Addendum 2: Symbol macros

A comment by Paul F. Dietz got me thinking about symbols and it occurred to me that symbol macros deserve their own category as well. Symbol macros appear to be an ordinary symbolic references to variables, but they expand to some code that computes a value. For instance, if foo were a symbol macro that expanded to (car bar), then using it in a form such as (+ foo 7) would expand to (+ (car bar) 7). A symbol macro is a two-edged sword. It is a very useful abstraction for providing a name to a computed value, but they also can fool the user of such a macro into thinking that a simple variable reference is happening when some more complex computation could be happening.

I think that makes seven ways and counting.

Joe MarshallDispatching

· 56 days ago
There are times when you are faced with a complex piece of control flow
try {
    if (condition())
    {
        ... block 1 ...
    }
    else 
    {
        switch (someValue())
        {
          case CASE_A:
            ... block 2 ...
            break;

          case CASE_B:
            ... block 3 ...
            break;

          default:
            ... block 4 ...
            break;
        }
    }
} catch (SomeException someException) {
    ... block 5 ...
}
and you want to abstract the control flow — all the conditionals, switches, and try/catches — away from the code that does the work — the various blocks. In fact, here I've abstracted it away by putting "... block n ..." in place of the blocks of code.

If I were writing this in Scheme or Common Lisp, I'd consider using continuation passing style. I'd write a function dispatch-on-foo that would perform the dispatch, but then invoke one of several first-class procedures passed in as arguments
(defun dispatch-on-foo (foo bar case1 case2 case3 case4 case5)
   (if (... complex conditional ...) 
       (funcall case1)
       (handler-case (case some-value
                       ((case-a) (funcall case2))
                       ((case-b) (funcall case3))
                       (t (funcall case4)))
         (error (condition) (funcall case5)))))
At the call site, I'd write
(dispatch-on-foo <arg1> <arg2>
  (lambda ()
     ... block 1 ...)
  (lambda ()
     ... block 2 ...)
  (lambda ()
     ... block 3 ...)
  (lambda ()
     ... block 4 ...)
  (lambda ()
     ... block 5 ...))
This is a win when the complexity of the dispatch is enough that you don't want to replicate it at every call site. Notice how the nested blocks of code have been pulled up to the same level and linearized. Granted, you've cluttered up the call site with lambda expressions, but as Steele pointed out, you can think of these as anonymous go tags: dispatch-on-foo in essence will end up doing a jump to one of these tags and execute the block there, skipping and ignoring the other blocks. Once you get used to thinking in this way, the lambdas disappear just like the parens do for a seasoned Lisp hacker. They just look like jump targets or case labels, and the call site looks a lot like a case expression. It is a bit more powerful than an ordinary case expression because you could arrange for dispatch-on-foo to funcall the appropriate closure on an argument (and have the lambda expression take an argument of course).

You could do something analagous with Java 8's lambdas, but on the rare occasion I've wanted to do something similar in Java 7. The problem is that Java 7 doesn't have lambda expressions. The solution is to change these anonymous lambdas into named callback methods. First we define a generic interface with our callbacks:
interface DispatchOnFooCases<T> {
    T caseOne (void);
    T caseTwo (void);
    T caseThree (void);
    ... etc. ...
 }
then we define the dispatch method:
<T> T dispatchOnFoo (FooClass foo, BarClass bar, DispatchOnFooCases dispatchOnFooCases)
{
    try {
        if (conditional())
            return dispatchOnFooCases.caseOne();
        else
            switch (someValue()) {
              case CASE_A:
                return dispatchOnFooCases.caseTwo();

              case CASE_B:
                return dispatchOnFooCases.caseThree();

              default:
                return dispatchOnFooCases.caseFour();
            }
    } catch (SomeException someException) {
        return dispatchOnFooCases.CaseFive();
    }
}
finally, at the call site, we write this:
{
    int value =
        dispatchOnFoo<int> (foo, bar,
            new DispatchOnFooCases<int> ()
            {
                @Override
                int caseOne (void)
                {
                    ... block 1 ...
                    return aValue;
                }

                @Override
                int caseTwo (void)
                {
                    ... block 2 ...
                    return aDifferentValue;
                }

                ... etc. ...
            });
}
The good news is that we've accomplished our goal of abstracting the complex conditional dispatch from the code that does the real work — the method bodies at the call site.

There is, unfortunately, a fair amount of bad news. First, if you thought lambda expressions introduced clutter, then this is a serious amount of clutter. Between @Overrides, type declarations, interfaces, and methods, there is just a lot of extra stuff you have to type. It still might be worth the clutter if the dispatch conditions are complex enough. They just need to be that much more complex to justify all this machinery. (We've actually done the work the compiler would do to allocate and pass a “multi-closure”.) There are cases where this pays off, though.

The second piece of bad news is that Java is not (in general) tail recursive. This means that the call to dispatchOnFoo and the callback to one of the cases both introduce a new stack frame. So although the case methods run in the same lexical environment as where they are defined, they are running two stack frames deeper. This won't make much of a difference unless you try to loop by recursively calling the code. In that case, you need to be very careful to limit the amount of recursion or you will overflow the stack. It is best to avoid recursion as much as possible in the bodies of the cases.

You probably won't need to resort to this doing this. It can be a case of the cure being worse than the disease. The complexity of introducing callbacks can exceed the complexity of the conditional you are trying to abstract. But this is an interesting way to abstract a very complex conditional and can come in handy when you can justify using it. I have actually used this technique in production code to separate some complex control flow from the code that did the actual work.

Lispers.deHamburg Lispers Meetup, Monday, 3rd February 2020

· 60 days ago

Greetings, Lisplings!

Our monthly gathering in Hamburg takes place as usual on the first Monday of the month, on 3rd February 2020. We meet around 19:00 (“7 p. m.”) at Ristorante Opera, Dammtorstr. 7.

This is an informal event for all Lisp experience levels, and about any Lisp dialect.

Come as you are, bring lispy topics.

Zach BeaneUltralisp search is awesome

· 63 days ago

Ultralisp.org has a great new search system. It indexes the docstrings of all the Ultralisp projects, and as a result it is very easy to search for specific functionality and get useful results. For example, I wanted to see what projects have base64-related functions, and now it’s just a simple search away. I highly recommend it.

Joe MarshallThe pros and cons of Agile

· 67 days ago
Agile methodology is currently the popular way of attempting to develop commodity software in a factory-like manner.  It's a little hard to define what exactly is Agile and what is not.  I've worked in several companies and with several groups in companies that all claim to be Agile, yet they did things in very different ways.  But they all seemed to agree on a few things, and this I suppose could be the spirit of Agile, if not the letter.

The basic characteristic of Agile is that you break down the software into smaller and smaller tasks until you reach those tasks that can be comfortably handled by a small (2-8 person) team of engineers to complete in a small (1-2 week) time frame.  Then at the end of the time frame, you supposedly have some software that "works" in some sense of the word.  It exhibits basic functionality and the general gist of what the customer wanted, if not satisfying all the requirements for the entire project. Then over the next several time periods of development, the software is iteratively improved by fleshing out remaining requirements, adding needed functionality, hooking components together, etc. During this time, the customer is involved to make sure that what is being delivered is actually what is needed.

Some projects I worked on did this formally by a book and followed strict guidelines for the methodology, others just winged it, but all had the basic characteristics above.

One of the primary advantages of Agile is its use to management.  By having the large problem broken down into team-size, biweekly pieces, management can easily allocate and track resources usage and progress.  They can treat a team as a black box of software development and assign tasks as they arise.  Management can attempt to measure the performance of a team and see whether it is increasing, decreasing, or remaining steady.  Teams are what managers like to manage.

Another advantage is frequent feedback from the customer.  Since after each time period, a somewhat working version of some fragment of the product is available for demonstration, the customer can give feedback as to how and whether this seems to meet his needs.  He can offer suggestions about what might be improved, what features he needs to make the product at least minimally useful to him, and prevent development from getting off track.

But Agile is not a panacea.  There is still a significant amount of software produced with the traditional “waterfall” methodology with specification completed before coding begins and integration done as a final step in coding and only then releasing to the customer.  There is also a fair amount of software written “artistically”. I would define artistic software as that written by a single developer working alone over a period of several months. Frequently, such a project never gets beyond the hobbyist stage, and as such it is a risky approach to writing production code. But on occasion, an artistic project can turn into something novel and useful. It more often exhibits a unity of vision and coherence that is harder to find in software written by groups of people. (Which isn't to say that small groups cannot write software with unity of vision and coherence, it's just harder. Or they'll have one particular person in the group that has more insight than the others.)

Managers aren't as useful to artistic developers. Artistic developers tend to manage themselves. And you cannot swap out one developer for another without swapping out the entire project with him. A manager can work with an artistic developer as a peer, and help manage the project, but cannot manage the developer.

Frequently involving customers has its pros and cons as well. Often customers have difficulty imagining anything beyond incremental improvements to the current ways of doing things. They'll want a UI widget that will make some task slightly easier, but not think of automating the task altogether. They'll want to automate a series of inefficient tasks when a different viewpoint of the end result might make those intermediate tasks irrelevant. Customers are not impressed with changes to the code that don't produce visible effects. You may have spent a week refactoring in order to make it trivial to add new commands and new outputs, but customers don't care. Customers don't care about potential use cases, they care about their specific use case to the exclusion of everything else. This can be discouraging to developers.

Because Agile is so useful to managers, big and intermediate sized companies will continue to use it to develop commodity software in a factory-like style. It isn't going to be replaced any time soon. But there is still ample room in the market for small companies and individuals with vision to carve out niches that Agile methodologies will overlook and find tricky to fill.

(But I'm a romantic at heart, and I like the image of the lone hacker crafting software on his home computer in his room late at night. If only it were easy to make a living that way.)

Lispers.deBerlin Lispers Meetup, Monday, 27th January 2020

· 71 days ago

We meet again on Monday 8pm, 27th January. Our host this time is James Anderson (www.dydra.com).

Berlin Lispers is about all flavors of Lisp including Common Lisp, Emacs Lisp and MacLisp.

Jörg Preisendörfer will talk about the Ion serialization format and Common Lisp. Ion is represented by either a human-readable text form or a compact binary form. The text form is a superset of JSON. [1]

We will also talk about the upcoming Lisp-related conferences in 2020.

[1] https://en.wikipedia.org/wiki/Ion_(serialization_format)

We meet in the Taut-Haus at Engeldamm 70 in Berlin-Mitte, the bell is "James Anderson". It is located in 10min walking distance from U Moritzplatz or U Kottbusser Tor or Ostbahnhof. In case of questions call Christian +49 1578 70 51 61 4.

Joe MarshallBut what if you really want to push a stack frame?

· 72 days ago
If you really don't want tail recursion, the solution is simple: don't put the call in “tail position”. We define a rather trivial function dont-tail-call and use it like this:
(dont-tail-call
  (foo x))
The semantics of Scheme are that the arguments are evaluated before the call to the function, so the call to foo is required to occur before the call to dont-tail-call which necessitates allocation of a continuation.

But what about a really clever compiler that somehow “optimizes” away the call to dont-tail-call? Such an “optimization” is highly questionable because it changes the observable semantics of the program so it is no longer call-by-value. A compiler that performed that transformation wouldn't be a Scheme compiler because Scheme is a call-by-value language.

But what if we had really clever compiler and we disregard the change in semantics? We can easily thwart the compiler by deferring the definition of dont-tail-call until run time. Even the cleverest compiler cannot see into the future.

The definition of dont-tail-call is left to the reader, as is how to defer it's definition until run time.

Joe MarshallAfraid of Tail Recursion

· 72 days ago
It's well known fact among proponents of tail recursion that some people just don't get it. They view tail recursion at best as a quirky compiler optimization that turns some recursive calls into loops. At worst, they see it as some sort of voodoo, or a debugging pessimization. They see little value in it. Some have outright disdain for it.

Tail recursion isn't just about turning recursive calls into loops. It's about changing how you look at function calling. Tail recursion just happens to fall out of this new viewpoint.

Most programmers, I think, view function calls as if they were akin to a short vacation. You pack up the arguments in your luggage, travel to the destination, unpack your luggage, do some stuff, repack your luggage with some souvenirs, return home, unpack everything and resume life where you left off. Your souvenirs are the return value.

Should you need a vacation from your vacation, you do the same thing: pack up the arguments in your luggage, travel to your new destination, unpack your luggage, do some stuff, repack your luggage with some souvenirs, return to your original vacation spot and resume your original vacation.

Tail recursion aficionados realize that the journey itself is the important part of the function call, and that a vacation includes two journeys. On the first journey you pack up the arguments, including the return ticket, in your luggage, use the outbound ticket to journey to the destination, unpack your luggage, and start doing stuff. When you run out of stuff to do, you make the second journey. You fetch the return ticket, repack your luggage, take the ticket to wherever it leads (presumably back home), unpack everything, and resume doing whatever you were doing there.

But perhaps you want to visit grandma instead of going directly home. Then we change the script slightly. When you run out of things to do on your vacation, you pack up your luggage with your souvenirs and the return ticket, then you journey to grandma's house, where you unpack and start doing stuff. Eventually you are done visiting grandma, so then you fetch the return ticket, repack your luggage, take the ticket to wherever it leads, unpack everything, and resume doing stuff there. It's a three-legged journey. You don't go from grandma's back to the vacation resort — there's nothing left for you to do there. You take the return ticket directly home.

Viewing things this way, a function call involves packaging the arguments in a suitable way, deallocating any temporary storage, and then making an unconditional transfer to the function, where we unpack the arguments and resume execution of the program. It is simply “a goto that passes arguments”.*

A function return is simply “a goto that passes a return value”. It involves packaging the return value in a suitable way, deallocating any temporary storage, and then making an unconditional transfer to the return address, where we resume execution of the program.

A tail recursive function call is simply “a goto that passes arguments”. It involves packaging the arguments in a suitable way, deallocating any temporary storage and then making an unconditional transfer to the function, where we resume execution of the program.

Do we really deallocate temporary storage before every control transfer? Certainly a return pops the topmost stack frame, and as often implemented, a tail recursive function call deallocates its stack frame or replaces it before transferring control, but a non tail recursive call? It does so as well, it's just that it also has to pack those values into a new continuation for the return trip. We use an implementation trick to avoid the absurdity of actually moving these values around: we move the base of frame pointer instead. Voila, we simultaneously deallocate the stack frame and allocate the continuation with the right values already in place.

Deallocating storage before each control transfer is an important part of the protocol. We're making a unconditional transfer to a destination with the hope, but no guarantee, that we'll come back, so we'd better tidy up before we leave. This ensures that we won't leave a trail of garbage behind us on each transfer which would accumulate and adversely affect the space complexity of our program.

Once you view a function call and return as not being a single sequence, but each one a separate, and virtually identical sequence, then tail recursion becomes a natural consequence. Tail recursion isn't a special case of function call, it is the same thing as a function call, the only difference being whether a new continuation (the "return ticket") is allocated in order to come back. Even function returns are the same thing, the only difference being that destination is (usually) determined dynamically rather than statically. Tail recursion isn't just another optimization, it's the result of treating inter-procedural control transfer symmetrically.

Another natural consequence is greatly increased options for control flow. Instead of a strict call/return pattern, you can make "three-legged" trips, or however many legs you need. You can make loops that incorporate one, two, or even a dynamically changing number of functions. You can replace compiler-generated returns with user-provided function calls (continuation-passing style) and implement arbitrarily complex control and data flow like multiple return values, exception handling, backtracking, coroutines, and patterns that don't even have names yet. And of course you can mix and match these patterns with the standard call and return pattern as well.

The phrase "tail recursion" is shorthand for this symmetric view of interprocedural control flow and is meant to encompass all these consequences and control flow options that spring from it. It's not about simply turning recursive functions into loops.

People who are afraid of tail recursion seem unable to see any value in taking up the symmetric viewpoint despite the fact that it opens up a whole new set of control flow techniques (in particular continuation-passing style). They find the notion that a procedure call is “a goto that passes arguments” “nonsensical”. A lot of good research has come from taking this nonsense seriously.


*The view that a function call is simply a “a goto that passes arguments” was developed by Steele in his “Lambda papers”.

The important point of cleaning up before the control transfer was formalized by Clinger in “Proper Tail Recursion and Space Efficiency”.

Someone — it might have been Clinger, but I cannot find a reference — called tail recursion “garbage collection for the stack”. The stack, being so much more limited in size than the heap, needs it that much more. Indeed Clinger notes the tight connection between tail recursion and heap garbage collection and points out that heap garbage collection is hampered if the stack is retaining pointers to logically dead data structures. If the dead structures are large enough, garbage collection can be rendered useless. Yet many popular languages provide garbage collection but not tail recursion.

The only difference between a call and return is that typically the call is to a statically known location and the return address is dynamically passed as a "hidden" argument. But some compilers, like Siskind's Stalin compiler, statically determine the return address as well.

The only difference between a function call and a tail recursive function call is when you need to return to the caller to complete some work. In this case, the caller needs to allocate a new continuation so that control is eventually returned. If there is no further work to be done in the caller, it doesn't create a new continuation, but simply passes along the one that was passed to it.

Many compilers have been written that handle function calls, tail recursive function calls, and returns identically. They only change what code they emit for handling the continuation allocation. These compilers naturally produce tail recursive code.

Most machines provide special purpose support for a LIFO stack. It is tempting to use the stack for allocation of continuations because they are almost always allocated and deallocated in LIFO order, and a stack gives higher performance when this is the case. Many compilers do in fact use the stack for continuations and argument passing. Some, like Winklemann's Chicken compiler follow Baker's suggestion and treat the stack as an "nursery" for the heap. Others avoid using the stack altogether because of the extra complexity it entails. And others cannot use the stack because of constraints placed on stack usage by OS conventions or the underlying virtual machine model.

Paul KhuongLazy Linear Knapsack

· 73 days ago

The continuous knapsack problem may be the simplest non-trivial linear programming problem:

\[\max_{x \in [0, 1]^n} p'x\] subject to \[w'x \leq b.\]

It has a linear objective, one constraint, and each decision variable is bounded to ensure the optimum exists. Note the key difference from the binary knapsack problem: decision variables are allowed to take any value between 0 and 1. In other words, we can, e.g., stick half of a profitable but large item in the knapsack. That's why this knapsack problem can be solved in linear time.

Dual to primal is reasonable

Duality also lets us determine the shape of all optimal solutions to this problem. For each item \(i\) with weight \(w_i\) and profit \(p_i\), let its profit ratio be \(r_i = p_i / w_i,\) and let \(\lambda^\star\) be the optimal dual (Lagrange or linear) multiplier associated with the capacity constraint \(w'x \leq b.\) If \(\lambda^\star = 0,\) we simply take all items with a positive profit ratio (\(r_i > 0\)) and a non-negative weight \(w_i \geq 0.\) Otherwise, every item with a profit ratio \(r_i > \lambda^\star\) will be at its weight upper bound (1 if \(w_i \geq 0\), 0 otherwise), and items with \(r_i < \lambda^\star\) will instead be at their lower bound (0 of \(w_i \leq 0\), and 1 otherwise).

Critical items, items with \(r_i = \lambda^\star,\) will take any value that results in \(w'x = b.\) Given \(\lambda^\star,\) we can derive the sum of weights for non-critical items; divide the remaining capacity for critical items by the total weight of critical items, and let that be the value for every critical item (with the appropriate sign for the weight).

For example, if we have capacity \(b = 10,\) and the sum of weights for non-critical items in the knsapsack is \(8,\) we're left with another two units of capacity to distribute however we want among critical items (they all have the same profit ratio \(r_i = \lambda^\star,\) so it doesn't matter where that capacity goes). Say critical items with a positive weight have a collective weight of 4; we could then assign a value of \(2 / 4 = 0.5\) to the corresponding decision variable (and 0 for critical items with a non-positive weight).

We could instead have \(b = 10,\) and the sum of weights for non-critical items in the knapsack \(12\): we must find two units of capacity among critical items (they all cost \(r_i = \lambda^\star\) per unit, so it doesn't matter which). If critical items with a negative weight have a collective weight of \(-3,\) we could assign a value of \(-2 / -3 = 0.6\overline{6}\) to the corresponding decision variables, and 0 for critical items with a non-negative weight.

The last case highlights something important about the knapsack: in general, we can't assume that the weights or profits are positive. We could have an item with a non-positive weight and non-negative profit (that's always worth taking), an item with positive weight and negative profit (never interesting), or weights and profits of the same sign. The last case is the only one that calls for actual decision making. Classically, items with negative weight and profit are rewritten away, by assuming they're taken in the knapsack, and replacing them with a decision variable for the complementary decision of removing that item from the knapsack (i.e., removing the additional capacity in order to improve the profit). I'll try to treat them directly as much as possible, because that reduction can be a significant fraction of solve times in practice.

The characterisation of optimal solutions above makes it easy to directly handle elements with a negative weight: just find the optimal multiplier, compute the contribution of non-critical elements (with decision variables at a bound) to the left-hand side of the capacity constraint, separately sums the negative and positive weights for critical elements, then do a final pass to distribute the remaining capacity to critical elements (and 0-weight / 0-value elements if one wishes).

Solving the dual looks like selection

Finding the optimal multiplier \(\lambda^\star\) is similar to a selection problem: the value is either 0 (the capacity constraint is redundant), or one of the profit ratios \(r_i,\) and, given a multiplier value \(\lambda,\) we can determine if it's too high or too low in linear time. If the non-critical elements yield a left-hand side such that critical elements can't add enough capacity (i.e., no solution with the optimal form can be feasible), \(\lambda\) is too low. If the maximum weight of potentially optimal solutions is too low, \(\lambda\) is too high.

We can thus sort the items by profit ratio \(r_i\), compute the total weight corresponding to each ratio with a prefix sum (with a pre-pass to sum all negative weights), and perform a linear (or binary) search to find the critical profit ratio. Moreover, the status of non-critical items is monotonic as \(\lambda\) grows: if an item with positive weight is taken at \(\lambda_0\), it is also taken for every \(\lambda \leq \lambda_0\), and a negative-weight item that's taken at \(\lambda_0\) is also taken for every \(\lambda \geq \lambda_0.\) This means we can adapt selection algorithms like Quickselect to solve the continuous knapsack problem in linear time.

I'm looking at large instances, so I would like to run these algorithms in parallel or even distributed on multiple machines, and ideally use GPUs or SIMD extensions. Unfortunately, selection doesn't parallelise very well: we can run a distributed quickselect where every processor partitions the data in its local RAM, but that still requires a logarithmic number of iterations.

Selection looks like quantile estimation; does the dual?

Lazy Select offers a completely different angle for the selection problem. Selecting the \(k\)th smallest element from a list of \(n\) elements is the same as finding the \(k / n\)th quantile1 in that list of \(n\) elements. We can use concentration bounds2 to estimate quantiles from a sample of, e.g., \(m = n^{3/4}\) elements: the population quantile value is very probably between the \(qm - \frac{\log m}{\sqrt{m}}\)th and \(qm + \frac{\log m}{\sqrt{m}}\)th values of the sample. Moreover, this range very probably includes at most \(\mathcal{O}(n^{3/4})\) elements3, so a second pass suffices to buffer all the elements around the quantile, and find the exact quantile. Even with a much smaller sample size \(m = \sqrt{n},\) we would only need four passes.

Unfortunately, we can't directly use that correspondence between selection and quantile estimation for the continuous knapsack.

I tried to apply a similar idea by sampling the knapsack elements equiprobably, and extrapolating from a solution to the sample. For every \(\lambda,\) we can derive a selection function \(f_\lambda (i) = I[r_i \geq \lambda]w_i\) (invert the condition if the weight is negative), and scale up \(\sum_i f(i)\) from the sample to the population). As long as we sample independently of \(f\), we can reuse the same sample for all \(f_\lambda.\) The difficulty here is that, while the error for Lazy Select scales as a function of \(n,\) the equivalent bounds with variable weights are a function of \(n(|\max_i w_i| + |\min_i w_i|)^2.\) That doesn't seem necessarily practical; scaling with \(\sum_i |w_i|\) would be more reasonable.

Good news: we can hit that, thanks to linearity.

Let's assume weights are all integers. Any item with weight \(w_i\) is equivalent to \(w_i\) subitems with unit weight (or \(-w_i\) elements with negative unit weight), and the same profit ratio \(r_i\), i.e., profit \(p_i / |w_i|\). The range of subitem weights is now a constant.

We could sample uniformly from the subitems with a Bernoulli for each subitem, but that's clearly linear time in the sum of weights, rather than the number of elements. If we wish to sample roughly \(m\) elements from a total weight \(W = \sum_i |w_i|,\) we can instead determine how many subitems (units of weight) to skip before sampling with a Geometric of success probability \(m / W.\) This shows us how to lift the integrality constraint on weights: sample from an Exponential with the same parameter \(m / W!\)

That helps, but we could still end up spending much more than constant time on very heavy elements. The trick is to deterministically special-case these elements: stash any element with large weight \(w_i \geq W / m\) to the side, exactly once. By Markov's inequality,4 we know there aren't too many heavy elements: at most \(m.\)

Let's test this out

The heart of the estimation problem can be formalised as follows: given a list of elements \(i \in [n]\) with weight \(w_i \geq 0\), generate a sample of \(m \leq n\) elements ahead of time. After the sample has been generated, we want to accept an arbitrary predicate \(p \in \{0,1\}^n\) and estimate \(\sum_{i\in [n]} p(i) w_i.\)

We just had a sketch of an algorithm for this problem. Let's see what it looks like in Python. The initial sample logic has to determine the total weight, and sample items with probability proportional to their weight. Items heavier than the cutoff are not considered in the sample and instead saved to an auxiliary list.

sample.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
def total_weight(items):
    return sum(weight for (_, weight) in items)


def sample_by_weight(items, rate, cutoff):
    """Samples from a list of (index, weight), with weight >= 0.

    Items with weight >= cutoff are taken with probability one.
    Others are sampled with rate `rate` / unit of weight.
    """
    sample = []
    large = []
    next_sample = random.expovariate(rate)
    for item in items:
        index, weight = item
        if weight >= cutoff:
            large.append(item)
        else:
            next_sample -= weight
            while next_sample <= 0:
                sample.append(index)
                next_sample += random.expovariate(rate)
    return sample, large

We can assemble the resulting sample (and list of "large" elements) to compute a lower bound on the weight of items that satisfy any predicate that's independent of the sampling decisions. The value for large elements is trivial: we have a list of all large elements. We can subtract the weight of all large elements from the total item weight, and determine how much we have to extrapolate up.

extrapolate.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
def hoeffding(n, alpha):
    """Determines how much we can expect a sample of n i.i.d. values
    sampled from a Bernouli to differ, given an error rate of alpha.

    Given a sample X of n i.i.d. values from a Bernoulli distribution,
    let delta be \bar{X} - E[\bar{X}], the one-sided difference
    between the sample average value and the expected sample average.

    Hoeffding's upper bound (see below) is conservative when the
    empirical probability is close to 0 or 1 (trivially, it can yield
    confidence bounds that are outside [0, 1]!), but simple, and in
    general not much worse than tighter confidence interval.

    P(delta >= eps) <= exp(-2 eps^2 n) = alpha
      -> -2 eps^2 n = ln alpha
     <->        eps = sqrt[-(ln alpha) / 2n ]

    """
    return math.sqrt(- math.log(alpha) / (2 * n))


def eval_weight(total_weight, sample, large, predicate, alpha):
    """Given a population's total weight, a memoryless sample (by weight)
    from the population's items, and large items that were
    deterministically picked, evaluates a lower bound for the sum of
    weights for items in the population that satisfy predicate.
    
    The lower bound is taken with error rate <= alpha.
    """
    large_sum = sum(weight for (index, weight) in large if predicate(index))
    # The remainder was up for sampling, unit of weight at a time.
    sampled_weight = total_weight - sum(weight for (_, weight) in large)
    if sampled_weight <= 0 or not sample:
        return large_sum
    # Estimate the Binomial success rate with a Beta
    successes = sum(1 if predicate(x) else 0 for x in sample)
    failures = len(sample) - successes
    # We want a lower bound, and the uniform prior can result in a
    # (valid) bound that's higher than the empirical rate, so take the
    # min of the two.
    empirical_rate = successes / sampled_weight
    delta = hoeffding(len(sample), alpha)
    return large_sum + sampled_weight * max(0, empirical_rate - delta)

And finally, here's how we can sample from an arbitrary list of items, compure a lower bound on the weight of items that satisfy a predicate, and compare that with the real lower bound.

lower_bound.py
1
2
3
4
5
6
7
8
9
def compare_bounds(items, rate, alpha, predicate):
    total = total_weight(items)
    # We expect a sample size of roughly rate * len(items), and
    # at most rate * len(items) large items.
    sample, large = sample_by_weight(items, rate, rate * total)
    lower_bound = eval_weight(total, sample, large, predicate, alpha)
    # Check if the lower bound is valid.
    actual = sum(weight for (index, weight) in items if predicate(index))
    return lower_bound <= actual + 1e-8, lower_bound, actual

How do we test that? Far too often, I see tests for randomised algorithms where the success rate is computed over randomly generated inputs. That's too weak! For example, this approach could lead us to accept that the identity function is a randomised sort function, with success probability \(\frac{1}{n!}.\)

The property we're looking for is that, for any input, the success rate (with the expectation over the pseudorandom sampling decisions) is as high as requested.

For a given input (list of items and predicate), we can use the Confidence sequence method (CSM) to confirm that the lower bound is valid at least \(1 - \alpha\) of the time.

csm_test.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
def compare_bounds_generator(test_case, rate, alpha):
    items = [(i, w) for (i, (w, _)) in enumerate(test_case)]
    chosen = set(i for (i, (_, p)) in enumerate(test_case) if p)
    while True:
        yield compare_bounds(items, rate, alpha, lambda x: x in chosen)[0]


def check_bounds(test_case, rate, alpha):
    """Test case is a list of pairs of weight and predicate value
       rate is the sample rate
       alpha is the confidence parameter for the lower bound.
    """
    wanted = 1 - alpha  # The Hoeffding bound is conservative, so
                        # this should let csm_driver stop quickly.
    result = csm.csm_driver(compare_bounds_generator(test_case,
                                                     rate,
                                                     alpha),
                            wanted,
                            1e-6,  # Wrong conclusion with p < 1e-6.
                            file=sys.stderr
                            )
    stop, actual, *_ = result
    assert actual >= wanted, "Result: %s" % str(result)

With a false positive rate of at most one in a million,5 we can run automated tests against check_bounds. I'll use Hypothesis to generate list of pairs of weight and predicate value:

test_bounds.py
1
2
3
4
5
6
7
8
9
from hypothesis import given, settings, Verbosity
import hypothesis.strategies as st

@given(test_case=st.lists(st.tuples(st.floats(min_value=0, max_value=1),
                                    st.booleans())),
       rate=st.floats(min_value=1e-6, max_value=0.5),
       alpha=st.floats(min_value=0.05, max_value=0.25))
def test_bounds(test_case, rate, alpha):
    check_bounds(test_case, rate, alpha)

Bimodal inputs tend to be harder, so we can add a specialised test generator.

test_bimodal_bounds.py
1
2
3
4
5
6
@given(test_case=st.lists(st.tuples(st.one_of(st.just(0.1), st.just(1)),
                                    st.booleans())),
       rate=st.floats(min_value=1e-6, max_value=0.5),
       alpha=st.floats(min_value=0.05, max_value=0.25))
def test_bimodal_bounds(test_case, rate, alpha):
    check_bounds(test_case, rate, alpha)

Again, we use Hypothesis to generate inputs, and the Confidence sequence method (available in C, Common Lisp, and Python) to check that the lower bound is valid with probability at least \(1 - \alpha\). The CSM tests for this statistical property with power 1 and adjustable error rate (in our case, one in a million): we only provide a generator for success values, and the driver adaptively determines when it makes sense to make a call and stop generating more data, while accounting for multiple hypothesis testing.

TL;DR: the estimation algorithm for individual sampling passes works, and the combination of Hypothesis and Confidence Sequence Method lets us painlessly test for a statistical property.

We can iteratively use this sampling procedure to derive lower and (symmetrically) upper bounds for the optimal Lagrange multiplier \(\lambda^\star,\) and Hoeffding's inequality lets us control the probability that the lower and upper bounds are valid. Typically, we'd use a tolerance of \(\sqrt{\log(n) / n},\) for an error rate of \(1 / n^2.\) I prefer to simply use something like \(7 / \sqrt{n}:\) the error rate is then less than \(10^{-42},\) orders of manitude smaller than the probability of hardware failure in any given nanosecond.6 We can still check for failure of our Las Vegas algorithm, but if something went wrong, it's much more likely that we detected a hardware failure than anything else. It's like running SuperPi to stress test a computer, except the work is useful. 😉

Repeat as necessary to solve a knapsack

How many sampling passes do we need? Our bounds are in terms of the sum of item weight: if we let our sample size be in \(\Theta(\sqrt{n}),\) the sum of weights \(\sum_i |w_i|\) for unfathomed items (that may or may not be chosen depending on the exact optimal multiplier \(\lambda^\star\) in the current range) will very probably shrink by a factor of \(\Omega(n^{1/4}).\) The initial sum can, in the worst case, be exponentially larger than the bitlength of the input, so even a division by \(n^{1/4}\) isn't necessarily that great.

I intend to apply this Lazy Linear Knapsack algorithm on subproblems in a more interesting solver, and I know that the sum of weights is bounded by the size of the initial problem, so that's good enough for me! After a constant (\(\approx 4\)) number of passes, the difference in item weight between the lower and upper bound on \(\lambda^\star\) should also be at most 1. One or two additional passes will get me near optimality (e.g., within \(10^{-4}\)), and the lower bound on \(\lambda^\star\) should thus yield a super-optimal solution that's infeasible by at most \(10^{-4},\) which is, for my intended usage (again), good enough.

Given an optimal enough \(\lambda^\star,\) we can construct an explicit solution in one pass, plus a simple fixup for critical items. This Lazy Knapsack seems pretty reasonable for parallel or GPU computing: each sampling pass only needs to read the items (i.e., no partitioning-like shuffling) before writing a fraction of the data to a sample buffer, and we only need a constant number of passes (around 6 or 7) in the worst case.


  1. It's more like a fractional percentile, but you know what I mean: the value such that the distribution function at that point equals \(k / n\). 

  2. Binomial bounds offer even stronger confidence intervals when the estimate is close to 0 or 1 (where Hoeffding's bound would yield a confidence interval that juts outside \([0, 1]\)), but don't impact worst-case performance. 

  3. Thanks to Hoeffding's inequality, again. 

  4. That's a troll. I think any self-respecting computer person would rather see it as a sort of pigeonhole argument. 

  5. We're juggling a handful of error rates here. We're checking whether the success rate for the Lazy Knapsack sampling subroutine is at least as high as \(1 - \alpha,\) as requested in the test parameters, and we're doing so with another randomised procedure that will give an incorrect conclusion at most once every one million invocation. 

  6. This classic Google study found 8% of DIMMs hit at least one error per year; that's more than one single-bit error every \(10^9\) DIMM-second, and they're mostly hard errors. More recently, Facebook reported that uncorrectable errors affect 0.03% of servers each month; that's more than one uncorrectable error every \(10^{10}\) server-second. If we performed one statistical test every nanosecond, the probability of memory failure alone would still dominate statistical errors by \(10^{20}!\) 

Joe MarshallAfraid of Recursion

· 74 days ago
Here's a trick I learned several years ago for transferring time-series data. In the case in question, I needed to transfer a bunch of timestamped records, but the server had a quirk to it. If you asked for too many records at once, it would simply return an error code and give up on your request. There was no way to know beforehand how many records might exist in any given time span, so you could get an error code on nearly any request, unless it was for a very short time span. On the other hand, many requests for long time spans would succeed because they had few records in them. Despite this quirk, the code was really simple:
List<record> xfer (Timestamp start, Timestamp end) {
    try {
        return tryFetchRecords(start, end);
    } catch (TooManyRecordsException e) {
        Timestamp mid = (start + end)/2;
        List<record> firstHalf = xfer (start, mid);
        List<record> secondHalf = xfer (mid, end);
        return firstHalf.addAll(secondHalf);
    }
}
On any request, if the server returned the error code, we would simply bisect the time span, recursively ask for each half separately, and combine the two halves. Should the bisected time span still contain too many records, the time span would be bisected again. The recursion would continue until the time span was small enough that the server could honor the request and return some records. The recursion would then unwind, combining the returned records into larger and larger lists until we had all the records for our original time span. Since the time span would be cut in half on each recurrence, the depth of recursion would be proportional to the logarithm (base 2) of the total number of records, which would be a reasonably small number even with an enormous number of records.

It's certainly possible to avoid recursion and do this iteratively by some form of paging, but the code would be slightly more complex. The server is not very cooperative, so there is no easy way to determine an appropriate page size beforehand, and the server doesn't support a “paging token” to help keep track of progress. The recursive solution finds an appropriate transfer size by trial and error, and keeps track of progress more or less automatically. An iterative paging solution would have to do these things more explicitly and this would make the iterative code a bit more complex. And why add any complexity when it isn't really necessary?

I thought this solution was really cool when I first saw it. I've used this trick for transferring time series data many times. It makes the server very simple to write because the protocol requires so little of it. It simply has to refuse to answer requests that return too many results. The client code is just about the 10 lines above.

But when I first suggest this code to people I usually get “push back” (that is, until they see it work in action, then they usually get on board with it). People seem unsure about the use of recursion and want a more complex protocol where the client and server negotiate a page size or cooperatively pass a paging token back and forth on each request and response. Their eyes glaze over as soon as they see the recursive call. They want to avoid recursion just because it's recursive.

I've seen “aversion to recursion” happen in a lot of circumstances, not just this one. Recursion isn't the solution to everything. No tool solves all problems. But it is an important tool that often offers elegant solutions to many problems. Programmers shouldn't be afraid of using it when it is appropriate.

Joe MarshallUnsyndicated blog

· 75 days ago
I've noticed that my blog posts are replicated in Planet Lisp and Planet Scheme, and here I am spamming them with random math stuff. So I'm creating a new blog, Jrm's Random Blog, where I can feel free to post about math, science, computers in general, and whatever else bugs me, without spamming the Lisp and Scheme readers. I'll keep posting to Abstract Heresies, but try to keep it more Lisp and computer language focused.

Joe MarshallGroups, semigroups, monoids, and computers

· 77 days ago
The day after I rant about mathematicians, I make a math post. “Do I contradict myself? Very well, then, I contradict myself, I am large, I contain multitudes.” — Walt Whitman

A group is a mathematical concept. It's pretty simple. It consists of a set, G, and an operation, *, which can be used to combine any two elements of G. What the set contains is not that important. It is the * operation we're interested in, and we can usually swap out G for another set without causing too many problems other than having to change the type signature of *. There are four axioms that * must obey
  • Closure—combining any two elements of G using * just gives you another element in G.
    Note that this means you can build an arbitrary binary tree of combinations: e.g.(* (* a b) (* (* c d) e))). These trees will always be like a tree of cons cells. In some sense, the closure axiom is equivalent to saying that all the elements of G have the same type and that the * operator operates on values of that type and produces values of that type. The closure axiom along with the binary operation means that we can reduce any tree of combinations to a single value.
  • Associativity(* (* a b) c) = (* a (* b c)) for any a, b, and c. This implies that you can take any arbitrary tree of combinations: e.g.(* (* a b) (* (* c d) e))) and simply flatten it into a list (* a b c d e), or given the flat sequence (* a b c d e) we can add parenthesis anywhere we like: (* a (* b c) d e). If we stop here and only have the closure and associativity axiom, we have what is called a “semigroup”. You can use the * operation to “fold” a semigroup down to single value, or to keep an accumulator and incrementally fold elements into the accumulator.
  • Identity element—There has to be an identity element id such that (* id x) = (* x id) = x for all x. It will be unique. If you see the identity object in a combination (* a b id c d), you can simply remove it: (* a b c d). The identity element also comes in handy as an initial value when you are folding a sequence. If you have some concept that would be a group except it doesn't have an identity element, then you can often just make one up and add it to the set G.
  • Inverse element—For every element in G there has to be another element, that when combined with the first, gives you the identity. So if a is an element in G, there has to be some other element, call it b, such that (* a b) = (* b a) = id. The inverse element is usually notated with a little -1: a-1. If you have an element in a combination right next to it's inverse: (* a x x-1 c), you can combine the element and it's inverse to get the identity: (* a id c), and then remove the identity: (* a c)
Frequently you run into something that obeys all the axioms but the inverse element axiom. This is called a monoid. A monoid is very much like a group except that you can get "stuck" when manipulating it if you run into one of the non-invertible elements because there's no inverse to "undo" it. There are certain things about monoids that are true only “if the appropriate inverses exist”. You run into that qualifier a lot when dealing with monoids. You don't need that qualifier if you are dealing with a group because they do exist by axiom. Or we could say that calling something a group is simply shorthand for adding “if the appropriate inverses exist” everywhere.

What does this have to do with computers? Consider the set of all subroutines with the operation of concatenation. It is closed — concatenating two subroutines gives you a third subroutine. It is associative — you just concatenate them linearly. There is an identity element, usually called no-op. And many, but not all, subroutines have inverses. So we have a monoid.

Consider the set of all strings with the operation of concatenation. It is closed, associative, the empty string is the identity element. It is a monoid.

Consider the set of functions whose input type is the same as the result type with the operation of composition. It is closed, associative, the identity function is the identity element. It is a monoid. If we consider only the subset of functions that also have inverses, we have a group. This particular monoid or group comes in especially handy because composition of functions is so useful.

Consider the set of invertible 2x2 matrices with integer components, a determinant of 1 or -1, and the operation of matrix multiply. It is closed, associative, there is an identity matrix, and I already said just consider the invertible ones. It forms a group. This group comes in handy for implementing arbitrary precision arithmetic. (Thanks to Bradley Lucier for the correction of the condition on the determinant. This makes the matrix continue to have integer components upon inversion, keeping things closed.)

The permutations of a list form a group. The integers under addition form a group.

These things are everywhere. And it isn't a coincidence. The concepts of a group, monoid, and semigroup are meant to capture the essence of what it is to have a foldable sequence of elements. (Can I complain about mathematicians here? They make up so much terminology and abstraction that it is virtually impossible to get at what they really mean. We're just talking about sequences of elements and trying to find some minimal axioms that you need to have to fold them, but try to find literature that actually says that's what we're doing is like trying to pull hen's teeth.)

So what good are groups, monoids, and semigroups? Aside from the obvious fact that foldable sequences are ubiquitous and really useful, that is. Not immediately apparent from the axioms is that in addition to folding a sequence, you can transform a sequence into a different, but equivalent one. If the appropriate inverses exist (there's that phrase), you can “unfold” some or all elements of a sequence. So by judicious folding and unfolding, you can transform a sequence.

Here's an unusual abstract example. Consider a pipeline which has a set of nodes and communicates values of the same type between the nodes. Values accumulate at the nodes until they are transmitted to the next node in the pipeline. We start with all the values in the initial node (on the right) and transmit them to the left:
(pipeline (node) (node) (node a b c))  ;; transmit the a
(pipeline (node) (node a) (node b c))  ;; transmit the b
(pipeline (node) (node a b) (node c))  ;; transmit the a
(pipeline (node a) (node b) (node c))  ;; transmit the c
(pipeline (node a) (node b c) (node))  ;; transmit the b
(pipeline (node a b) (node c) (node))  ;; transmit the c
(pipeline (node a b c) (node) (node))  ;; done
If the values we transmit are drawn from a group, we can replace each node with the group's * operator:
(* identity identity (* a b c))  ;; transmit the a
(* identity (* identity a) (* b c))  ;; transmit the b
(* identity (* a b) (* identity c))  ;; transmit the a
(* (* identity a) (* identity  b) (* identity c))  ;; transmit the c
(* (* identity a) (* b c) identity)  ;; transmit the b
(* (* a b) (* identity c) identity)  ;; transmit the c
(* (* a b c) identity identity)  ;; done
The astute reader will notice that all we're doing is making use of the associativity axiom and moving the parenthesis around so that the values seem to move between the different nodes. But we preserve the invariant that the “value” of the entire pipeline doesn't change as the values move. The * operator need not be concatenate, which would give simple queuing behavior, but can be any operator satisfying the axioms giving us much more interesting pipelines. One implementation of arbitrary precision arithmetic transmits Möbius transformations along just such a pipeline to refine the upper and lower limits of a computed approximation. In this implementation, the * operator is the composition of Möbius transformations.

Here's a more concrete example. If you have a series of nested functions: (f (g x)) and both f and g take and return the same type, rewrite it as ((compose f g) x) and use a little group theory on it.
(f (g x))
((compose f g) x)
;; or more explicitly
((fold-left compose identity (list f g)) x)
If the appropriate inverses exist, then there will be another function h such that (compose f g) is equal to (compose h f) essentially allowing you to "slide" g to the left "through" f. It is relatively easy to see that h must be equivalent to (compose f g f-1). Mathematicians say that h is conjugate to g. Conjugates always have a form like aba-1. By finding conjugates, you can take a sequence and slide the elements left and right through other elements. This also allows you to fold things out of order. (Or in the pipeline example, transmit items out of order.) If we were left folding into an accumulator, folding h before f is equivalent to folding g after f. Another way of looking at it is this. Suppose we're standing to the left of f and looking through the “lens” of f at g. h is what g "looks like" when viewed through f.

If we want, we can define slide such that (compose slide (compose f g)) is equivalent to (compose h f). slide is (compose h f g-1 f-1). (This isn't a generic slide sequence, it only works on (compose f g). It ought to be an identity because (compose f g) is equivalent to (compose h f).) I complained that mathematicians provided too few concrete examples, so here is a concrete example using list permutations:
> (reverse (rotate-left '(a b c d)))
(a d c b)

;; rewrite as explicit fold-left of compose
> ((fold-left compose identity (list reverse rotate-left)) '(a b c d))
(a d c b)

;; sliding rotate-left through reverse turns it into rotate-right
> ((fold-left compose identity (list rotate-right reverse)) '(a b c d))
(a d c b)

;; A sequence that when composed with (list reverse rotate-left) turns it into
;; (rotate-right reverse)
> (define slide 
    (fold-left compose identity (list rotate-right reverse rotate-right reverse)))
slide

> ((fold-left compose identity (list slide reverse rotate-left)) '(a b c d))
(a d c b)

;; rewrite back to direct procedure calls
> (rotate-right (reverse '(a b c d)))
(a d c b)

;; and slide ought to be an identity
> ((fold-left compose identity (list slide)) '(a b c d))
(a b c d)

Or suppose you have (f (g x)), but for some reason you want(g (f x)) (which would, in general, be a different value unless f and g happen to commute). Again, rewrite (f (g x)) as ((compose f g) x) and apply a little group theory. If the appropriate inverses exist, there will be a function commute-fg such that (compose commute-fg (compose f g)) is equivalent to (compose g f). With a little thought, you can see that commute-fg is equivalent to (compose g f g-1 f-1). (Again, this isn't a generic commute, it only causes this specific f and g to commute.) commute-fg is called a commutator because it makes f and g commute. Commutators always have the form aba-1b-1. By finding commutators and inserting them in the right place, you can take a sequence and swap adjacent elements. Again, a concrete example with lists:
;; an illustration of what swap-first two does
> (swap-first-two '(a b c d))
(b a c d)

;; we're given
> (reverse (swap-first-two '(a b c d)))
(d c a b)

;; but we want, for some reason to reverse first
> (swap-first-two (reverse '(a b c d)))
(c d b a)

;; rewrite as fold-left of compose
> ((fold-left compose identity (list reverse swap-first-two)) '(a b c d))
(d c a b)

;; define our commutator
;; note that swap-first-two and reverse are their own inverses
> (define commute-fg
    (fold-left compose identity (list swap-first-two reverse swap-first-two reverse)))

;; make f and g commute
;; observe that it returns the desired result
> ((fold-left compose identity (list commute-fg reverse swap-first-two)) '(a b c d))
(c d b a)

There's two interesting things here. First, notice that in both examples I convert (f (g x)) to ((fold-left compose identity (list f g)) x) and then proceed to ignore x and just consider (fold-left compose identity (list f g)) as if x didn't exist. I've abstracted away the x. (Of course I have to eventually supply the x if I want an answer, but it only comes back at the last moment.) Second, notice that although slide and commute-fg are foldable sequences, I use them as if they were higher order functions operating on the foldable sequence (compose f g) to transform it, first into (compose h f), second into (compose g f). This second thing is a neat trick. We're taking a function that operates on lists and treating it as if it were a higher-order function that operates on functions. This is called the “action” of slide and commute-fg because it appears as if elements of the set G of our group can “act” directly on other elements.

Every element in the underlying set G of a group has an action associated with it which operates directly on other elements in G. This is an important concept in group theory. Now earlier I said that the actual elements of G don't matter much, so the action must be more closely tied to the operator *. And if we swap out G for another set we'll still have the same actions, they'll just be associated with the elements of the new set (in an isomorphic way). The actions are pretty abstract.

There's a lot more one could say about the actions. They are a rich source of interesting math. My brain is getting fatigued with all this abstraction, so I'll leave the topic be for now.

If group theory is about the essence of what it means to have a foldable sequence, then category theory is about the essence of composition. They offer two somewhat different approaches to similar material. What do you do with sequences but compose them? What comes from composition but a sequence? Many concepts in group theory carry over into category theory. Naturally a completely different set of terminology is used, but the concepts are there.

But that's enough group theory for today and category theory can wait until later posts.


For older items, see the Planet Lisp Archives.


Last updated: 2020-04-02 09:56