Planet Lisp

Joe Marshall"No 'x' in 'Nixon'"

· 2 hours ago
I just had a phone screen where they asked if I could write a palindrome detector. I wrote a recursive and an iterative version. I like the recursive solution because it is so simple. There are only three equations:

  • An empty string is a palindrome.
  • A string with one character is a palindrome.
  • If a string starts and ends with the same character, and the substring between them is a palindrome, then the entire string is a palindrome.
I wrote them the iterative version so I wouldn't look like an idiot.  The iterative version is generally faster with a standard compiler, but it is more complex.  Not much more complex, but still, you have to keep track of two indexes into the string and how close they've gotten to each other.

The recursive solution is just more elegant.

Zach BeaneWorking with letsencrypt's certbot for a Lisp webserver

· 24 hours ago

Every 90 days my letsencrypt certificate expires and I renew it manually. I have to cut and paste data from the certbot script into repl forms to serve the right data on the right path and it’s such a pain that sometimes I put it off until after it expires, and people email me about it, and I feel bad.

The manual process looks something like this (not my actual domain or challenges):

# certbot certonly --manual -d my.example.com

Saving debug log to /var/log/letsencrypt/letsencrypt.log
Plugins selected: Authenticator manual, Installer None
Cert is due for renewal, auto-renewing...
Renewing an existing certificate
Performing the following challenges:
http-01 challenge for my.example.com

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
NOTE: The IP of this machine will be publicly logged as having requested this
certificate. If you're running certbot in manual mode on a machine that is not
your server, please ensure you're okay with that.

Are you OK with your IP being logged?
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
(Y)es/(N)o: y

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Create a file containing just this data:

uaut.gj61B3f7oYQcWZSF4kxS4OFh8KQlsDVtPXrw60Tkj2JLW7RtZaVE0MIWwiEKRlxph7SaLwp1ETFjaGDUKN

And make it available on your web server at this URL:

http://my.example.com/.well-known/acme-challenge/SpZa7sf4QMEFoF7lKh7aT4EZjNWSVcur2jODPQkgExa

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Press Enter to Continue

There’s no way that letsencrypt’s certbot program will automatically get support for my custom hunchentoot “framework” and I don’t really want to look into adding it myself.

For a while I thought about writing some Expect functionality in SBCL - sb-ext:run-program already supports a :pty option so it wouldn’t be super-difficult. With a theoretical cl-expect you could spawn certbot with the right options, slurp out the verification secret and URL via the process’s standard output stream, and call whatever code you need to serve the secret at that location.

I realized today there’s a halfway solution that takes the worst cutting and pasting out of the loop. The unix command script –flush starts an interactive session where all input and output is saved to a file. My Lisp webserver can then read certbot program output from that file and configure the webserver automagically for me. I still have to manually start certbot and enter a few REPL commands, but it’s easier than before.

Here’s the core of Lisp side of things for processing the script file:

(defun scrape-letsencrypt-interaction (file)
  (let (secret path)
    (with-open-file (stream file)
      (labels (...)
        (loop
          (skip-past "Create a file containing")
          (next-line)
          (setf secret (read-trimmed-line))
          (skip-past "make it available")
          (next-line)
          (setf path (substring-after ".well-known" (read-trimmed-line))))))))

This could be done just as well (perhaps with cl-ppcre, but I didn’t want to pull it in as dependency.

Here are the functions from labels:

((next-line ()
   (let ((line (read-line stream nil)))
     (when (null line)
       (unless (and secret path)
     (error "Didn't find a secret and path anywhere in ~S"
            file))
       (return-from scrape-letsencrypt-interaction
         (values secret path)))
     line))
 (skip-past (string)
   (loop
     (let ((line (next-line)))
       (when (search string line)
         (return)))))
 (read-trimmed-line ()
   (string-trim '(#\Return)
                (next-line)))
 (substring-after (string target)
   (let ((pos (search string target)))
     (unless pos
       (error "Could not find ~S in ~S" string target))
     (subseq target pos))))

The goal here is to only look at the last secret and URL in the script output, so the main loop keeps track of what it’s seen so far and next-line returns those values at EOF. The output also has ugly trailing ^M noise so read-trimmed-line takes care of that.

Here’s the whole thing all together:

(defun scrape-letsencrypt-interaction (file)
  (let (secret path)
    (with-open-file (stream file)
      (labels ((next-line ()
                 (let ((line (read-line stream nil)))
                   (when (null line)
                     (unless (and secret path)
                       (error "Didn't find a secret and path anywhere in ~S"
                              file))
                     (return-from scrape-letsencrypt-interaction
                       (values secret path)))
                   line))
               (skip-past (string)
                 (loop
                   (let ((line (next-line)))
                     (when (search string line)
                       (return)))))
               (read-trimmed-line ()
                 (string-trim '(#\Return)
                              (next-line)))
               (substring-after (string target)
                 (let ((pos (search string target)))
                   (unless pos
                     (error "Could not find ~S in ~S" string target))
                   (subseq target pos))))
        (loop
          (skip-past "Create a file containing")
          (next-line)
          (setf secret (read-trimmed-line))
          (skip-past "make it available")
          (next-line)
          (setf path (substring-after ".well-known" (read-trimmed-line))))))))

I don’t mind shaving only half a yak when it feels like useful progress!

Someday I’ll get around to a proper Expect-like thing…

Wimpie NortjeHTTP Routing libraries for Hunchentoot

· 5 days ago

Hunchentoot is a web server and not a heavy weight web development framework. It provides methods to implement URL paths for static files, directory paths and with regex's, very much like the mainstream web servers. It does not provide an easy way to define routes for a REST API. When you use Hunchentoot without a web development framework you will probably want something to make route definition easier.

There are a few options for building REST APIs available in frameworks, Hunchentoot derivatives or other web servers but I wanted to implement REST routes with the original Hunchentoot web server. I found three libraries that can do this: simple-routes, Froute and easy-routes.

Simple-routes is the simplest and easiest to use. Routes are defined in a list similar to Hunchentoot's *dispatch-table*. It supports variables in the URL segments but there is no support for middleware1 type functionality.

Froute is the most powerful of the three. It is based on CLOS and is designed so it can be used with any web server although only a Hunchentoot connector is currently implemented. Routes are defined as CLOS classes and even though middleware is not a specific feature the documentation gives an example on how to use class inheritance to implement such functionality. The power of being CLOS based also makes this library the most complex to use.

Easy-routes has the concept of decorators which are functions that execute before the route body so it can be used to implement middleware functionality. Unlike Clack's middleware which are defined at a central place for all routes, decorators need to be applied to each route handler individually. It's not quite there, but close enough.

The lack of middleware options disqualified simple-routes for me and Froute looked like it provides everything I need, and more, but with much greater complexity than easy-routes. I decided to use easy-routes with the option to switch to Froute when I needed the extra capability.

Hunchentoot takes an "acceptor" argument at startup. Easy-routes provides two options: easy-routes-acceptor and routes-acceptor. Easy-routes-acceptor first executes all the route handlers and if no suitable handler is found it executes the normal Hunchentoot request handlers. The routes-acceptor executes only the route handlers and returns an 404 NOT FOUND error if no suitable handler is found.

I use routes-acceptor because it ensures that only requests with explicitly defined handlers are handled. With the easy-routes-acceptor it is too easy to create a security hole with some default Hunchentoot request handler that catches non-existent routes. It can be burdensome to use this approach for handling static files but I run Hunchentoot behind Nginx which also handles the static files.

The table summarises the main features I investigated:

  simple-routes easy-routes Froute
Web server Hunchentoot Hunchentoot Hunchentoot (can be expanded to others)
REST routes Yes Yes Yes
Argument extraction from URL Yes Yes Yes
Dispatch based on HTTP method Yes Yes Yes
Middleware No Decorators CLOS inheritance
Fallback for undefined routes Hunchentoot easy-handler framework None or Hunchentoot easy-handler framework None
Learning curve Negligible Minimal Medium. Requires some CLOS knowledge.
  1. Middleware are functions that run before or after the main request handler. They are called as part of the request handling process and not by the handler. This makes them ideal to handle general functions like setting up a database connection, performing authorisation or any task which is not part of a particular request handler.

Paul KhuongA Multiset of Observations With Constant-time Sample Mean and Variance

· 5 days ago

Fixed notation issues in the “Faster multiset updates” section. Thank you Joonas.

Let’s say you have a multiset (bag) of “reals” (floats or rationals), where each value is a sampled observations. It’s easy to augment any implementation of the multiset ADT to also return the sample mean of the values in the multiset in constant time: track the sum of values in the multiset, as they are individually added and removed. This requires one accumulator and a counter for the number of observations in the multiset (i.e., constant space), and adds a constant time overhead to each update.

It’s not as simple when you also need the sample variance of the multiset \(X\), i.e.,

\[\frac{1}{n - 1} \sum\sb{x \in X} (x - \hat{x})\sp{2},\]

where \(n = |X|\) is the sample size and \(\hat{x}\) is the sample mean \(\sum\sb{x\in X} x/n,\) ideally with constant query time, and constant and update time overhead.

One could try to apply the textbook equality

\[s\sp{2} = \frac{1}{n(n-1)}\left[n\sum\sb{x\in X} x\sp{2} - \left(\sum\sb{x\in X} x\right)\sp{2}\right].\]

However, as Knuth notes in TAoCP volume 2, this expression loses a lot of precision to round-off in floating point: in extreme cases, the difference might be negative (and we know the variance is never negative). More commonly, we’ll lose precision when the sampled values are clustered around a large mean. For example, the sample standard deviation of 1e8 and 1e8 - 1 is 1, same as for 0 and 1. However, the expression above would evaluate that to 0.0, even in double precision: while 1e8 is comfortably within range for double floats, its square 1e16 is outside the range where all integers are represented exactly.

Knuth refers to a better behaved recurrence by Welford, where a running sample mean is subtracted from each new observation before squaring. John Cook has a C++ implementation of the recurrence that adds observations to a sample variance in constant time. In Python, this streaming algorithm looks like this.

streaming_variance.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
class StreamingVariance:
    def __init__(self):
        self.n = 0
        self.mean = 0
        self.var_sum = 0  # centered 2nd moment (~variance of the sum of observations)

    def observe(self, v):
        self.n += 1
        if self.n == 1:
            self.mean = v
            return
        old_mean = self.mean
        self.mean += (v - old_mean) / self.n
        self.var_sum += (v - old_mean) * (v - self.mean)

        def get_mean(self):
        return self.mean

    def get_variance(self):
        return self.var_sum / (self.n - 1) if self.n > 1 else 0

That’s all we need for insert-only multisets, but does not handle removals; if only we had removals, we could always implement updates (replacement) as a removal and an insertion.

Luckily, StreamingVariance.observe looks invertible. It’s shouldn’t be hard to recover the previous sample mean, given v, and, given the current and previous sample means, we can re-evaluate (v - old_mean) * (v - self.mean) and subtract it from self.var_sum.

Let \(\hat{x}\sp{\prime}\) be the sample mean after observe(v). We can derive the previous sample mean \(\hat{x}\) from \(v\):

\[(n - 1)\hat{x} = n\hat{x}\sp{\prime} - v \Leftrightarrow \hat{x} = \hat{x}\sp{\prime} + \frac{\hat{x}\sp{\prime} - v}{n-1}.\]

This invertibility means that we can undo calls to observe in LIFO order. We can’t handle arbitrary multiset updates, only a stack of observation. That’s still better than nothing.

variance_stack.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
class VarianceStack:
    def __init__(self):
        self.n = 0
        self.mean = 0
        self.var_sum = 0  # variance of the sum

    def push(self, v):
        self.n += 1
        if self.n == 1:
            self.mean = v
            return
        old_mean = self.mean
        self.mean += (v - old_mean) / self.n
        self.var_sum += (v - old_mean) * (v - self.mean)

    def pop(self, v):
        assert self.n > 0
        if self.n == 1:
            self.n = 0
            self.mean = 0
            self.var_sum = 0
            return
        next_n = self.n - 1
        old_mean = self.mean
        self.mean = old_mean + (old_mean - v) / next_n
        # var_sum should never be negative, clamp it so.
        self.var_sum = max(0, self.var_sum - (v - self.mean) * (v - old_mean))
        self.n -= 1

    def get_mean(self):
        return self.mean

    def get_variance(self):
        return self.var_sum / (self.n - 1) if self.n > 1 else 0

Before going any further, let’s test this.

Testing the VarianceStack

The best way to test the VarianceStack is to execute a series of push and pop calls, and compare the results of get_mean and get_variance with batch reference implementations.

I could hardcode calls in unit tests. However, that quickly hits diminishing returns in terms of marginal coverage VS developer time. Instead, I’ll be lazy, completely skip unit tests, and rely on Hypothesis, its high level “stateful” testing API in particular.

We’ll keep track of the values pushed and popped off the observation stack in the driver: we must make sure they’re matched in LIFO order, and we need the stack’s contents to compute the reference mean and variance. We’ll also want to compare the results with reference implementations, modulo some numerical noise. Let’s try to be aggressive and bound the number of float values between the reference and the actual results.

variance_stack_driver.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
import struct
import unittest
import hypothesis.strategies as st
from hypothesis.stateful import RuleBasedStateMachine, invariant, precondition, rule


def float_bits(x: float) -> int:
    bits = struct.unpack('=q', struct.pack('=d', x))[0]
    significand = bits % (1 << 63)
    # ~significand = -1 - significand. We need that instead of just
    # -significand to handle signed zeros.
    return significand if bits >= 0 else ~significand


FLOAT_DISTANCE = 2**10


def assert_almost_equal(x, y, max_delta=FLOAT_DISTANCE):
    assert abs(float_bits(x) - float_bits(y)) <= max_delta


class VarianceStackDriver(RuleBasedStateMachine):
    def __init__(self):
        super(VarianceStackDriver, self).__init__()
        self.values = []

    @rule(v=st.floats(allow_nan=False, allow_infinity=False))
    def push(self, v):
        self.values.append(v)

    # Don't generate `pop()` calls when the stack is empty.
    @precondition(lambda self: self.values)
    @rule()
    def pop(self):
        self.values.pop()

    def reference_mean(self):
        if self.values:
            return sum(self.values) / len(self.values)
        return 0

    def reference_variance(self):
        n = len(self.values)
        if n <= 1:
            return 0
        mean = self.reference_mean()
        return sum(pow(x - mean, 2) for x in self.values) / n

    @invariant()
    def mean_matches(self):
        assert_almost_equal(self.reference_mean(), self.reference_mean())

    @invariant()
    def variance_matches(self):
        assert_almost_equal(self.reference_variance(),
                            self.reference_variance())


StackTest = VarianceStackDriver.TestCase

if __name__ == '__main__':
    unittest.main()

This initial driver does not even use the VarianceStack yet. All it does is push values to the reference stack, pop values when the stack has something to pop, and check that the reference implementations match themselves after each call: I want to first shake out any bug in the test harness itself.

Not surprisingly, Hypothesis does find an issue in the reference implementation:

Falsifying example:
state = VarianceStackDriver()
state.push(v=0.0)
state.push(v=2.6815615859885194e+154)
state.teardown()

We get a numerical OverflowError in reference_variance: 2.68...e154 / 2 is slightly greater than sqrt(sys.float_info.max) = 1.3407807929942596e+154, so taking the square of that value errors out instead of returning infinity.

Let’s start by clamping the range of the generated floats.

variance_stack_driver.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import math
import sys
...


MAX_RANGE = math.sqrt(sys.float_info.max) / 2

FLOAT_STRATEGY = st.floats(min_value=-MAX_RANGE, max_value=MAX_RANGE)

class VarianceStackDriver(RuleBasedStateMachine):
    ...

    @rule(v=FLOAT_STRATEGY)
    def push(self, v):
        self.variance_stack.push(v)
        self.values.append(v)

    ...

Now that the test harness doesn’t find fault in itself, let’s hook in the VarianceStack, and see what happens when only push calls are generated (i.e., first test only the standard streaming variance algorithm).

variance_stack_driver.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
def assert_almost_equal(x, y, max_delta=FLOAT_DISTANCE):
    distance = abs(float_bits(x) - float_bits(y))
    # Print out some useful information on failure.
    assert distance <= max_delta, '%.18g != %.18g (%f)' % (x, y, math.log(distance, 2))

class VarianceStackDriver(RuleBasedStateMachine):
    def __init__(self):
        super(VarianceStackDriver, self).__init__()
        self.values = []
        self.variance_stack = VarianceStack()

    @rule(v=FLOAT_STRATEGY)
    def push(self, v):
        self.variance_stack.push(v)
        self.values.append(v)

    # Never generate `pop()`
    @precondition(lambda self: self.values and False)
    @rule()
    def pop(self):
        self.values.pop()

    def reference_mean(self):
        if self.values:
            return sum(self.values) / len(self.values)
        return 0

    def reference_variance(self):
        n = len(self.values)
        if n <= 1:
            return 0
        mean = self.reference_mean()
        return sum(pow(x - mean, 2) for x in self.values) / n

    @invariant()
    def mean_matches(self):
        assert_almost_equal(self.reference_mean(),
                            self.variance_stack.get_mean())

    @invariant()
    def variance_matches(self):
        assert_almost_equal(self.reference_variance(),
                            self.variance_stack.get_variance())

This already fails horribly.

Falsifying example:
state = VarianceStackDriver()
state.push(v=1.0)
state.push(v=1.488565707357403e+138)
state.teardown()
F

The reference finds a variance of 5.54e275, which is very much not the streaming computation’s 1.108e276. We can manually check that the reference is wrong: it’s missing the n - 1 correction term in the denominator.

We should use this updated reference.

variance_stack_driver.py
1
2
3
4
5
6
7
8
9
class VarianceStackDriver(RuleBasedStateMachine):
    ...

    def reference_variance(self):
        n = len(self.values)
        if n <= 1:
            return 0
        mean = self.reference_mean()
        return sum(pow(x - mean, 2) for x in self.values) / (n - 1)

Let’s now re-enable calls to pop().

variance_stack_driver.py
1
2
3
4
5
6
7
8
class VarianceStackDriver(RuleBasedStateMachine):
    ...

    @precondition(lambda self: self.values)
    @rule()
    def pop(self):
        self.variance_stack.pop(self.values[-1])
        self.values.pop()

And now things fail in new and excitingly numerical ways.

Falsifying example:
state = VarianceStackDriver()
state.push(v=0.0)
state.push(v=0.00014142319560050964)
state.push(v=14188.9609375)
state.pop()
state.teardown()
F

This counter-example fails with the online variance returning 0.0 instead of 1e-8. That’s not unexpected: removing (the square of) a large value from a running sum spells catastrophic cancellation. It’s also not that bad for my use case, where I don’t expect to observe very large values.

Another problem for our test harness is that floats are very dense around 0.0, and I’m ok with small (around 1e-8) absolute error because the input and output will be single floats.

Let’s relax assert_almost_equal, and restrict generated observations to fall in \([-2\sp{-12}, 2\sp{12}].\)

variance_stack_driver.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# Let values be off by ~1 single float ULP
FLOAT_DISTANCE = 2**32

# or by 1e-8
ABSOLUTE_EPS = 1e-8


def assert_almost_equal(x, y, max_delta=FLOAT_DISTANCE, abs_eps=ABSOLUTE_EPS):
    delta = abs(x - y)
    distance = abs(float_bits(x) - float_bits(y))
    assert distance <= max_delta or delta <= abs_eps, '%.18g != %.18g (%f)' % (
        x, y, math.log(distance, 2))


# Avoid generating very large observations.
MAX_RANGE = 2**12

FLOAT_STRATEGY = st.floats(width=32, min_value=-MAX_RANGE, max_value=MAX_RANGE)

With all these tweaks to make sure we generate easy (i.e., interesting) test cases, Hypothesis fails to find a failure after its default time budget.

I’m willing to call that a victory.

From stack to full multiset

We have tested code to undo updates in Welford’s classic streaming variance algorithm. Unfortunately, inverting pushes away only works for LIFO edits, and we’re looking for arbitrary inserts and removals (and updates) to a multiset of observations.

However, both the mean \(\hat{x} = \sum\sb{x\in X} x/n\) and the centered second moment \(\sum\sb{x\in X}(x - \hat{x})\sp{2}\) are order-independent: they’re just sums over all observations. Disregarding round-off, we’ll find the same mean and second moment regardless of the order in which the observations were pushed in. Thus, whenever we wish to remove an observation from the multiset, we can assume it was the last one added to the estimates, and pop it off.

We think we know how to implement running mean and variance for a multiset of observations. How do we test that with Hypothesis?

The hardest part about testing dictionary (map)-like interfaces is making sure to generate valid identifiers when removing values. As it turns out, Hypothesis has built-in support for this important use case, with its Bundles. We’ll use that to test a dictionary from observation name to observation value, augmented to keep track of the current mean and variance of all values.

variance_multiset_driver.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
class VarianceBag(VarianceStack):
    def update(self, old, new):
        # Replace one instance of `old` with `new` by
        # removing `old` and inserting `new`.
        self.pop(old)
        self.push(new)


class VarianceBagDriver(RuleBasedStateMachine):
    keys = Bundle("keys")

    def __init__(self):
        super(VarianceBagDriver, self).__init__()
        self.entries = dict()
        self.variance_bag = VarianceBag()

    @rule(target=keys, k=st.binary(), v=FLOAT_STRATEGY)
    def add_entry(self, k, v):
        if k in self.entries:
            self.update_entry(k, v)
            return multiple()

        self.entries[k] = v
        self.variance_bag.push(v)
        return k

    @rule(k=consumes(keys))
    def del_entry(self, k):
        self.variance_bag.pop(self.entries[k])
        del self.entries[k]

    @rule(k=keys, v=FLOAT_STRATEGY)
    def update_entry(self, k, v):
        self.variance_bag.update(self.entries[k], v)
        self.entries[k] = v

    def reference_mean(self):
        if self.entries:
            return sum(self.entries.values()) / len(self.entries)
        return 0

    def reference_variance(self):
        n = len(self.entries)
        if n <= 1:
            return 0
        mean = self.reference_mean()
        return sum(pow(x - mean, 2) for x in self.entries.values()) / (n - 1)

    @invariant()
    def mean_matches(self):
        assert_almost_equal(self.reference_mean(),
                            self.variance_bag.get_mean())

    @invariant()
    def variance_matches(self):
        assert_almost_equal(self.reference_variance(),
                            self.variance_bag.get_variance())


BagTest = VarianceBagDriver.TestCase

Each call to add_entry will either go to update_entry if the key already exists, or add an observation to the dictionary and streaming estimator. If we have a new key, it is added to the keys Bundle; calls to del_entry and update_entry draw keys from this Bundle. When we remove an entry, it’s also consumed from the keys Bundle.

Hypothesis finds no fault with our new implementation of dictionary-with-variance, but update seems like it could be much faster and numerically stable, and I intend to mostly use this data structure for calls to update.

Faster multiset updates

The key operation for my use-case is to update one observation by replacing its old value with a new one. We can maintain the estimator by popping old away and pushing new in, but this business with updating the number of observation n and rescaling everything seems like a lot of numerical trouble.

We should be able to do better.

We’re replacing the multiset of sampled observations \(X\) with \(X\sp{\prime} = X \setminus \{\textrm{old}\} \cup \{\textrm{new}\}.\) It’s easy to maintain the mean after this update: \(\hat{x}\sp{\prime} = \hat{x} + (\textrm{new} - \textrm{old})/n.\)

The update to self.var_sum, the sum of squared differences from the mean, is trickier. We start with \(v = \sum\sb{x\in X} (x - \hat{x})\sp{2},\) and we wish to find \(v\sp{\prime} = \sum\sb{x\sp{\prime}\in X\sp{\prime}} (x\sp{\prime} - \hat{x}\sp{\prime})\sp{2}.\)

Let \(\delta = \textrm{new} - \textrm{old}\) and \(\delta\sb{\hat{x}} = \delta/n.\) We have \[\sum\sb{x\in X} (x - \hat{x}\sp{\prime})\sp{2} = \sum\sb{x\in X} [(x - \hat{x}) - \delta\sb{\hat{x}}]\sp{2},\] and \[[(x - \hat{x}) - \delta\sb{\hat{x}}]\sp{2} = (x - \hat{x})\sp{2} - 2\delta\sb{\hat{x}} (x - \hat{x}) + \delta\sb{\hat{x}}\sp{2}.\]

We can reassociate the sum, and find

\[\sum\sb{x\in X} (x - \hat{x}\sp{\prime})\sp{2} = \sum\sb{x\in X} (x - \hat{x})\sp{2} - 2\delta\sb{\hat{x}} \left(\sum\sb{x \in X} x - \hat{x}\right) + n \delta\sb{\hat{x}}\sp{2}\]

Once we notice that \(\hat{x} = \sum\sb{x\in X} x/n,\) it’s clear that the middle term sums to zero, and we find the very reasonable

\[v\sb{\hat{x}\sp{\prime}} = \sum\sb{x\in X} (x - \hat{x})\sp{2} + n \delta\sb{\hat{x}}\sp{2} = v + \delta \delta\sb{\hat{x}}.\]

This new accumulator \(v\sb{\hat{x}\sp{\prime}}\) corresponds to the sum of the squared differences between the old observations \(X\) and the new mean \(\hat{x}\sp{\prime}\). We still have to update one observation from old to new. The remaining adjustment to \(v\) (self.var_sum) corresponds to going from \((\textrm{old} - \hat{x}\sp{\prime})\sp{2}\) to \((\textrm{new} - \hat{x}\sp{\prime})\sp{2},\) where \(\textrm{new} = \textrm{old} + \delta.\)

After a bit of algebra, we get \[(\textrm{new} - \hat{x}\sp{\prime})\sp{2} = [(\textrm{old} - \hat{x}\sp{\prime}) + \delta]\sp{2} = (\textrm{old} - \hat{x}\sp{\prime})\sp{2} + \delta (\textrm{old} - \hat{x} + \textrm{new} - \hat{x}\sp{\prime}).\]

The adjusted \(v\sb{\hat{x}\sp{\prime}}\) already includes \((\textrm{old} - \hat{x}\sp{\prime})\sp{2}\) in its sum, so we only have to add the last term to obtain the final updated self.var_sum

\[v\sp{\prime} = v\sb{\hat{x}\sp{\prime}} + \delta (\textrm{old} - \hat{x} + \textrm{new} - \hat{x}\sp{\prime}) = v + \delta [2 (\textrm{old} - \hat{x}) + \textrm{new} - \hat{x}\sp{\prime}].\]

That’s our final implementation for VarianceBag.update, for which Hypothesis also fails to find failures.

VarianceBag.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
class VarianceBag(VarianceStack):
    def update(self, old, new):
        assert self.n > 0
        if self.n == 1:
            self.mean = new
            self.var_sum = 0
            return
        delta = new - old
        old_mean = self.mean
        delta_mean = delta / self.n
        self.mean += delta_mean

        adjustment = delta * (2 * (old - old_mean) + (delta - delta_mean))
        self.var_sum = max(0, self.var_sum + adjustment)

How much do you trust testing?

We have automated property-based tests and some human-checked proofs. Ship it?

I was initially going to ask a CAS to check my reformulations, but the implicit \(\forall\) looked messy. Instead, I decided to check the induction hypothesis implicit in VarianceBag.update, and enumerate all cases up to a certain number of values with Z3 in IPython.

In [1]: from z3 import *
In [2]: x, y, z, new_x = Reals("x y z new_x")
In [3]: mean = (x + y + z) / 3
In [4]: var_sum = sum((v - mean) * (v - mean) for v in (x, y, z))
In [5]: delta = new_x - x
In [6]: new_mean = mean + delta / 3
In [7]: delta_mean = delta / 3
In [8]: adjustment = delta * (2 * (x - mean) + (delta - delta_mean))
In [9]: new_var_sum = var_sum + adjustment

# We have our expressions. Let's check equivalence for mean, then var_sum
In [10]: s = Solver() 
In [11]: s.push()
In [12]: s.add(new_mean != (new_x + y + z) / 3)
In [13]: s.check()
Out[13]: unsat  # No counter example of size 3 for the updated mean
In [14]: s.pop()

In [15]: s.push()
In [16]: s.add(new_mean == (new_x + y + z) / 3)  # We know the mean matches
In [17]: s.add(new_var_sum != sum((v - new_mean) * (v - new_mean) for v in (new_x, y, z)))
In [18]: s.check()
Out[18]: unsat  # No counter example of size 3 for the updated variance

Given this script, it’s a small matter of programming to generalise from 3 values (x, y, and z) to any fixed number of values, and generate all small cases up to, e.g., 10 values.

z3-check.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
def updated_expressions(vars, new_x):
    x = vars[0]
    num_var = len(vars)
    mean = sum(vars) / num_var
    var_sum = sum((v - mean) * (v - mean) for v in vars)
    delta = new_x - x
    delta_mean = delta / num_var
    new_mean = mean + delta_mean
    adjustment = delta * (2 * (x - mean) + (delta - delta_mean))
    new_var_sum = var_sum + adjustment
    return new_mean, new_var_sum


def test_num_var(num_var):
    assert num_var > 0
    vars = [Real('x_%i' % i) for i in range(0, num_var)]
    new_x = Real('new_x')

    new_mean, new_var_sum = updated_expressions(vars, new_x)
    new_vars = [new_x] + vars[1:]
    s = Solver()
    s.push()
    s.add(new_mean != sum(new_vars) / num_var)
    result = s.check()
    print('updated mean %s' % result)
    if result != unsat:
        print(s.model())
        return False
    s.pop()

    s.push()
    s.add(new_mean == sum(new_vars) / num_var)
    s.add(new_var_sum != sum(
        (v - new_mean) * (v - new_mean) for v in new_vars))
    result = s.check()
    print('updated variance %s' % result)
    if result != unsat:
        print(s.model())
        return False
    return True


for i in range(1, 11):
    print('testing n=%i' % i)
    if test_num_var(i):
        print('OK')
    else:
        print('FAIL %i' % i)
        break

I find the most important thing when it comes to using automated proofs is to insert errors and confirm we can find the bugs we’re looking for.

I did that by manually mutating the expressions for new_mean and new_var_sum in updated_expressions. This let me find a simple bug in the initial implementation of test_num_var: I used if not result instead of result != unsat, and both sat and unsat are truthy. The code initially failed to flag a failure when z3 found a counter-example for our correctness condition!

And now I’m satisfied

I have code to augment an arbitrary multiset or dictionary with a running estimate of the mean and variance; that code is based on a classic recurrence, with some new math checked by hand, with automated tests, and with some exhaustive checking of small inputs (to which I claim most bugs can be reduced).

I’m now pretty sure the code works, but there’s another more obviously correct way to solve that update problem. This 2008 report by Philippe Pébay1 presents formulas to compute the mean, variance, and arbitrary moments in one pass, and shows how to combine accumulators, a useful operation in parallel computing.

We could use these formulas to augment an arbitrary \(k\)-ary tree and re-combine the merged accumulator as we go back up the (search) tree from the modified leaf to the root. The update would be much more stable (we only add and merge observations), and incur logarithmic time overhead (with linear space overhead). However, given the same time budget, and a logarithmic space overhead, we could also implement the constant-time update with arbitrary precision software floats, and probably guarantee even better precision.

The constant-time update I described in this post demanded more effort to convince myself of its correctness, but I think it’s always a better option than an augmented tree for serial code, especially if initial values are available to populate the accumulators with batch-computed mean and variance. I’m pretty sure the code works, and it’s up in this gist. I’ll be re-implementing it in C++ because that’s the language used by the project that lead me to this problem; feel free to steal that gist.


  1. There’s also a 2016 journal article by Pébay and others with numerical experiments, but I failed to implement their simpler-looking scalar update...

Quicklisp newsNovember 2019 Quicklisp dist update now available

· 6 days ago
New projects:
  • cacau — Test Runner in Common Lisp. — GPLv3
  • cl-elastic — Elasticsearch client for Common Lisp — MIT
  • cl-libiio — Common Lisp bindings for libiio. — GPLv3
  • cl-transmission — A Common Lisp library to interface with transmission using its rpc — MIT
  • dartscltools — More or less useful utilities — MIT
  • plain-odbc — a lisp wrapper around the ODBC library — BSD
  • sanity-clause — Sanity clause is a data contract and validation library. — LGPLv3
  • tfm — A TeX Font Metrics library for Common Lisp — BSD
  • tmpdir — Simple library to create temporary directories — MIT license
  • trivial-method-combinations — Portability library for accessing method combination objects — Unlicense
  • winlock — File locking using the Windows API. — MIT
  • with-user-abort — provides an easy way to catch ctrl+c. useful for making binaries. — BSD 3-Clause
Updated projectsaprilassert-passertion-errorasync-processbabelbordeaux-threadsbpceramicchameleonchanlcl+sslcl-asynccl-bnfcl-collidercl-dctcl-editdistancecl-environmentscl-fondcl-i18ncl-krakencl-kyoto-cabinetcl-lzmacl-naive-storecl-openglcl-patternscl-prevalencecl-pslibcl-rdkafkacl-rediscl-sdl2cl-smtpcl-storecl-strcl-tiledcl-tomlcl-torrentscloser-mopclsql-local-timeclssclxclx-xembedcoleslawcommon-lisp-actorsconfcroatoancurry-compose-reader-macrosdartsclhashtreedecltdeploydrakmadufyeager-future2eventbusfemlispfiascogendlgeneric-clgraphhornerhunchentoot-multi-acceptorjsownlakelet-pluslichat-protocollinear-programmingliterate-lisplyricsmagiclmaidenmarkupmcclimmetabang-bindmmapnodguinum-utilsoverlordparachuteparen6petalispphoe-toolboxpolisherpostmodernpy4clqbase64qt-libsqtoolsquilcquriqvmreplicroanrutilssc-extensionsscalplselserapeumshadowsimple-parallel-taskssimpletskeleton-creatorslystaplestatic-vectorsstudio-clientstumpwmswap-bytessxqlthe-cost-of-nothingtriviatrivial-benchmarktrivial-jumptablestrivial-package-local-nicknamestrivial-sshtype-rumbravernacularwoowookiexml-emitteryoutube.

To get this update, use: (ql:update-dist "quicklisp")

ABCL DevStuffing an @Armedbear

· 8 days ago
Notes on stuffing the Bear into a single jar with ABCL-AIO after T-day for Black Friday.

Happy Thanksgiving!

Lispers.deHamburg Lispers Meetup, Monday, 2nd December 2019

· 10 days ago

Hamburg will see another gathering of Lispers on 2nd December 2019. We meet around 19:00 (“7 p. m.”) at Ristorante Opera, Dammtorstr. 7.

This is an informal event for all Lisp experience levels, and about any Lisp dialect.

This time, we'll likely talk about Advent of Code and some Lisp opportunities.

ABCL DevUnleashing the Bear past Java 11

· 14 days ago
Against the falling canvas of 2019, the Bear has freely thrust ABCL
1.6.0 onto the world stage.

This release fully supports building and running on Java 6, 7, 8, and
11 via the openjdk runtimes.  By building from our
source, it is rumored that the User may utilize additional
hosting platforms such as Java 13.

Support for Microsoft Windows platforms has not been thoroughly tested
at this point but we intend to add it our automated cloud based build
in the process of releasing a stabilizing ABCL 1.6.1 build due by the
end of this year.

Thanks to all past, present, and future contributors to the Bear;
here's to hackin' our way out of the prison of the Java
language for over a decade.


Didier VernaQuickref 3.0 "The Alchemist" is released

· 16 days ago

Following up with the latest release of Declt, I'm also happy to announce the release of Quickref 3.0 "The Alchemist", our reference manuals aggregator for Quicklisp libraries. The official website has been updated yesterday with both the latest Quicklisp version, and the new Declt, resulting in the documentation of 1792 Common Lisp libraries.

Vsevolod DyomkinProgramming Algorithms: Strings

· 16 days ago

It may not be immediately obvious why the whole chapter is dedicated to strings. Aren't they just glorified arrays? There are several answers to these challenges:

  • indeed, strings are not just arrays, or rather, not only arrays: in different contexts, other representations, such as trees or complex combinations of arrays, may be used. And, besides, there are additional properties that are important for strings even when they are represented as arrays
  • there's a lot of string-specific algorithms that deserve their own chapter
  • finally, strings play a significant role in almost every program, so they have specific handling: in the OS, standard library, and even, sometimes, your application framework

In the base case, a string is, indeed, an array. As we already know, this array may either store its length or be a 0-terminated security catastrophe, like in C (see buffer overflow). So, to iterate, strings should store their length. Netstrings are a notable take on the idea of the length-aware strings: it's a simple external format that serializes a string as a tuple of length and contents, separated by a colon and ending with a comma: 3:foo, is the netsrting for the string foo.

More generally, a string is a sequence of characters. The characters themselves may be single bytes as well as fixed or variable-length byte sequences. The latter character encoding poses raises a challenging question of what to prefer, correctness or speed? With variable-length Unicode code points, the simplest and fastest string variant, a byte array, breaks, for it will incorrectly report its length (in bytes, not in characters) and fail to retrieve the character by index. Different language ecosystems address this issue differently, and the majority is, unfortunately, broken in one aspect or another. Overall, there may be two possible solution paths. The first one is to use a fixed-length representation and pad shorter characters to full length. Generally, such representation will be 32-bit UTF-32 resulting in up to 75% storage space waste for the most common 1-byte ASCII characters. The alternative approach will be to utilize a more advanced data-structure. The naive variant is a list, which implies an unacceptable slowdown of character access operation to O(n). Yet, a balanced approach may combine minimal additional space requirements with acceptable speed. One of the solutions may be to utilize the classic bitmap trick: use a bit array indicating, for each byte, whether it's the start of a character (only a 12% overhead). Calculating the character position may be performed in a small number of steps with the help of an infamous, in close circles, operation — Population count aka Hamming weight. This hardware instruction calculates the number of 1-bits in an integer and is accessible via logcount Lisp standard library routine. Behind the scenes, it is also called for bit arrays if you invoke count 1 on them. At least this is the case for SBCL:

CL-USER> (disassemble (lambda (x) 
(declare (type (simple-array bit) x))
(count 1 x)))

; disassembly for (LAMBDA (X))
; Size: 267 bytes. Origin: #x100FC9FD1A
...
; DA2: F3480FB8FA POPCNT RDI, RDX

The indexing function implementation may be quite tricky, but the general idea is to try to jump ahead n characters and calculate the popcount of the substring from the previous position to the current that will tell us the number of characters we have skipped. For the base case of a 1-byte string, we will get exactly where we wanted in just 1 jump and 1 popcount. However, if there were multibyte characters in the string, the first jump would have skipped less than n characters. If the difference is sufficiently small (say, below 10) we can just perform a quick linear scan of the remainder and find the position of the desired character. If it's larger than n/2 we can jump ahead n characters again (this will repeat at most 3 times as the maximum byte-length of a character is 4), and if it's below n/2 we can jump n/2 characters. And if we overshoot we can reverse the direction of the next jump or search. You can see where it's heading: if at each step (or, at least, at each 4th step) we are constantly half dividing our numbers this means O(log n) complexity. That's the worst performance for this function we can get, and it will very efficiently handle the cases when the character length doesn't vary: be it 1 byte — just 2 operations, or 4 bytes — 8 ops.

Here is the prototype of the char-index operation implemented according to the described algorithm (without the implementation of the mb-linear-char-index that performs the final linear scan):

(defstruct (mb-string (:conc-name mbs-))
bytes
bitmap)

(defparameter *mb-threshold* 10)

(defun mb-char-index (string i)
(let ((off 0))
(loop
(with ((cnt (count 1 (mbs-bitmap string) :start off :end (+ off i))))
(diff (- i cnt)))
(cond
((= cnt i) (return (+ off i)))
((< diff *mb-threshold*) (return (mb-linear-char-index
string diff off)))
((< cnt (floor i 2)) (:+ off i)
(:- i cnt))
(t (:+ off (floor i 2))
(:- i cnt)))))))

The length of such a string may be calculated by perfoming the popcount on the whole bitmap:

(defun mb-length (string)
(count 1 (mbs-bitmap string)))

It's also worth taking into account that there exists a set of rules assembled under the umbrella of the Unicode collation algorithm that specifies how to order strings containing Unicode code-points.

Basic String-Related Optimizations

Strings are often subject to subsequencing, so an efficient implementation may use structure sharing. As we remember, in Lisp, this is accessible via the displaced arrays mechanism (and a convenience RUTILS function slice that we have already used in the code above). Yet, structure sharing should be utilized with care as it opens a possibility for action-at-a-distance bugs if the derived string is modified, which results in parallel modification of the original. Though, strings are rarely modified in-place so, even in its basic form (without mandatory immutability), the approach works well. Moreover, some programming language environments make strings immutable by default. In such cases, to perform on-the-fly string modification (or rather, creation) such patterns as the Java StringBuilder are used, which creates the string from parts by first accumulating them in a list and then, when necessary, concatenating the list's contents into a single final string. An alternative approach is string formatting (the format function in Lisp) that is a higher-level interface, which still needs to utilize some underlying mutation/combination mechanism.

Another important string-related technology is interning. It is a space-saving measure to avoid duplicating the same strings over and over again, which operates by putting a string in a table and using its index afterwards. This approach also enables efficient equality comparison. Interning is performed by the compiler implicitly for all constant strings (in the special segment of the program's memory called "string table"/sstab), and also may be used explicitly. In Lisp, there's a standard function intern, for this. Lisp symbols used interned strings as their names. Another variant of interning is string pooling. The difference is that interning uses a global string table while the pools may be local.

Strings in the Editor

Now, let's consider situations, in which representing strings as arrays doesn't work. The primary one is in the editor. I.e. when constant random modification is the norm. There's another not so obvious requirement related to editing: handle potentially arbitrary long strings that still need to be dynamically modified. Have you tried opening a hundred-megabyte text document in your favorite editor? You'd better don't unless you're a Vim user :) Finally, an additional limitation of handling the strings in the editor is posed when we allow concurrent modification. This we'll discuss in the chapter on concurrent algorithms.

So, why array as a string backend doesn't work well in the editor? Because of content relocation required by all edit operations. O(n) editing is, obviously, not acceptable. What to do? There are several more advanced approaches:

  1. The simplest change will be, once again, to use an array of arrays. For example, for each line. This will not change the general complexity of O(n) but, at least, will reduce n significantly. The issue is that, still, it will depend on the length of the line so, for not so rare degraded case when there are few or no linebreaks, the performance will seriously deteriorate. And, moreover, having observable performance differences between editing different paragraphs of the text is not user-friendly at all.
  2. A more advanced approach would be to use trees, reducing access time to O(log n). There are many different kinds of trees and, in fact, only a few may work as efficient string representations. Among them a popular data structure, for representing strings, is a Rope. It's a binary tree where each leaf holds a substring and its length, and each intermediate node further holds the sum of the lengths of all the leaves in its left subtree. It's a more-or-less classic application of binary trees to a storage problem so we won't spend more time on it here. Suffice to say that it has the expected binary-tree performance of O(log n) for all operations, provided that we keep it balanced. It's an ok alternative to a simple array, but, for such a specialized problem, we can do better with a custom solution.
  3. And the custom solution is to return to arrays. There's one clever way to use them that works very well for dynamic strings. It is called a Gap buffer. This structure is an array (buffer) with a gap in the middle. I.e., let's imagine that we have a text of n characters. The Gap buffer will have a length of n + k where k is the gap size — some value, derived from practice, that may fluctuate in the process of string modification. You can recognize this gap as the position of the cursor in the text. Insertion operation in the editor is performed exactly at this place, so it's O(1). Just, afterwards, the gap will shrink by 1 character, so we'll have to resize the array, at some point, if there are too many insertions and the gap shrinks below some minimum size (maybe, below 1). The deletion operation will act exactly the opposite by growing the gap at one of the sides. The Gap buffer is an approach that is especially suited for normal editing — a process that has its own pace. It also allows the system to represent multiple cursors by maintaining several gaps. Also, it may be a good idea to represent each paragraph as a gap buffer and use an array of them for the whole text. The gap buffer is a special case of the Zipper pattern that we'll discuss in the chapter on functional data structures.

Substring Search

One of the most common string operations is substring search. For ordinary sequences we, usually, search for a single element, but strings, on the contrary, more often need subsequence search, which is more complex. A naive approach will start by looking for the first character, then trying to match the next character and the next, until either something ends or there's a mismatch. Unlike with hash-tables, Lisp standard library has good support for string processing, including such operations as search (which, actually, operates on any sequence type) and mismatch that compares two strings from a chosen side and returns the position at which they start to diverge.

If we were to implement our own string-specific search, the most basic version would, probably, look like this:

(defun naive-match (pat str)
(dotimes (i (- (1+ (length str)) (length pat)))
(when (= (mismatch pat (slice str i))
(length pat))
(return-from naive-match i))))

If the strings had been random, the probability that we are correctly matching each subsequent character would have dropped to 0 very fast. Even if we consider just the English alphabet, the probability of the first character being the same in 2 random strings is 1/26, the first and second — 1/676, and so on. And if we assume that the whole charset may be used, we'll have to substitute 26 with 256 or a greater value. So, in theory, such naive approach has almost O(n) complexity, where n is the length of the string. Yet, the worst case has O(n * m), where m is the length of the pattern. Why? If we try to match a pattern a..ab against a string aa.....ab, at each position, we'll have to check the whole pattern until the last character mismatches. This may seem like an artificial example and, indeed, it rarely occurs. But, still, real-world strings are not so random and are much closer to the uniform corner case than to the random one. So, researchers have come up with a number of ways to improve subsequence matching performance. Those include the four well-known inventor-glorifying substring search algorithms: Knuth-Morris-Pratt, Boyer-Moore, Rabin-Karp, and Aho-Corasick. Let's discuss each one of them and try to determine their interesting properties.

KMP

Knuth-Morris-Pratt is the most basic of these algorithms. Prior to performing the search, it examines the pattern to find repeated subsequences in it and creates a table containing, for each character of the pattern, the length of the prefix of the pattern that can be skipped if we have reached this character and failed the search at it. This table is also called the "failure function". The number in the table is calculated as the length of the proper suffix[1] of the pattern substring ending before the current character that matches the start of the pattern.

I'll repeat here the example provided in Wikipedia that explains the details of the table-building algorithm, as it's somewhat tricky.

Let's build the table for the pattern abdcabd. We set the table entry for the first char a to -1. To find the entry for b, we must discover a proper suffix of a which is also a prefix of the pattern. But there are no proper suffixes of a, so we set this entry to 0. To find the entry with index 2, we see that the substring ab has a proper suffix b. However b is not a prefix of the pattern. Therefore, we also set this entry to 0.

For the next entry, we first check the proper suffix of length 1, and it fails like in the previous case. Should we also check longer suffixes? No. We can formulate a shortcut rule: at each stage, we need to consider checking suffixes of a given size (1+ n) only if a valid suffix of size n was found at the previous stage and should not bother to check longer lengths. So we set the table entry for c to 0 also.

We pass to the subsequent character a. The same logic shows that the longest substring we need to consider has length 1, and as in the previous case it fails since d is not a prefix. But instead of setting the table entry to 0, we can do better by noting that a is also the first character of the pattern, and also that the corresponding character of the string can't be a (as we're calculating for the mismatch case). Thus there is no point in trying to match the pattern for this character again — we should begin 1 character ahead. This means that we may shift the pattern by match length plus one character, so we set the table entry to -1.

Considering now the next character b: though by inspection the longest substring would appear to be a, we still set the table entry to 0. The reasoning is similar to the previous case. b itself extends the prefix match begun with a, and we can assume that the corresponding character in the string is not b. So backtracking before it is pointless, but that character may still be a, hence we set the entry not to -1, but to 0, which means shifting the pattern by 1 character to the left and trying to match again.

Finally, for the last character d, the rule of the proper suffix matching the prefix applies, so we set the table entry to 2.

The resulting table is:

    a    b   c   d   a   b   d   
-1 0 0 0 -1 0 2

Here's the implementation of the table-building routine:

(defun kmp-table (pat)
(let ((rez (make-array (length pat)))
(i 0)) ; prefix length
(:= (? rez 0) -1)
(loop :for j :from 1 :below (length pat) :do
(if (char= (char pat i) (char pat j))
(:= (? rez j) (? rez i))
(progn (:= (? rez j) i
i (? rez i))
(loop :while (and (>= i 0)
(not (char= (char pat i) (char pat j))))
:do (:= i (? rez i)))))
(:+ i))
rez))

It can be proven that it runs in O(m). We won't show it here, so coming up with proper calculations is left as an exercise to the reader.

Now, the question is, how shall we use this table? Let's look at the code:

(defun kmp-match (pat str)
(let ((s 0)
(p 0)
(ff (kmp-table pat))
(loop :while (< s (length str)) :do
(if (= (char pat p) (char str s))
;; if the current characters of the pattern and string match
(if (= (1+ p) (length pat)))
;; if we reached the end of the pattern - success
(return (- s p))
;; otherwise, match the subsequent characters
(:= p (1+ p)
s (1+ s)))
;; if the characters don't match
(if (= -1 (? ff p))
;; shift the pattern for the whole length
(:= p 0
;; and skip to the next char in the string
s (1+ s))
;; try matching the current char again,
;; shifting the pattern to align the prefix
;; with the already matched part
(:= p (? ff p)))))))

As we see, the index in the string (s), is incremented at each iteration except when the entry in the table is positive. In the latter case, we may examine the same character more than once but not more than we have advanced in the pattern. And the advancement in the pattern meant the same advancement in the string (as the match is required for the advancement). In other words, we can backtrack not more than n times over the whole algorithm runtime, so the worst-case number of operations in kmp-search is 2n, while the best-case is just n. Thus, the total complexity is O(n + m).

And what will happen in our aa..ab example? The failure function for it will look like the following: -1 -1 -1 -1 (- m 2). Once we reach the first mismatch, we'll need to backtrack by 1 character, perform the comparison, which will mismatch, advance by 1 character (to b), mismatch again, again backtrack by 1 character, and so on until the end of the string. So, this case, will have almost the abovementiond 2n runtime.

To conclude, the optimization of KMP lies in excluding unnecessary repetition of the same operations by memoizing the results of partial computations — both in table-building and matching parts. The next chapter of the book will be almost exclusively dedicated to studying this approach in algorithm design.

BM

Boyer-Moore algorithm is conceptually similar to KMP, but it matches from the end of the pattern. It also builds a table, or rather three tables, but using a different set of rules, which also involve the characters in the string we search. More precisely, there are two basic rules instead of one for KMP. Besides, there's another rule, called the Galil rule, that is required to ensure the linear complexity of the algorithm. Overall, BM is pretty complex in the implementation details and also requires more preprocessing than KMP, so its utility outweighs these factors only when the search is repeated multiple times for the same pattern.

Overall, BM may be faster with normal text (and the longer the pattern, the faster), while KMP will work the best with strings that have a short alphabet (like DNA). However, I would choose KMP as the default due to its relative simplicity and much better space utilization.

RK

Now, let's talk about alternative approaches that rely on techniques other than pattern preprocessing. They are usually used to find matches of multiple patterns in one go as, for the base case, their performance will be worse than that of the previous algorithms.

Rabin-Karp algorithm uses an idea of the Rolling hash. It is a hash function that can be calculated incrementally. The RK hash is calculated for each substring of the length of the pattern. If we were to calculate a normal hash function like fnv-1, we'd need to use each character for the calculation — resulting in O(n * m) complexity of the whole procedure. The rolling hash is different as it requires, at each step of the algorithm, to perform just 2 operations: as the "sliding window" moves over the string, subtract the part of the hash corresponding to the character that is no longer part of the substring and add the new value for the character that has just become the part of the substring.

Here is the skeleton of the RK algorithm:

(defun rk-match (pat str)
(let ((len (length pat))
(phash (rk-hash pat)))
(loop :for i :from len :to (length str)
:for beg := (- i len)
:for shash := (rk-hash (slice str 0 len))
:then (rk-rehash len shash (char str beg) (char str i))
:when (and (= phash shash)
(string= pat (slice str beg len))
:collect beg)))

A trivial rk-hash function would be just:

(defun rk-hash (str)
(loop :for ch :across str :sum (char-code ch)))

But it is, obviously, not a good hash-function as it doesn't ensure the equal distribution of hashes. Still, in this case, we need a reversible hash-function. Usually, such hashes add position information into the mix. An original hash-function for the RK algorithm is the Rabin fingerprint that uses random irreducible polynomials over Galois fields of order 2. The mathematical background needed to explain it is somewhat beyond the scope of this book. However, there are simpler alternatives such as the following:

(defun rk-hash (str)
(assert (> (length str) 0))
(let ((rez (char-code (char str 0))))
(loop :for ch :across (slice str 1) :do
(:= rez (+ (rem (* rez 256) 101)
(char-code ch))))
(rem rez 101))

Its basic idea is to treat the partial values of the hash as the coefficients of some polynomial.

The implementation of rk-rehash for this function will look like this:

(defun rk-rehash (hash len ch1 ch2)
(rem (+ (* (+ hash 101
(- (rem (* (char-code ch1)
(expt 256 (1- len)))
101)))
256)
(char-code ch2))
101))

Our rk-match could be used to find many matches of a single pattern. To adapt it for operating on multiple patterns at once, we'll just need to pre-calculate the hashes for all patterns and lookup the current rk-hash value in this set. Additional optimization of this lookup may be performed with the help of a Bloom filter — a stochastic data structure we'll discuss in more detail later.

Finally, it's worth noting that there are other similar approaches to the rolling hash concept that trade some of the uniqueness properties of the hash function for the ability to produce hashes incrementally or have similar hashes for similar sequences. For instance, the Perceptual hash (phash) is used to find near-match images.

AC

Aho-Corasick is another algorithm that allows matching multiple strings at once. The preprocessing step of the algorithm constructs a Finite-State Machine (FSM) that resembles a trie with additional links between the various internal nodes. The FSM is a graph data structure that encodes possible states of the system and actions needed to transfer it from one state to the other.

The AC FSM is constructed in the following manner:

  1. Build a trie of all the words in the set of search patterns (the search dictionary). This trie represents the possible flows of the program when there's a successful character match at the current position. Add a loop edge for the root node.
  2. Add backlinks transforming the trie into a graph. The backlinks are used when a failed match occurs. These backlinks are pointing either to the root of the trie or if there are some prefixes that correspond to the part of the currently matched path — to the end of the longest prefix. The longest prefix is found using BFS of the trie. This approach is, basically, the same idea used in KMP and BM to avoid reexamining the already matched parts. So backlinks to the previous parts of the same word are also possible.

Here is the example FSM for the search dictionary '("the" "this" "that" "it" "his"):

Basically, it's just a trie with some backlinks to account for already processed prefixes. One more detail missing for this graph to be a complete FSM is an implicit backlink from all nodes without an explicit backlink that don't have backlinks to the root node.

The main loop of the algorithm is rather straightforward, examine each character and then:

  • either follow one of the transitions (direct edge) if the character of the edge matches
  • or follow the backlink if it exists
  • or reset the FSM state — go to root
  • if the transition leads us to a terminal node, record the match(es) and return to root as, well

As we see from the description, the complexity of the main loop is linear in the length of the string: at most, 2 matches are performed, for each character. The FSM construction is also linear in the total length of all the words in the search dictionary.

The algorithm is often used in antivirus software to perform an efficient search for code signatures against a database of known viruses. It also formed the basis of the original Unix command fgrep. And, from my point of view, it's the simplest to understand yet pretty powerful and versatile substring search algorithm that may be a default choice if you ever have to implement one yourself.

Regular Expressions

Searching is, probably, the most important advanced string operation. Besides, it is not limited to mere substring search — matching of more complex patterns is even in higher demand. These patterns, which are called "regular expressions" or, simply, regexes, may include optional characters, repetition, alternatives, backreferences, etc. Regexes play an important role in the history of the Unix command-line, being the principal technology of the infamous grep utility, and then the cornerstone of Perl. All modern programming languages support them either in the standard library or, as Lisp, with high-quality third-party addons (cl-ppcre).

One of my favorite programming books, "Beautiful Code", has a chapter on implementing simple regex matching from Brian Kernighan with code written by Rob Pike. It shows how easy it is to perform basic matching of the following patterns:

c    matches any literal character c
. matches any single character
^ matches the beginning of the input string
$ matches the end of the input string
* matches zero or more occurrences of the previous character

Below the C code from the book is translated into an equivalent Lisp version:

(defun match (regex text)
"Search for REGEX anywhere in TEXT."
(if (starts-with "^" regex) ; STARTS-WITH is from RUTILS
(match-here (slice regex 1) text)
(dotimes (i (length text))
(when (match-here regex (slice text i))
(return t)))))

(defun match-here (regex text)
"Search for REGEX at beginning of TEXT."
(cond ((= 0 (length regex))
t)
((and (> (length regex) 1)
(char= #\* (char regex 1)))
(match-star (char regex 1) (slice regex 2) text))
((string= "$" regex)
(= 0 (length text)))
((and (> (length text) 0)
(member (char text 0) (list #\. (char text 0)))
(match-here (slice regex 1) (slice text 1)))))

(defun match-star (c regex text)
"Search for C*REGEX at beginning of TEXT."
(loop
(when (match-here regex text) (return t))
(:= text (slice text 1))
(unless (and (> (length text) 0)
(member c (list #\. (char text 0))))
(return)))

This is a greedy linear algorithm. However, modern regexes are much more advanced than this naive version. They include such features as register groups (to record the spans of text that match a particular subpattern), backreferences, non-greedy repetition, and so on and so forth. Implementing those will require changing the simple linear algorithm to a backtracking one. And incorporating all of them would quickly transform the code above into a horrible unmaintainable mess: not even due to the number of cases that have to be supported but due to the need of accounting for the complex interdependencies between them.

And, what's worse, soon there will arise a need to resort to backtracking. Yet, a backtracking approach has a critical performance flaw: potential exponential runtime for certain input patterns. For instance, the Perl regex engine (PCRE) requires over sixty seconds to match a 30-character string aa..a against the pattern a?{15}a{15} (on standard hardware). While the alternative approach, which we'll discuss next, requires just twenty microseconds — a million times faster. And it handles a 100-character string of a similar kind in under 200 microseconds, while Perl would require over 1015 years.[2]

This issue is quite severe and has even prompted Google to release their own regex library with strict linear performance guarantees — RE2. The goal of the library is not to be faster than all other engines under all circumstances. Although RE2 guarantees linear-time performance, the linear-time constant varies depending on the overhead entailed by its way of handling of the regular expression. In a sense, RE2 behaves pessimistically whereas backtracking engines behave optimistically, so it can be outperformed in various situations. Also, its goal is not to implement all of the features offered by PCRE and other engines. As a matter of principle, RE2 does not support constructs for which only backtracking solutions are known to exist. Thus, backreferences and look-around assertions are not supported.

The figures above are taken from a seminal article by Russ Cox. He goes on to add:

Historically, regular expressions are one of computer science's shining examples of how using good theory leads to good programs. They were originally developed by theorists as a simple computational model, but Ken Thompson introduced them to programmers in his implementation of the text editor QED for CTSS. Dennis Ritchie followed suit in his own implementation of QED, for GE-TSS. Thompson and Ritchie would go on to create Unix, and they brought regular expressions with them. By the late 1970s, regular expressions were a key feature of the Unix landscape, in tools such as ed, sed, grep, egrep, awk, and lex. Today, regular expressions have also become a shining example of how ignoring good theory leads to bad programs. The regular expression implementations used by today's popular tools are significantly slower than the ones used in many of those thirty-year-old Unix tools.

The linear-time approach to regex matching relies on a similar technic to the one in the Aho-Corasick algorithm — the FSM. Actually, if by regular expressions we mean a set of languages that abide by the rules of the regular grammars in the Chomsky hierarchy of languages, the FSM is their exact theoretical computation model. Here is how an FSM for a simple regex a*b$ might look like:

Such FSM is called an NFA (Nondeterministic Finite Automaton) as some states have more than one alternative successor. Another type of automata are DFAs (Deterministic Finite Automata) that permit transitions to at most one state, for each state. The method to transform the regex into an NFA is called the Thompson's construction. And an NFA can be made into a DFA by the Powerset construction and then be minimized to get an optimal automaton. DFAs are more efficient to execute than NFAs, because DFAs are only ever in one state at a time: they never have a choice of multiple next states. But the construction takes additional time. Anyway, both NFAs and DFAs guarantee linear-time execution.

The Thompson's algorithm builds the NFA up from partial NFAs for each subexpression, with a different construction for each operator. The partial NFAs have no matching states: instead, they have one or more dangling arrows, pointing to nothing. The construction process will finish by connecting these arrows to a matching state.

  • The NFAs for matching a single character e is a single node with a slot for an incoming arrow and a pending outgoing arrow labeled with e.
  • The NFA for the concatenation e1e2 connects the outgoing arrow of the e1 machine to the incoming arrow of the e2 machine.
  • The NFA for the alternation e1|e2 adds a new start state with a choice of either the e1 machine or the e2 machine.
  • The NFA for e? alternates the e machine with an empty path.
  • The NFA for e* uses the same alternation but loops a matching e machine back to the start.
  • The NFA for e+ also creates a loop, but one that requires passing through e at least once.

Counting the states in the above constructions, we can see that this technic creates exactly one state per character or metacharacter in the regular expression. The only exception is the constructs c{n} or c{n,m} which require to duplicate the single chracter automaton n or m times respectively, but it is still a constant number. Therefore the number of states in the final NFA is at most equal to the length of the original regular expression plus some constant.

Implementation of the Thompson's Construction

The core of the algorithm could be implemented very transparently with the help of the Lisp generic functions. However, to enable their application, we'd first need to transform the raw expression into a sexp (tree-based) form. Such representation is supported, for example, in the cl-ppcre library:

PPCRE> (parse-string "ab[0-9]+c$")
(:SEQUENCE "ab" (:GREEDY-REPETITION 1 NIL (:RANGE #\0 #\9)) #\c :END-ANCHOR)

Parsing is a whole separate topic that will be discussed next. But once we have performed it, we gain a possibility to straightforwardly implement the Thompson's construction by traversing the parse tree and emitting, for each state, the corresponding part of the automaton. The Lisp generic functions are a great tool for implementing such transformation as they allow to define methods that are selected based on either the type or the identity of the arguments. And those methods can be added independently, so the implementation is clear and extensible. We will define 2 generic functions: one to emit the automaton fragment (th-part) and another to help in transition selection (th-match).

First, let's define the state node of the FSM. We will use a linked graph representation for the automaton. So, a variable for the FSM in the code will point to its start node, and it will, in turn, reference the other nodes. There will also be a special node that will be responsible for recording the matches (*matched-state*).

(defstruct th-state
transitions)

(defparameter *initial-state* nil)
(defparameter *matched-state* (make-th-state))

(defun th-state (&rest transitions)
"A small convenience function to construct TH-STATE structs."
(make-th-state :transitions (loop :for (cond state) :on transitions :by 'cddr
:collect (pair cond state))))

And now, we can define the generic function that will emit the nodes:

(define-condition check-start-anchor () ())

(defgeneric th-part (next-state kind &rest args)
(:documentation
"Emit the TH-STATE structure of a certain KIND
(which may be a keyword or a raw string) using the other ARGS
and pointing to NEXT-STATE struct.")
(:method (next-state (kind (eql :sequence)) &rest args)
(apply 'th-part (if (rest args)
(apply 'th-part :sequence (rest args))
next-state)
(first args)))
(:method (next-state (kind (eql :greedy-repetition)) &rest args)
;; this method can handle *, +, {n}, and {n,m} regex modifiers
;; in any case, there's a prefix sequence of fixed nonnegative length
;; of identical elements that should unconditionally match
;; followed by a bounded or unbounded sequence that,
;; in case of a failed match, transitions to the next state
(apply 'th-part
(let ((*initial-state* next-state))
(apply 'th-part next-state :sequence
(loop :repeat (or (second args) 1)
:collect (mklist (third args)))))
:sequence (loop :repeat (first args)
:collect (mklist (third args)))))
(:method (next-state (kind character) &rest args)
(th-state kind next-state
;; Usually, *initial-state* will be nill, i.e. further computations
;; alone this path will be aborted, but for some variants (? or *)
;; they will just continue normally to the next state.
;; The special variable allows to control this as you can see in
;; the method for :greedy-repetition
t *initial-state*))
(:method (next-state (kind (eql :end-anchor)) &rest args)
(th-state nil *matched-state*
t *initial-state*))
(:method (next-state (kind (eql :start-anchor)) &rest args)
;; This part is unique in that all the other parts consume the next character
;; (we're not implementing lookahead here), but this one shouldn't.
;; To implement such behavior without the additional complexity created by passing
;; the string being searched to this function (which we'll still, probably, need to do
;; later on, but were able to avoid so far), we can resort to a cool Lisp technique
;; of signaling a condition that can be handled specially in the top-level code
(signal 'check-start-anchor)
next-state))

Here, we have defined some of the methods of th-part that specialize for the basic :sequence of expressions, :greedy-repetition (regex * and +), a single character and single symbols :start-anchor/:end-anchor (regexes ^ and $). As you can see, some of them dispatch (are chosen based on) the identity of the first argument (using eql specializers), while the character-related method specializes on the class of the arg. As we develop this facility, we could add more methods with defmethod. Running th-part on the whole parse-tree will produce the complete automaton, we don't need to do anything else!

To use the constructed FSM, we run it with the string as input. NFAs are endowed with the ability to guess perfectly when faced with a choice of next state: to run the NFA on a real computer, we must find a way to simulate this guessing. One way to do that is to guess one option, and if that doesn't work, try the other. A more efficient way to simulate perfect guessing is to follow all admissible paths simultaneously. In this approach, the simulation allows the machine to be in multiple states at once. To process each letter, it advances all the states along all the arrows that match the letter. In the worst case, the NFA might be in every state at each step, but this results in at worst a constant amount of work independent of the length of the string, so arbitrarily large input strings can be processed in linear time. The efficiency comes from tracking the set of reachable states but not which paths were used to reach them. In an NFA with n nodes, there can only be n reachable states at any step.

(defun run-nfa (nfa str)
(let ((i 0)
(start 0)
(matches (list))
(states (list nfa)))
;; this is the counterpart for the start-anchor signal
(handler-bind ((check-start-anchor
;; there's no sense to proceed matching a ^... regex
;; if the string is not at its start
(lambda (c) (when (> i 0) (return-from run-nfa)))))
(dovec (char (concatenate 'vector str
#(nil))) ; for handling end-anchor
(let ((new-states (list)))
(dolist (state states)
(dolist (tr (? state 'transitions))
(when (th-match tr char)
(case (rt tr)
(*matched-state* (push start matches))
(nil ) ; ignore it
(t (pushnew (rt tr) new-states)))
(return))))
(if new-states
(:= states new-states)
(:= states (list nfa)
start nil)))
(:+ i)
(unless start (:= start i))))
matches))

The th-match function may have methods to match a single char and a character range, as well as a particular predicate. Its implementation is trivial and left as an exercise to the reader.

Overall, interpreting an automaton is a simple and robust approach, yet if we want to squeeze all the possible performance, we can compile it directly to machine code. This is much easier to do with the DFA as it has at most 2 possible transitions from each state, so the automaton can be compiled to a multi-level conditional and even a jump-table.

Grammars

Regexes are called "regular" for a reason: there's a corresponding mathematical formalism "regular languages" that originates from the hierarchy of grammars compiled by Noah Chomsky. This hierarchy has 4 levels, each one allowing strictly more complex languages to be expressed with it. And for each level, there's an equivalent computation model:

  • Type-0: recursivel-enumerable (or universal) grammars — Turing machine
  • Type-1: context-dependent (or context-sensitive) grammars — a linear bounded automaton
  • Type-2: context-free grammars — pushdown automaton
  • Type-3: regular grammars — FSM

We have already discussed the bottom layer of the hierarchy. Regular languages are the most limited (and thus the simplest to implement): for example, you can write a regex a{15}b{15}, but you won't be able to express a{n}b{n} for an arbitrary n, i.e. ensure that b is repeated the same number of times as a. The top layer corresponds to all programs and so all the programming science and lore, in general, is applicable to it. Now, let's talk about context-free grammars which are another type that is heavily used in practice and even has a dedicated set of algorithms. Such grammars can be used not only for simple matching but also for parsing and generation. Parsing, as we have seen above, is the process of transforming a text that is assumed to follow the rules of a certain grammar into the structured form that corresponds to the particular rules that can be applied to this text. And generation is the reverse process: applying the rules, obtain the text. This topic is huge and there's a lot of literature on it including the famous Dragon Book.

Parsing is used for processing both artificial (including programming) and natural languages. And, although different sets of rules may be used, as well as different approaches for selecting a particular rule, the resulting structure will be a tree. In fact, formally, each grammar consists of 4 items:

  • The set of terminals (leaves of the parse tree) or tokens of the text: these could be words or characters for the natural language; keywords, identifiers, and literals for the programming language; etc.
  • The set of nonterminals — symbols used to name different items in the rules and in the resulting parse tree — the non-leaf nodes of the tree. These symbols are abstract and not encountered in the actual text. The examples of nonterminals could be VB (verb) or NP (noun phrase) in natural language parsing, and if-section or template-argument in parsing of C++ code.
  • The root symbol (which should be one of the nonterminals).
  • The set of production rules that have two-sides: a left-hand (lhs) and a right-hand (rhs) one. In the left-hand side, there should be at least one nonterminal, which is substituted with a number of other terminals or nonterminals in the right-hand side. During generation, the rule allows the algorithm to select a particular surface form for an abstract nonterminal (for example, turn a nonterminal VB into a word do). During parsing, which is a reverse process, it allows the program, when it's looking at a particular substring, to replace it with a nonterminal and expand the tree structure. When the parsing process reaches the root symbol in the by performing such substitution and expansion, it is considered terminated.

Each compiler has to use parsing as a step in transforming the source into executable code. Also, parsing may be applied for any data format (for instance, JSON) to transform it into machine data. In natural language processing, parsing is used to build the various tree representations of the sentence, which encode linguistic rules and structure.

There are many different types of parsers that differ in the additional constraints they impose on the structure of the production rules of the grammar. The generic context-free constraint is that in each production rule the left-hand side may only be a single nonterminal. The most wide-spread of context-free grammars are LL(k) (in particular, LL(1)) and LR (LR(1), SLR, LALR, GLR, etc). For example, LL(1) parsers (one of the easiest to build) parses the input from left to right, performing leftmost derivation of the sentence, and it is allowed to look ahead at most 1 character. Not all combinations of derivation rules allow the algorithm to build a parser that will be able to perform unambiguous rule selection under such constraints. But, as the LL(1) parsing is simple and efficient, some authors of grammars specifically target their language to be LL(1)-parseable. For example, Pascal and other programming languages created by Niklas Wirth fall into this category.

There are also two principal approaches to implementing the parser: a top-down and a bottom-up one. In a top-down approach, the parser tries to build the tree from the root, while, in a bottom-up one, it tries to find the rules that apply to groups of terminal symbols and then combine those until the root symbol is reached. Obviously, we can't enumerate all parsing algorithms here, so we'll study only a single approach, which is one of the most wide-spread, efficient, and flexible ones — Shift-Reduce Parsing. It's a bottom-up linear algorithm that can be considered one of the instances of the pushdown automaton approach — a theoretical computational model for context-free grammars.

A shift-reduce parser operates on a queue of tokens of the original sentence. It also has access to a stack. At each step, the algorithm can perform:

  • either a shift operation: take the token from the queue and push it onto the stack
  • or a reduce operation: take the top items from the stack, select a matching rule from the grammar, and add the corresponding subtree to the partial parse tree, in the process, removing the items from the stack

Thus, for each token, it will perform exactly 2 "movement" operations: push it onto the stack and pop from the stack. Plus, it will perform rule lookup, which requires a constant number of operations (maximum length of the rhs of any rule) if an efficient structure is used for storing the rules. A hash-table indexed by the rhs's or a trie are good choices for that.

Here's a small example from the domain of NLP syntactic parsing. Let's consider a toy grammar:

S -> NP VP .
NP -> DET ADJ NOUN
NP -> PRP$ NOUN ; PRP$ is a posessive pronoun
VP -> VERB VP
VP -> VERB NP

and the following vocabulary:

DET -> a|an|the
NOUN -> elephant|pijamas
ADJ -> large|small
VERB -> is|wearing
PRP$ -> my

No, let's parse the sentence (already tokenized): A large elephant is wearing my pyjamas . First, we'll need to perform part-of-speech tagging, which, in this example, is a matter of looking up the appropriate nonterminals from the vocabulary grammar. This will result in the following:

DET ADJ    NOUN  VERB  VERB PRP$  NOUN  .
| | | | | | | |
A large elephant is wearing my pyjamas .

This POS tags will serve the role of terminals for our parsing grammar. Now, the shift-reduce process itself begins:

1. Initial queue: (DET ADJ NOUN VERB VERB PRP$ NOUN .)
Initial stack: ()
Operation: shift


2. Queue: (ADJ NOUN VERB VERB PRP$ NOUN .)
Stack: (DET)
Operation: shift (as there are no rules with the rhs DET)

3. Queue: (NOUN VERB VERB PRP$ NOUN .)
Stack: (ADJ DET)
Operation: shift

4. Queue: (VERB VERB PRP$ NOUN .)
Stack: (NOUN ADJ DET)
Operation: reduce (rule NP -> DET ADJ NOUN)
; we match the rules in reverse to the stack

5. Queue: (VERB VERB PRP$ NOUN .)
Stack: (NP)
Operation: shift

6. Queue: (VERB PRP$ NOUN .)
Stack: (VERB NP)
Operation: shift

7. Queue: (PRP$ NOUN .)
Stack: (VERB VERB NP)
Operation: shift

8. Queue: (NOUN .)
Stack: (PRP$ VERB VERB NP)
Operation: shift

9. Queue: (.)
Stack: (NOUN PRP$ VERB VERB NP)
Operation: reduce (rule: NP -> PRP$ NOUN)

10. Queue: (.)
Stack: (NP VERB VERB NP)
Operation: reduce (rule: VP -> VERB NP)

11. Queue: (.)
Stack: (VP VERB NP)
Operation: reduce (rule: VP -> VERB VP)

12. Queue: (.)
Stack: (VP NP)
Operation: shift

11. Queue: ()
Stack: (. VP NP)
Operation: reduce (rule: S -> NP VP .)

12. Reached root symbol - end.

The resulting parse tree is:

__________S___________
/ \ \
/ __VP__ \
/ / \ \
/ / __VP_ \
/ / / \ \
___NP_____ / / _NP_ \
/ | \ / / / \ \
DET ADJ NOUN VERB VERB PRP$ NOUN .
| | | | | | | |
A large elephant is wearing my pyjamas .

The implementation of the basic algorithm is very simple:

(defstruct grammar
rules
max-length)

(defmacro grammar (&rest rules)
`(make-grammar
:rules (pairs->ht (mapcar (lambda (rule)
(pair (nthcdr 2 rule) (first rule)))
',rules))
:max-length
(let ((max 0))
(dolist (rule ',rules)
(when (> #1=(length (nthcdr 2 rule)) max)
(:= max #1#)))
max))) ; #1= & #1# are reader-macros for anonymous variables

(defun parse (grammar queue)
(let ((stack (list)))
(loop :while queue :do
(print stack) ; diagnostic output
(if-it (find-rule stack grammar)
;; reduce
(dotimes (i (length (cdr it))
(push it stack))
(pop stack))
;; shift
(push (pop queue) stack))
:finally (return (find-rule stack grammar)))))

(defun find-rule (stack grammar)
(let (prefix)
(loop :for item in stack
:repeat (? grammar 'max-length) :do
(push (car (mklist item)) prefix)
(when-it (? grammar 'rules prefix)
;; otherwise parsing will fail with a stack
;; containing a number of partial subtrees
(return (cons it (reverse (subseq stack 0 (length prefix)))))))))

CL-USER> (parse (print (grammar (S -> NP VP |.|)
(NP -> DET ADJ NOUN)
(NP -> PRP$ NOUN)
(VP -> VERB VP)
(VP -> VERB NP)))
'(DET ADJ NOUN VERB VERB PRP$ NOUN |.|))
#S(GRAMMAR
:RULES #{
'(NP VP |.|) S
'(DET ADJ NOUN) NP
'(PRP$ NOUN) NP
'(VERB VP) VP
'(VERB NP) VP
}
:MAX-LENGTH 3)
NIL
(DET)
(ADJ DET)
(NOUN ADJ DET)
((NP DET ADJ NOUN))
(VERB (NP DET ADJ NOUN))
(VERB VERB (NP DET ADJ NOUN))
(PRP$ VERB VERB (NP DET ADJ NOUN))
(NOUN PRP$ VERB VERB (NP DET ADJ NOUN))
((NP PRP$ NOUN) VERB VERB (NP DET ADJ NOUN))
((VP VERB (NP PRP$ NOUN)) VERB (NP DET ADJ NOUN))
((VP VERB (VP VERB (NP PRP$ NOUN))) (NP DET ADJ NOUN))
(S (NP DET ADJ NOUN) (VP VERB (VP VERB (NP PRP$ NOUN))) |.|)

However, the additional level of complexity of the algorithm arises when the grammar becomes ambiguous, i.e. there may be situations when several rules apply. Shift-reduce is a greedy algorithm, so, in its basic form, it will select some rule (for instance, with the shortest rhs or just the first match), and it cannot backtrack. This may result in a parsing failure. If some form of rule weights is added, the greedy selection may produce a suboptimal parse. Anyway, there's no option of backtracking to correct a parsing error. In the NLP domain, the peculiarity of shift-reduce parsing application is that the number of rules is quite significant (it can reach thousands) and, certainly, there's ambiguity. In this setting, shift-reduce parsing is paired with machine learning technics, which perform a "soft" selection of the action to take at each step, as reduce is applicable almost always, so a naive greedy technique becomes pointless.

Actually, shift-reduce would better be called something like stack-queue parsing, as different parsers may not limit the implementation to just the shift and reduce operations. For example, an NLP parser that allows the construction of non-projective trees (those, where the arrows may cross, i.e. subsequent words may not always belong to a single or subsequent upper-level categories), adds a swap operation. A more advanced NLP parser that produces a graph structure called an AMR (abstract meaning representation) has 9 different operations.

Shift-reduce parsing is implemented in many of the parser generator tools, which generate a parser program from a set of production rules. For instance, the popular Unix tool yacc is a LALR parser generator that uses shift-reduce. Another popular tool ANTLR is a parser generator for LL(k) languages that uses a non-shift-reduce direct pushdown automaton-based implementation.

Besides shift-reduce and similar automata-based parsers, there are many other parsing technics used in practice. For example, CYK probabilistic parsing was popular in NLP for some time, but it's an O(n^3) algorithm, so it gradually fell from grace and lost to machine-learning enhanced shift-reduce variants. Another approach is packrat parsing (based on PEG — parsing expression grammars) that has a great Lisp parser-generator library esrap. Packrat is a more powerful top-down parsing approach with backtracking and unlimited lookahead that nevertheless guarantees linear parse time. Any language defined by an LL(k) or LR(k) grammar can be recognized by a packrat parser, in addition to many languages that conventional linear-time algorithms do not support. This additional power simplifies the handling of common syntactic idioms such as the widespread but troublesome longest-match rule, enables the use of sophisticated disambiguation strategies such as syntactic and semantic predicates, provides better grammar composition properties, and allows lexical analysis to be integrated seamlessly into parsing. The last feature makes packrat very appealing to the programmers as they don't have to define separate tools for lexical analysis (tokenization and token categorization) and parsing. Moreover, the rules for tokens use the same syntax, which is also quite similar to regular expression syntax. For example, here's a portion of the esrap rules for parsing tables in Markdown documents. The Markdown table may look something like this:

| Left-Aligned  | Center Aligned  | Right Aligned |
| :------------ |:---------------:| -----:|
| col 3 is | some wordy text | $1600 |
| col 2 is | centered | $12 |
| zebra stripes | are neat | $1 |

You can see that the code is quite self-explanatory: each defrule form consists of a rule name (lhs), its rhs, and a transformation of the rhs into a data structure. For instance, in the rule table-row the rhs is (and (& #\|) (+ table-cell) #\| sp newline). The row should start with a | char followed by 1 or more table-cells (a separate rule), and ended by | with some space charctaers and a newline. And the transformation (:destructure (_ cells &rest __) ... only cares about the content, i.e. the table cells.

(defrule sp (* space-char)
(:text t))

(defrule table-cell (and #\|
sp
(* (and (! (or (and sp #\|) endline)) inline))
sp
(& #\|))
(:destructure (_ __ content &rest ___)
(mapcar 'second content)))

(defrule table-row (and (& #\|) (+ table-cell) #\| sp newline)
(:destructure (_ cells &rest __)
(mapcar (lambda (a) (cons :plain a))
cells)))

(defrule table-align-cell (and sp (? #\:) (+ #\-) (? #\:) sp #\|)
(:destructure (_ left __ right &rest ___)
(if right (if left 'center 'right) (when left 'left))))

(defrule table-align-row (and #\| (+ table-align-cell) sp newline)
(:destructure (_ aligns &rest __)
aligns))

(defrule table-head (and table-row table-align-row))

To conclude the topic of parsing, I wanted to pose a question: can it be used to match the regular expressions? And the answer, of course, is that it can, as we are operating in a more powerful paradigm that includes the regexes as a subdomain. However, the critical showstopper of applying parsing to this problem is the need to define the grammar instead of writing a compact and more or less intuitive regex...

String Search in Action: Plagiarism Detection

Plagiarism detection is a very challenging problem that doesn't have an exact solution. The reason is that there's no exact definition of what can be considered plagiarism and what can't, the boundary is rather blurry. Obviously, if the text or its part is just copy-pasted, everything is clear. But, usually (and especially when they know that plagiarism detection is at play), people will apply their creativity to alter the text in some slight or even significant ways. However, over the years, researchers have come up with numerous algorithms of plagiarism detection, with quality good enough to be used in our educational institutions. The problem is very popular and there are even shared task challenges dedicated to improving plagiarism catchers. It's somewhat an arms race between the plagiarists and the detection systems.

One of the earliest but, still, quite effective ways of implementing plagiarism detection is the Shingle algorithm. It is also based on the idea of using hashes and some basic statistical sampling techniques. The algorithm operates in the following stages:

  1. Text normalization (this may include case normalization, reduction of the words to basic forms, error correction, cleanup of punctuation, stopwords, etc.)
  2. Selection of the shingles and calculation of their hashes.
  3. Sampling the shingles from the text at question.
  4. Comparison of the hashes of the original shingles to the sampled hashes and evaluation.

The single shingle is a continues sequence of words from the normalized text (another name for this object, in NLP, is ngram). The original text will give us (1- n) shingles, where n is the number of words. The hashes of the shingles are normal string hashes (like fnv-1).

The text, which is analyzed for plagiarism, is also split into shingles, but not all of them are used. Just a random sample of m. The Sampling theorem can give a good estimate of the number that can be trusted with a high degree of confidence. For efficient comparison, all the original hashes can be stored in a hash-set. If the number of overlapping shingles exceeds some threshold, the text can be considered plagiarised. The other take on the result of the algorithm application may be to return the plagiarism degree, which will be the percentage of the overlapping shingles. The complexity of the algorithm is O(n + m).

In a sense, the Shingle algorithm may be viewed as an instance of massive string search, where the outcome we're interested in is not so much the positions of the patterns in the text (although, those may also be used to indicate the parts of the text that are plagiarism-suspicious) as the fact that they are present in it.

Take-aways

Strings are peculiar objects: initially, it may seem that they are just arrays. But, beyond this simple understanding, due to the main usage patterns, a much more interesting picture can be seen. Advanced string representations and algorithms are examples of special-purpose optimization applied to general-purpose data structures. This is another reason why strings are presented at the end of the part on derived data structures: string algorithms make heavy use of the material we have covered previously, such as trees and graphs.

We have also discussed the FSMs — a powerful data-structure that can be used to reliably implement complex workflows. FSMs may be used not only for string matching but also for implementing protocol handling (for example, in the HTTP server), complex user interactions, and so on. The Erlang programming language even has a standard library behavior gen_fsm (replaced by the newer gen_statem) that is a framework for easy implementation of FSMs — as many Erlang applications are mass service systems that have state machine-like operation.

P.S. Originally, I expected this chapter to be one of the smallest in the book, but it turned out to be the longest one. Strings are not so simple as they might seem... ;)


[1] A proper suffix is a suffix that is at least one character shorter than the string itself. For example, in the string abc the proper suffices are bc and c.

[2] Perl is only the most conspicuous example of a large number of popular programs that use the same algorithm; the same applies to Python, or PHP, or Ruby, or many other languages.

Didier VernaDeclt 3.0 "Montgomery Scott" is released

· 18 days ago

I'm pleased to announce the release of Declt 3.0 "Montgomery Scott", my Texinfo reference manual generator for Common Lisp libraries.

This is a new major release of the library, although the changes may not be so visible from the outside. The main concern in this release has been to increase the robustness of the output, from three different angles.

  1. Many places from where Declt attempts to extract meaningful information are underspecified (ASDF system slots notably).
  2. The pretty-printing of Lisp items can be difficult, given the liberalism and extensibility of the language.
  3. Finally, several restrictions in the syntax of Texinfo itself (anchor names notably) get in the way.

These issues were described in a paper presented at the TeX Users Group conference, this summer in Palo Alto. This release goes a long way toward fixing them.

The new generated reference manuals look mostly the same as before, but some important things have changed under the hood, notably the names of hyperlinks and cross-references (backward-incompatible, hence the increase in the major version number). The output is now also much more robust with respect to the final output format: the generation of HTML, DVI, Postscript, PDF, and even plain Info should work fine and has been tested on all Quicklisp libraries (in fact, Quickref will also be upgraded in the near future to provide all those formats at once).

Those improvements do come at a cost. Unicode is now required (even for Info readers). To be honest, many Lisp libraries already had that implicit requirement. Also, this release depends on improvements and bug fixes only available in Texinfo 6.7, now a requirement as well. I have to thank Gavin Smith, from the Texinfo team, for his collaboration.

Apart from that, this release also has a number of bug fixes.

  • Some method specializers were not handled properly.
  • The manuals were missing documentation for some non-Lisp source files.
  • There were some glitches in the pretty-printing of unreadable objects.

Get it at the usual place, and enjoy!

Lispers.deBerlin Lispers Meetup, Monday, 25th November 2019

· 18 days ago

We meet again on Monday 8pm, 25th November. Our host this time will be Max-Gerd Retzlaff (www.retzlaff-wenger.com).

Berlin Lispers is about all flavors of Lisp including Common Lisp, Clojure and Scheme.

Max-Gerd Retzlaff will talk a bit about his current programming for his PhD research[1] in Computer Graphics. Topics include preprocessing of noisy point clouds from a custom-built laser scanner, parsing binary data, coordinate transformations, drift detection using the bisection method, interpolation of scan lines (with Alpha Catmull Rom splines), and his matrix library struct matrices[2].

[1] http://cg.ivd.kit.edu/retzlaff

[2] http://matroid.org/flux/struct-matrices.lisp / .asd

We meet in Max's office at Wichertstraße 2, 10439 Berlin-Prenzlauer Berg, it is the storefront office at street level, the bell reads "Max-Gerd Retzlaff". It is located right around the corner of S+U Schönhauser Allee in 2 min walking distance. In case of questions call Christian +49 1578 70 51 61 4.

Wimpie NortjeHTTP Server: Why Hunchentoot instead of Clack?

· 19 days ago

When you do not use a framework for web application development, one of the first choices to make is to select an HTTP server.

Many web frameworks are built on Clack/Lack. It provides a unified interface to the major CL web server implementations. If you need the ability to easily switch between different web servers without having to change application code, for example you are worried that your chosen HTTP server becomes unsupported, then using such a unified interface is a sensible idea.

On the downside, to achieve such a unified interface, Clack has to limit itself to the lowest common denominator. All the available web servers have some particular strong feature which is usually the reason why one would pick a specific library. Clack removes this distinction between them and makes them all appear the same. It also forces Clack to then replace many of the features already provided by some libraries with its own implementation, like for example sending large files or handling static files.

Clack does have some features not provided by any web server which may be a good reason to consider its use. One such feature is custom middleware. Middleware are functions that automatically run before or after every request handler without the handler having to make the function calls. Things like cookie handling or CSRF prevention. While it may be possible to extend the web servers with middleware-like functionality it is a core feature of Clack.

If Clack-specific features are not relevant and you are not worried that your chosen HTTP server will vanish into thin air then there is not much reason to use it, especially if you need a specific web server's distinguishing feature.

I have used Clack before and it did not properly support some Hunchentoot features that I needed. While I like the middleware concept a lot I decided to use the web server directly to avoid running into similar issues again.

I considered some of the more mature webservers and settled on Hunchentoot. My reasons are:

  1. The learning curve is smaller because I have used it before.
  2. It is probably the most used and most mature web server project so I expect it to have the least bugs.
  3. It is slow compared to other web servers but there are enough other places to optimise for speed to make up for it. If it ever becomes a problem.
  4. I don't need any of the defining features of the other web servers.

When selecting an HTTP server Clack should be considered just as seriously as any of the web servers because it does provide some useful features. However, it should not be taken for granted that Clack is the best choice in all situations. There are times and reasons why a specific web server may be a better choice.

Wimpie NortjePicking libraries for web development

· 20 days ago

I started a project in Common Lisp again. Selective Share is a hosted secrets management service for server-side software. The project has a web app and a command line client, both of which are written in Common Lisp for the same reasons I previously stated in this post.

For the server software I decided to do the web development without a framework. There are full blown web frameworks available like Lucerne, Ningle and Caveman but they have far fewer users than the likes of Rails or Django, which means they are not nearly as matured and their development progress much slower.

When working without a framework one spends a lot of time selecting the libraries to use for each task. Common Lisp being what it is there is usually a choice of libraries for every task. This is great when you know what you need but it makes initial progress very slow when you are new to a particular field. For example, there are at least seven libraries for dealing with JSON, each with its own specialities and shortcomings.

This series of posts describe some of my design decisions and library selections. It should provide a springboard for frameworkless web development but may also be more generally helpful.

The topics covered are listed below, links will be added as the posts become available:

  • Project framework
  • Database
  • DB access
  • ORM / DAO
  • DB migrations
  • HTTP server
  • Request routing and middleware
  • JSON
  • Templating
  • Test framework
  • Logging
  • Session management
  • Password hashing
  • Storing configuration
  • Building binaries
  • CLI parameters

ABCL DevTurning the Bear past Java 11

· 22 days ago
As the leaves fall in the Northern hemisphere, thoughts of the Bear
turn to a rumored permanent slumber, and a complete absence of the
possibility of new updates.  The Bear reviews all submitted
patches, so if you don't like the current implementation,
please get out and push.  Or pull?  Or fast forward? The Bear is
seemingly always confused with Git...

As previously rumored, patches for getting the Armed Bear

Common Lisp implementation running on Java 11 have landed in
the trunk.

Unless anyone objects strongly, we will release ABCL 1.6.0 as soon 

as possible with the intention to support the openjdk{6,7,8,11} runtimes.

After stabilizing the 1.6 branch with the inevitable necessary and
unknown updates, we can turn our attention to the ABCL 2.0 branch which
will introduce support for additional runtimes but drop support for emitting 

openjdk6 or openjdk7 compatible byte-code.

Stay tuned, Bear fans...

Lispers.deLisp-Meetup in Hamburg on Monday, 4th November 2019

· 36 days ago

We meet at Ristorante Opera, Dammtorstraße 7, Hamburg, starting around 19:00 CEST on 4th November 2019.

This is an informal gathering of Lispers of all experience levels.

Topics might include Advent of Code, Lisp jobs, cryptography, event sourcing.

Eugene ZaikonnikovYocto recipe for CLISP

· 40 days ago

Some time ago I put together a simple recipe for building CLISP with Yocto. It is pinned to 2.49.93 and should work both for target and -native builds. Tested with Sumo only.

Vsevolod DyomkinProgrammatically Drawing Simple Graphviz Graphs

· 42 days ago

For my book, I had to draw a number of graphs. Obviously, I wanted to have a programmatic way to do that, and Graphviz is the goto-library for that. In Lisp, the interface to Graphviz is cl-dot, but, for me, it wasn't easy to figure out from the manual the "simple and easy" way to use it. I.e. I couldn't find a stupid beginner-level interface, so I had to code it myself. Here's the implementation that allows anyone with a REPL to send to Graphviz lists of edges and obtain graph images.

Generated images:

Vsevolod DyomkinProgramming Algorithms: Graphs

· 43 days ago

Graphs have already been mentioned several times, in the book, in quite diverse contexts. Actually, if you are familiar with graphs you can spot opportunities to use them in quite different areas for problems that aren't explicitly formulated with graphs in mind. So, in this chapter, we'll discuss how to handle graphs to develop such intuition to some degree.

But first, let's list the most prominent examples of the direct graph applications, some of which we'll see here in action:

  • pathfinding
  • network analysis
  • dependency analysis in planning, compilers, etc.
  • various optimization problems
  • distributing and optimizing computations
  • knowledge representation and reasoning with it
  • meaning representation in natural language processing

Graphs may be thought of as a generalization of trees: indeed, trees are, as we said earlier, connected directed acyclic graphs. But there's an important distinction in the patterns of the usage of graphs and trees. Graphs, much more frequently than trees, have weights associated with the edges, which adds a whole new dimension both to algorithms for processing them and to possible data that can be represented in the graph form. So, while the main application of trees is reflecting some hierarchy, for graphs, it is often more about determining connectedness and its magnitude, based on the weights.

Graph Representations

A graph is, basically, a set of nodes (called "vertices", V) and an enumeration of their pairs ("edges", E). The edges may be directed or undirected (i.e. bidirectional), and also weighted or unweighted. There are many ways that may be used to represent these sets, which have varied utility for different situations. Here are the most common ones:

  • as a linked structure: (defstruct node data links) where links may be either a list of other nodes, possibly, paired with weights, or a list of edge structures represented as (defsturct edge source destination weight). For directed graphs, this representation will be similar to a singly-linked list but for undirected — to a heavier doubly-linked one
  • as an adjacency matrix (V x V). This matrix is indexed by vertices and has zeroes when there's no connection between them and some nonzero number for the weight (1 — in case of unweighted graphs) when there is a connection. Undirected graphs have a symmetric adjacency matrix and so need to store only the abovediagonal half of it
  • as an adjacency list that enumerates for each vertex the other vertices it's connected to and the weights of connections
  • as an incidence matrix (V x E). This matrix is similar to the previous representation, but with much more wasted space. The adjacency list may be thought of as a sparse representation of the incidence matrix. The matrix representation may be more useful for hypergraphs (that have more than 2 vertices for each edge), though
  • just as a list of edges

Topological Sort

Graphs may be divided into several kinds according to the different properties they have and specific algorithms, which work on them:

  • disjoint (with several unconnected subgraphs), connected, and fully-connected (every vertex is linked to all the others)
  • cyclic and acyclic, including directed acyclic (DAG)
  • bipartite: when there are 2 groups of vertices and each vertex from one group is connected only to the vertices from the other

In practice, Directed Acyclic Graphs are quite important. These are directed graphs, in which there's no vertex that you can start a path from and return back to it. They find applications in optimizing scheduling and computation, determining historical and other types of dependencies (for example, in dataflow programming and even spreadsheets), etc. In particular, every compiler would use one and even make will when building the operational plan. The basic algorithm on DAGs is Topological sort. It creates a partial ordering of the graph's vertices which ensures that every child vertex is always preceding all of its ancestors.

Here is an example. This is a DAG:

And these are the variants of its topological ordering:

6 4 5 3 2 1 8 7
6 4 5 2 3 1 8 7
8 7 6 4 5 3 2 1
8 7 6 4 5 2 3 1

There are several variants as the graph is disjoint, and also the order in which the vertices are traversed is not fully deterministic.

There are two common approaches to topological sort: the Kahn's algorithm and the DFS-based one. Here is the DFS version:

  1. Choose an arbitrary vertex and perform the DFS from it until a vertex is found without children that weren't visited during the DFS.
  2. While performing the DFS, add each vertex to the set of visited ones. Also check that the vertex hasn't been visited already, or else the graph is not acyclic.
  3. Then, add the vertex we have found to the resulting sorted array.
  4. Return to the previous vertex and repeat searching for the next descendant that doesn't have children and add it.
  5. Finally, when all of the current vertex's children are visited add it to the result array.
  6. Repeat this for the next unvisited vertex until no unvisited ones remain.

Why does the algorithm satisfy the desired constraints? First of all, it is obvious that it will visit all the vertices. Next, when we add the vertex we have already added all of its descendants — satisfying the main requirement. Finally, there's a consistency check during the execution of the algorithm that ensures there are no cycles.

Before proceeding to the implementation, as with other graph algorithms, it makes sense to ponder what representation will work the best for this problem. The default one — a linked structure — suits it quite well as we'll have to iterate all the outgoing edges of each node. If we had to traverse by incoming edges then it wouldn't have worked, but a matrix one would have.

(defstruct node
id edges)

(defstruct edge
src dst label)

(defstruct (graph (:conc-name nil) (:print-object pprint-graph))
(nodes (make-hash-table))) ; mapping of node ids to nodes

As usual, we'll need a more visual way to display the graph than the default print-function. But that is pretty tricky considering that graphs may have an arbitrary structure with possibly intersecting edges. The simplest approach for small graphs would be to just draw the adjacency matrix. We'll utilize it for our examples (relying on the fact that we have control over the set of node ids):

(defun pprint-graph (graph stream)
(let ((ids (sort (keys (nodes graph)) '<)))
(format stream "~{ ~A~}~%" ids) ; here, Tab is used for space
(dolist (id1 ids)
(let ((node (? graph 'nodes id1)))
(format stream "~A" id1)
(dolist (id2 ids)
(format stream " ~:[~;x~]" ; here, Tab as well
(find id2 (? node 'edges) :key 'edge-dst)))
(terpri stream)))))

Also, let's create a function to simplify graph initialization:

(defun init-graph (edges)
(with ((rez (make-graph))
(nodes (nodes rez)))
(loop :for (src dst) :in edges :do
(let ((src-node (getset# src nodes (make-node :id src))))
(getset# dst nodes (make-node :id dst))
(push (make-edge :src src :dst dst)
(? src-node 'edges))))
rez))

CL-USER> (init-graph '((7 8)
(1 3)
(1 2)
(3 4)
(3 5)
(2 4)
(2 5)
(5 4)
(5 6)
(4 6)))

1 2 3 4 5 6 7 8
1 x x
2 x x
3 x x
4 x
5 x x
6
7 x
8

So, we already see in action 3 different ways of graphs representation: linked, matrix, and edges lists.

Now, we can implement and test topological sort:

(defun topo-sort (graph)
(let ((nodes (nodes graph))
(visited (make-hash-table))
(rez (vec)))
(dokv (id node nodes)
(unless (? visited id)
(visit node nodes visited rez)))
rez))

(defun visit (node nodes visited rez)
(dolist (edge (? node 'edges))
(with ((id (? edge 'dst))
(child (? nodes id)))
(unless (find id rez)
(assert (not (? visited id)) nil
"The graph isn't acyclic for vertex: ~A" id)
(:= (? visited id) t)
(visit child nodes visited rez))))
(vector-push-extend (? node 'id) rez)
rez)

CL-USER> (topo-sort (init-graph '((7 8)
(1 3)
(1 2)
(3 4)
(3 5)
(2 4)
(2 5)
(5 4)
(5 6)
(4 6))))
#(8 7 6 4 5 2 3 1)

This technique of tracking the visited nodes is used in almost every graph algorithm. As noted previously, it can either be implemented using an additional hash-table (like in the example) or by adding a boolean flag to the vertex/edge structure itself.

MST

Now, we can move to algorithms that work with weighted graphs. They represent the majority of the interesting graph-based solutions. One of the most basic of them is determining the Minimum Spanning Tree. Its purpose is to select only those graph edges that form a tree with the lowest total sum of weights. Spanning trees play an important role in network routing where there is a number of protocols that directly use them: STP (Spanning Tree Protocol), RSTP (Rapid STP), MSTP (Multiple STP), etc.

If we consider the graph from the previous picture, its MST will include the edges 1-2, 1-3, 3-4, 3-5, 5-6, and 7-8. Its total weight will be 24.

Although there are quite a few MST algorithms, the most well-known are Prim's and Kruskal's. Both of them rely on some interesting solutions and are worth studying.

Prim's Algorithm

Prim's algorithm grows the tree one edge at a time, starting from an arbitrary vertex. At each step, the least-weight edge that has one of the vertices already in the MST and the other one outside is added to the tree. This algorithm always has an MST of the already processed subgraph, and when all the vertices are visited, the MST of the whole graph is completed. The most interesting property of Prim's algorithm is that its time complexity depends on the choice of the data structure for ordering the edges by weight. The straightforward approach that searches for the shortest edge will have O(V^2) complexity, but if we use a priority queue it can be reduced to O(E logV) with a binary heap or even O(E + V logV) with a Fibonacci heap. Obviously, V logV is significantly smaller than E logV for the majority of graphs: up to E = V^2 for fully-connected graphs.

Here's the implementation of the Prim's algorithm with an abstract heap:

(defun prim-mst (graph)
(let ((initial-weights (list))
(mst (list))
(total 0)
weights
edges
cur)
(dokv (id node (nodes graph))
(if cur
(push (pair id (or (? edges id)
;; a standard constant that is
;; a good enough substitute for infinity
most-positive-fixnum))
initial-weights)
(:= cur id
edges (? node 'edges))))
(:= weights (heapify initial-weights))
(loop
(with (((id weight) (heap-pop weights)))
(unless id (return))
(when (? edges id)
;; if not, we have moved to the new connected component
;; so there's no edge connecting it to the previous one
(push (pair cur id) mst)
(:+ total weight))
(dokv (id w edges)
(when (< w weight)
(heap-decrease-key weights id w)))
(:= cur id
edges (? graph 'nodes id 'edges))))
(values mst
total)))

To make it work, we need to perform several modifications:

  • first of all, the list of all node edges should be change to a hash-table to ensure O(1) access by child id
  • the heap should store not only the keys but also values (a trivial change)
  • we need to implement another fundamental heap operation heap-decrease-key, which we haven't mentioned in the previous chapter

For the binary heap, it's, actually, just a matter of performing heap-up. But the tricky part is that it requires an initial search for the key. To ensure constant-time search and, subsequently, O(log n) total complexity, we need to store the pointers to heap elements in a separate hash-table.

Let's confirm the stated complexity of this implementation? First, the outer loop operates for each vertex so it has V iterations. Each iteration has an inner loop that involves a heap-pop (O(log V)) and a heap-update (also O(log V)) for a number of vertices, plus a small number of constant-time operations. heap-pop will be invoked exactly once per vertex, so it will need O(V logV) total operations, and heap-update will be called at most once for each edge (O(E logV)). Considering that E is usually greater than V, this is how we can arrive at the final complexity estimate.

The Fibonacci heap improves on the binary heap, in this context, as its decrease-key operation is O(1) instead of O(log V), so we are left with just O(V logV) for heap-pops and E heap-decrease-keys. Unlike the binary heap, the Fibonacci one is not just a single tree but a set of trees. And this is used in decrease-key: instead of popping an item up the heap and rearranging it in the process, a new tree rooted at this element is cut from the current one. This is not always possible in constant time as there are some invariants that might be violated, which will in turn trigger some updates to the newly created two trees. Yet, using an amortized cost of the operation is still O(1).

Here's a brief description of the principle behind the Fibonacci heap adapted from Wikipedia:

A Fibonacci heap is a collection of heaps. The trees do not have a prescribed shape and, in the extreme case, every element may be its own separate tree. This flexibility allows some operations to be executed in a lazy manner, postponing the work for later operations. For example, merging heaps is done simply by concatenating the two lists of trees, and operation decrease key sometimes cuts a node from its parent and forms a new tree. However, at some point order needs to be introduced to the heap to achieve the desired running time. In particular, every node can have at most O(log n) children and the size of a subtree rooted in a node with k children is at least F(k+2), where F(k) is the k-th Fibonacci number. This is achieved by the rule that we can cut at most one child of each non-root node. When a second child is cut, the node itself needs to be cut from its parent and becomes the root of a new tree. The number of trees is decreased in the operation delete minimum, where trees are linked together. Here's an example Fibonacci heap that consists of 3 trees:

6  2      1 <- minimum
| / | \
5 3 4 7
|
8
|
9

Kruskal's Algorithm

Kruskal's algorithm operates from the point of view of not vertices but edges. At each step, it adds to the tree the current smallest edge unless it will produce a cycle. Obviously, the biggest challenge here is to efficiently find the cycle. Yet, the good news is that, like with the Prim's algorithm, we also have already access to an efficient solution for this problem — Union-Find. Isn't it great that we already have built a library of techniques that may be reused in creating more advanced algorithms? Actually, this is the goal of developing as an algorithms programmer — to be able to see a way to reduce the problem, at least partially, to some already known and proven solution.

Like Prim's algorithm, Kruskal's approach also has O(E logV) complexity: for each vertex, it needs to find the minimum edge not forming a cycle with the already built partial MST. With Union-Find, this search requires O(logE), but, as E is at most V^2, logE is at most logV^2 that is equal to 2 logV. Unlike Prim's algorithm, the partial MST built by the Kruskal's algorithm isn't necessary a tree for the already processed part of the graph.

The implementation of the algorithm, using the existing code for Union-Find is trivial and left as an exercise to the reader.

Pathfinding

So far, we have only looked at problems with unweighted graphs. Now, we can move to weighted ones. Pathfinding in graphs is a huge topic that is crucial in many domains: maps, games, networks, etc. The goal, usually, is to find the shortest path between two nodes in a directed weighted graph. Yet, there may be variations like finding shortest paths from a selected node to all other nodes, finding the shortest path in a maze (that may be represented as a grid graph with all edges of weight 1), etc.

There are, once again, two classic pathfinding algorithms, each one with a certain feature that makes it interesting and notable. Dijkstra's algorithm is a classic example of greedy algorithms as its alternative name suggests — shortest path first (SPF). The A* builds upon it by adding the notion of an heuristic. Dijkstra's approach is the basis of many computer network routing algorithms, such as IS-IS and OSPF, while A* and modifications are often used in games, as well as in pathfinding on the maps.

Dijkstra's Algorithm

The idea of Dijkstra's pathfinding is to perform a limited BFS on the graph only looking at the edges that don't lead us "away" from the target. Dijkstra's approach is very similar to the Prim's MST algorithm: it also uses a heap (Binary or Fibonacci) to store the shortest paths from the origin to each node with their weighs (lengths). At each step, it selects the minimum from the heap, expands it to the neighbor nodes, and updates the weights of the neighbors if they become smaller (the weights start from infinity).

For our SPF implementation we'll need to use the same trick that was shown in the Union-Find implementation — extend the node structure to hold its weight and the path leading to it:

(defstruct (spf-node (:include node))
(weight most-positive-infinity)
(path (list)))

Here is the main algorithm:

(defun spf (graph src dst)
(with ((nodes (? graph 'nodes))
(spf (list))
;; the following code should express initialize
;; the heap with a single node of weight 0
;; and all other nodes of weight MOST-POSITIVE-FIXNUM
;; (instead of running a O(n*log n) HEAPIFY)
(weights (init-weights-heap nodes src)))
(loop
(with (((id weight) (heap-pop weights)))
(cond ((eql id dst) (let ((dst (? nodes dst)))
;; we return 2 values: the path and its length
(return (values (cons dst (? dst 'path))
(? dst 'weight)))))
((= most-positive-fixnum weight) (return))) ; no path exists
(dolist (edge (? nodes id 'edges))
(with ((cur (? edge 'dst))
(node (? nodes cur))
(w (+ weight (? edge 'weight))))
(when (< w (? node 'weight))
(heap-decrease-key weights cur w)
(:= (? node weight) w
(? node path) (cons (? nodes id)
(? nodes id 'path))))))))))

A* Algorithm

There are many ways to improve the vanilla SPF. One of them is to move in-parallel from both sides: the source and the destination.

A* algorithm (also called Best-First Search) improves upon the Dijkstra's method by changing how the weight of the path is estimated. Initially, it was just the distance we've already traveled in the search, which is known exactly. But we don't know for sure the length of the remaining part. However, in Euclidian and similar spaces, where the triangle inequality holds (that the direct distance between 2 points is not greater than the distance between them through any other point) it's not an unreasonable assumption that the direct path will be shorter than the circuitous ones. This premise does not always hold as there may be obstacles, but quite often it does. So, we add a second term to the weight, which is the direct distance between the current node and the destination. This simple idea underpins the A* search and allows it to perform much faster in many real-world scenarios, although its theoretical complexity is the same as for simple SPF. The exact guesstimate of the remaining distance is called the algorithm's heuristic and should be specified for each domain separately: for maps, it is the linear distance, but there are clever ways to invent similar estimates where distances can't be calculated directly.

Overall, this algorithm is one of the simplest examples of the heuristic approach. The idea of heuristics, basically, lies in finding patterns that may significantly improve the performance of the algorithm for the common cases, although their efficiency can't be proven for the general case. Isn't it the same approach as, for example, hash-tables or splay trees that also don't guarantee the same optimal performance for each operation. The difference is that, although those techniques have possible local cases of suboptimality they provide global probabilistic guarantees. For heuristic algorithms, usually, even such estimations are not available, although they may be performed for some of them. For instance, the performance of A* algorithm will suffer if there is an "obstacle" on the direct path to the destination, and it's not possible to predict, for the general case, what will be the configuration of the graph and where the obstacles will be. Yet, even in the worst case, A* will still have at least the same speed as the basic SPF.

The changes to the SPF algorithm needed for A* are the following:

  • init-weights-heap will use the value of the heuristic instead of most-positive-fixnum as the initial weight. This will also require us to change the loop termination criteria from (= most-positive-fixnum weight) by adding some notion of visited nodes
  • there will be additional term added to the weight of the node formula: (+ weight (? edge 'weight) (heuristic node))

A good comparison of the benefits A* brings over simple SPF may be shown with this picture of pathfinding on a rectangular grid without diagonal connections, where each node is labeled with its 2d-coordinates. To find the path from node (0 0) to (2 2) (length 4) using the Dijkstra's algorithm, we'll need to visit all of the points in the grid:

  0 1 2
0 + .
1 .
2

0 1 2
0 + . .
1 . .
2 .

0 1 2
0 + . .
1 . . .
2 . .

0 1 2
0 + > v
1 . . v
2 . . +

With A*, however, we'll move straight to the point:

  0 1 2
0 + .
1 .
2

0 1 2
0 + .
1 . .
2

0 1 2
0 + .
1 . . .
2 .

0 1 2
0 + v
1 . > v
2 . +

The final path, in these pictures, is selected by the rule to always open the left neighbor first.

Maximum Flow

Weighted directed graphs are often used to represent different kinds of networks. And one of the main tasks on such networks is efficient capacity planning. The main algorithm for that is Maximum Flow calculation. It works on the so-called transport networks contain three kinds of vertices: a source, a sink, and intermediate nodes. The source has only outgoing edges, the sink has only incoming, and all the other nodes obey the balance condition: the total weights (flow) of all incoming and outgoing edges are equal. The task of determining maximum flow is to estimate the largest amount that can flow through the whole net from the source to the sink. Besides knowing the actual capacity of the network, this also allows finding the bottlenecks and edges that are not fully utilized. From such point of view, the problem is called Minimum Cut estimation.

There are many approaches to solving this problem. The most direct and intuitive of them is the Ford-Fulkerson method. It is a greedy algorithm, once again, that computes the maximum flow by trying all the paths from source to sink until there is some residual capacity available. These paths are called "augmented paths" as they augment the network flow. And, to track the residual capacity, a copy of the initial weight graph called the "residual graph" is maintained. With each new path added to the total flow, its flow is subtracted from the weights of all of its edges in the residual graph. Besides — and this is the key point in the algorithm that allows it to be optimal despite its greediness — the same amount is added to the backward edges in the residual graph. The backward edges don't exist in the original graph, and they are added to the residual graph in order to let the subsequent iterations reduce the flow along some edge, but not below zero. Why this may be necessary? Each graph node has a maximum input and output capacity. It is possible to saturate the output capacity by different input edges and the optimal edge to use depends on the whole graph, so, in a single greedy step, it's not possible to determine over which edges more incoming flow should be directed. The backward edges virtually increase the output capacity by the value of the seized input capacity thus allowing the algorithm to redistribute the flow later on if necessary.

We'll implement the FFA using the matrix graph representation. First of all, to show it in action, and also as it's easy to deal with backward edges in a matrix as they are already present, just with zero initial capacity. However, as this matrix will be sparse, in the majority of the cases, to achieve optimal efficiency, just like with most other graph algorithms, we'll need to use a better way to store the edges, for instance, an edge list. With it, we could implement the addition of backward edges directly but lazily during the processing of each augmented path.

(defstuct mf-edge
beg end capacity)

(defun max-flow (g)
(let ((rg (copy-array g)) ; residual graph
(rez 0))
(loop :for path := (aug-path rg) :while path :do
(let ((flow most-positive-fixnum))
;; the flow along the path is the residual capacity
;; of the least wide edge
(dolist (edge path)
(let ((cap (? edge 'capacity)))
(when (< (abs cap) flow)
(:= flow (abs cap)))))
(dolist (edge path)
(with (((beg end) ? edge))
(:- (aref rg beg end) flow)
(:+ (aref rg end beg) flow)))
(:+ rez flow)))
rez))

(defun aug-path (g)
(with ((sink (1- (array-dimension g 0)))
(visited (make-array (1+ sink) :initial-element nil)))
(labels ((dfs (g i)
(if (zerop (aref g i sink))
(dotimes (j sink)
(unless (or (zerop (aref g i j))
(? visited j))
(when-it (dfs g j)
(:= (? visited j) t)
(return (cons (make-mf-edge :beg i :end j
:capacity (aref g i j))
it)))))
(list (make-mf-edge :beg i :end sink
:capacity (aref g i sink))))))
(dfs g 0))))

CL-USER> (max-flow #2A((0 4 4 0 0 0)
(0 0 0 4 2 0)
(0 0 0 1 2 0)
(0 0 0 0 0 3)
(0 0 0 0 0 5))))
7

So, as you can see from the code, to find an augmented path, we need to perform DFS on the graph from the source, sequentially examining the edges with some residual capacity to find a path to the sink.

A peculiarity of this algorithm is that there is no certainty that we'll eventually reach the state when there will be no augmented paths left. The FFA works correctly for integer and rational weights, but when they are irrational it is not guaranteed to terminate. When the capacities are integers, the runtime of Ford-Fulkerson is bounded by O(E f) where f is the maximum flow in the graph. This is because each augmented path can be found in O(E) time and it increases the flow by an integer amount of at least 1. A variation of the Ford-Fulkerson algorithm with guaranteed termination and a runtime independent of the maximum flow value is the Edmonds-Karp algorithm, which runs in O(V E^2).

Graphs in Action: PageRank

Another important set of problems from the field of network analysis is determining "centers of influence", densely and sparsely populated parts, and "cliques". PageRank is the well-known algorithm for ranking the nodes in terms of influence (i.e. the number and weight of incoming connections they have), which was the secret sauce behind Google's initial success as a search engine. It will be the last of the graph algorithms we'll discuss in this chapter, so many more will remain untouched. We'll be returning to some of them in the following chapters, and you'll be seeing them in many problems once you develop an eye for spotting the graphs hidden in many domains.

PageRank algorithm outputs a probability distribution of the likelihood that a person randomly clicking on links will arrive at any particular page. This distribution ranks the relative importance of all pages. The probability is expressed as a numeric value between 0 and 1, but Google used to multiply it by 10 and round to the greater integer, so PR of 10 corresponded to the probability of 0.9 and more and PR=1 — to the interval from 0 to 0.1. In the context of Pagerank, all web pages are the nodes in the so-called webgraph, and the links between them are the edges, originally, weighted equally.

PageRank is an iterative algorithm that may be considered an instance of the very popular, in unsupervised optimization and machine learning, Expectation Maximization (EM) approach. The general idea of EM is to randomly initialize the quantities that we want to estimate, and then iteratively recalculate each quantity, using the information from the neighbors, to "move" it closer to the value that ensures optimality of the target function. Epochs (an iteration that spans the whole data set using each node at most once) of such recalculation should continue either until the whole epoch doesn't produce a significant change of the loss function we're optimizing, i.e. we have reached the stationary point, or a satisfactory number of iterations was performed. Sometimes a stationary point either can't be reached or will take too long to reach, but, according to Pareto's principle, 20% of effort might have moved us 80% to the goal.

In each epoch, we recalculate the PageRank of all nodes by transferring weights from a node equally to all of its neighbors. The neighbors with more inbound connections will thus receive more weight. However, the PageRank concept adds a condition that an imaginary surfer who is randomly clicking on links will eventually stop clicking. The probability that the transfer will continue is called a damping factor d. Various studies have tested different damping factors, but it is generally assumed that the damping factor for the webgraph will be set around 0.85. The damping factor is subtracted from 1 (and in some variations of the algorithm, the result is divided by the number of documents in the collection) and this term is then added to the product of the damping factor and the sum of the incoming PageRank scores. The damping factor is subtracted from 1 (and in some variations of the algorithm, the result is divided by the number of documents (N) in the collection) and this term is then added to the product of the damping factor and the sum of the incoming PageRank scores. So the PageRank of a page is mostly derived from the PageRanks of other pages. The damping factor adjusts the derived value downward.

Implementation

Actually, PageRank can be computed both iteratively and algebraically. In algebraic form, each PageRank iteration may be expressed simply as:

(:= pr (+ (* d (mat* g pr))
(/ (- 1 d) n)))

where g is the graph incidence matrix and pr is the vector of PageRank for each node.

However, the definitive property of PageRank is that it is estimated for huge graphs. I.e. directly representing them as matrices isn't possible as well as performing the matrix operations on them. The iterative algorithm allows more control as well as distributing of the computation, so it is usually preferred, in practice, not only for Pagerank but also for most other optimization techniques. Sp PageRank should be viewed primarily as a distributed algorithm. The need to implement it on a large cluster triggered the development by Google of the influential MapReduce distributed computation framework.

Here is a simplified PageRank implementation of the iterative method:

(defun pagerank (g &key (d 0.85) (repeat 100))
(with ((n (tally (nodes g)))
(pr (make-arrray n :initial-element (/ 1 n)))
(loop :repeat repeat :do
(let ((pr2 (map 'vector (lambda (x) (- 1 (/ d n)))
pr)))
(dokv (i node nodes)
(let ((p (? pr i))
(m (tally (? node 'children)))
(dokv (j child (? node 'children))
(:+ (? pr2 j) (* d (/ p m))))))
(:= pr pr2))
pr))

We use the same graph representation as previously and perform the update "backwards": not by gathering all incoming edges, which will require us to add another layer of data that is both not necessary and hard to maintain, but transferring the PR value over outgoing edges one by one. Such an approach also makes the computation trivially to distribute as we can split the whole graph into arbitrary set of nodes and the computation for each set can be performed in parallel: we'll just need to maintain a local copy of the pr2 vector and merge it at the end of each iteration by simple summation. This method naturally fits the map-reduce framework: the map step is the inner node loop, while the reduce step is merging of the pr2 vectors.

Take-aways

  1. The more we progress into advanced topics of this book, the more apparent will be the tendency to reuse the approaches, tools, and technologies we have developed previously. Graph algorithms are good demonstrations of new features and qualities that can be obtained by a smart combination and reuse of existing data structures.
  2. Many graph algorithms are greedy. This means that they use the locally optimal solution trying to arrive at a global one. This is conditioned by the structure — or rather lack of structure — of graphs that don't have a specific hierarchy to guide the optimal solution. The greediness, however, shouldn't mean suboptimality. In many greedy algorithms, like FFA, there is a way to play back the wrong solution. Others provide a way to trade off execution speed and optimality. A good example of the latter approach is Beam search that we'll discuss in the next chapters.
  3. In A*, we had a first glimpse of heuristic algorithms — an area that may be quite appealing to many programmers who are used to solving the problem primarily optimizing for its main scenarios. This approach may lack some mathematical rigor, but it also has its place and we'll see other heuristic algorithms in the following chapters that are, like A*, the best practical solution in their domains: for instance, the Monte Carlo Tree Search (MCTS).
  4. Another thing that becomes more apparent in the progress of this book is how small the percentage of the domain we can cover in detail in each chapter. This is true for graphs: we have just scratched the surface and outlined the main approaches to handling them. We'll see more of graph-related stuff in the following chapters, as well. Graph algorithms may be quite useful in a great variety of areas that not necessarily have a direct formulation as graph problems (like maps or networks do) and so developing an intuition to recognize the hidden graph structure may help the programmer reuse the existing elegant techniques instead of having to deal with own cumbersome ad-hoc solutions.

Lispers.deBerlin Lispers Meetup, Monday, 28th October 2019

· 43 days ago

We meet again on Monday 8pm, 28th October. Our host this time is James Anderson (www.dydra.com).

Berlin Lispers is about all flavors of Lisp including Emacs Lisp, Scheme, and Clojure.

This time we talk about the Advent of Code and look at examples in Common Lisp.

[1] https://twitter.com/ericwastl/status/1178596028662112256

[2] https://adventofcode.com/

[3] https://old.reddit.com/r/adventofcode/

We meet in the Taut-Haus at Engeldamm 70 in Berlin-Mitte, the bell is "James Anderson". It is located in 10min walking distance from U Moritzplatz or U Kottbusser Tor or Ostbahnhof. In case of questions call Christian +49 1578 70 51 61 4.

Lispers.deBerlin Lispers - Talk Announcement - Thursday, 24th October 2019

· 46 days ago

Georg Kinnemann will give a talk about LISP and AI with the title "Die Entwicklung eines Spezial-Computers für Künstliche Intelligenz in der DDR" at Deutsche Technikmuseum Berlin.

5.30pm, 24th October

Deutsches Technikmuseum, Trebbiner Straße 9, Vortragssaal, 4. OG

[1] https://sdtb.de/technikmuseum/veranstaltungen/ein-spezial-computer-fuer-kuenstliche-intelligenz-in-der-ddr/

[2] https://lispers.de/docs/ki-spezialcomputer-dtb-20191024.txt

Quicklisp newsOctober 2019 Quicklisp dist update now available

· 53 days ago
New projects:
  • 3bz — deflate decompressor — MIT
  • bp — Bitcoin Protocol components in Common Lisp — MIT
  • cardiogram — Simple test framework — MIT
  • cesdi — Provides a compute-effective-slot-definition-initargs generic function that allows for more ergonomic initialization of effective slot definition objects. — Unlicense
  • chameleon — Configuration management facilities for Common Lisp with multiple profile support. — MIT
  • ci-utils — A set of tools for using CI platforms — MIT
  • cl-clsparse — Common Lisp bindings for clSPARSE — Apache License, Version 2.0
  • cl-ecma-48 — This package exports a macro for defining ECMA-48 control functions and the 162 functions defined by this. — AGPLv3
  • cl-flat-tree — A flat-tree implementation in Common Lisp. — MIT
  • cl-kraken — A Common Lisp API client for the Kraken exchange — MIT
  • cl-naive-store — This is a naive, persisted, in memory (lazy loading) data store for Common Lisp. — MIT
  • cl-shlex — Lexical analyzer for simple shell-like syntax. — MIT
  • cl-smt-lib — SMT object supporting SMT-LIB communication over input and output streams — BSD-3-Clause
  • cl-wadler-pprint — An implementation of A Prettier Printer in Common Lisp. — Apache-2.0/MIT
  • classowary — An implementation of the Cassowary linear constraint solver toolkit — zlib
  • clsql-local-time — Allows the use of local-time:timestamp objects in CLSQL models and queries — MIT license
  • datamuse — Common Lisp library for accessing the Datamuse word-finding API — MIT
  • date-calc — Package for simple date calculation — GPL or Artistic
  • font-discovery — Find system font files matching a font spec. — zlib
  • horse-html — Parenscript/HTML done better — MIT
  • hunchentoot-multi-acceptor — Multiple domain support under single hunchentoot acceptor — Apache License, Version 2.0
  • lila — a cleaner language based on Common Lisp — MIT
  • linear-programming — A library for solving linear programming problems — MIT
  • lsx — Embeddable HTML templating engine with JSX-like syntax — BSD 2-Clause
  • markup — markup provides a reader-macro to read HTML/XML tags inside of Common Lisp code — Apache License, Version 2.0
  • num-utils — Numerical utilities for Common Lisp — Boost Software License - Version 1.0
  • orizuru-orm — An ORM for Common Lisp and PostgreSQL. — GPLv3
  • paren6 — Paren6 is a set of ES6 macros for Parenscript — Apache License, version 2.0
  • pngload-fast — A reader for the PNG image format. — MIT
  • polisher — Infix notation to S-expression translator — MIT
  • select — DSL for array slices. — Boost
  • simple-parallel-tasks — Evaluate forms in parallel — GPL-3
  • stripe — A client for the Stripe payment API. — MIT
  • trivial-extensible-sequences — Portability library for the extensible sequences protocol. — zlib
  • trivial-package-local-nicknames — Portability library for package-local nicknames — Public domain
  • uax-14 — Implementation of the Unicode Standards Annex #14's line breaking algorithm — zlib
  • uax-9 — Implementation of the Unicode Standards Annex #9's bidirectional text algorithm — zlib
  • with-output-to-stream — Provides a simple way of directing output to a stream according to the concise and intuitive semantics of FORMAT's stream argument. — Public Domain
  • ziz — An ad hoc Quicklisp distribution. — MIT
Updated projects: 3d-matricesalso-alsaanaphoraantikaprilarchitecture.service-providerasdf-encodingsasteroidsatomicsbikebinary-iobinfixbknr-datastoreblack-tiebodge-chipmunkbodge-glfwbodge-nanovgbodge-nuklearbodge-odebodge-openalbodge-sndfilecavemanceplcl+sslcl-algebraic-data-typecl-amqpcl-change-casecl-collidercl-cookiecl-coverallscl-cudacl-dbicl-digikar-utilitiescl-fadcl-fondcl-freetype2cl-geocodecl-hamcrestcl-ipfs-api2cl-kanrencl-ledgercl-lexercl-lzlibcl-mangocl-marklesscl-mssqlcl-openstack-clientcl-patternscl-pdfcl-permutationcl-pythoncl-qrencodecl-rdkafkacl-readlinecl-satcl-sat.glucosecl-sat.minisatcl-sdl2cl-sqlitecl-strcl-tiledcl-yesqlclackclack-errorscloser-mopclxcoleslawcommand-line-argumentscommon-lisp-jupyterconcrete-syntax-treecroatoandata-lensdatum-commentsdefinitionsdeploydexadordrakmadufyeasy-routeseclectorecoenvyeruditeesrapesrap-pegfare-scriptsfast-httpfast-websocketfemlispfiascofloat-featuresflowfolio2fxmlgendlglsl-specglsl-toolkitgolden-utilsgraphhelambdaphermetichttp-bodyironcladjsonrpcjsownkenzolacklastfmlisp-binaryliterate-lisplog4cllucernemagiclmaidenmatlispmcclimmitoninevehninglenodguioriginoverlordparachuteparseparser.common-rulespatchworkpetalisppiggyback-parameterspjlinkpngloadportableaservepostmodernproc-parseprometheus.clpy4clqlotquilcquriqvmrandomratifyrereplicrestasroverpcqrtg-mathrutilssc-extensionsscalplselserapeumshadowshould-testsimplified-typesslyspinneretstaplestumpwmswank-clienttriviatrivial-continuationtrivial-hashtable-serializetrivial-indenttrivial-json-codectrivial-left-padtrivial-monitored-threadtrivial-object-locktrivial-pooled-databasetrivial-timertrivial-utilitiestrivial-variable-bindingstype-iumbrauri-templateutilities.print-itemsvarjoverbosevernacularwoozs3.

To get this update, use (ql:update-dist "quicklisp").

If you get a "badly sized local archive" error during the update, you can also safely use the DELETE-AND-RETRY restart to proceed. This error was introduced by a metadata problem on my end. Sorry about that!

Didier VernaTFM 1.0 "Artificial Uncial" is released

· 59 days ago

I'm happy to announce the first stable version of TFM, a TeX Font Metrics parser for Common Lisp. TFM is the standard font description format used by TeX. The TFM library parses and decodes TFM files into an abstract data structure, providing easy access to the corresponding font information. Get it here.

Lispers.deLisp-Meetup in Hamburg on Monday, 7th October 2019

· 61 days ago

We meet at Ristorante Opera, Dammtorstraße 7, Hamburg, starting around 19:00 CEST on 7th October 2019.

This is an informal gathering of Lispers of all experience levels.

Paul KhuongA Couple of (Probabilistic) Worst-case Bounds for Robin Hood Linear Probing

· 68 days ago

I like to think of Robin Hood hash tables with linear probing as arrays sorted on uniformly distributed keys, with gaps. That makes it clearer that we can use these tables to implement algorithms based on merging sorted streams in bulk, as well as ones that rely on fast point lookups. A key question with randomised data structures is how badly we can expect them to perform.

A bound on the length of contiguous runs of entries

There’s a lot of work on the expected time complexity of operations on linear probing Robin Hood hash tables. Alfredo Viola, along with a few collaborators, has long been exploring the distribution of displacements (i.e., search times) for random elements. The packed memory array angle has also been around for a while.1

I’m a bit wary of the “random element” aspect of the linear probing bounds: while I’m comfortable with an expectation over the hash function (i.e., over the uniformly distributed hash values), a program could repeatedly ask for the same key, and consistently experience worse-than-expected performance. I’m more interested in bounding the worst-case displacement (the distance between the ideal location for an element, and where it is actually located) across all values in a randomly generated2 Robin Hood table, with high enough probability. That probability doesn’t have to be extremely high: \(p = 0.9\) or even \(p = 0.5\) is good enough, as long as we can either rehash with an independent hash function, or the probability of failure drops exponentially enough as the displacement leeway grows.

The people who study hashing with buckets, or hashing as load balancing, seem more interested in these probable worst-case bounds: as soon as one bucket overflows, it’s game over for that hash table! In that context, we wish to determine how much headroom we must reserve in each bucket, on top of the expected occupancy, in order to make sure failures are rare enough. That’s a balls into bins problem, where the \(m\) balls are entries in the hash table, and the \(n\) bins its hash buckets.

Raab and Steger’s Simple and Tight Analysis of the “balls into bins” problem shows that the case where the average occupancy grows with \((\log m)\sp{k}\) and \(k > 1\) has potential, when it comes to worst-case bounds that shrink quickly enough: we only need headroom that grows with \(\sqrt{\log\sp{k+1} m} = \log\sp{(k+1)/2} m\), slightly more than the square root of the average occupancy.

The only issue is that the balls-into-bins analysis is asymptotic, and, more importantly, doesn’t apply at all to linear probing!

One could propose a form of packed memory array, where the sorted set is subdivided in chunks such that the expected load per chunk is in \(\Theta(\log\sp{k} m)\), and the size of each chunk multiplicatively larger (more than \(\log\sp{(k+1)/2}m\))…

Can we instead derive similar bounds with regular linear probing? It turns out that Raab and Steger’s bounds are indeed simple: they find the probability of overflow for one bucket, and derive a union bound for the probability of overflow for any bucket by multiplying the single-bucket failure probability by the number of buckets. Moreover, the single-bucket case itself is a Binomial confidence interval.

We can use the same approach for linear probing; I don’t expect a tight result, but it might be useful.

Let’s say we want to determine how unlikely it is to observe a clump of \(\log\sp{k}n\) entries, where \(n\) is the capacity of the hash table. We can bound the probability of observing such a clump starting at index 0 in the backing array, and multiply by the size of the array for our union bound (the clump could start anywhere in the array).

Given density \(d = m / n\), where \(m\) is the number of entries and \(n\) is the size of the array that backs the hash table, the probability that any given element falls in a range of size \(\log\sp{k}n\) is \(p = d/n \log\sp{k}n\). The number of entries in such a range follows a Binomial distribution \(B(dn, p)\), with expected value \(d \log\sp{k}n\). We want to determine the maximum density \(d\) such that \(\mathrm{Pr}[B(dn, p) > \log\sp{k}n] < \frac{\alpha}{dn}\), where \(\alpha\) is our overall failure rate. If rehashing is acceptable, we can let \(\alpha = 0.5\), and expect to find a suitably uniform hash function after half a rehash on average.

We know we want \(k > 1\) for the tail to shrink rapidly enough as \(n\) grows, but even \(\log\sp{2}n\) doesn’t shrink very rapidly. After some trial and error, I settled on a chunk size \(s(n) = 5 \log\sb{2}\sp{3/2} n\). That’s not great for small or medium sized tables (e.g., \(s(1024) = 158.1\)), but grows slowly, and reflects extreme worst cases; in practice, we can expect the worst case for any table to be more reasonable.

Assuming we have a quantile function for the Binomial distribution, we can find the occupancy of our chunk, at \(q = 1 - \frac{\alpha}{n}\). The occupancy is a monotonic function of the density, so we can use, e.g., bisection search to find the maximum density such that the probability that we saturate our chunk is \(\frac{\alpha}{n}\), and thus the probability that any continuous run of entries has size at least \(s(n) = 5 \log\sb{2}\sp{3/2} n\) is less than \(\alpha\).

For \(\alpha = 0.5\), the plot of densities looks like the following.

This curve roughly matches the shape of some my older purely numerical experiments with Robin Hood hashing. When the table is small (less than \(\approx 1000\)), \(\log n\) is a large fraction of \(n\), so the probability of finding a run of size \(s(n) = 5 \log\sb{2}\sp{3/2} n\) is low. When the table is much larger, the asymptotic result kicks in, and the probabiliy slowly shrinks. However, even around the worst case \(n \approx 4500\), we can exceed \(77\%\) density and only observe a run of length \(s(n)\) half the time.

If we really don’t want to rehash, we can let \(\alpha = 10\sp{-10}\), which compresses the curve and shifts it down: the minimum value is now slightly above \(50\%\) density, and we can clearly see the growth in permissible density as the size of the table grows.

In practice, we can dynamically compute the worst-case displacement, which is always less than the longest run (i.e., less than \(s(n) = 5 \log\sb{2}\sp{3/2} n\)). However, having non-asymptotic bounds lets us write size-specialised code and know that its assumptions are likely to be satisfied in real life.

Bounding buffer sizes for operations on sorted hashed streams

I mentioned at the beginning of this post that we can also manipulate Robin Hood hash tables as sorted sets, where the sort keys are uniformly distributed hash values.

Let’s say we wished to merge the immutable source table S into the larger destination table D in-place, without copying all of D. For example, from

S = [2, 3, 4]
D = [1, 6, 7, 9, 10];

we want the merged result

D' = [1, 2, 3, 4, 6, 7, 9, 10].

The issue here is that, even with gaps, we might have to overwrite elements of D, and buffer them in some scratch space until we get to their final position. In this case, all three elements of S must be inserted between the first and second elements of D, so we could need to buffer D[1:4].

How large of a merge buffer should we reasonably plan for?

In general, we might have to buffer as many elements as there are in the smaller table of S and D. However, we’re working with hash values, so we can expect them to be distributed uniformly. That should give us some grip on the problem.

We can do even better and only assume that both sorted sets were sampled from the same underlying distribution. The key idea is that the rank of an element in S is equal to the value of S’s empirical distribution function for that element, multiplied by the size of S (similarly for D).

The amount of buffering we might need is simply a measure of the worst-case difference between the two empirical DFs: the more S get ahead of D, the more we need to buffer values of D before overwriting them (if we’re very unlucky, we might need a buffer the same size as S). That’s the two-sample Kolmogorov-Smirnov statistic, and we have simple bounds for that distance.

With probability \(1 - \alpha\), we’ll consume from S and D at the same rate \(\pm \sqrt{-\frac{(|S| + |D|) \ln \alpha}{2 |S| |D|}}\). We can let \(\alpha = 10\sp{-10}\) and pre-allocate a buffer of size

\[|S| \sqrt{-\frac{(|S| + |D|) \ln \alpha}{2 |S| |D|}} < \sqrt{\frac{23.03 |S| (|S| + |D|)}{2 |D|}}.\]

In the worst case, \(|S| = |D|\), and we can preallocate a buffer of size \(\sqrt{23.03 |D|} < 4.8 \sqrt{|D|}\) and only need to grow the buffer every ten billion (\(\alpha\sp{-1}\)) merge.

The same bound applies in a stream processing setting; I assume this is closer to what Frank had in mind when he brought up this question.

Let’s assume a “push” dataflow model, where we still work on sorted sets of uniformly distributed hash values (and the data tuples associated with them), but now in streams that generate values every tick. The buffer size problem now sounds as follows. We wish to implement a sorted merge operator for two input streams that generate one value every tick, and we can’t tell our sources to cease producing values; how much buffer space might we need in order to merge them correctly?

Again, we can go back to the Kolmogorov-Smirnov statistic. In this case however, we could buffer each stream independently, so we’re looking for critical values for the one-sided one-sample Kolmogorov-Smirnov test (how much one stream might get ahead of the hypothetical exactly uniform stream). We have recent (1990) simple and tight bounds for this case as well.

The critical values for the one-sided case are stronger than the two-sided two-sample critical values we used earlier: given an overflow probability of \(1 - \alpha\), we need to buffer at most \(\sqrt{-\frac{n \ln \alpha}{2}},\) elements. For \(\alpha = 10\sp{-20}\) that’s less than \(4.8 \sqrt{n}\).3 This square root scaling is pretty good news in practice: shrinking \(n\) to \(\sqrt{n}\) tends to correspond to going down a rung or two in the storage hierarchy. For example, \(10\sp{15}\) elements is clearly in the range of distributed storage; however, such a humongous stream calls for a buffer of fewer than \(1.5 \cdot 10\sp{8}\) elements, which, at a couple gigabytes, should fit in RAM on one large machine. Similarly, \(10\sp{10}\) elements might fill the RAM on one machine, but the corresponding buffer of less than half a million elements could fit in L3 cache, while one million elements could fill the L3, and 4800 elements fit in L1 or L2.

What I find neat about this (probabilistic) bound on the buffer size is its independence from the size of the other inputs to the merge operator. We can have a shared \(\Theta(\sqrt{n})\)-size buffer in front of each stream, and do all our operations without worrying about getting stuck (unless we’re extremely unlucky, in which case we can grow the buffer a bit and resume or restart the computation).

Probably of more theoretical interest is the fact that these bounds do not assume a uniform distribution, only that all the input streams are identically and independently sampled from the same underlying distribution. That’s the beauty of working in terms of the (inverse) distribution functions.

I don’t think there’s anything deeper

That’s it. Two cute tricks that use well-understood statistical distributions in hashed data structure and algorithm design. I doubt there’s anything to generalise from either bounding approach.

However, I definitely believe they’re useful in practice. I like knowing that I can expect the maximum displacement for a table of \(n\) elements with Robin Hood linear probing to be less than \(5 \log\sb{2}^{3/2} n\), because that lets me select an appropriate option for each table, as a function of that table’s maximum displacement, while knowing the range of displacements I might have to handle. Having a strong bound on how much I might have to buffer for stream join operators feels even more useful: I can pre-allocate a single buffer per stream and not think about efficiently growing the buffer, or signaling that a consumer is falling behind. The probability that I’ll need a larger buffer is so low that I just need to handle it, however inefficiently. In a replicated system, where each node picks an independent hash function, I would even consider crashing when the buffer is too small!


  1. I swear the Waterloo theme isn’t some CanCon thing.

  2. With consistent tie-breaking, the layout of entries in a Robin hood hash table is a function of the set of entries, independently of the sequence of add and remove operations.

  3. Assuming we consume ten streams per nanosecond, we expect to experience underbuffering once every 316 years.

Vsevolod DyomkinProgramming Algorithms: Trees

· 69 days ago


Couldn't resist adding this xkcd

Balancing a binary tree is the infamous interview problem that has all that folklore and debate associated with it. To tell you the truth, like the other 99% of programmers, I never had to perform this task for some work-related project. And not even due to the existence of ready-made libraries, but because self-balancing binary trees are, actually, pretty rarely used. But trees, in general, are ubiquitous even if you may not recognize their presence. The source code we operate with, at some stage of its life, is represented as a tree (a popular term here is Abstract Syntax Tree or AST, but the abstract variant is not the only one the compilers process). The directory structure of the file system is the tree. The object-oriented class hierarchy is likewise. And so on. So, returning to interview questions, trees indeed are a good area as they allow to cover a number of basic points: linked data structures, recursion, complexity. But there's a much better task, which I have encountered a lot in practice and also used quite successfully in the interview process: breadth-first tree traversal. We'll talk about it a bit later.

Similar to how hash-tables can be thought of as more sophisticated arrays (they are sometimes even called "associative arrays"), trees may be considered an expansion of linked lists. Although technically, a few specific trees are implemented not as a linked data structure but are based on arrays, the majority of trees are linked. Like hash-tables, some trees also allow for efficient access to the element by key, representing an alternative key-value implementation option.

Basically, a tree is a recursive data structure that consists of nodes. Each node may have zero or more children. If the node doesn't have a parent, it is called the root of the tree. And the constraint on trees is that the root is always single. Graphs may be considered a generalization of trees that don't impose this constraint, and we'll discuss them in a separate chapter. In graph terms, a tree is an acyclic directed single-component graph. Directed means that there's a one-way parent-child relation. And acyclic means that a child can't have a connection to the parent neither directly, nor through some other nodes (in the opposite case, what will be the parent and what — the child?) The recursive nature of trees manifests in the fact that if we extract an arbitrary node of the tree with all of its descendants, the resulting part will remain a tree. We can call it a subtree. Besides parent-child or, more generally, ancestor-descendant "vertical" relationships that apply to all the nodes in the tree, we can also talk about horizontal siblings — the set of nodes that have the same parent/ancestor.

Another important tree concept is the distinction between terminal (leaf) and nonterminal (branch) nodes. Leaf nodes don't have any children. In some trees, the data is stored only in the leaves with branch nodes serving to structure the tree in a certain manner. In other trees, the data is stored in all nodes without any distinction.

Implementation Variants

As we said, the default tree implementation is a linked structure. A linked list may be considered a degenerate tree with all nodes having a single child. A tree node may have more than one child, and so, in a linked representation, each tree root or subroot is the origin of a number of linked lists (sometimes, they are called "paths").

Tree:  a
/ \
b c
/ \ \
d e f

Lists:
a -> b -> d
a -> b -> e
a -> c -> f
b -> d
b -> e
c -> f

So, a simple linked tree implementation will look a lot like a linked list one:

(defstruct (tree-node (:conc-name nil))
key
children) ; instead of linked list's next

CL-USER> (with ((f (make-tree-node :key "f"))
(e (make-tree-node :key "e"))
(d (make-tree-node :key "d"))
(c (make-tree-node :key "c" :children (list f)))
(b (make-tree-node :key "b" :children (list d e))))
(make-tree-node :key "a"
:children (list b c)))
#S(TREE-NODE
:KEY "a"
:CHILDREN (#S(TREE-NODE
:KEY "b"
:CHILDREN (#S(TREE-NODE :KEY "d" :CHILDREN NIL)
#S(TREE-NODE :KEY "e" :CHILDREN NIL)))
#S(TREE-NODE
:KEY "c"
:CHILDREN (#S(TREE-NODE :KEY "f" :CHILDREN NIL)))))

Similar to lists that had to be constructed from tail to head, we had to populate the tree in reverse order: from leaves to root. With lists, we could, as an alternative, use push and reverse the result, in the end. But, for trees, there's no such operation as reverse.

Obviously, not only lists can be used as a data structure to hold the children. When the number of children is fixed (for example, in a binary tree), they may be defined as separate slots: e.g. left and right. Another option will be to use a key-value, which allows assigning labels to tree edges (as the keys of the kv), but the downside is that the ordering isn't defined (unless we use an ordered kv like a linked hash-table). We may also want to assign weights or other properties to the edges, and, in this case, either an additional collection (say child-weights) or a separate edge struct should be defined to store all those properties. In the latter case, the node structure will contain edges instead of children. In fact, the tree can also be represented as a list of such edge structures, although this approach is quite inefficient, for most of the use cases.

Another tree representation utilizes the available linked list implementation directly instead of reimplementing it. Let's consider the following simple Lisp form:

(defun foo (bar)
"Foo function."
(baz bar))

It is a tree with the root containing the symbol defun and 4 children:

  • the terminal symbol foo
  • the tree containing the function arguments ((bar))
  • the terminal sting (the docstring "Foo function.")
  • and the tree containing the form to evaluate ((baz bar))

By default, in the list-based tree, the first element is the head and the rest are the leaves. This representation is very compact and convenient for humans, so it is used not only for source code. For example, you can see a similar representation for the constituency trees, in linguistics:

(TOP (S (NP (DT This)) (VP (VBZ is) (NP (DT a) (NN test))) (. .)))
;; if we'd like to use the above form as Lisp code,
;; we'd have to shield the symbol "." with ||: (|.| |.|) instead of (. .)

It is equivalent to the following parse tree:

     TOP
/ | \
| VP |
| | \ |
NP | NP |
| | / \ |
DT VBZ DT NN .
This is a test .

Another, more specific, alternative is when we are interested only in the terminal nodes. In that case, there will be no explicit root and each list item will be a subtree. The following trees are equivalent:

(((a b (c d)) e) (f (g h)))

<root>
/ \
/ \ / \
/ | \ e f /\
a b /\ g h
c d

A tree that has all terminals at the same depth and all nonterminal nodes present — a complete tree — with a specified number of children may be stored in a vector. This is a very efficient implementation that we'll have a glance at when we'll talk about heaps.

Finally, a tree may be also represented, although quite inefficiently, with a matrix (only one half is necessary).

Tree Traversal

It should be noted that, unlike with other structures, basic operations, such as tree construction, modification, element search and retrieval, work differently for different tree variants. Thus we'll discuss them further when describing those variants.

Yet, one tree-specific operation is common to all tree representations: traversal. Traversing a tree means iterating over its subtrees or nodes in a certain order. The most direct traversal is called depth-first search or DFS. It is the recursive traversal from parent to child and then to the next child after we return from the recursion. The simplest DFS for our tree-node-based tree may be coded in the following manner:

(defun dfs-node (fn root)
(call fn (key root))
(dolist (child (children root))
(dfs-node fn child)))

CL-USER> (dfs-node 'print *tree*) ; where *tree* is taken from the previous example
"a"
"b"
"d"
"e"
"c"
"f"

In the spirit of Lisp, we could also define a convenience macro:

(defmacro dotree-dfs ((value root) &body body)
(let ((node (gensym))) ; this code is needed to prevent possible symbol collisions for NODE
`(dfs-node (lambda (,node)
(let ((,value (key ,node)))
,@body))
,root)))

And if we'd like to traverse a tree represented as a list, the changes are minor:

(defun dfs-list (fn tree)
;; we need to handle both subtrees (lists) and leaves (atoms)
;; so, we'll just convert everything to a list
(let ((tree (mklist tree)))
(call fn (first tree))
(dolist (child (rest tree))
(dfs-list fn child))))

CL-USER> (dfs-list 'print '(defun foo (bar)
"Foo function."
(baz bar)))
DEFUN
FOO
BAR
"Foo function."
BAZ
BAR

Recursion is very natural in tree traversal: we could even say that trees are recursion realized in a data structure. And the good news here is that, very rarely, there's a chance to hit recursion limits as the majority of trees are not infinite, and also the height of the tree, which conditions the depth of recursion, grows proportionally to the logarithm of the tree size[1], and that's pretty slow.

These simple DFS implementations apply the function before descending down the tree. This style is called preorder traversal. There are alternative styles: inorder and postorder. With postorder, the call is executed after the recursion returns, i.e. on the recursive ascent:

(defun post-dfs (fn node)
(dolist (child (children node))
(post-dfs fn child))
(call fn (key node)))

CL-USER> (post-dfs 'print *tree*)
"d"
"e"
"b"
"f"
"c"
"a"

Inorder traversal is applicable only to binary trees: first traverse the left side, then call fn and then descend into the right side.

An alternative traversal approach is Breadth-first search (BFS). It isn't so natural as DFS as it traverses the tree layer by layer, i.e. it has to, first, accumulate all the nodes that have the same depth and then integrate them. In the general case, it isn't justified, but there's a number of algorithms where exactly such ordering is required.

Here is an implementation of BFS (preorder) for our tree-nodes:

(defun bfs (fn nodes)
(let ((next-level (list)))
(dolist (node (mklist nodes))
(call fn (key node))
(dolist (child (children node))
(push child next-level)))
(when next-level
(bfs fn (reverse next-level)))))

CL-USER> (bfs 'print *tree*)
"a"
"b"
"c"
"d"
"e"
"f"

An advantage of BFS traversal is that it can handle potentially unbounded trees, i.e. it is suitable for processing trees in a streamed manner, layer-by-layer.

In object-orientation, tree traversal is usually accomplished with by the means of the so-called Visitor pattern. Basically, it's the same approach of passing a function to the traversal procedure but in disguise of additional (and excessive) OO-related machinery. Here is a Visitor pattern example in Java:

interface ITreeVisitor {
List<ITreeNode> children;
void visit(ITreeNode node);
}

interface ITreeNode {
void accept(ITreeVisitor visitor);
}

interface IFn {
void call(ITreeNode);
}

class TreeNode implements ITreeNode {
public void accept(ITreeVisitor visitor) {
visitor.visit(this);
}
}

class PreOrderTreeVisitor implements ITreeVisitor {
private IFn fn;

public PreOrderTreeVisitor(IFn fn) {
this.fn = fn;
}

public void visit(ITreeNode node) {
fn.call(node);
for (ITreeeNode child : node.children())
child.visit(this);
}
}

The zest of this example is the implementation of the method visit that calls the function with the current node and iterates over its children by recursively applying the same visitor. You can see that it's exactly the same as our dfs-node.

One of the interesting tree-traversal tasks is tree printing. There are many ways in which trees can be displayed. The simplest one is directory-style (like the one used by the Unix tree utility):

$ tree /etc/acpi
/etc/acpi
├── asus-wireless.sh
├── events
│   ├── asus-keyboard-backlight-down
│   ├── asus-keyboard-backlight-up
│   ├── asus-wireless-off
│   └── asus-wireless-on
└── undock.sh

It may be implemented with DFS and only requires tracking of the current level in the tree:

(defun pprint-tree-dfs (node &optional (level 0) (skip-levels (make-hash-table)))
(when (= 0 level)
(format t "~A~%" (key node)))
(let ((last-index (1- (length (children node)))))
(doindex (i child (children node))
(let ((last-child-p (= i last-index)))
(dotimes (j level)
(format t "~C " (if (? skip-levels j) #\Space #\│)))
(format t "~C── ~A~%"
(if last-child-p #\└ #\├)
(key child))
(:= (? skip-levels level) last-child-p)
(pprint-tree-dfs child
(1+ level)
skip-levels))))))

CL-USER> (pprint-tree-dfs *tree*)
a
├── b
│ ├── d
│ └── e
└── c
└── f

1+ and 1- are standard Lisp shortucts for adding/substracting 1 from a number. The skip-levels argument is used for the last elements to not print the excess .

A more complicated variant is top-to-bottom printing:

;; example from CL-NLP library
CL-USER> (nlp:pprint-tree
'(TOP (S (NP (NN "This"))
(VP (VBZ "is")
(NP (DT "a")
(JJ "simple")
(NN "test")))
(|.| ".")))
TOP
:
S
.-----------:---------.
: VP :
: .---------. :
NP : NP :
: : .----:-----. :
NN VBZ DT JJ NN .
: : : : : :
This is a simple test .

This style, most probably, will need a BFS and a careful calculation of spans of each node to properly align everything. Implementing such a function is left as an exercise to the reader, and a very enlightening one, I should say.

Binary Search Trees

Now, we can return to the topic of basic operations on tree elements. The advantage of trees is that, when built properly, they guarantee O(log n) for all the main operations: search, insertion, modification, and deletion.

This quality is achieved by keeping the leaves sorted and the trees in a balanced state. "Balanced" means that any pair of paths from the root to the leaves have lengths that may differ by at most some predefined quantity: ideally, just 1 (AVL trees), or, as in the case of Red-Black trees, the longest path can be at most twice as long as the shortest. Yet, such situations when all the elements align along a single path, effectively, turning the tree into a list, should be completely ruled out. We have already seen, with Binary search and Quicksort (remember the justification for the 3-medians rule), why this constraint guarantees logarithmic complexity.

The classic example of balanced trees are Binary Search Trees (BSTs), of which AVL and Red-Black trees are the most popular variants. All the properties of BSTs may be trivially extended to n-ary trees, so we'll discuss the topic using the binary trees examples.

Just to reiterate the general intuition for the logarithmic complexity of tree operations, let's examine a complete binary tree: a tree that has all levels completely filled with elements, except maybe for the last one. In it, we have n elements, and each level contains twice as many nodes as the previous. This property means that n is not greater than (+ 1 2 4 ... (/ k 2) k), where k is the capacity of the last level. This formula is nothing but the sum of a geometric progression with the number of items equal to h, which is, by the textbook:

(/ (* 1 (- 1 (expt 2 h)))
(- 1 2))

In turn, thisexpression may be reduced to: (- (expt 2 h) 1). So (+ n 1) equals to (expt 2 h), i.e. the height of the tree (h) equals to (log (+ n 1) 2).

BSTs have the ordering property: if some element is to the right of another in the tree, it should consistently be greater (or smaller — depending on the ordering direction). This constraint means that after the tree is built, just extracting its elements by performing an inorder DFS produces a sorted array. The Treesort algorithm utilizes this approach directly to achieve the same O(n * log n) complexity as other efficient sorting algorithms. This n * log n is the complexity of each insertion (O(log n)) multiplied by the number of times it should be performed (n). So, Treesort operates by taking an array and adding its elements to the BST, then traversing the tree and putting the encountered elements into the resulting array, in a proper order.

Besides, the ordering property also means that, after adding a new element to the tree, in the general case, it should be rebalanced as the newly added element may not be placed in an arbitrary spot, but has just two admissible locations, and choosing any of those may violate the balance constraint. The specific balance invariants and approaches to tree rebalancing are the distinctive properties of each variant of BSTs that we will see below.

Splay Trees

A Splay tree represents a kind of BST that is one of the simplest to understand and to implement. It is also quite useful in practice. It has the least strict constraints and a nice property that recently accessed elements occur near the root. Thus, a Splay tree can naturally act as an LRU-cache. However, there are degraded scenarios that result in O(n) access performance, although, the average complexity of Splay tree operations is O(log n) due to amortization (we'll talk about it in a bit).

The approach to balancing a Splay tree is to move the element we have accessed/inserted into the root position. The movement is performed by a series of operations that are called tree rotations. A certain pattern of rotations forms a step of the algorithm. For all BSTs, there are just two possible tree rotations, and they serve as the basic block, in all balancing algorithms. A rotation may be either a left or a right one. Their purpose is to put the left or the right child into the position of its parent, preserving the order of all the other child elements. The rotations can be illustrated by the following diagrams in which x is the parent node, y is the target child node that will become the new parent, and A,B,C are subtrees. It is said that the rotation is performed around the edge x -> y.

Left rotation:

    x          y
/ \ / \
y C -> A x
/ \ / \
A B B C

Right rotation:

  x              y
/ \ / \
A y -> x C
/ \ / \
B C A B

As you see, the left and right rotations are complementary operations, i.e. performing one after the other will return the tree to the original state. During the rotation, the inner subtree (B) has its parent changed from y to x.

Here's an implementation of rotations:

(defstruct (bst-node (:conc-name nil)
(:print-object (lambda (node out)
(format out "[~a-~@[~a~]-~@[~a~]]"
(key node)
(lt node)
(rt node)))))
key
val ; we won't use this slot in the examples,
; but without it, in real-world use cases,
; the tree doesn't make any sense :)
lt ; left child
rt) ; right child

(defun tree-rotate (node parent grandparent)
(cond
((eql node (lt parent)) (:= (lt parent) (rt node)
(rt node) parent))
((eql node (rt parent)) (:= (rt parent) (lt node)
(lt node) parent))
(t (error "NODE (~A) is not the child of PARENT (~A)"
node parent)))
(cond
((null grandparent) (return-from tree-rotate node))
((eql parent (lt grandparent)) (:= (lt grandparent) node))
((eql parent (rt grandparent)) (:= (rt grandparent) node))
(t (error "PARENT (~A) is not the child of GRANDPARENT (~A)"
parent grandparent))))

You have probably noticed that we need to pass to this function not only the nodes on the edge around which the rotation is executed but also the grandparent node of the target to link the changes to the tree. If grandparent is not supplied, it is assumed that parent is the root and we need to separately reassign the variable holding the reference to the tree to child, after the rotation.

Splay trees combine rotations into three possible actions:

  • The Zig step is used to make the node the new root when it's already the direct child of the root. It is accomplished by a single left/right rotation(depending on whether the target is to the left or to the right of the root) followed by an assignment.
  • The Zig-zig step is a combination of two zig steps that is performed when both the target node and its parent are left/right nodes. The first rotation is around the edge between the target node and its parent, and the second — around the target and its former grandparent that has become its new parent, after the first rotation.
  • The Zig-zag step is performed when the target and its parent are not in the same direction: either one is left while the other is right or vise versa. In this case, correspondingly, first a left rotation around the parent is needed, and then a right one around its former grandparent (that has now become the new parent of the target). Or vice versa.

However, with our implementation of tree rotations, we don't have to distinguish the 3 different steps and the implementation of the operation splay becomes really trivial:

(defun splay (node &rest chain)
(loop :for (parent grandparent) :on chain :do
(tree-rotate node parent grandparent))
node)

The key point here and in the implementation of Splay tree operations is the use of reverse chains of nodes from the child to the root which will allow us to perform chains of splay operations in an end-to-end manner and also custom modifications of the tree structure.

From the code, it is clear that splaying requires at maximum the same number of steps as the height of the tree because each rotation brings the target element 1 level up. Now, let's discuss why all Splay tree operations are O(log n). Element access requires binary search for the element in the tree, which is O(log n) provided the tree is balanced, and then splaying it to root — also O(log n). Deletion requires search, then swapping the element either with the rightmost child of its left subtree or the leftmost child of its right subtree (direct predecessor/successor) — to make it childless, removing it, and, finally, splaying the parent of the removed node. And update is, at worst, deletion followed by insertion.

Here is the implementation of the Splay tree built of bst-nodes and restricted to only arithmetic comparison operations. All of the high-level functions, such as st-search, st-insert or st-delete return the new tree root obtained after that should substitute the previous one in the caller code.

(defun node-chain (item root &optional chain)
"Return as the values the node equal to ITEM or the closest one to it
and the chain of nodes leading to it, in the splay tree based in ROOT."
(if root
(with (((key lt rt) ? root)
(chain (cons root chain)))
(cond ((= item key) (values root
chain))
((< item key) (st-search item lt chain))
((> item key) (st-search item rt chain))))
(values nil
chain)))

(defun st-search (item root)
(with ((node chain (node-chain item root)))
(when node
(apply 'splay chain))))

(defun st-insert (item root)
(assert root nil "Can't insert item into a null tree")
(with ((node chain (st-search item root)))
(unless node
(let ((parent (first chain)))
;; here, we use the property of the := expression
;; that it returns the item being set
(push (:= (? parent (if (> (key parent) item) 'lt 'rt))
(make-bst-node :key item))
chain)))
(apply 'splay chain)))

(defun idir (dir)
(case dir
(lt 'rt)
(rt 'lt)))

(defun closest-child (node)
(dolist (dir '(lt rt))
(let ((parent nil)
(current nil))
(do ((child (call dir node) (call (idir dir) child)))
((null child) (when current
(return-from closest-child
(values dir
current
parent))))
(:= current child
parent current)))))

(defun st-delete (item root)
(with ((node chain (st-search item root))
(parent (second chain)))
(if (null node)
root ; ITEM was not found
(with ((dir child child-parent (closest-child node))
(idir (idir dir)))
(when parent
(:= (? parent (if (eql (lt parent) node) 'lt 'rt))
child))
(when child
(:= (? child idir) (? node idir))
(when child-parent
(:= (? child-parent idir) (? child dir))))
(if parent
(apply 'splay (rest chain))
child)))))

(defun st-update (old new root)
(st-insert new (st-delete old root)))

The deletion is somewhat tricky due to the need to account for different cases: when removing the root, the direct child of the root, or the other node.

Let's test the Splay tree operation in the REPL (coding pprint-bst as a slight modification of pprint-tree-dfs is left as an excercise to the reader):

CL-USER> (defparameter *st* (make-bst-node :key 5))
CL-USER> *st*
[5--]
CL-USER> (pprint-bst (:= *st* (st-insert 1 *st*)))
1
├── .
└── 5
CL-USER> (pprint-bst (:= *st* (st-insert 10 *st*)))
10
├── 1
│ ├── .
│ └── 5
└── .
CL-USER> (pprint-bst (:= *st* (st-insert 3 *st*)))
3
├── 1
└── 10
├── .
└── 5
CL-USER> (pprint-bst (:= *st* (st-insert 7 *st*)))
7
├── 3
│ ├── 1
│ └── 5
└── 10
CL-USER> (pprint-bst (:= *st* (st-insert 8 *st*)))
8
├── 7
│ ├── 3
│ │ ├── 1
│ │ └── 5
│ └── .
└── 10
CL-USER> (pprint-bst (:= *st* (st-insert 2 *st*)))
2
├── 1
└── 8
├── 7
│ ├── 3
│ │ ├── .
│ │ └── 5
│ └── .
└── 10
CL-USER> (pprint-bst (:= *st* (st-insert 4 *st*)))
4
├── 2
│ ├── 1
│ └── 3
└── 8
├── 7
│ ├── 5
│ └── .
└── 10
CL-USER> *st*
[4-[2-[1--]-[3--]]-[8-[7-[5--]-]-[10--]]]

As you can see, the tree gets constantly rearranged at every insertion.

Accessing an element, when it's found in the tree, also triggers tree restructuring:

RTL-USER> (pprint-bst (st-search 5 *st*))
5
├── 4
│ ├── 2
│ │ ├── 1
│ │ └── 3
│ └── .
└── 8
├── 7
└── 10

The insertion and deletion operations, for the Splay tree, also may have an alternative implementation: first, split the tree in two at the place of the element to be added/removed and then combine them. For insertion, the combination is performed by making the new element the root and linking the previously split subtrees to its left and right. As for deletion, splitting the Splay tree requires splaying the target element and then breaking the two subtrees apart (removing the target that has become the root). The combination is also O(log n) and it is performed by splaying the rightmost node of the left subtree (the largest element) so that it doesn't have the right child. Then the right subtree can be linked to this vacant slot.

Although regular access to the Splay tree requires splaying of the element we have touched, tree traversal should be implemented without splaying. Or rather, just the normal DFS/BFS procedures should be used. First of all, this approach will keep the complexity of the operation at O(n) without the unnecessary log n multiplier added by the splaying operations. Besides, accessing all the elements inorder will trigger the edge-case scenario and turn the Splay tree into a list — exactly the situation we want to avoid.

Complexity Analysis

All of those considerations apply under the assumption that all the tree operations are O(log n). But we haven't proven it yet. Turns out that, for Splay trees, it isn't a trivial task and requires amortized analysis. Basically, this approach averages the cost of all operations over all tree elements. Amortized analysis allows us to confidently use many advanced data structures for which it isn't possible to prove the required time bounds for individual operations, but the general performance over the lifetime of the data structure is in those bounds.

The principal tool of the amortized analysis is the potential method. Its idea is to combine, for each operation, not only its direct cost but also the change to the potential cost of other operations that it brings. For Splay trees, we can observe that only zig-zig and zig-zag steps are important, for the analysis, as zig step happens only once for each splay operation and changes the height of the tree by at most 1. Also, both zig-zig and zig-zag have the same potential.

Rigorously calculating the exact potential requires a number of mathematical proofs that we don't have space to show here, so let's just list the main results.

  1. The potential of the whole Splay tree is the sum of the ranks of all nodes, where rank is the logarithm of the number of elements in the subtree rooted at node:

    (defun rank (node)
    (let ((size 0))
    (dotree-dfs (_ node)
    (:+ size))
    (log size 2)))
  2. The change of potential produced by a single zig-zig/zig-zag step can be calculated in the following manner:

    (+ (- (rank grandparent-new) (rank grandparent-old))
    (- (rank parent-new) (rank parent-old))
    (- (rank node-new) (rank node-old)))

    Since (= (rank node-new) (rank grandparent-old)) it can be reduced to:

    (- (+ (rank grandparent-new) (rank parent-new))
    (+ (rank parent-old) (rank node-old)))

    Which is not larger than:

    (- (+ (rank grandparent-new) (rank node-new))
    (* 2 (rank node-old)))

    Which, in turn, due to the concavity of the log function, may be reduced to:

    (- (* 3 (- (rank node-new) (rank node-old))) 2)

    The amortized cost of any step is 2 operations larger than the change in potential as we need to perform 2 tree rotations, so it's not larger than:

    (* 3 (- (rank node-new) (rank node-old)))
  3. When summed over the entire splay operation, this expression "telescopes" to (* 3 (- (rank root) (rank node))) which is O(log n). Telescoping means that when we calculate the sum of the cost of all zig-zag/zig-zig steps, the inner terms cancel each other and only the boundary ones remain. The difference in ranks is, in the worst case, log n as the rank of the root is (log n 2) and the rank of the arbitrary node is between that value and (log 1 2) (0).

  4. Finally, the total cost for m splay operations is O(m log n + n log n), where m log n term represents the total amortized cost of a sequence of m operations and n log n is the change in potential that it brings.

As mentioned, the above exposition is just a cursory look at the application of the potential method that skips some important details. If you want to learn more you can start with this discussion on CS Theory StackExchange.

To conclude, similar to hash-tables, the performance of Splay tree operations for a concrete element depends on the order of the insertion/removal of all the elements of the tree, i.e. it has an unpredictable (random) nature. This property is a disadvantage compared to some other BST variants that provide precise performance guarantees. Another disadvantage, in some situations, is that the tree is constantly restructured, which makes it mostly unfit for usage as a persistent data structure and also may not play well with many storage options. Yet, Splay trees are simple and, in many situations, due to their LRU-property, may be preferable over other BSTs.

Red-Black and AVL Trees

Another BST that also has similar complexity characteristics to Splay trees and, in general, a somewhat similar approach to rebalancing is the Scapegoat tree. Both of these BSTs don't require storing any additional information about the current state of the tree, which results in the random aspect of their operation. And although it is smoothed over all the tree accesses, it may not be acceptable in some usage scenarios.

An alternative approach, if we want to exclude the random factor, is to track the tree state. Tracking may be achieved by adding just 1 bit to each tree node (as with Red-Black trees) or 2 bits, the so-called balance factors (AVL trees)[2]. However, for most of the high-level languages, including Lisp, we'll need to go to great lengths or even perform low-level non-portable hacking to, actually, ensure that exactly 1 or 2 bits is spent for this data, as the standard structure implementation will allocate a whole word even for a bit-sized slot. Moreover, in C likewise, due to cache alignment, the structure will also have the size aligned to memory word boundaries. So, by and large, usually we don't really care whether the data we'll need to track is a single bit flag or a full integer counter.

The balance guarantee of an RB tree is that, for each node, the height of the left and right subtrees may differ by at most a factor of 2. Such boundary condition occurs when the longer path contains alternating red and black nodes, and the shorter — only black nodes. Balancing is ensured by the requirement to satisfy the following invariants:

  1. Each tree node is assigned a label: red or black (basically, a 1-bit flag: 0 or 1).
  2. The root should be black (0).
  3. All the leaves are also black (0). And the leaves don't hold any data. A good implementation strategy to satisfy this property is to have a constant singleton terminal node that all preterminals will link to. ((defparameter *rb-leaf* (make-rb-node))).
  4. If a parent node is red (1) then both its children should be black (0). Due to mock leaves, each node has exactly 2 children.
  5. Every path from a given node to any of its descendant leaf nodes should contain the same number of black nodes.

So, to keep the tree in a balanced state, the insert/update/delete operations should perform rebalancing when the constraints are violated. Robert Sedgewick has proposed the simplest version of the red-black tree called the Left-Leaning Red-Black Tree (LLRB). The LLRB maintains an additional invariant that all red links must lean left except during inserts and deletes, which makes for the simplest implementation of the operations. Below, we can see the outline of the insert operation:

(defstruct (rb-node (:include bst-node) (:conc-name nil))
(red nil :type boolean))

(defun rb-insert (item root &optional parent)
(let ((node (make-rb-node :key item)))
(when (null root)
(return-from rb-insert node))
(when (and (red (lt root))
(red (rt root)))
(:= (red root) (not (red root))
(red (lt root)) nil
(red (rt root)) nil))
(cond ((< (key root) value)
(:= (lt root) (rb-insert node (lt root) root)))
((> (key root) value)
(:= (rt root) (rb-insert node (rt root) root))))
(when (and (red (rt root))
(not (red (lt root))))
(:= (red (lt root)) (red root)
root (tree-rotate (lt root) root parent)
(red root) t))
(when (and (red (lt root))
(not (red (rt root))))
(:= (red (rt root)) (red root)
root (tree-rotate (rt root) root parent)
(red root) t)))
root)

This code is more of an outline. You can easily find the complete implementation of the RB-tree on the internet. The key here is to understand the principle of their operation. Also, we won't discuss AVL trees, in detail. Suffice to say that they are based on the same principles but use a different set of balancing operations.

Both Red-Black and AVL trees may be used when worst-case performance guarantees are required, for example, in real-time systems. Besides, they serve a basis for implementing persistent data-structures that we'll talk about later. The Java TreeMap and similar data structures from the standard libraries of many languages are implemented with one of these BSTs. And the implementations of them both are present in the Linux kernel and are used as data structures for various queues.

OK, now you know how to balance a binary tree :D

B-Trees

B-tree is a generalization of a BST that allows for more than two children. The number of children is not unbounded and should be in a predefined range. For instance, the simplest B-tree — 2-3 tree — allows for 2 or 3 children. Such trees combine the main advantage of self-balanced trees — logarithmic access time — with the benefit of arrays — locality — the property which allows for faster cache access or retrieval from the storage. That's why B-trees are mainly used in data storage systems. Overall, B-tree implementations perform the same trick as we saw in prod-sort: switching to sequential search when the sequence becomes small enough to fit into the cache line of the CPU.

Each internal node of a B-tree contains a number of keys. For a 2-3 tree, the number is either 1 or 2. The keys act as separation values which divide the subtrees. For example, if the keys are x and y, all the values in the leftmost subtree will be less than x, all values in the middle subtree will be between x and y, and all values in the rightmost subtree will be greater than y. Here is an example:

              [ 7 . 18 ]
/ | \
[ 1 . 3 ] [ 10 . 15 ] [ 20 . _ ]

This tree has 4 nodes. Each node has 2 key slots and may have 0 (in the case of the leaf nodes), 2 or 3 children. The node structure for it might look like this:

(defstruct 23-node
key1
key2
val1
val2
lt
md
rt)

Yet, a more general B-tree node would, probably, contain arrays for keys/values and children links:

(defstruct bt-node
(keys (make-array *max-keys*))
(vals (make-array *max-keys*))
(children (make-array (1+ *max-keys*)))

The element search in a B-tree is very similar to that of a BST. Just, there will be up to *max-keys* comparisons instead of 1, in each node. Insertion is more tricky as it may require rearranging the tree items to satisfy its invariants. A B-tree is kept balanced after insertion by the procedure of splitting a would-be overfilled node, of (1+ n) keys, into two (/ n 2)-key siblings and inserting the mid-value key into the parent. That's why, usually, the range of the number of keys in the node, in the B-tree is chosen to be between k and (* 2 k). Also, in practice, k will be pretty large: an order of 10s or even 100. Depth only increases when the root is split, maintaining balance. Similarly, a B-tree is kept balanced after deletion by merging or redistributing keys among siblings to maintain the minimum number of keys for non-root nodes. A merger reduces the number of keys in the parent potentially forcing it to merge or redistribute keys with its siblings, and so on. The depth of the tree will increase slowly as elements are added to it, but an increase in the overall depth is infrequent and results in all leaf nodes being one more node farther away from the root.

A version of B-trees that is particularly developed for storage systems and is used in a number of filesystems, such as NTFS and ext4, and databases, such as Oracle and SQLite, is B+ trees. A B+ tree can be viewed as a B-tree in which each node contains only keys (not key-value pairs), and to which an additional level is added at the bottom with linked leaves. The leaves of the B+ tree are linked to one another in a linked list, making range queries or an (ordered) iteration through the blocks simpler and more efficient. Such a property could not be achieved in a B-tree, since not all keys are present in the leaves: some are stored in the root or intermediate nodes.

However, a newer Linux file-system, developed specifically for use on the SSDs and called btrfs uses plain B-trees instead of B+ trees because the former allows implementing copy-on-write, which is needed for efficient snapshots. The issue with B+ trees is that its leaf nodes are interlinked, so if a leaf were copy-on-write, its siblings and parents would have to be as well, as would their siblings and parents and so on until the entire tree was copied. We can recall the same situation pertaining to the doubly-linked list compared to singly-linked ones. So, a modified B-tree without leaf linkage is used in btrfs, with a refcount associated with each tree node but stored in an ad-hoc free map structure.

Overall, B-trees are a very natural continuation of BSTs, so we won't spend more time with them here. I believe, it should be clear how to deal with them, overall. Surely, there are a lot of B-tree variants that have their nuances, but those should be studied in the context of a particular problem they are considered for.

Heaps

A different variant of a binary tree is a Binary Heap. Heaps are used in many different algorithms, such as path pathfinding, encoding, minimum spanning tree, etc. They even have their own O(log n) sorting algorithm — the elegant Heapsort. In a heap, each element is either the smallest (min-heap) or the largest (max-heap) element of its subtree. It is also a complete tree and the last layer should be filled left-to-right. This invariant makes the heap well suited for keeping track of element priorities. So Priority Queues are, usually, based on heaps. Thus, it's beneficial to be aware of the existence of this peculiar data structure.

The constraints on the heap allow representing it in a compact and efficient manner — as a simple vector. Its first element is the heap root, the second and third are its left and right child (if present) and so on, by recursion. This arrangement permits access to the parent and children of any element using the simple offset-based formulas (in which the element is identified by its index):

(defun hparent (i)
"Calculate the index of the parent of the heap element with index I."
(floor (- i 1) 2))

(defun hrt (i)
"Calculate the index of the right child of the heap element with index I."
(* (+ i 1) 2))

(defun hlt (i)
"Calculate the index of the left child of the heap element with index I."
(- (hrt i) 1))

So, to implement a heap, we don't need to define a custom node structure, and besides, can get to any element in O(1)! Here is the utility to rearrange an arbitrary array in a min-heap formation (in other words, we can consider a binary heap to be a special arrangement of array elements). It works by iteratively placing each element in its proper place by swapping with children until it's larger than both of the children.

(defun heapify (vec)
(let ((mid (floor (length vec) 2)))
(dotimes (i mid)
(heap-down vec (- mid i 1))))
vec)

(defun heap-down (vec beg &optional (end (length vec)))
(let ((l (hlt beg))
(r (hrt beg)))
(when (< l end)
(let ((child (if (or (>= r end)
(> (? vec l)
(? vec r)))
l r)))
(when (> (? vec child)
(? vec beg))
;; rotatef swaps the elements of the sequence
(rotatef (? vec beg)
(? vec child))
(heap-down vec child end)))))
vec)

And here is the reverse operation to pop the item up the heap:

(defun heap-up (vec i)
(when (> (? vec i)
(? vec (hparent i)))
(rotatef (? vec i)
(? vec (hparent i)))
(heap-up vec (hparent i)))
heap)

Also, as with other data structures, it's essential to be able to visualize the content of the heap in a convenient form, as well as to check the invariants. These tasks may be accomplished with the help of the following functions:

(defun draw-heap (vec)
(format t "~%")
(with ((size (length vec))
(h (+ 1 (floor (log size 2)))))
(dotimes (i h)
(let ((spaces (loop :repeat (- (expt 2 (- h i)) 1) :collect #\Space)))
(dotimes (j (expt 2 i))
(let ((k (+ (expt 2 i) j -1)))
(when (= k size) (return))
(format t "~{~C~}~2D~{~C~}" spaces (? vec k) spaces)))
(format t "~%"))))
(format t "~%")
vec)

(defun check-heap (vec)
(dotimes (i (floor (length vec) 2))
(when (= (hlt i) (length vec)) (return))
(assert (not (> (? vec (hlt i)) (? vec i)))
() "Left child (~A) is > parent at position ~A (~A)."
(? vec (hlt i)) i (? vec i))
(when (= (hrt i) (length vec)) (return))
(assert (not (> (? vec (hrt i)) (? vec i)))
() "Right child (~A) is > than parent at position ~A (~A)."
(? vec (hrt i)) i (? vec i)))
vec)

CL-USER> (check-heap #(10 5 8 2 3 7 1 9))
Left child (9) is > parent at position 3 (2).
[Condition of type SIMPLE-ERROR]

CL-USER> (check-heap (draw-heap (heapify #(1 22 10 5 3 7 8 9 7 13))))

22
13 10
9 3 7 8
5 7 1

#(22 13 10 9 3 7 8 5 7 1)
Due to the regular nature of the heap, drawing it with BFS is much simpler than for most other trees. As with ordered trees, heap element insertion and deletion require repositioning of some of the elements.
(defun heap-push (node vec)
(vector-push-extend node vec)
(heap-up vec (1- (length vec))))

(defun heap-pop (vec)
(rotatef (? vec 0) (? vec (- (length vec) 1)))
;; PROG1 is used to return the result of the first form
;; instead of the last, like it happens with PROGN
(prog1 (vector-pop vec)
(heap-down vec 0)))

Now, we can implement Heapsort. The idea is to iteratively arrange the array in heap order element by element. Each arrangement will take log n time as we're pushing the item down a complete binary tree the height of which is log n. And we'll need to perform n such iterations.

(defun heapsort (vec)
(heapify vec)
(dotimes (i (length vec))
(let ((last (- (length vec) i 1)))
(rotatef (? vec 0)
(? vec last))
(heap-down vec 0 last)))
vec)

CL-USER> (heapsort #(1 22 10 5 3 7 8 9 7 13))
#(1 3 5 7 7 8 9 10 13 22)

There are so many sorting algorithms, so why invent another one? That's a totally valid point, but the advantage of heaps is that they keep the maximum/minimum element constantly at the top so you don't have to perform a full sort or even descend into the tree if you need just the top element. This simplification is especially relevant if we constantly need to access such elements as with priority queues.

Actually, a heap should not necessarily be a tree. Besides the Binary Heap, there are also Binomial, Fibonacci and other kinds of heaps that may even not necessary be trees, but even collections of trees (forests). We'll discuss some of them in more detail in the next chapters, in the context of the algorithms for which their use makes a notable difference in performance.

Tries

If I were to answer the question, what's the most underappreciated data structure, I'd probably say, a trie. For me, tries are a gift that keeps on giving, and they have already saved me program performance in a couple of situations that seemed hopeless. Besides, they are very simple to understand and implement.

A trie is also called a prefix tree. It is, usually, used to optimize dictionary storage and lookup when the dictionary has a lot of entries and there is some overlap between them. The most obvious example is a normal English language dictionary. A lot of words have common stems ("work", "word", "worry" all share the same beginning "wor"), there are many wordforms of the same word ("word", "words", "wording", "worded").

Thre are many approaches to trie implementation. Let's discuss with the most straightforward and, to so to say, primitive one. Here is a trie for representing a string dictionary that is character-based and uses an alist to store children pointers:

(defstruct (tr-node (:conc-name nil))
val
(children (list)))

(defun tr-lookup (key root)
(dovec (ch key
;; when iteration terminates normally
;; we have found the node we were looking for
(val root))
(if-it (assoc1 ch (children root))
(:= root it)
(return))))

(defun tr-add (key val root)
(let ((i 0))
(dovec (ch key)
(if-it (assoc1 ch (children root))
(:= root it
i (1+ i))
(return)))
(if (= i (length key))
;; there was already something at key -
;; several actions might be taken:
;; signal an error (continuable), overwrite, abort
(cerror "Assign a new value"
"There was already a value at key: ~A" (val root))
(dovec (ch (slice key i))
(let ((child (make-tr-node)))
(push (cons ch child) (children root))
(:= root child))))
(:= (val root) val)))

CL-USER> (defparameter *trie* (make-tr-node))
*TRIE*
CL-USER> *trie*
#S(TR-NODE :VAL NIL :CHILDREN NIL)

For the sake of brevity, we won't define a special print-function for our trie and will use a default one. In a real setting, though, it is highly advisable.

CL-USER> (tr-lookup "word" *trie*)
NIL
CL-USER> (tr-add "word" 42 *trie*)
42
CL-USER> *trie*
#S(TR-NODE
:VAL NIL
:CHILDREN ((#\w
. #S(TR-NODE
:VAL NIL
:CHILDREN ((#\o
. #S(TR-NODE
:VAL NIL
:CHILDREN ((#\r
. #S(TR-NODE
:VAL NIL
:CHILDREN ((#\d
. #S(TR-NODE
:VAL 42
:CHILDREN NIL)))))))))))))
CL-USER> (tr-lookup "word" *trie*)
42
CL-USER> (tr-add "word" :foo *trie*)

There was already a value at key: 42
[Condition of type SIMPLE-ERROR]

Restarts:
0: [CONTINUE] Assign a new value
1: [RETRY] Retry SLIME REPL evaluation request.
2: [*ABORT] Return to SLIME's top level.
3: [ABORT] abort thread (#<THREAD "repl-thread" RUNNING {100F6297C3}>)

Backtrace:
0: (TR-ADD "word" :FOO #S(TR-NODE :VAL 42 :CHILDREN NIL))
1: (SB-INT:SIMPLE-EVAL-IN-LEXENV (TR-ADD "word" :FOO *TRIE*) #<NULL-LEXENV>)
2: (EVAL (TR-ADD "word" :FOO *TRIE*))
--more--

;;; Take the restart 0

:FOO
Cl-USER> (tr-add "we" :baz *trie*)
:BAZ
CL-USER> *trie*
#S(TR-NODE
:VAL NIL
:CHILDREN ((#\w
. #S(TR-NODE
:VAL NIL
:CHILDREN ((#\e . #S(TR-NODE :VAL :BAZ :CHILDREN NIL))
(#\o
. #S(TR-NODE
:VAL NIL
:CHILDREN ((#\r
. #S(TR-NODE
:VAL NIL
:CHILDREN ((#\k
. #S(TR-NODE
:VAL :BAR
:CHILDREN NIL))
(#\d
. #S(TR-NODE
:VAL :FOO
:CHILDREN NIL)))))))))))))

There are many ways to optimize this trie implementation. First of all, you can see that some space is wasted on intermediate nodes with no values. This is mended by Radix Trees (also known as Patricia Trees) that merge all intermediate nodes. I.e., our trie would change into the following more compact structure:

#S(TR-NODE
:VAL NIL
:CHILDREN ((#\w
. #S(TR-NODE
:VAL NIL
:CHILDREN ((#\e . #S(TR-NODE :VAL :BAZ :CHILDREN NIL))
("or" . #S(TR-NODE
:VAL NIL
:CHILDREN ((#\k
. #S(TR-NODE
:VAL :BAR
:CHILDREN NIL))
(#\d
. #S(TR-NODE
:VAL :FOO
:CHILDREN NIL)))))))))))))

Besides, there are ways to utilize the array to store trie offsets (similar to heaps), instead of using a linked backbone for it. Such variant is called a succinct trie. Also, there are compressed (C-tries), hash-array mapped (HAMTs), and other kinds of tries.

The main advantage of tries is efficient space usage thanks to the elimination of repetition in keys storage. In many scenarios, usage of tries also improves the speed of access. Consider the task of matching against a dictionary of phrases, for example, biological or medical terms, names of companies or works of art, etc. These are usually 2-3 words long phrases, but, occasionally, there may be an outlier of 10 or more words. The straightforward approach would be to put the dictionary into a hash-table, then iterate over the input string trying to find the phrases in the table, starting from each word. The only question is: where do we put an end of the phrase? As we said, the phrase may be from 1 to, say, 10 words in length. With a hash-table, we have to check every variant: a single-word phrase, a two-word one, and so on up to the maximum length. Moreover, if there are phrases with the same beginning, which is often the case, we'd do duplicate work of hashing that beginning, for each variant (unless we use an additive hash, but this isn't adviced for hash-tables). With a trie, all the duplication is not necessary: we can iteratively match each word until we either find the match in the tree or discover that there is no continuation of the current subphrase.

Trees in Action: Efficient Mapping

Finally, the last family of tree data structures I had to mention is trees for representing spatial relations. Overall, mapping and pathfinding is an area that prompted the creation of a wide range of useful algorithms and data structures. There are two fundamental operations for processing spatial data: nearest neighbor search and range queries. Given the points on the plane, how do we determine the closest points to a particular one? How do we retrieve all points inside a rectangle or a circle? A primitive approach is to loop through all the points and collect the relevant information, which results in at least O(n) complexity — prohibitively expensive if the number of points is beyond several tens or hundreds. And such problems, by the way, arise not only in the field of processing geospatial data (they are at the core of such systems as PostGIS, mapping libraries, etc.) but also in Machine Learning (for instance, the k-NN algorithm directly requires such calculations) and other areas.

A more efficient solution has an O(log n) complexity and is, as you might expect, based on indexing the data in a special-purpose tree. The changes to the tree will also have O(log n) complexity, while the initial indexing is O(n log n). However, in most of the applications that use this technic, changes are much less frequent than read operations, so the upfront cost pays off.

There is a number of trees that allow efficient storage of spatial data: segment trees, interval trees, k-d trees, R-trees, etc. The most common spatial data structure is an R-tree (rectangle-tree). It distributes all the points in an n-dimensional space (usually, n will be 2 or 3) among the leaves of the tree by recursively dividing the space into k rectangles holding roughly the same number of points until each tree node has at most k points. Let's say we have started from 1000 points on the plane and chosen k to be 10. In this case, the first level of the tree (i.e. children of the root) will contain 10 nodes, each one having as the value the dimensions of the rectangle that bounds approximately 100 points. Every node like that will have 10 more children, each one having around 10 points. Maybe, some will have more, and, in this case, we'll give those nodes 10 children each with, probably, 1 or 2 points in the rectangles they will command. Now, we can perform a range search with the obtained tree by selecting only the nodes that intersect the query rectangle. For a small query box, this approach will result in the discarding of the majority of the nodes at each level of the tree. So, a range search over an R-tree has O(k log n) where k is the number of intersecting rectangles.

Now, let's consider neighbor search. Obviously, the closest points to a particular one we are examining lie either in the same rectangle as the point or in the closest ones to it. So, we need to, first, find the smallest bounding rectangle, which contains our point, perform the search in it, then, if we haven't got enough points yet, process the siblings of the current tree node in the order of their proximity to it.

There are many other spatial problems that may be efficiently solved with this approach. One thing to note is that the described procedures require the tree to store, in the leaf nodes, references to every point contained in their rectangles.

Take-aways

So, balancing a tree isn't such a unique and interesting task. On the contrary, it's quite simple yet boring due to the number of edge cases you have to account for. Yet, we have just scratched the surface of the general topic of trees. It is vast: the Wikipedia section for tree data structures contains almost 100 of them and it's, definitely, not complete. Moreover, new tree variants will surely be invented in the future. But you will hardly deal with more than just a few variants during the course of your career, spending the majority of time with the simple "unconstrained" trees. And we have seen, in action, the basic principles of tree operation that will be helpful, in the process.

There's a couple of other general observations about programming algorithms we can draw from this chapter:

  1. Trees are very versatile data structures that are a default choice when you need to represent some hierarchy. They are also one of a few data structures for which recursive processing is not only admissible but also natural and efficient.
  2. Visualization is key to efficient debugging of complex data-structures. Unfortunately, it's hard to show that in the book how I have spent several hours on the code for the splay tree, but without an efficient way to display the trees coupled with dynamic tracing, I would probably have spent twice as much. And, both the print-function for individual node and pprint-bst were helpful here.

[1] This statement is stricktly true for balanced trees, but, even for imbalanced trees, such estimation is usually correct.

[2] Although it was shown that this value may also be reduced to a single bit using a clever implementation trick.

Lispers.deBerlin Lispers Meetup, Monday, 30th September 2019

· 70 days ago

We meet again on Monday 8pm, 30th September. Our host this time is James Anderson (www.dydra.com).

Berlin Lispers is about all flavors of Lisp including Common Lisp, Clojure and Scheme.

Max-Gerd Retzlaff will talk about his art installation "Other Dimension" [1] as part of the "Gewebtes Licht" exhibition in 2011 and how he used Embeddable Common Lisp for it.

[1] http://other-dimension.net/ in collaboration with Alex Wenger

We meet in the Taut-Haus at Engeldamm 70 in Berlin-Mitte, the bell is "James Anderson". It is located in 10min walking distance from U Moritzplatz or U Kottbusser Tor or Ostbahnhof. In case of questions call Christian +49 1578 70 51 61 4.

Vsevolod DyomkinProgramming Algorithms: Hash-Tables

· 86 days ago

Now, we can move on to studying advanced data structures which are built on top of the basic ones such as arrays and lists, but may exhibit distinct properties, have different use cases, and special algorithms. Many of them will combine the basic data structures to obtain new properties not accessible to the underlying structures. The first and most important of these advanced structures is, undoubtedly, the hash-table. However vast is the list of candidates to serve as key-values, hash-tables are the default choice for implementing them.

Also, hash-sets, in general, serve as the main representation for medium and large-sized sets as they ensure O(1) membership test, as well as optimal set-theoretic operations complexity. A simple version of a hash-set can be created using a normal hash-table with t for all values.

Implementation

The basic properties of hash-tables are average O(1) access and support for arbitrary keys. These features can be realized by storing the items in an array at indices determined by a specialized function that maps the keys in a pseudo-random way — hashes them. Technically, the keys should pertain to the domain that allows hashing, but, in practice, it is always possible to ensure either directly or by using an intermediate transformation. The choice of variants for the hash-function is rather big, but there are some limitations to keep in mind:

  1. As the backing array has a limited number of cells (n), the function should produce values in the interval [0, n). This limitation can be respected by a 2-step process: first, produce a number in an arbitrary range (for instance, a 32-bit integer) and then take the remainder of its division by n.
  2. Ideally, the distribution of indices should be uniform, but similar keys should map to quite distinct indices. I.e. hashing should turn things which are close, into things which are distant. This way, even very small changes to the input will yield sweeping changes in the value of the hash. This property is called the "avalanche effect".

Dealing with Collisions

Even better would be if there were no collisions — situations when two or more keys are mapped to the same index. Is that, at all, possible? Theoretically, yes, but all the practical implementations that we have found so far are too slow and not feasible for a hash-table that is dynamically updated. However, such approaches may be used if the keyset is static and known beforehand. They will be covered in the discussion of perfect hash-tables.

For dynamic hash-tables, we have to accept that collisions are inevitable. The probability of collisions is governed by an interesting phenomenon called "The Birthday Paradox". Let's say, we have a group of people of some size, for instance, 20. What is the probability that two of them have birthdays on the same date? It may seem quite improbable, considering that there are 365 days in a year and we are talking just about a handful of people. But if you take into account that we need to examine each pair of people to learn about their possible birthday collision that will give us (/ (* 20 19) 2), i.e. 190 pairs. We can calculate the exact probability by taking the complement to the probability that no one has a birthday collision, which is easier to reason about. The probability that two people don't share their birthday is (/ (- 365 1) 365): there's only 1 chance in 365 that they do. For three people, we can use the chain rule and state that the probability that they don't have a birthday collision is a product of the probability that any two of them don't have it and that the third person also doesn't share a birthday with any of them. This results in (* (/ 364 365) (/ (- 365 2) 365)). The value (- 365 2) refers to the third person not having a birthday intersection with neither the first nor the second individually, and those are distinct, as we have already asserted in the first term. Continuing in such fashion, we can count the number for 20 persons:

(defun birthday-collision-prob (n)
(let ((rez 1))
(dotimes (i n)
(:* rez (/ (- 365 i) 365)))
;; don't forget that we want the complement
;; of the probability of no collisions,
;; hence (- 1.0 ...)
(- 1.0 rez)))

CL-USER> (birthday-collision-prob 20)
0.4114384

So, among 20 people, there's already a 40% chance of observing a coinciding birthday. And this number grows quickly: it will become 50% at 23, 70% at 30, and 99.9% at just 70!

But why, on Earth, you could ask, have we started to discusss birthdays? Well, if you substitute keys for persons and the array size for the number of days in a year, you'll get the formula of the probability of at least one collision among the hashed keys in an array, provided the hash function produces perfectly uniform output. (It will be even higher if the distribution is non-uniform).

(defun hash-collision-prob (n size)
(let ((rez 1))
(dotimes (i n)
(:* rez (/ (- size i) size)))
(- 1.0 rez)))

Let's say, we have 10 keys. What should be the array size to be safe against collisions?

CL-USER> (hash-collision-prob 10 10)
0.9996371

99.9%. OK, we don't stand a chance to accidentally get a perfect layout. :( What if we double the array size?

CL-USER> (hash-collision-prob 10 20)
0.9345271

93%. Still, pretty high.

CL-USER> (hash-collision-prob 10 100)
0.37184352
CL-USER> (hash-collision-prob 10 10000)
0.004491329

So, if we were to use a 10k-element array to store 10 items the chance of a collision would fall below 1%. Not practical...

Note that the number depends on both arguments, so (hash-collision-prob 10 100) (0.37) is not the same as (hash-collision-prob 20 200) (0.63).

We did this exercise to completely abandon any hope of avoiding collisions and accept that they are inevitable. Such mind/coding experiments may be an effective smoke-test of our novel algorithmic ideas: before we go full-speed and implement them, it makes sense to perform some back-of-the-envelope feasibility calculations.

Now, let's discuss what difference the presence of these collisions makes to our hash-table idea and how to deal with this issue. The obvious solution is to have a fallback option: when two keys hash to the same index, store both of the items in a list. The retrieval operation, in this case, will require a sequential scan to find the requested key and return the corresponding value. Such an approach is called "chaining" and it is used by some implementations. Yet, it has a number of drawbacks:

  • It complicates the implementation: we now have to deal with both a static array and a dynamic list/array/tree. This change opens a possibility for some hard-to-catch bugs, especially, in the concurrent settings.
  • It requires more memory than the hash-table backing array, so we will be in a situation when some of the slots of the array are empty while others chain several elements.
  • It will have poor performance due to the necessity of dealing with a linked structure and, what's worse, not respecting cache locality: the chain will not fit in the original array so at least one additional RAM round-trip will be required.

One upside of this approach is that it can store more elements than the size of the backing array. And, in the extreme case, it degrades to bucketing: when a small number of buckets point to long chains of randomly shuffled elements.

The more widely-used alternative to chaining is called "open addressing" or "closed hashing". With it, the chains are, basically, stored in the same backing array. The algorithm is simple: when the calculated hash is pointing at an already occupied slot in the array, find the next vacant slot by cycling over the array. If the table isn't full we're guaranteed to find one. If it is full, we need to resize it, first. Now, when the element is retrieved by key, we need to perform the same procedure: calculate the hash, then compare the key of the item at the returned index. if the keys are the same, we've found the desired element, otherwise — we need to cycle over the array comparing keys until we encounter the item we need.

Here's an implementation of the simple open addressing hash-table using eql for keys comparison:

(defstruct ht
array
(count 0))

(defun ht (&rest kvs)
(let ((rez (make-ht :array (make-array 16))))
(loop :for (k v) :in kvs :do
(add-ht k v rez))
rez))

(defun ht-get (key ht)
(with ((size (length @ht.array)))
(start (rem (hash key) size)))
(do ((count 0 (1+ count))
(i start (rem (1+ i) size))
(item (? ht 'array start)
(? ht 'array i))
((or (null item)
(= count size)))
(when (eql key (car item))
(return
(values (cdr item)
;; the second value is an index, at which
;; the item was found (also used to distinguish
;; the value nil from not found, which is also
;; represented by nil but without the second value)
i))))))

(defun ht-add (key val ht)
(with ((array (ht-array ht))
(size (length array)))
;; flet defines a local function that has access
;; to the local variables defined in HT-ADD
(flet ((add-item (k v)
(do ((i (rem (hash k) size)
(rem (1+ i) size))
((null @ht.array#i)
(:= @ht.array#i (cons k v)))
;; this do-loop doesn't have a body
)))
;; TALLY is a generic function for size retrieval, from RUTILS
(when (= (tally ht) size)
;; when the backing array is full
;; expand it to have the length equal to the next power of 2
(:= size (expt 2 (ceiling (log (1+ count) 2)))
@ht.array (make-array size))
;; and re-add its contents
(dovec (item array)
(add-item (car item) (cdr item)))
;; finally, add the new item
(add-item key val)))

(defun ht-rem (key ht)
;; here, we use the index of the item
;; returned as the 2nd value of HT-GET
;; (when-it is a so called anaphoric macro, from RUTILS,
;; that assigns the value of its first argument
;; to an implicitly created variable IT
;; and evaluates the body when IT isn't null)
(when-it (nth-value 2 (ht-get key ht))
(void (? ht 'array it))
;; return the index to indicate that the item was found
it))

To avoid constant resizing of the hash-table, just as with dynamic arrays, the backing array is, usually, allocated to have the size equal to a power of 2: 16 elements, to begin with. When it is filled up to a certain capacity it is resized to the next power of 2: 32, in this case. Usually, around 70-80% is considered peak occupancy as too collisions may happen afterward and the table access performance severely degrades. In practice, this means that normal open-addressing hash-tables also waste from 20 to 50 percent of allocated space. This inefficiency becomes a serious problem with large tables, so other implementation strategies become preferable when the size of data reaches tens and hundreds of megabytes. Note that, in our trivial implementation above, we have, effectively, used the threshold of 100% to simplify the code. Adding a configurable threshold is just a matter of introducing a parameter and initiating resizing not when (= (ht-count ht) size) but upon (= (ht-count ht) (floor size threshold)). As we've seen, resizing the hash-table requires calculating the new indices for all stored elements and adding them anew into the resized array.

Analyzing the complexity of the access function of the hash-table and proving that it is amortized O(1) isn't trivial. It depends on the properties of the hash-function, which should ensure good uniformity. Besides, the resizing threshold also matters: the more elements are in the table, the higher the chance of collisions. Also, you should keep in mind that if the keys possess some strange qualities that prevent them from being hashed uniformly, the theoretical results will not hold.

In short, if we consider a hash-table with 60% occupancy (which should be the average number, for a common table) we end up with the following probabilities:

  • probability that we'll need just 1 operation to access the item (i.e. the initially indexed slot is empty): 0.4
  • probability that we'll need 2 operations (the current slot is occupied, the next one is empty): (* 0.6 0.4) — 0.24
  • probability that we'll need 3 operations: (* (expt 0.6 2) 0.4) — 0.14
  • probability that we'll need 4 operations: (* (expt 0.6 3) 0.4) — 0.09

Actually, these calculations are slightly off and the correct probability of finding an empty slot should be somewhat lower, although the larger the table is, the smaller the deviation in the numbers. Finding out why is left as an exercise for the reader :)

As you see, there's a progression here. With probability around 0.87, we'll need no more than 4 operations. Without continuing with the arithmetic, I think, it should be obvious that we'll need, on average, around 3 operations to access each item and the probability that we'll need twice as many (6) is quite low (below 5%). So, we can say that the number of access operations is constant (i.e. independent of the number of elements in the table) and is determined only by the occupancy percent. So, if we keep the occupancy in the reasonable bounds, named earlier, on average, 1 hash code calculation/lookup and a couple of retrievals and equality comparisons will be needed to access an item in our hash-table.

Hash-Code

So, we can conclude that a hash-table is primarily parametrized by two things: the hash-function and the equality predicate. In Lisp, in particular, there's a choice of just the four standard equality predicates: eq, eql, equal, and equalp. It's somewhat of a legacy that you can't use other comparison functions so some implementations, as an extension, allow th programmer to specify other predicates. However, in practice, the following approach is sufficient for the majority of the hash-table use cases:

  • use the eql predicate if the keys are numbers, characters, or symbols
  • use equal if the keys are strings or lists of the mentioned items
  • use equalp if the keys are vectors, structs, CLOS objects or anything else containing one of those

But I'd recommend trying your best to avoid using the complex keys requiring equalp. Besides the performance penalty of using the heaviest equality predicate that performs deep structural comparison, structs, and vectors, in particular, will most likely hash to the same index. Here is a quote from one of the implementors describing why this happens:

Structs have no extra space to store a unique hash code within them. The decision was made to implement this because automatic inclusion of a hashing slot in all structure objects would have made all structs an average of one word longer. For small structs this is unacceptable. Instead, the user may define a struct with an extra slot, and the constructor for that struct type could store a unique value into that slot (either a random value or a value gotten by incrementing a counter each time the constructor is run). Also, create a hash generating function which accesses this hash-slot to generate its value. If the structs to be hashed are buried inside a list, then this hash function would need to know how to traverse these keys to obtain a unique value. Finally, then, build your hash-table using the :hash-function argument to make-hash-table (still using the equal test argument), to create a hash-table which will be well-distributed. Alternatively, and if you can guarantee that none of the slots in your structures will be changed after they are used as keys in the hash-table, you can use the equalp test function in your make-hash-table call, rather than equal. If you do, however, make sure that these struct objects don't change, because then they may not be found in the hash-table.

But what if you still need to use a struct or a CLOS object as a hash key (for instance, if you want to put them in a set)? There are three possible workarounds:

  • Choose one of their slots as a key (if you can guarantee its uniqueness).
  • Add a special slot to hold a unique value that will serve as a key.
  • Use the literal representation obtained by calling the print-function of the object. Still, you'll need to ensure that it will be unique and constant. Using an item that changes while being the hash key is a source of very nasty bugs, so avoid it at all cost.

These considerations are also applicable to the question of why Java requires defining both equals and hashCode methods for objects that are used as keys in the hash-table or hash-set.

Advanced Hashing Techniques

Beyond the direct implementation of open addressing, called "linear probing" (for it tries to resolve collisions by performing a linear scan for an empty slot), a number of approaches were proposed to improve hash distribution and reduce the collision rate. However, for the general case, their superiority remains questionable, and so the utility of a particular approach has to be tested in the context of the situations when linear probing demonstrates suboptimal behavior. One type of such situations occurs when the hash-codes become clustered near some locations due to deficiencies of either the hash-function or the keyset.

The simplest modification of linear probing is called "quadratic probing". It operates by performing the search for the next vacant slot using the linear probing offsets (or some other sequence of offsets) that are just raised to the power 2. I.e. if, with linear probing, the offset sequence was 1,2,3,etc, with the quadratic one, it is 1,4,9,... "Double hashing" is another simple alternative, which, instead of a linear sequence of offsets, calculates the offsets using another hash-function. This approach makes the sequence specific to each key, so the keys that map to the same location will have different possible variants of collision resolution. "2-choice hashing" also uses 2 hash-functions but selects the particular one for each key based on the distance from the original index it has to be moved for collision resolution.

More elaborate changes to the original idea are proposed in Cuckoo, Hopscotch, and Robin Hood caching, to name some of the popular alternatives. We won't discuss them now, but if the need arises to implement a non-standard hash-table it's worth studying all of those before proceeding with an idea of your own. Although, who knows, someday you might come up with a viable alternative technique, as well...

Hash-Functions

The class of possible hash-functions is very diverse: any function that sufficiently randomizes the key hashes will do. But what good enough means? One of the ways to find out is to look at the the pictures of the distribution of hashes. Yet, there are other factors that may condition the choice: speed, complexity of implementation, collision resistance (important for cryptographic hashes that we won't discuss in this book).

The good news is that, for most practical purposes, there's a single function that is both fast and easy to implement and understand. It is called FNV-1a.

(defparameter *fnv-primes*
'((32 . 16777619)
(64 . 1099511628211)
(128 . 309485009821345068724781371)
(256 . 374144419156711147060143317175368453031918731002211)))

(defparameter *fnv-offsets*
'((32 . 2166136261)
(64 . 14695981039346656037)
(128 . 144066263297769815596495629667062367629)
(256 . 100029257958052580907070968620625704837092796014241193945225284501741471925557)))

(defun fnv-1a (x &key (bits 32))
(assert (member bits '(32 64 128 256)))
(let ((rez (assoc1 bits *fnv-offsets*))
(prime (assoc1 bits *fnv-primes*)))
(dotimes (i (/ bits 8))
(:= rez (ldb (byte bits 0)
(* (logxor rez (ldb (byte 8 (* i 8)) x))
prime))))
rez))

The constants *fnv-primes* and *fnv-offsets* are precalculated up to 1024 bits (here, I used just a portion of the tables).

Note that, in this implementation, we use normal Lisp multiplication (*) that is not limited to fixed-size numbers (32-bit, 64-bit,...) so we need to extract only the first bits with ldb.

Also note that if you were to calculate FNV-1a with some online hash calculator you'd, probably, get a different result. Experimenting with it, I noticed that it is the same if we use only the non-zero bytes from the input number. This observation aligns well with calculating the hash for simple strings when each character is a single byte. For them the hash-function would look like the following:

(defun fnv-1a-str (str)
(let ((rez (assoc1 32 *fnv-offsets*))
(prime (assoc1 32 *fnv-primes*)))
(dovec (char str)
(:= rez (ldb 32 (* (logxor rez (ldb (byte 8 (* i 8))
(char-code char)))
prime))))
rez))

So, even such a simple hash-function has nuances in its implementation and it should be meticulously checked against some reference implementation or a set of expected results.

Alongside FNV-1a, there's also FNV-1, which is a slightly worse variation, but it may be used if we need to apply 2 different hash functions at once (like, in 2-way or double hashing).

What is the source of the hashing property of FNV-1a? Xors and modulos. Combining these simple and efficient operations is enough to create a desired level of randomization. Most of the other hash-functions use the same building blocks as FNV-1a. They all perform arithmetic (usually, addition and multiplication as division is slow) and xor'ing, adding into the mix some prime numbers. For instance, here's what the code for another popular hash-function "djb2" approximately looks like:

(defun djb2-str (str)
(let ((rez 5381) ; a DJB2 prime number
(loop :for char :across str :do
(:= rez (ldb 32 (+ (char-code char)
(ldb 32 (+ (ash rez 5)
rez)))))))
rez))

Operations

The generic key-value operations we have discussed in the previous chapter, obviously also apply to hash-tables. There are also specific low-level ones, defined by the Lisp standard. And it's worth mentioning that, in regards to hash-tables, I find the standard quite lacking so a lot of utilities were added as part of RUTILS. The reason for the deficiency in the stadard is, I believe, that when hash-tables had been added to Lisp they had been still pretty novel technology not widely adopted in the programming languages community. So there had been neither any significant experience using them, nor a good understanding of the important role they would play. Languages such as Python or Clojure as well as the ones that were designed even later, were developed with this knowledge already in mind. Yet, this situation doesn't pose insurmountable difficulty for Lisp users as the language provides advanced extension tools such as macros and reader macros, so the necessary parts can be added and, in fact, exist as 3rd-party extensions. Using them becomes just a question of changing your habits and adapting to more efficient approaches. The situation is different for the users of many other languages, such as Java users, who had to wait for the new major version of the language to get access to such things as literal hash-table initialization. The feature I consider to be crucially important to improving the level of code clarity, in the declarative paradigm.

Initialization

Normally, the hash-table can be created with make-hash-table, which has a number of configuration options, including :test (default: eql). Most of the implementations allow the programmer to make synchronized (thread-safe) hash-tables via another configuration parameter, but the variants of concurrency control will differ.

Yet, it is important to have a way to define hash-tables already pre-initialized with a number of key-value pairs, and make-hash-table can't handle this. Pre-initialized hash tables represent a common necessity for tables serving as dictionaries, and such pre-initialization greatly simplifies many code patterns. Thus RUTILS provides such a syntax (in fact, in 2 flavors) with the help of reader macros:

#{equal "foo" :bar "baz" 42}
#h(equal "foo" :bar "baz" 42)

Both of these expressions will expand into a call to make-hash-table with equal test and a two calls to set operation to populate the table with the kv-pairs "foo" :bar and "baz" 42. For this stuff to work, you need to switch to the appropriate readtable by executing: (named-readtables:in-readtable rutils-readtable).

The reader-macro to parse #h()-style literal readtables isn't very complicated. As all reader-macros, it operates on the character stream of the program text, processing one character at a time. Here is it's implementation:

(defun |#h-reader| (stream char arg)
(read-char stream) ; skip the open paren
;; we can also add a sanity check to ensure that this character
;; is indeed a #\(
(with (;; read-delimited-list is a standard library function
;; that reads items until a delimiter is encountered
;; and then returns them as a list of parsed Lisp objects
(sexp (read-delimited-list #\) stream t))
;; the idea is that the first element may be a hash-table
;; test function; in this case, the number of items in the
;; definition will be odd as each key-value pair should have
;; an even number of elements
(test (when (oddp (length sexp))
(first sexp)))
;; the rest of the values, after the possible test function,
;; are key-value pairs
(kvs (group 2 (if test (rest sexp) sexp)))
(ht (gensym)))
`(let ((,ht (make-hash-table :test ',(or test 'eql))))
;; iterate the tail of the KVS list (:on loop clause)
;; and, for each key-value pair, generate an expression
;; to add the value for the key in the resulting hash-table
,@(mapcar (lambda (kv)
`(:= (? ,ht ,(first kv)) ,(second kv)))
,kvs)
,ht)))

After such a function is defined, it can be plugged into the standard readtable:

(set-dispatch-macro-character #\# #\h '|#h-reader|)

Or it may be used in a named-readtable (you can learn how to do that, from the docs).

print-hash-table is the utility to perform the reverse operation — display hash-tables in the similar manner:

RUTILS> (print-hash-table #h(equal "foo" :bar "baz" 42))
#{EQUAL
"foo" :BAR
"baz" 42
}
#<HASH-TABLE :TEST EQUAL :COUNT 2 {10127C0003}>

The last line of the output is the default Lisp printed representation of the hash-table. As you see, it is opaque and doesn't display the elements of the table. RUTILS also allows switching to printing the literal representation instead of the standard one with the help of toggle-print-hash-table. However, this extension is intended only for debugging purposes as it is not fully standard-conforming.

Access

Accessing the hash-table elements is performed with gethash, which returns two things: the value at key and t when the key was found in the table, or two nils otherwise. By using (:= (gethash key ht) val) (or (:= (? ht key) val)) we can modify the stored value. Notice the reverse order of arguments of gethash compared to the usual order in most accessor functions, when the structure is placed first and the key second. However, gethash differs from generic ? in that it accepts an optional argument that is used as the default value if the requested key is not present in the table. In some languages, like Python, there's a notion of "default hash-tables" that may be initialized with a common default element. In Lisp, a different approach is taken. However, it's possible to easily implement default hash-tables and plug them into the generic-elt mechanism:

(defstruct default-hash-table
table
default-value)

(defun gethash-default (key ht)
(gethash key (? ht 'table) (? ht 'default-value)))

(defmethod generic-elt ((kv default-hash-table) key &rest keys)
(gethash-default key kv))

RUTILS also defines a number of aliases/shorthands for hash-table operations. As the # symbol is etymologically associated with hashes, it is used in the names of all these functions:

  • get# is a shorthand and a more distinctive alias for gethash
  • set# is an alias for (:= (gethash ...
  • getset# is an implementation of the common pattern: this operation either retrieves the value if the key is found in the table or calculates its third argument returns it and also sets it for the given key for future retrieval
  • rem# is an alias for remhash (remove the element from the table)
  • take# both returns the key and removes it (unlike rem# that only removes)
  • in# tests for the presence of the key in the table
  • also, p# is an abbreviated version of print-hash-table

Iteration

Hash-tables are unordered collections, in principle. But, still, there is always a way to iterate over them in some (unspecified) order. The standard utility for that is either maphash, which unlike map doesn't populate the resulting collection and is called just for the side effects, or the special loop syntax. Both are suboptimal, from several points of view, so RUTILS defines a couple of alternative options:

  • dotable functions in the same manner as dolist except that it uses two variables: for the key and the value
  • mapkv, mentioned in the previous chapter, works just like mapcar by creating a new result table with the same configuration as the hash-table it iterates over and assigns the results of invoking the first argument — the function of two elements — with each of the kv-pairs

Despite the absence of a predefined ordering, there are ways in which some order may be introduced. For example, in SBCL, the order in which the elements are added, is preserved by using additional vectors called index-vector and next-vector that store this information. Another option which allows forcing arbitrary ordering is to use the so-called Linked Hash-Table. It is a combination of a hash-table and a linked list: each key-value pair also has the next pointer, which links it to some other item in the table. This way, it is possible to have ordered key-values without resorting to tree-based structures. A poor man's linked hash-table can be created on top of the normal one with the following trick: substitute values by pairs containing a value plus a pointer to the next pair and keep track of the pointer to the first pair in a special slot.

(deftruct linked-hash-table-item
key
val
next)

(defstruct linked-hash-table
table
head
tail)

(defun gethash-linked (key ht)
(? ht 'table key 'val))

(defun sethash-linked (key ht val)
;; The initial order of items is the order of addition.
;; If we'd like to impose a different order,
;; we'll have to perform reordering after each addition
;; or implement a custom sethash function.
(with (((table head tail) ? ht)
(cur (gethash key table)))
(if cur
(:= (? cur 'val) val)
(let ((new (make-linked-hash-table-item
:key key :val val)))
(when (null head)
(:= (? ht 'head) new))
(:= (? ht 'tail)
(if tail
(:= (? ht 'tail 'next) new)
new))))))

(defmethod mapkv (fn (ht linked-hash-table))
(with ((rez (make-linked-hash-table
:table (hash-table :key (hash-table-key (? rez 'table)))))
(prev nil))
(do ((item (? rez 'head) (? item 'next)))
((null item))
(sethash-linked key rez (call fn (? item 'val))))))

The issue with this approach, as you can see from the code, is that we also need to store the key, and it duplicates the data also stored in the backing hash-table itself. So, an efficient linked hash-table has to be implemented from scratch using an array as a base instead of a hash-table.

Perfect Hashing

In the previous exposition, we have concluded that using hash-tables implies a significant level of reserved unused space (up to 30%) and inevitable collisions. Yet, if the keyset is static and known beforehand, we can do better: find a hash-function, which will exclude collisions (simple perfect hashing) and even totally get rid of reserved space (minimal perfect hashing, MPH). Although the last variant will still need extra space to store the additional information about the hash-functions, it may be much smaller: in some methods, down to ~3-4 bits per key, so just 5-10% overhead. Statistically speaking, constructing such a hash-function is possible. But the search for its parameters may require some trial and error.

Implementation

The general idea is simple, but how to find the appropriate hash-function? There are several approaches described in sometimes hard-to-follow scientific papers and a number of cryptic programs in low-level C libraries. At a certain point in time, I needed to implement some variant of an MPH so I read those papers and studied the libraries to some extent. Not the most pleasant process, I should confess. One of my twitter pals once wrote: "Looks like it's easier for people to read 40 blog posts than a single whitepaper." And, although he was putting a negative connotation to it, I recognized the statement as a very precise description of what a research engineer does: read a whitepaper (or a dozen, for what it's worth) and transform it into working code and — as a possible byproduct — into an explanation ("blog post") that other engineers will understand and be able to reproduce. And it's not a skill every software developer should be easily capable of. Not all papers can even be reproduced because the experiment was not set up correctly, some parts of the description are missing, the data is not available, etc. Of those, which, in principle, can be, only some are presented in the form that is clear enough to be reliably programmed.

Here is one of the variants of minimal perfect hashing that possesses such qualities. It works for datasets of any size as a 3-step process:

  1. At the first stage, by the use of a common hash-function (in particular, the Jenkins hash), all keys are near-uniformly distributed into buckets, so that the number of keys in each bucket doesn't exceed 256. It can be achieved with very high probability if the hash divisor is set to (ceiling (length keyset) 200). This allows the algorithm to work for data sets of arbitrary size, thereby reducing the problem to a simpler one that already has a known solution.
  2. Next, for each bucket, the perfect hash function is constructed. This function is a table (and it's an important mathematical fact that each discrete function is equivalent to a table, albeit, potentially, of unlimited length). The table contains byte-sized offsets for each hash code, calculated by another application of the Jenkins hash, which produces two values in one go (actually, three, but one of them is not used). The divisor of the hash-function, this time, equals to double the number of elements in the bucket. And the uniqueness requirement is that the sum of offsets corresponding, in the table, to the two values produced by the Jenkins hash is unique, for each key. To check if the constraint is satisfied, the hashes are treated as vertices of a graph, and if it happens to be acyclic (the probability of this event is quite high if the parameters are chosen properly), the requirement can be satisfied, and it is possible to construct the perfect hash function, by the process described as the next step. Otherwise, we change the seed of the Jenkins hash and try again until the resulting graph is acyclic. In practice, just a couple of tries are needed.
  3. Finally, the hash-function for the current bucket may be constructed from the graph by the CHM92 algorithm (named after the authors and the year of the paper), which is another version of perfect hashing but suitable only for limited keysets. Here, you can see the CHM92 formula implemented in code:
(defstruct mpht
(data nil :type simple-vector)
(offsets nil :type (simple-array octet))
(meta nil :type (simple-array quad))
(div nil))

;; div is the divisor of the top-level hash, which is calculated as:
;; (/ (1- (length meta)) 2)

(defun mpht-index (item mpht)
(with (((offsets meta div) ? mpht)
(bucket-id (* (mod (jenkins-hash item) div) 2))
(bucket-offset (? meta bucket-id))
(bucket-seed (? meta (+ 1 bucket-id)))
;; the number of items in the bucket is calculated
;; by substracting the offset of the next bucket
;; from the offset of the current one
(bucket-count (- (? meta (+ 2 bucket-id))
bucket-offset))
(hash1 hash2 (jenkins-hash2 item bucket-seed bucket-div))
(base (* bucket-offset 2)))
(+ bucket-offset (mod (+ (? offsets (+ base hash1))
(? offsets (+ base hash2)))
bucket-count))))

This algorithm guarantees exactly O(1) hash-table access and uses 2 bytes per key, i.e. it will result in a constant 25% overhead on the table's size (in a 64-bit system): 2 byte-sized offsets for the hashes plus negligible 8 bytes per bucket (each bucket contains ~200 elements) for meta information. Better space-utilization solutions (up to 4 times more efficient) exist, but they are harder to implement and explain.

The Jenkins hash-function was chosen for two reasons:

  • Primarily, because, being a relatively good-quality hash, it has a configurable parameter seed that is used for probabilistic probing (searching for an acyclic graph). On the contrary, FNV-1a doesn't work well with an arbitrary prime hence the usage of a pre-calculated one that isn't subject to change.
  • Also, it produces 3 pseudo-random numbers right away, and we need 2 for the second stage of the algorithm.

The CHM92 Algorithm

The CHM92 algorithm operates by performing a depth-first search (DFS) on the graph, in the process, labeling the edges with unique numbers and calculating the corresponding offset for each of the Jenkins hash values. In the picture, you can see one of the possible labelings: each vertex is the value of one of the two hash-codes returned by jenkins-hash2 for each key, and every edge, connecting them, corresponds to a key that produced the hashes. The unique indices of the edges were obtained during DFS. Now, each hash-code is mapped iteratively to the number that is (- edge-index other-vertex-index). So, some codes will map to the same number, but it is guaranteed that, for each key, the sum of two corresponding numbers will be unique (as the edge indices are unique).

CHM92 is an example of the probabilistic algorithms we will discuss in more detail near the end of the book.

Let's say we have implemented the described scheme like I did in the const-table library. Now, we need to perform the measurements to validate that we have, in fact, achieved the desired improvement over the standard hash-table implementation. In this case, we are interested not only in speed measurements, which we already know how to perform but also in calculating the space occupied.

The latter goal is harder to achieve. Usually, most of the programming languages will provide the analog of a sizeof function that returns the space occupied by an array, a structure or an object. Here, we're interested not in "shallow" sizeof but in a "deep" one that will descend into the structure's slots and add their sizes recursively.

First, let's create functions to populate the tables with a significant number of random string key-value pairs.

(defun random-string (size)
(coerce (loop :repeat size :collect (code-char (+ 32 (random 100))))
'string))

(defun random-hash-table (&key (n 100000))
(let ((rez (make-hash-table :test 'equal)))
(loop :repeat n :do
(:= (? rez (random-string (+ 3 (random 4))))
(random-string (+ 3 (random 4)))))
rez))

(defun random-const-table (&key (n 100000))
(let ((rez (make-const-table :test 'equal)))
(loop :repeat n :do
(:= (? rez (random-string (+ 3 (random 4))))
(random-string (+ 3 (random 4)))))
rez))

A very approximate space measurement may be performed using the standard operator room. But it doesn't provide detailed per-object statistics. Here's a result of the room measurement, in SBCL (the format of the report will be somewhat different, for each implementation):

CL-USER> (room)
Dynamic space usage is: 45,076,224 bytes.
Immobile space usage is: 18,998,832 bytes (64,672 bytes overhead).
Read-only space usage is: 0 bytes.
Static space usage is: 1,264 bytes.
Control stack usage is: 9,048 bytes.
Binding stack usage is: 640 bytes.
Control and binding stack usage is for the current thread only.
Garbage collection is currently enabled.

Breakdown for dynamic space:
11,369,232 bytes for 76,040 simple-vector objects
9,095,952 bytes for 160,669 instance objects
8,289,568 bytes for 518,098 cons objects
3,105,920 bytes for 54,655 simple-array-unsigned-byte-8 objects
2,789,168 bytes for 54,537 simple-base-string objects
2,344,672 bytes for 9,217 simple-character-string objects
6,973,472 bytes for 115,152 other objects

43,967,984 bytes for 988,368 dynamic objects (space total)

Breakdown for immobile space:
16,197,840 bytes for 24,269 code objects
1,286,496 bytes for 26,789 symbol objects
1,041,936 bytes for 27,922 other objects

18,526,272 bytes for 78,980 immobile objects (space total)


CL-USER> (defparameter *ht* (random-hash-table))
*HT*
CL-USER> (room)
...
Breakdown for dynamic space:
13,349,920 bytes for 77,984 simple-vector objects
11,127,008 bytes for 208,576 simple-character-string objects
9,147,824 bytes for 161,469 instance objects
8,419,360 bytes for 526,210 cons objects
3,517,792 bytes for 2,997 simple-array-unsigned-byte-32 objects
3,106,288 bytes for 54,661 simple-array-unsigned-byte-8 objects
7,671,168 bytes for 166,882 other objects

56,339,360 bytes for 1,198,779 dynamic objects (space total)

So, it seems like we added roughly 10 megabytes by creating a hash-table with 100,000 random 5-9 character keys and values. Almost all of that space went into the keys and values themselves — 9 Mb ("11,127,008 bytes for 208,576 simple-character-string objects" versus "2,344,672 bytes for 9,217 simple-character-string objects" — a bit less than 200,000 new strings were added).

Also, if we examine the hash-table, we can see that its occupancy is rather high — around 90%! (The number of keys 99706 instead of 10000 tells us that there was a small portion of duplicate keys among the randomly generated ones).

CL-USER> (describe *ht*)
#<HASH-TABLE :TEST EQUAL :COUNT 99706 {1002162EF3}>
[hash-table]

Occupancy: 0.9
Rehash-threshold: 1.0
Rehash-size: 1.5
Size: 111411

And now, a simple time measurement:

CL-USER> (let ((keys (keys *ht*)))
(time (loop :repeat 100 :do
(dolist (k keys)
(gethash k *ht*)))))
Evaluation took:
0.029 seconds of real time
0.032000 seconds of total run time (0.032000 user, 0.000000 system)
110.34% CPU
72,079,880 processor cycles
0 bytes consed

Now, let's try the const-tables that are the MPHT implementation:

СL-USER> (time (defparameter *ct* (cstab:build-const-table *ht*)))
...................................................................................................
Evaluation took:
0.864 seconds of real time
...
СL-USER> (room)
...
Breakdown for dynamic space:
14,179,584 bytes for 78,624 simple-vector objects
11,128,464 bytes for 208,582 simple-character-string objects
9,169,120 bytes for 161,815 instance objects
8,481,536 bytes for 530,096 cons objects
3,521,808 bytes for 2,998 simple-array-unsigned-byte-32 objects
3,305,984 bytes for 54,668 simple-array-unsigned-byte-8 objects
7,678,064 bytes for 166,992 other objects

57,464,560 bytes for 1,203,775 dynamic objects (space total)

Another megabyte was added for the metadata of the new table, which doesn't seem significantly different from the hash-table version. Surely, often we'd like to be much more precise in space measurements. For this, SBCL recently added an allocation profiler sb-aprof, but we won't go into the details of its usage, in this chapter.

And now, time measurement:

CL-USER> (let ((keys (keys *ht*)))
(time (loop :repeat 100 :do
(dolist (k keys)
(cstab:csget k *ct*)))))
Evaluation took:
3.561 seconds of real time

Oops, a two-orders-of-magnitude slowdown! Probably, it has to do with many factors: the lack of optimization in my implementation compared to the one in SBCL, the need to calculate more hashes and with a slower hash-function, etc. I'm sure that the implementation may be sped up at least an order of magnitude, but, even then, what's the benefit of using it over the default hash-tables? Especially, considering that MPHTs have a lot of moving parts and rely on a number of "low-level" algorithms like graph traversal or efficient membership testing, most of which need a custom efficient implementation...

Still, there's one dimension in which MPHTs may provide an advantage: significantly reduce space usage by not storing the keys. Though, it becomes problematic if we need to distinguish the keys that are in the table from the unknown ones as those will also hash to some index, i.e. overlap with an existing key. So, either the keyspace should be known beforehand and exhaustively covered in the table or some precursory membership test is necessary when we anticipate the possibility of unseen keys. Yet, there are ways to perform the test efficiently (exactly or probabilistically), which require much less storage space than would be needed to store the keys themselves. Some of them we'll see in the following chapters.

If the keys are omitted, the whole table may be reduced to a Jump-table. Jump-tables are a low-level trick possible when all the keys are integers in the interval [0, n). It removes the necessity to perform sequential equality comparisons for every possible branch until one of the conditions matches: instead, the numbers are used directly as an offset. I.e. the table is represented by a vector, each hash-code being the index in that vector.

A jump-table for the MPHT will be simply a data array, but sometimes evaluation of different code is required for different keys. Such more complex behavior may be implemented in Lisp using the lowest-level operators tagbody and go (and a bit of macrology if we need to generate a huge table). This implementation will be a complete analog of the C switch statement. The skeleton for such "executable" table will look like this, where 0, 1,... are goto labels:

(block nil
(tagbody (go key)
0 (return (do-something0))
1 (return (do-something1))
...))

Distributed Hash-Tables

Another active area of hash-table-related research is algorithms for distributing them over the network. This is a natural way to represent a lot of datasets, and thus there are numerous storage systems (both general- and special-purpose) which are built as distributed hash-tables. Among them are, for instance, Amazon DynamoDB or an influential open-source project Kademlia. We will discuss in more detail, in the chapter on Distributed Algorithms, some of the technologies developed for this use case, and here I wanted to mention just one concept.

Consistent Hashing addresses the problem of distributing the hash-codes among k storage nodes under the real-world limitations that some of them may become temporarily unavailable or new peers may be added into the system. The changes result in changes of the value of k. The straightforward approach would just divide the space of all codes into k equal portions and select the node into whose portion the particular key maps. Yet, if k is changed, all the keys need to be rehashed, which we'd like to avoid at all cost as rehashing the whole database and moving the majority of the keys between the nodes, at once, will saturate the network and bring the system to a complete halt.

The idea or rather the tweak behind Consistent Hashing is simple: we also hash the node ids and store the keys on the node that has the next hash-code larger than the hash of the key (modulo n, i.e. wrap around 0). Now, when a new node is added, it is placed on this so-called "hash ring" between two other peers, so only part of the keys from a single node (the next on the ring) require being redistributed to it. Likewise, when the node is removed, only its keys need to be reassigned to the next peer on the ring (it is supposed that the data is stored in multiple copies on different nodes, so when one of the nodes disappears the data doesn't become totally lost).

The only problem with applying this approach directly is the uneven distribution of keys originating from uneven placement of the hash-codes of the nodes on the hash ring. This problem can be solved with another simple tweak: have multiple ids for each node that will be hashed to different locations, effectively emulating a larger number of virtual nodes, each storing a smaller portion of the keys. Due to the randomization property of hashes, not so many virtual nodes will be needed, to obtain a nearly uniform distribution of keys over the nodes.

A more general version of this approach is called Rendezvous Hashing. In it, the key for the item is combined with the node id for each node and then hashed. The largest value of the hash determines the designated node to store the item.

Hashing in Action: Content Addressing

Hash-tables are so ubiquitous that it's, actually, difficult to single out some peculiar use case. Instead, let's talk about hash-functions. They can find numerous uses beyond determining the positions of the items in the hash-table, and one of them is called "content addressing": globally identify a piece of data by its fingerprint instead of using external meta information like name or path. This is one of the suggested building blocks for large-scale distributed storage systems, but it works locally, as well: your git SCM system silently uses it behind the scenes to identify the changesets it operates upon.

The advantages of Content Addressing are:

  • Potential for space economy: if the system has a chance of operating on repeated items (like git does, although it's not the only reason for choosing such naming scheme for blobs: the other being the lack of a better variant), content addressing will make it possible to avoid storing them multiple times.
  • It guarantees that the links will always return the same content, regardless of where it is retrieved from, who added it to the network, how and when. This enables such distributed protocols as BitTorrent that split the original file into multiple pieces, each one identified by its hash. These pieces can be distributed in an untrusted network.
  • As mentioned above, content addressing also results in a conflict-free naming scheme (provided that the hash has enough bits — usually, cryptographic hashes such as SHA-1 are used for this purpose, although, in many cases, such powerful hash-functions are an overkill).

Take-aways

This chapter resented a number of complex approaches that require a lot of attention to detail to be implemented efficiently. On the surface, the hash-table concept may seem rather simple, but, as we have seen, the production-grade implementations are not that straightforward. What general conclusions can we make?

  1. In such mathematically loaded areas as hash-function and hash-table implementation, rigorous testing is critically important. For there is a number of unexpected sources of errors: incorrect implementation, integer overflow, concurrency issues, etc. A good testing strategy is to use an already existing trusted implementation and perform a large-scale comparison testing with a lot of random inputs. We haven't discussed the testing code here but will return to the practical implementation of such testing frameworks in the following chapters.
  2. Besides, a correct implementation doesn't necessarily mean a fast one. Low-level optimization techniques play a crucial role here.
  3. In the implementation of MPHT, we have seen in action another important approach to solving algorithmic and, more generally, mathematic problems: reducing them to a problem that has a known solution.
  4. Space measurement is another important area of algorithms evaluation that is somewhat harder to accomplish than runtime profiling. We'll also see more usage of both of these tools throughout the book.

Nicolas HafnerLibrary Policy Update - Confession 88

· 92 days ago

header
Recently there's been a few changes to the libraries I maintain, as well as to the new libraries I publish. I thought that these changes would be of general interest, even if you might not use my libraries at all.

License Switch from Artistic 2 to Zlib

Previously all of my libraries were published under the Artistic 2 license. However, a problem was brought to my attention in that license, specifically prevailing to Lisp distributions. Namely, §8 in the license text:

8) You are permitted to link Modified and Standard Versions with other works, to embed the Package in a larger work of your own, or to build stand-alone binary or bytecode versions of applications that include the Package, and Distribute the result without restriction, provided the result does not expose a direct interface to the Package.

The problem being this, in specific:

provided the result does not expose a direct interface to the Package.

This seems to prohibit distributing an application that would expose something like a REPL or a scripting interface that allows the user to interface with the library. I'm not sure what "direct" means in this context, nor what "linking" really means for Lisp - the GPL licenses have a similar issue. Either way, the implications of this restriction are severe enough that I was convinced to abandon the license.

I have since changed the license of all of my libraries and projects to Zlib. Everyone is also hereby permitted to use any prior versions of my projects that were licensed under Artistic 2 under the zlib license terms.

Why Zlib and not MIT or BSD? Simply because I like the clauses that prohibit claiming credit for the software. In any case, I hope that alleviates potential concerns people had about using my software due to the license terms.

Fully Qualified Domain Names for Packages

Since the package local nicknames extension to Common Lisp is now supported widely enough for my tastes, I have decided to drop the short nickname I used to include for packages so far. All of my packages have always included a FQDN name, but also included a shorter nickname for convenience. Newly published libraries will not do this anymore.

You are now encouraged to use the packages by adding local nicknames to your own package definitions. For instance, to use the new Classowary library, you would do something like this:

(defpackage #:org.my.stuff.package
  (:use #:cl)
  (:local-nicknames
    (#:cass #:org.shirakumo.classowary)))

If you use an implementation - like Allegro or LispWorks - that does not support package local nicknames yet, please contact support and request the addition of this feature. It should not be a difficult feature to add, and there is a comprehensive test suite available to aid the process.

If you are a user of my packages and have so far been using the short name, please update your own packages to use a nickname. You won't have to change any other code as long as you do that. You should do this because I would like to alleviate the package namespace pollution problem by removing the short, global aliases from my libraries in the future. Thus, consider the short names to be officially deprecated.

Closing Thoughts

I hope that with these changes I can help push the ecosystem in a positive direction. I would also like to remind everyone of the portability effort. I think the variety of implementations on offer is a big asset to Common Lisp, and I would very much like it to be possible for people to write high quality libraries that work on a variety of implementations without having to expel a huge amount of time repeating the porting work. As such, it would be amazing if people could help to improve existing portability libraries and implementations to extend support coverage.

More articles and Lispy things coming soon, I hope!

Lispers.deLisp-Meetup in Hamburg on Monday, 2nd September 2019

· 98 days ago

We meet at Ristorante Opera, Dammtorstraße 7, Hamburg, starting around 19:00 CEST on 2nd September 2019.

This is an informal gathering of Lispers of all experience levels.


For older items, see the Planet Lisp Archives.


Last updated: 2019-12-06 21:44