Joe Marshall — "No 'x' in 'Nixon'"
@20191206 21:44 · 2 hours ago An empty string is a palindrome.
 A string with one character is a palindrome.
 If a string starts and ends with the same character, and the substring between them is a palindrome, then the entire string is a palindrome.
Zach Beane — Working with letsencrypt's certbot for a Lisp webserver
@20191205 23:35 · 24 hours agoEvery 90 days my letsencrypt certificate expires and I renew it manually. I have to cut and paste data from the certbot script into repl forms to serve the right data on the right path and it’s such a pain that sometimes I put it off until after it expires, and people email me about it, and I feel bad.
The manual process looks something like this (not my actual domain or challenges):
# certbot certonly manual d my.example.com
Saving debug log to /var/log/letsencrypt/letsencrypt.log
Plugins selected: Authenticator manual, Installer None
Cert is due for renewal, autorenewing...
Renewing an existing certificate
Performing the following challenges:
http01 challenge for my.example.com
                                       
NOTE: The IP of this machine will be publicly logged as having requested this
certificate. If you're running certbot in manual mode on a machine that is not
your server, please ensure you're okay with that.
Are you OK with your IP being logged?
                                       
(Y)es/(N)o: y
                                       
Create a file containing just this data:
uaut.gj61B3f7oYQcWZSF4kxS4OFh8KQlsDVtPXrw60Tkj2JLW7RtZaVE0MIWwiEKRlxph7SaLwp1ETFjaGDUKN
And make it available on your web server at this URL:
http://my.example.com/.wellknown/acmechallenge/SpZa7sf4QMEFoF7lKh7aT4EZjNWSVcur2jODPQkgExa
                                       
Press Enter to Continue
There’s no way that letsencrypt’s certbot program will automatically get support for my custom hunchentoot “framework” and I don’t really want to look into adding it myself.
For a while I thought about writing some
Expect functionality in SBCL 
sbext:runprogram
already supports a :pty
option so it wouldn’t
be superdifficult. With a theoretical clexpect you could spawn
certbot with the right options, slurp out the verification secret and
URL via the process’s standard output stream, and call whatever code
you need to serve the secret at that location.
I realized today there’s a halfway solution that takes the worst cutting and pasting out of the loop. The unix command script –flush starts an interactive session where all input and output is saved to a file. My Lisp webserver can then read certbot program output from that file and configure the webserver automagically for me. I still have to manually start certbot and enter a few REPL commands, but it’s easier than before.
Here’s the core of Lisp side of things for processing the script
file:
(defun scrapeletsencryptinteraction (file)
(let (secret path)
(withopenfile (stream file)
(labels (...)
(loop
(skippast "Create a file containing")
(nextline)
(setf secret (readtrimmedline))
(skippast "make it available")
(nextline)
(setf path (substringafter ".wellknown" (readtrimmedline))))))))
This could be done just as well (perhaps with clppcre, but I didn’t want to pull it in as dependency.
Here are the functions from labels
:
((nextline ()
(let ((line (readline stream nil)))
(when (null line)
(unless (and secret path)
(error "Didn't find a secret and path anywhere in ~S"
file))
(returnfrom scrapeletsencryptinteraction
(values secret path)))
line))
(skippast (string)
(loop
(let ((line (nextline)))
(when (search string line)
(return)))))
(readtrimmedline ()
(stringtrim '(#\Return)
(nextline)))
(substringafter (string target)
(let ((pos (search string target)))
(unless pos
(error "Could not find ~S in ~S" string target))
(subseq target pos))))
The goal here is to only look at the last secret and URL in the script
output, so the main loop keeps track of what it’s seen so far and
nextline
returns those values at EOF. The output also has ugly
trailing ^M
noise so readtrimmedline
takes care of that.
Here’s the whole thing all together:
(defun scrapeletsencryptinteraction (file)
(let (secret path)
(withopenfile (stream file)
(labels ((nextline ()
(let ((line (readline stream nil)))
(when (null line)
(unless (and secret path)
(error "Didn't find a secret and path anywhere in ~S"
file))
(returnfrom scrapeletsencryptinteraction
(values secret path)))
line))
(skippast (string)
(loop
(let ((line (nextline)))
(when (search string line)
(return)))))
(readtrimmedline ()
(stringtrim '(#\Return)
(nextline)))
(substringafter (string target)
(let ((pos (search string target)))
(unless pos
(error "Could not find ~S in ~S" string target))
(subseq target pos))))
(loop
(skippast "Create a file containing")
(nextline)
(setf secret (readtrimmedline))
(skippast "make it available")
(nextline)
(setf path (substringafter ".wellknown" (readtrimmedline))))))))
I don’t mind shaving only half a yak when it feels like useful progress!
Someday I’ll get around to a proper Expectlike thing…
Wimpie Nortje — HTTP Routing libraries for Hunchentoot
@20191202 00:00 · 5 days agoHunchentoot is a web server and not a heavy weight web development framework. It provides methods to implement URL paths for static files, directory paths and with regex's, very much like the mainstream web servers. It does not provide an easy way to define routes for a REST API. When you use Hunchentoot without a web development framework you will probably want something to make route definition easier.
There are a few options for building REST APIs available in frameworks, Hunchentoot derivatives or other web servers but I wanted to implement REST routes with the original Hunchentoot web server. I found three libraries that can do this: simpleroutes, Froute and easyroutes.
Simpleroutes is the simplest and easiest to use. Routes are defined in a list
similar to Hunchentoot's *dispatchtable*
. It supports variables in the URL
segments but there is no support for middleware^{1} type functionality.
Froute is the most powerful of the three. It is based on CLOS and is designed so it can be used with any web server although only a Hunchentoot connector is currently implemented. Routes are defined as CLOS classes and even though middleware is not a specific feature the documentation gives an example on how to use class inheritance to implement such functionality. The power of being CLOS based also makes this library the most complex to use.
Easyroutes has the concept of decorators which are functions that execute before the route body so it can be used to implement middleware functionality. Unlike Clack's middleware which are defined at a central place for all routes, decorators need to be applied to each route handler individually. It's not quite there, but close enough.
The lack of middleware options disqualified simpleroutes for me and Froute looked like it provides everything I need, and more, but with much greater complexity than easyroutes. I decided to use easyroutes with the option to switch to Froute when I needed the extra capability.
Hunchentoot takes an "acceptor" argument at startup. Easyroutes provides two
options: easyroutesacceptor
and routesacceptor
. Easyroutesacceptor
first executes all the route handlers and if no suitable handler is found it
executes the normal Hunchentoot request handlers. The routesacceptor
executes only the route handlers and returns an 404 NOT FOUND
error if no
suitable handler is found.
I use routesacceptor
because it ensures that only requests with explicitly
defined handlers are handled. With the easyroutesacceptor
it is too easy to
create a security hole with some default Hunchentoot request handler that
catches nonexistent routes. It can be burdensome to use this approach for
handling static files but I run Hunchentoot behind Nginx which also handles the
static files.
The table summarises the main features I investigated:
simpleroutes  easyroutes  Froute  

Web server  Hunchentoot  Hunchentoot  Hunchentoot (can be expanded to others) 
REST routes  Yes  Yes  Yes 
Argument extraction from URL  Yes  Yes  Yes 
Dispatch based on HTTP method  Yes  Yes  Yes 
Middleware  No  Decorators  CLOS inheritance 
Fallback for undefined routes  Hunchentoot easyhandler framework  None or Hunchentoot easyhandler framework  None 
Learning curve  Negligible  Minimal  Medium. Requires some CLOS knowledge. 

Middleware are functions that run before or after the main request handler. They are called as part of the request handling process and not by the handler. This makes them ideal to handle general functions like setting up a database connection, performing authorisation or any task which is not part of a particular request handler. ↩
Paul Khuong — A Multiset of Observations With Constanttime Sample Mean and Variance
@20191201 04:51 · 5 days agoFixed notation issues in the “Faster multiset updates” section. Thank you Joonas.
Let’s say you have a multiset (bag) of “reals” (floats or rationals), where each value is a sampled observations. It’s easy to augment any implementation of the multiset ADT to also return the sample mean of the values in the multiset in constant time: track the sum of values in the multiset, as they are individually added and removed. This requires one accumulator and a counter for the number of observations in the multiset (i.e., constant space), and adds a constant time overhead to each update.
It’s not as simple when you also need the sample variance of the multiset \(X\), i.e.,
\[\frac{1}{n  1} \sum\sb{x \in X} (x  \hat{x})\sp{2},\]
where \(n = X\) is the sample size and \(\hat{x}\) is the sample mean \(\sum\sb{x\in X} x/n,\) ideally with constant query time, and constant and update time overhead.
One could try to apply the textbook equality
\[s\sp{2} = \frac{1}{n(n1)}\left[n\sum\sb{x\in X} x\sp{2}  \left(\sum\sb{x\in X} x\right)\sp{2}\right].\]
However, as Knuth notes in TAoCP volume 2,
this expression loses a lot of precision to roundoff in floating point:
in extreme cases, the difference might be negative
(and we know the variance is never negative).
More commonly, we’ll lose precision
when the sampled values are clustered around a large mean.
For example, the sample standard deviation of 1e8
and 1e8  1
is 1
, same as for 0
and 1
.
However, the expression above would evaluate that to 0.0
, even in double precision:
while 1e8
is comfortably within range for double floats,
its square 1e16
is outside the range where all integers are represented exactly.
Knuth refers to a better behaved recurrence by Welford, where
a running sample mean is subtracted from each new observation
before squaring.
John Cook has a C++
implementation
of the recurrence that adds observations to a sample variance in constant time.
In Python, this streaming algorithm looks like this.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 

That’s all we need for insertonly multisets, but does not handle removals; if only we had removals, we could always implement updates (replacement) as a removal and an insertion.
Luckily, StreamingVariance.observe
looks invertible.
It’s shouldn’t be hard to recover the previous sample mean, given v
,
and, given the current and previous sample means,
we can reevaluate (v  old_mean) * (v  self.mean)
and
subtract it from self.var_sum
.
Let \(\hat{x}\sp{\prime}\) be the sample mean after observe(v)
.
We can derive the previous sample mean \(\hat{x}\) from \(v\):
\[(n  1)\hat{x} = n\hat{x}\sp{\prime}  v \Leftrightarrow \hat{x} = \hat{x}\sp{\prime} + \frac{\hat{x}\sp{\prime}  v}{n1}.\]
This invertibility means that we can undo calls to observe
in
LIFO order. We can’t handle arbitrary multiset updates, only a
stack of observation. That’s still better than nothing.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 

Before going any further, let’s test this.
Testing the VarianceStack
The best way to test the VarianceStack
is to execute a series of
push
and pop
calls, and compare the results of get_mean
and
get_variance
with batch reference implementations.
I could hardcode calls in unit tests. However, that quickly hits diminishing returns in terms of marginal coverage VS developer time. Instead, I’ll be lazy, completely skip unit tests, and rely on Hypothesis, its high level “stateful” testing API in particular.
We’ll keep track of the values pushed and popped off the observation stack in the driver: we must make sure they’re matched in LIFO order, and we need the stack’s contents to compute the reference mean and variance. We’ll also want to compare the results with reference implementations, modulo some numerical noise. Let’s try to be aggressive and bound the number of float values between the reference and the actual results.
variance_stack_driver.py1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 

This initial driver does not even use the VarianceStack
yet.
All it does is push values to the reference stack,
pop values when the stack has something to pop,
and check that the reference implementations match themselves after each call:
I want to first shake out any bug in the test harness itself.
Not surprisingly, Hypothesis does find an issue in the reference implementation:
Falsifying example:
state = VarianceStackDriver()
state.push(v=0.0)
state.push(v=2.6815615859885194e+154)
state.teardown()
We get a numerical OverflowError
in reference_variance
: 2.68...e154 / 2
is slightly greater than sqrt(sys.float_info.max) = 1.3407807929942596e+154
,
so taking the square of that value errors out instead of returning infinity.
Let’s start by clamping the range of the generated floats.
variance_stack_driver.py1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 

Now that the test harness doesn’t find fault in itself,
let’s hook in the VarianceStack
, and see what happens
when only push
calls are generated (i.e., first test
only the standard streaming variance algorithm).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 

This already fails horribly.
Falsifying example:
state = VarianceStackDriver()
state.push(v=1.0)
state.push(v=1.488565707357403e+138)
state.teardown()
F
The reference finds a variance of 5.54e275
,
which is very much not the streaming computation’s 1.108e276
.
We can manually check that the reference is wrong:
it’s missing the n  1
correction term in the denominator.
We should use this updated reference.
variance_stack_driver.py1 2 3 4 5 6 7 8 9 

Let’s now reenable calls to pop()
.
1 2 3 4 5 6 7 8 

And now things fail in new and excitingly numerical ways.
Falsifying example:
state = VarianceStackDriver()
state.push(v=0.0)
state.push(v=0.00014142319560050964)
state.push(v=14188.9609375)
state.pop()
state.teardown()
F
This counterexample fails with the online variance returning 0.0
instead of 1e8
.
That’s not unexpected:
removing (the square of) a large value from a running sum
spells catastrophic cancellation.
It’s also not that bad for my use case,
where I don’t expect to observe very large values.
Another problem for our test harness is that
floats are very dense around 0.0
, and
I’m ok with small (around 1e8
) absolute error
because the input and output will be single floats.
Let’s relax assert_almost_equal
, and
restrict generated observations to fall
in \([2\sp{12}, 2\sp{12}].\)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 

With all these tweaks to make sure we generate easy (i.e., interesting) test cases, Hypothesis fails to find a failure after its default time budget.
I’m willing to call that a victory.
From stack to full multiset
We have tested code to undo updates in Welford’s classic streaming variance algorithm.
Unfortunately, inverting push
es away only works for LIFO edits,
and we’re looking for arbitrary inserts and removals (and updates) to a multiset
of observations.
However, both the mean \(\hat{x} = \sum\sb{x\in X} x/n\) and the centered second moment \(\sum\sb{x\in X}(x  \hat{x})\sp{2}\) are orderindependent: they’re just sums over all observations. Disregarding roundoff, we’ll find the same mean and second moment regardless of the order in which the observations were pushed in. Thus, whenever we wish to remove an observation from the multiset, we can assume it was the last one added to the estimates, and pop it off.
We think we know how to implement running mean and variance for a multiset of observations. How do we test that with Hypothesis?
The hardest part about testing dictionary (map)like interfaces is making sure to generate valid identifiers when removing values. As it turns out, Hypothesis has builtin support for this important use case, with its Bundles. We’ll use that to test a dictionary from observation name to observation value, augmented to keep track of the current mean and variance of all values.
variance_multiset_driver.py1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 

Each call to add_entry
will either go to update_entry
if
the key already exists, or add an observation to the dictionary
and streaming estimator. If we have a new key, it is added
to the keys
Bundle; calls to del_entry
and update_entry
draw keys from this Bundle. When we remove an entry, it’s
also consumed from the keys
Bundle.
Hypothesis finds no fault with our new implementation of dictionarywithvariance,
but update
seems like it could be much faster and numerically stable,
and I intend to mostly use this data structure for calls to update
.
Faster multiset updates
The key operation for my usecase is to update one observation
by replacing its old
value with a new
one.
We can maintain the estimator by popping old
away and pushing new
in,
but this business with updating the number of observation n
and
rescaling everything seems like a lot of numerical trouble.
We should be able to do better.
We’re replacing the multiset of sampled observations \(X\) with \(X\sp{\prime} = X \setminus \{\textrm{old}\} \cup \{\textrm{new}\}.\) It’s easy to maintain the mean after this update: \(\hat{x}\sp{\prime} = \hat{x} + (\textrm{new}  \textrm{old})/n.\)
The update to self.var_sum
, the sum of squared differences from the mean, is trickier.
We start with \(v = \sum\sb{x\in X} (x  \hat{x})\sp{2},\)
and we wish to find \(v\sp{\prime} = \sum\sb{x\sp{\prime}\in X\sp{\prime}} (x\sp{\prime}  \hat{x}\sp{\prime})\sp{2}.\)
Let \(\delta = \textrm{new}  \textrm{old}\) and \(\delta\sb{\hat{x}} = \delta/n.\) We have \[\sum\sb{x\in X} (x  \hat{x}\sp{\prime})\sp{2} = \sum\sb{x\in X} [(x  \hat{x})  \delta\sb{\hat{x}}]\sp{2},\] and \[[(x  \hat{x})  \delta\sb{\hat{x}}]\sp{2} = (x  \hat{x})\sp{2}  2\delta\sb{\hat{x}} (x  \hat{x}) + \delta\sb{\hat{x}}\sp{2}.\]
We can reassociate the sum, and find
\[\sum\sb{x\in X} (x  \hat{x}\sp{\prime})\sp{2} = \sum\sb{x\in X} (x  \hat{x})\sp{2}  2\delta\sb{\hat{x}} \left(\sum\sb{x \in X} x  \hat{x}\right) + n \delta\sb{\hat{x}}\sp{2}\]
Once we notice that \(\hat{x} = \sum\sb{x\in X} x/n,\) it’s clear that the middle term sums to zero, and we find the very reasonable
\[v\sb{\hat{x}\sp{\prime}} = \sum\sb{x\in X} (x  \hat{x})\sp{2} + n \delta\sb{\hat{x}}\sp{2} = v + \delta \delta\sb{\hat{x}}.\]
This new accumulator \(v\sb{\hat{x}\sp{\prime}}\) corresponds to the sum of the
squared differences between the old observations \(X\) and the new mean \(\hat{x}\sp{\prime}\).
We still have to update one observation from old
to new
.
The remaining adjustment to \(v\) (self.var_sum
) corresponds to
going from \((\textrm{old}  \hat{x}\sp{\prime})\sp{2}\)
to \((\textrm{new}  \hat{x}\sp{\prime})\sp{2},\)
where \(\textrm{new} = \textrm{old} + \delta.\)
After a bit of algebra, we get \[(\textrm{new}  \hat{x}\sp{\prime})\sp{2} = [(\textrm{old}  \hat{x}\sp{\prime}) + \delta]\sp{2} = (\textrm{old}  \hat{x}\sp{\prime})\sp{2} + \delta (\textrm{old}  \hat{x} + \textrm{new}  \hat{x}\sp{\prime}).\]
The adjusted \(v\sb{\hat{x}\sp{\prime}}\) already includes
\((\textrm{old}  \hat{x}\sp{\prime})\sp{2}\)
in its sum, so we only have to add the last term
to obtain the final updated self.var_sum
\[v\sp{\prime} = v\sb{\hat{x}\sp{\prime}} + \delta (\textrm{old}  \hat{x} + \textrm{new}  \hat{x}\sp{\prime}) = v + \delta [2 (\textrm{old}  \hat{x}) + \textrm{new}  \hat{x}\sp{\prime}].\]
That’s our final implementation for VarianceBag.update
,
for which Hypothesis also fails to find failures.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 

How much do you trust testing?
We have automated propertybased tests and some humanchecked proofs. Ship it?
I was initially going to ask a CAS
to check my reformulations,
but the implicit \(\forall\) looked messy.
Instead, I decided to check the induction hypothesis implicit in
VarianceBag.update
, and enumerate all cases up to a certain number
of values with Z3 in IPython.
In [1]: from z3 import *
In [2]: x, y, z, new_x = Reals("x y z new_x")
In [3]: mean = (x + y + z) / 3
In [4]: var_sum = sum((v  mean) * (v  mean) for v in (x, y, z))
In [5]: delta = new_x  x
In [6]: new_mean = mean + delta / 3
In [7]: delta_mean = delta / 3
In [8]: adjustment = delta * (2 * (x  mean) + (delta  delta_mean))
In [9]: new_var_sum = var_sum + adjustment
# We have our expressions. Let's check equivalence for mean, then var_sum
In [10]: s = Solver()
In [11]: s.push()
In [12]: s.add(new_mean != (new_x + y + z) / 3)
In [13]: s.check()
Out[13]: unsat # No counter example of size 3 for the updated mean
In [14]: s.pop()
In [15]: s.push()
In [16]: s.add(new_mean == (new_x + y + z) / 3) # We know the mean matches
In [17]: s.add(new_var_sum != sum((v  new_mean) * (v  new_mean) for v in (new_x, y, z)))
In [18]: s.check()
Out[18]: unsat # No counter example of size 3 for the updated variance
Given this script, it’s a small matter of programming to generalise
from 3 values (x
, y
, and z
) to any fixed number of values, and
generate all small cases up to, e.g., 10 values.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 

I find the most important thing when it comes to using automated proofs is to insert errors and confirm we can find the bugs we’re looking for.
I did that by manually mutating the expressions for new_mean
and new_var_sum
in updated_expressions
. This let me find a simple bug in the initial
implementation of test_num_var
: I used if not result
instead of result != unsat
,
and both sat
and unsat
are truthy. The code initially failed to flag a failure
when z3
found a counterexample for our correctness condition!
And now I’m satisfied
I have code to augment an arbitrary multiset or dictionary with a running estimate of the mean and variance; that code is based on a classic recurrence, with some new math checked by hand, with automated tests, and with some exhaustive checking of small inputs (to which I claim most bugs can be reduced).
I’m now pretty sure the code works, but there’s another more obviously correct way to solve that update problem. This 2008 report by Philippe Pébay^{1} presents formulas to compute the mean, variance, and arbitrary moments in one pass, and shows how to combine accumulators, a useful operation in parallel computing.
We could use these formulas to augment an arbitrary \(k\)ary tree and recombine the merged accumulator as we go back up the (search) tree from the modified leaf to the root. The update would be much more stable (we only add and merge observations), and incur logarithmic time overhead (with linear space overhead). However, given the same time budget, and a logarithmic space overhead, we could also implement the constanttime update with arbitrary precision software floats, and probably guarantee even better precision.
The constanttime update I described in this post demanded more effort to convince myself of its correctness, but I think it’s always a better option than an augmented tree for serial code, especially if initial values are available to populate the accumulators with batchcomputed mean and variance. I’m pretty sure the code works, and it’s up in this gist. I’ll be reimplementing it in C++ because that’s the language used by the project that lead me to this problem; feel free to steal that gist.

There’s also a 2016 journal article by Pébay and others with numerical experiments, but I failed to implement their simplerlooking scalar update...↩
Quicklisp news — November 2019 Quicklisp dist update now available
@20191130 21:09 · 6 days ago cacau — Test Runner in Common Lisp. — GPLv3
 clelastic — Elasticsearch client for Common Lisp — MIT
 cllibiio — Common Lisp bindings for libiio. — GPLv3
 cltransmission — A Common Lisp library to interface with transmission using its rpc — MIT
 dartscltools — More or less useful utilities — MIT
 plainodbc — a lisp wrapper around the ODBC library — BSD
 sanityclause — Sanity clause is a data contract and validation library. — LGPLv3
 tfm — A TeX Font Metrics library for Common Lisp — BSD
 tmpdir — Simple library to create temporary directories — MIT license
 trivialmethodcombinations — Portability library for accessing method combination objects — Unlicense
 winlock — File locking using the Windows API. — MIT
 withuserabort — provides an easy way to catch ctrl+c. useful for making binaries. — BSD 3Clause
To get this update, use: (ql:updatedist "quicklisp")
ABCL Dev — Stuffing an @Armedbear
@20191128 14:07 · 8 days agoHappy Thanksgiving!
Lispers.de — Hamburg Lispers Meetup, Monday, 2nd December 2019
@20191127 00:00 · 10 days agoHamburg will see another gathering of Lispers on 2nd December 2019. We meet around 19:00 (“7 p. m.”) at Ristorante Opera, Dammtorstr. 7.
This is an informal event for all Lisp experience levels, and about any Lisp dialect.
This time, we'll likely talk about Advent of Code and some Lisp opportunities.
ABCL Dev — Unleashing the Bear past Java 11
@20191122 14:31 · 14 days ago1.6.0 onto the world stage.
This release fully supports building and running on Java 6, 7, 8, and
11 via the openjdk runtimes. By building from our
source, it is rumored that the User may utilize additional
hosting platforms such as Java 13.
Support for Microsoft Windows platforms has not been thoroughly tested
at this point but we intend to add it our automated cloud based build
in the process of releasing a stabilizing ABCL 1.6.1 build due by the
end of this year.
Thanks to all past, present, and future contributors to the Bear;
here's to hackin' our way out of the prison of the Java
language for over a decade.
Didier Verna — Quickref 3.0 "The Alchemist" is released
@20191121 00:00 · 16 days agoFollowing up with the latest release of Declt, I'm also happy to announce the release of Quickref 3.0 "The Alchemist", our reference manuals aggregator for Quicklisp libraries. The official website has been updated yesterday with both the latest Quicklisp version, and the new Declt, resulting in the documentation of 1792 Common Lisp libraries.
Vsevolod Dyomkin — Programming Algorithms: Strings
@20191120 10:47 · 16 days agoIt may not be immediately obvious why the whole chapter is dedicated to strings. Aren't they just glorified arrays? There are several answers to these challenges:
 indeed, strings are not just arrays, or rather, not only arrays: in different contexts, other representations, such as trees or complex combinations of arrays, may be used. And, besides, there are additional properties that are important for strings even when they are represented as arrays
 there's a lot of stringspecific algorithms that deserve their own chapter
 finally, strings play a significant role in almost every program, so they have specific handling: in the OS, standard library, and even, sometimes, your application framework
In the base case, a string is, indeed, an array. As we already know, this array may either store its length or be a 0terminated security catastrophe, like in C (see buffer overflow). So, to iterate, strings should store their length. Netstrings are a notable take on the idea of the lengthaware strings: it's a simple external format that serializes a string as a tuple of length and contents, separated by a colon and ending with a comma: 3:foo,
is the netsrting for the string foo
.
More generally, a string is a sequence of characters. The characters themselves may be single bytes as well as fixed or variablelength byte sequences. The latter character encoding poses raises a challenging question of what to prefer, correctness or speed? With variablelength Unicode code points, the simplest and fastest string variant, a byte array, breaks, for it will incorrectly report its length (in bytes, not in characters) and fail to retrieve the character by index. Different language ecosystems address this issue differently, and the majority is, unfortunately, broken in one aspect or another. Overall, there may be two possible solution paths. The first one is to use a fixedlength representation and pad shorter characters to full length. Generally, such representation will be 32bit UTF32 resulting in up to 75% storage space waste for the most common 1byte ASCII characters. The alternative approach will be to utilize a more advanced datastructure. The naive variant is a list, which implies an unacceptable slowdown of character access operation to O(n)
. Yet, a balanced approach may combine minimal additional space requirements with acceptable speed. One of the solutions may be to utilize the classic bitmap trick: use a bit array indicating, for each byte, whether it's the start of a character (only a 12% overhead). Calculating the character position may be performed in a small number of steps with the help of an infamous, in close circles, operation — Population count aka Hamming weight. This hardware instruction calculates the number of 1bits in an integer and is accessible via logcount
Lisp standard library routine. Behind the scenes, it is also called for bit arrays if you invoke count 1
on them. At least this is the case for SBCL:
CLUSER> (disassemble (lambda (x)
(declare (type (simplearray bit) x))
(count 1 x)))
; disassembly for (LAMBDA (X))
; Size: 267 bytes. Origin: #x100FC9FD1A
...
; DA2: F3480FB8FA POPCNT RDI, RDX
The indexing function implementation may be quite tricky, but the general idea is to try to jump ahead n
characters and calculate the popcount of the substring from the previous position to the current that will tell us the number of characters we have skipped. For the base case of a 1byte string, we will get exactly where we wanted in just 1 jump and 1 popcount. However, if there were multibyte characters in the string, the first jump would have skipped less than n
characters. If the difference is sufficiently small (say, below 10) we can just perform a quick linear scan of the remainder and find the position of the desired character. If it's larger than n/2
we can jump ahead n
characters again (this will repeat at most 3 times as the maximum bytelength of a character is 4), and if it's below n/2
we can jump n/2
characters. And if we overshoot we can reverse the direction of the next jump or search. You can see where it's heading: if at each step (or, at least, at each 4th step) we are constantly half dividing our numbers this means O(log n)
complexity. That's the worst performance for this function we can get, and it will very efficiently handle the cases when the character length doesn't vary: be it 1 byte — just 2 operations, or 4 bytes — 8 ops.
Here is the prototype of the charindex
operation implemented according to the described algorithm (without the implementation of the mblinearcharindex
that performs the final linear scan):
(defstruct (mbstring (:concname mbs))
bytes
bitmap)
(defparameter *mbthreshold* 10)
(defun mbcharindex (string i)
(let ((off 0))
(loop
(with ((cnt (count 1 (mbsbitmap string) :start off :end (+ off i))))
(diff ( i cnt)))
(cond
((= cnt i) (return (+ off i)))
((< diff *mbthreshold*) (return (mblinearcharindex
string diff off)))
((< cnt (floor i 2)) (:+ off i)
(: i cnt))
(t (:+ off (floor i 2))
(: i cnt)))))))
The length
of such a string may be calculated by perfoming the popcount on the whole bitmap:
(defun mblength (string)
(count 1 (mbsbitmap string)))
It's also worth taking into account that there exists a set of rules assembled under the umbrella of the Unicode collation algorithm that specifies how to order strings containing Unicode codepoints.
Basic StringRelated Optimizations
Strings are often subject to subsequencing, so an efficient implementation may use structure sharing. As we remember, in Lisp, this is accessible via the displaced arrays mechanism (and a convenience RUTILS function slice
that we have already used in the code above). Yet, structure sharing should be utilized with care as it opens a possibility for actionatadistance bugs if the derived string is modified, which results in parallel modification of the original. Though, strings are rarely modified inplace so, even in its basic form (without mandatory immutability), the approach works well. Moreover, some programming language environments make strings immutable by default. In such cases, to perform onthefly string modification (or rather, creation) such patterns as the Java StringBuilder
are used, which creates the string from parts by first accumulating them in a list and then, when necessary, concatenating the list's contents into a single final string. An alternative approach is string formatting (the format
function in Lisp) that is a higherlevel interface, which still needs to utilize some underlying mutation/combination mechanism.
Another important stringrelated technology is interning. It is a spacesaving measure to avoid duplicating the same strings over and over again, which operates by putting a string in a table and using its index afterwards. This approach also enables efficient equality comparison. Interning is performed by the compiler implicitly for all constant strings (in the special segment of the program's memory called "string table"/sstab
), and also may be used explicitly. In Lisp, there's a standard function intern
, for this. Lisp symbols used interned strings as their names. Another variant of interning is string pooling. The difference is that interning uses a global string table while the pools may be local.
Strings in the Editor
Now, let's consider situations, in which representing strings as arrays doesn't work. The primary one is in the editor. I.e. when constant random modification is the norm. There's another not so obvious requirement related to editing: handle potentially arbitrary long strings that still need to be dynamically modified. Have you tried opening a hundredmegabyte text document in your favorite editor? You'd better don't unless you're a Vim user :) Finally, an additional limitation of handling the strings in the editor is posed when we allow concurrent modification. This we'll discuss in the chapter on concurrent algorithms.
So, why array as a string backend doesn't work well in the editor? Because of content relocation required by all edit operations. O(n)
editing is, obviously, not acceptable. What to do? There are several more advanced approaches:
 The simplest change will be, once again, to use an array of arrays. For example, for each line. This will not change the general complexity of
O(n)
but, at least, will reducen
significantly. The issue is that, still, it will depend on the length of the line so, for not so rare degraded case when there are few or no linebreaks, the performance will seriously deteriorate. And, moreover, having observable performance differences between editing different paragraphs of the text is not userfriendly at all.  A more advanced approach would be to use trees, reducing access time to
O(log n)
. There are many different kinds of trees and, in fact, only a few may work as efficient string representations. Among them a popular data structure, for representing strings, is a Rope. It's a binary tree where each leaf holds a substring and its length, and each intermediate node further holds the sum of the lengths of all the leaves in its left subtree. It's a moreorless classic application of binary trees to a storage problem so we won't spend more time on it here. Suffice to say that it has the expected binarytree performance ofO(log n)
for all operations, provided that we keep it balanced. It's an ok alternative to a simple array, but, for such a specialized problem, we can do better with a custom solution.  And the custom solution is to return to arrays. There's one clever way to use them that works very well for dynamic strings. It is called a Gap buffer. This structure is an array (buffer) with a gap in the middle. I.e., let's imagine that we have a text of
n
characters. The Gap buffer will have a length ofn + k
wherek
is the gap size — some value, derived from practice, that may fluctuate in the process of string modification. You can recognize this gap as the position of the cursor in the text. Insertion operation in the editor is performed exactly at this place, so it'sO(1)
. Just, afterwards, the gap will shrink by 1 character, so we'll have to resize the array, at some point, if there are too many insertions and the gap shrinks below some minimum size (maybe, below 1). The deletion operation will act exactly the opposite by growing the gap at one of the sides. The Gap buffer is an approach that is especially suited for normal editing — a process that has its own pace. It also allows the system to represent multiple cursors by maintaining several gaps. Also, it may be a good idea to represent each paragraph as a gap buffer and use an array of them for the whole text. The gap buffer is a special case of the Zipper pattern that we'll discuss in the chapter on functional data structures.
Substring Search
One of the most common string operations is substring search. For ordinary sequences we, usually, search for a single element, but strings, on the contrary, more often need subsequence search, which is more complex. A naive approach will start by looking for the first character, then trying to match the next character and the next, until either something ends or there's a mismatch. Unlike with hashtables, Lisp standard library has good support for string processing, including such operations as search
(which, actually, operates on any sequence type) and mismatch
that compares two strings from a chosen side and returns the position at which they start to diverge.
If we were to implement our own stringspecific search, the most basic version would, probably, look like this:
(defun naivematch (pat str)
(dotimes (i ( (1+ (length str)) (length pat)))
(when (= (mismatch pat (slice str i))
(length pat))
(returnfrom naivematch i))))
If the strings had been random, the probability that we are correctly matching each subsequent character would have dropped to 0 very fast. Even if we consider just the English alphabet, the probability of the first character being the same in 2 random strings is 1/26
, the first and second — 1/676
, and so on. And if we assume that the whole charset may be used, we'll have to substitute 26 with 256 or a greater value. So, in theory, such naive approach has almost O(n)
complexity, where n
is the length of the string. Yet, the worst case has O(n * m)
, where m
is the length of the pattern. Why? If we try to match a pattern a..ab
against a string aa.....ab
, at each position, we'll have to check the whole pattern until the last character mismatches. This may seem like an artificial example and, indeed, it rarely occurs. But, still, realworld strings are not so random and are much closer to the uniform corner case than to the random one. So, researchers have come up with a number of ways to improve subsequence matching performance. Those include the four wellknown inventorglorifying substring search algorithms: KnuthMorrisPratt, BoyerMoore, RabinKarp, and AhoCorasick. Let's discuss each one of them and try to determine their interesting properties.
KMP
KnuthMorrisPratt is the most basic of these algorithms. Prior to performing the search, it examines the pattern to find repeated subsequences in it and creates a table containing, for each character of the pattern, the length of the prefix of the pattern that can be skipped if we have reached this character and failed the search at it. This table is also called the "failure function". The number in the table is calculated as the length of the proper suffix[1] of the pattern substring ending before the current character that matches the start of the pattern.
I'll repeat here the example provided in Wikipedia that explains the details of the tablebuilding algorithm, as it's somewhat tricky.
Let's build the table for the pattern abdcabd
. We set the table entry for the first char a
to 1. To find the entry for b
, we must discover a proper suffix of a
which is also a prefix of the pattern. But there are no proper suffixes of a
, so we set this entry to 0. To find the entry with index 2, we see that the substring ab
has a proper suffix b
. However b
is not a prefix of the pattern. Therefore, we also set this entry to 0.
For the next entry, we first check the proper suffix of length 1, and it fails like in the previous case. Should we also check longer suffixes? No. We can formulate a shortcut rule: at each stage, we need to consider checking suffixes of a given size (1+ n)
only if a valid suffix of size n
was found at the previous stage and should not bother to check longer lengths. So we set the table entry for c
to 0 also.
We pass to the subsequent character a
. The same logic shows that the longest substring we need to consider has length 1, and as in the previous case it fails since d
is not a prefix. But instead of setting the table entry to 0, we can do better by noting that a
is also the first character of the pattern, and also that the corresponding character of the string can't be a
(as we're calculating for the mismatch case). Thus there is no point in trying to match the pattern for this character again — we should begin 1 character ahead. This means that we may shift the pattern by match length plus one character, so we set the table entry to 1.
Considering now the next character b
: though by inspection the longest substring would appear to be a
, we still set the table entry to 0. The reasoning is similar to the previous case. b
itself extends the prefix match begun with a
, and we can assume that the corresponding character in the string is not b
. So backtracking before it is pointless, but that character may still be a
, hence we set the entry not to 1, but to 0, which means shifting the pattern by 1 character to the left and trying to match again.
Finally, for the last character d
, the rule of the proper suffix matching the prefix applies, so we set the table entry to 2.
The resulting table is:
a b c d a b d
1 0 0 0 1 0 2
Here's the implementation of the tablebuilding routine:
(defun kmptable (pat)
(let ((rez (makearray (length pat)))
(i 0)) ; prefix length
(:= (? rez 0) 1)
(loop :for j :from 1 :below (length pat) :do
(if (char= (char pat i) (char pat j))
(:= (? rez j) (? rez i))
(progn (:= (? rez j) i
i (? rez i))
(loop :while (and (>= i 0)
(not (char= (char pat i) (char pat j))))
:do (:= i (? rez i)))))
(:+ i))
rez))
It can be proven that it runs in O(m)
. We won't show it here, so coming up with proper calculations is left as an exercise to the reader.
Now, the question is, how shall we use this table? Let's look at the code:
(defun kmpmatch (pat str)
(let ((s 0)
(p 0)
(ff (kmptable pat))
(loop :while (< s (length str)) :do
(if (= (char pat p) (char str s))
;; if the current characters of the pattern and string match
(if (= (1+ p) (length pat)))
;; if we reached the end of the pattern  success
(return ( s p))
;; otherwise, match the subsequent characters
(:= p (1+ p)
s (1+ s)))
;; if the characters don't match
(if (= 1 (? ff p))
;; shift the pattern for the whole length
(:= p 0
;; and skip to the next char in the string
s (1+ s))
;; try matching the current char again,
;; shifting the pattern to align the prefix
;; with the already matched part
(:= p (? ff p)))))))
As we see, the index in the string (s
), is incremented at each iteration except when the entry in the table is positive. In the latter case, we may examine the same character more than once but not more than we have advanced in the pattern. And the advancement in the pattern meant the same advancement in the string (as the match is required for the advancement). In other words, we can backtrack not more than n
times over the whole algorithm runtime, so the worstcase number of operations in kmpsearch
is 2n
, while the bestcase is just n
. Thus, the total complexity is O(n + m)
.
And what will happen in our aa..ab
example? The failure function for it will look like the following: 1 1 1 1 ( m 2)
. Once we reach the first mismatch, we'll need to backtrack by 1 character, perform the comparison, which will mismatch, advance by 1 character (to b
), mismatch again, again backtrack by 1 character, and so on until the end of the string. So, this case, will have almost the abovementiond 2n
runtime.
To conclude, the optimization of KMP lies in excluding unnecessary repetition of the same operations by memoizing the results of partial computations — both in tablebuilding and matching parts. The next chapter of the book will be almost exclusively dedicated to studying this approach in algorithm design.
BM
BoyerMoore algorithm is conceptually similar to KMP, but it matches from the end of the pattern. It also builds a table, or rather three tables, but using a different set of rules, which also involve the characters in the string we search. More precisely, there are two basic rules instead of one for KMP. Besides, there's another rule, called the Galil rule, that is required to ensure the linear complexity of the algorithm. Overall, BM is pretty complex in the implementation details and also requires more preprocessing than KMP, so its utility outweighs these factors only when the search is repeated multiple times for the same pattern.
Overall, BM may be faster with normal text (and the longer the pattern, the faster), while KMP will work the best with strings that have a short alphabet (like DNA). However, I would choose KMP as the default due to its relative simplicity and much better space utilization.
RK
Now, let's talk about alternative approaches that rely on techniques other than pattern preprocessing. They are usually used to find matches of multiple patterns in one go as, for the base case, their performance will be worse than that of the previous algorithms.
RabinKarp algorithm uses an idea of the Rolling hash. It is a hash function that can be calculated incrementally. The RK hash is calculated for each substring of the length of the pattern. If we were to calculate a normal hash function like fnv1, we'd need to use each character for the calculation — resulting in O(n * m)
complexity of the whole procedure. The rolling hash is different as it requires, at each step of the algorithm, to perform just 2 operations: as the "sliding window" moves over the string, subtract the part of the hash corresponding to the character that is no longer part of the substring and add the new value for the character that has just become the part of the substring.
Here is the skeleton of the RK algorithm:
(defun rkmatch (pat str)
(let ((len (length pat))
(phash (rkhash pat)))
(loop :for i :from len :to (length str)
:for beg := ( i len)
:for shash := (rkhash (slice str 0 len))
:then (rkrehash len shash (char str beg) (char str i))
:when (and (= phash shash)
(string= pat (slice str beg len))
:collect beg)))
A trivial rkhash
function would be just:
(defun rkhash (str)
(loop :for ch :across str :sum (charcode ch)))
But it is, obviously, not a good hashfunction as it doesn't ensure the equal distribution of hashes. Still, in this case, we need a reversible hashfunction. Usually, such hashes add position information into the mix. An original hashfunction for the RK algorithm is the Rabin fingerprint that uses random irreducible polynomials over Galois fields of order 2. The mathematical background needed to explain it is somewhat beyond the scope of this book. However, there are simpler alternatives such as the following:
(defun rkhash (str)
(assert (> (length str) 0))
(let ((rez (charcode (char str 0))))
(loop :for ch :across (slice str 1) :do
(:= rez (+ (rem (* rez 256) 101)
(charcode ch))))
(rem rez 101))
Its basic idea is to treat the partial values of the hash as the coefficients of some polynomial.
The implementation of rkrehash
for this function will look like this:
(defun rkrehash (hash len ch1 ch2)
(rem (+ (* (+ hash 101
( (rem (* (charcode ch1)
(expt 256 (1 len)))
101)))
256)
(charcode ch2))
101))
Our rkmatch
could be used to find many matches of a single pattern. To adapt it for operating on multiple patterns at once, we'll just need to precalculate the hashes for all patterns and lookup the current rkhash value in this set. Additional optimization of this lookup may be performed with the help of a Bloom filter — a stochastic data structure we'll discuss in more detail later.
Finally, it's worth noting that there are other similar approaches to the rolling hash concept that trade some of the uniqueness properties of the hash function for the ability to produce hashes incrementally or have similar hashes for similar sequences. For instance, the Perceptual hash (phash) is used to find nearmatch images.
AC
AhoCorasick is another algorithm that allows matching multiple strings at once. The preprocessing step of the algorithm constructs a FiniteState Machine (FSM) that resembles a trie with additional links between the various internal nodes. The FSM is a graph data structure that encodes possible states of the system and actions needed to transfer it from one state to the other.
The AC FSM is constructed in the following manner:
 Build a trie of all the words in the set of search patterns (the search dictionary). This trie represents the possible flows of the program when there's a successful character match at the current position. Add a loop edge for the root node.
 Add backlinks transforming the trie into a graph. The backlinks are used when a failed match occurs. These backlinks are pointing either to the root of the trie or if there are some prefixes that correspond to the part of the currently matched path — to the end of the longest prefix. The longest prefix is found using BFS of the trie. This approach is, basically, the same idea used in KMP and BM to avoid reexamining the already matched parts. So backlinks to the previous parts of the same word are also possible.
Here is the example FSM for the search dictionary '("the" "this" "that" "it" "his")
:
Basically, it's just a trie with some backlinks to account for already processed prefixes. One more detail missing for this graph to be a complete FSM is an implicit backlink from all nodes without an explicit backlink that don't have backlinks to the root node.
The main loop of the algorithm is rather straightforward, examine each character and then:
 either follow one of the transitions (direct edge) if the character of the edge matches
 or follow the backlink if it exists
 or reset the FSM state — go to root
 if the transition leads us to a terminal node, record the match(es) and return to root as, well
As we see from the description, the complexity of the main loop is linear in the length of the string: at most, 2 matches are performed, for each character. The FSM construction is also linear in the total length of all the words in the search dictionary.
The algorithm is often used in antivirus software to perform an efficient search for code signatures against a database of known viruses. It also formed the basis of the original Unix command fgrep
. And, from my point of view, it's the simplest to understand yet pretty powerful and versatile substring search algorithm that may be a default choice if you ever have to implement one yourself.
Regular Expressions
Searching is, probably, the most important advanced string operation. Besides, it is not limited to mere substring search — matching of more complex patterns is even in higher demand. These patterns, which are called "regular expressions" or, simply, regexes, may include optional characters, repetition, alternatives, backreferences, etc. Regexes play an important role in the history of the Unix commandline, being the principal technology of the infamous grep
utility, and then the cornerstone of Perl. All modern programming languages support them either in the standard library or, as Lisp, with highquality thirdparty addons (clppcre).
One of my favorite programming books, "Beautiful Code", has a chapter on implementing simple regex matching from Brian Kernighan with code written by Rob Pike. It shows how easy it is to perform basic matching of the following patterns:
c matches any literal character c
. matches any single character
^ matches the beginning of the input string
$ matches the end of the input string
* matches zero or more occurrences of the previous character
Below the C code from the book is translated into an equivalent Lisp version:
(defun match (regex text)
"Search for REGEX anywhere in TEXT."
(if (startswith "^" regex) ; STARTSWITH is from RUTILS
(matchhere (slice regex 1) text)
(dotimes (i (length text))
(when (matchhere regex (slice text i))
(return t)))))
(defun matchhere (regex text)
"Search for REGEX at beginning of TEXT."
(cond ((= 0 (length regex))
t)
((and (> (length regex) 1)
(char= #\* (char regex 1)))
(matchstar (char regex 1) (slice regex 2) text))
((string= "$" regex)
(= 0 (length text)))
((and (> (length text) 0)
(member (char text 0) (list #\. (char text 0)))
(matchhere (slice regex 1) (slice text 1)))))
(defun matchstar (c regex text)
"Search for C*REGEX at beginning of TEXT."
(loop
(when (matchhere regex text) (return t))
(:= text (slice text 1))
(unless (and (> (length text) 0)
(member c (list #\. (char text 0))))
(return)))
This is a greedy linear algorithm. However, modern regexes are much more advanced than this naive version. They include such features as register groups (to record the spans of text that match a particular subpattern), backreferences, nongreedy repetition, and so on and so forth. Implementing those will require changing the simple linear algorithm to a backtracking one. And incorporating all of them would quickly transform the code above into a horrible unmaintainable mess: not even due to the number of cases that have to be supported but due to the need of accounting for the complex interdependencies between them.
And, what's worse, soon there will arise a need to resort to backtracking. Yet, a backtracking approach has a critical performance flaw: potential exponential runtime for certain input patterns. For instance, the Perl regex engine (PCRE) requires over sixty seconds to match a 30character string aa..a
against the pattern a?{15}a{15}
(on standard hardware). While the alternative approach, which we'll discuss next, requires just twenty microseconds — a million times faster. And it handles a 100character string of a similar kind in under 200 microseconds, while Perl would require over 1015 years.[2]
This issue is quite severe and has even prompted Google to release their own regex library with strict linear performance guarantees — RE2. The goal of the library is not to be faster than all other engines under all circumstances. Although RE2 guarantees lineartime performance, the lineartime constant varies depending on the overhead entailed by its way of handling of the regular expression. In a sense, RE2 behaves pessimistically whereas backtracking engines behave optimistically, so it can be outperformed in various situations. Also, its goal is not to implement all of the features offered by PCRE and other engines. As a matter of principle, RE2 does not support constructs for which only backtracking solutions are known to exist. Thus, backreferences and lookaround assertions are not supported.
The figures above are taken from a seminal article by Russ Cox. He goes on to add:
Historically, regular expressions are one of computer science's shining examples of how using good theory leads to good programs. They were originally developed by theorists as a simple computational model, but Ken Thompson introduced them to programmers in his implementation of the text editor QED for CTSS. Dennis Ritchie followed suit in his own implementation of QED, for GETSS. Thompson and Ritchie would go on to create Unix, and they brought regular expressions with them. By the late 1970s, regular expressions were a key feature of the Unix landscape, in tools such as ed, sed, grep, egrep, awk, and lex. Today, regular expressions have also become a shining example of how ignoring good theory leads to bad programs. The regular expression implementations used by today's popular tools are significantly slower than the ones used in many of those thirtyyearold Unix tools.
The lineartime approach to regex matching relies on a similar technic to the one in the AhoCorasick algorithm — the FSM. Actually, if by regular expressions we mean a set of languages that abide by the rules of the regular grammars in the Chomsky hierarchy of languages, the FSM is their exact theoretical computation model. Here is how an FSM for a simple regex a*b$
might look like:
Such FSM is called an NFA (Nondeterministic Finite Automaton) as some states have more than one alternative successor. Another type of automata are DFAs (Deterministic Finite Automata) that permit transitions to at most one state, for each state. The method to transform the regex into an NFA is called the Thompson's construction. And an NFA can be made into a DFA by the Powerset construction and then be minimized to get an optimal automaton. DFAs are more efficient to execute than NFAs, because DFAs are only ever in one state at a time: they never have a choice of multiple next states. But the construction takes additional time. Anyway, both NFAs and DFAs guarantee lineartime execution.
The Thompson's algorithm builds the NFA up from partial NFAs for each subexpression, with a different construction for each operator. The partial NFAs have no matching states: instead, they have one or more dangling arrows, pointing to nothing. The construction process will finish by connecting these arrows to a matching state.
 The NFAs for matching a single character
e
is a single node with a slot for an incoming arrow and a pending outgoing arrow labeled withe
.  The NFA for the concatenation
e1e2
connects the outgoing arrow of thee1
machine to the incoming arrow of thee2
machine.  The NFA for the alternation
e1e2
adds a new start state with a choice of either thee1
machine or thee2
machine.  The NFA for
e?
alternates thee
machine with an empty path.  The NFA for
e*
uses the same alternation but loops a matchinge
machine back to the start.  The NFA for
e+
also creates a loop, but one that requires passing throughe
at least once.
Counting the states in the above constructions, we can see that this technic creates exactly one state per character or metacharacter in the regular expression. The only exception is the constructs c{n}
or c{n,m}
which require to duplicate the single chracter automaton n
or m
times respectively, but it is still a constant number. Therefore the number of states in the final NFA is at most equal to the length of the original regular expression plus some constant.
Implementation of the Thompson's Construction
The core of the algorithm could be implemented very transparently with the help of the Lisp generic functions. However, to enable their application, we'd first need to transform the raw expression into a sexp (treebased) form. Such representation is supported, for example, in the clppcre library:
PPCRE> (parsestring "ab[09]+c$")
(:SEQUENCE "ab" (:GREEDYREPETITION 1 NIL (:RANGE #\0 #\9)) #\c :ENDANCHOR)
Parsing is a whole separate topic that will be discussed next. But once we have performed it, we gain a possibility to straightforwardly implement the Thompson's construction by traversing the parse tree and emitting, for each state, the corresponding part of the automaton. The Lisp generic functions are a great tool for implementing such transformation as they allow to define methods that are selected based on either the type or the identity of the arguments. And those methods can be added independently, so the implementation is clear and extensible. We will define 2 generic functions: one to emit the automaton fragment (thpart
) and another to help in transition selection (thmatch
).
First, let's define the state node of the FSM. We will use a linked graph representation for the automaton. So, a variable for the FSM in the code will point to its start node, and it will, in turn, reference the other nodes. There will also be a special node that will be responsible for recording the matches (*matchedstate*
).
(defstruct thstate
transitions)
(defparameter *initialstate* nil)
(defparameter *matchedstate* (makethstate))
(defun thstate (&rest transitions)
"A small convenience function to construct THSTATE structs."
(makethstate :transitions (loop :for (cond state) :on transitions :by 'cddr
:collect (pair cond state))))
And now, we can define the generic function that will emit the nodes:
(definecondition checkstartanchor () ())
(defgeneric thpart (nextstate kind &rest args)
(:documentation
"Emit the THSTATE structure of a certain KIND
(which may be a keyword or a raw string) using the other ARGS
and pointing to NEXTSTATE struct.")
(:method (nextstate (kind (eql :sequence)) &rest args)
(apply 'thpart (if (rest args)
(apply 'thpart :sequence (rest args))
nextstate)
(first args)))
(:method (nextstate (kind (eql :greedyrepetition)) &rest args)
;; this method can handle *, +, {n}, and {n,m} regex modifiers
;; in any case, there's a prefix sequence of fixed nonnegative length
;; of identical elements that should unconditionally match
;; followed by a bounded or unbounded sequence that,
;; in case of a failed match, transitions to the next state
(apply 'thpart
(let ((*initialstate* nextstate))
(apply 'thpart nextstate :sequence
(loop :repeat (or (second args) 1)
:collect (mklist (third args)))))
:sequence (loop :repeat (first args)
:collect (mklist (third args)))))
(:method (nextstate (kind character) &rest args)
(thstate kind nextstate
;; Usually, *initialstate* will be nill, i.e. further computations
;; alone this path will be aborted, but for some variants (? or *)
;; they will just continue normally to the next state.
;; The special variable allows to control this as you can see in
;; the method for :greedyrepetition
t *initialstate*))
(:method (nextstate (kind (eql :endanchor)) &rest args)
(thstate nil *matchedstate*
t *initialstate*))
(:method (nextstate (kind (eql :startanchor)) &rest args)
;; This part is unique in that all the other parts consume the next character
;; (we're not implementing lookahead here), but this one shouldn't.
;; To implement such behavior without the additional complexity created by passing
;; the string being searched to this function (which we'll still, probably, need to do
;; later on, but were able to avoid so far), we can resort to a cool Lisp technique
;; of signaling a condition that can be handled specially in the toplevel code
(signal 'checkstartanchor)
nextstate))
Here, we have defined some of the methods of thpart
that specialize for the basic :sequence
of expressions, :greedyrepetition
(regex *
and +
), a single character and single symbols :startanchor
/:endanchor
(regexes ^
and $
). As you can see, some of them dispatch (are chosen based on) the identity of the first argument (using eql
specializers), while the characterrelated method specializes on the class of the arg. As we develop this facility, we could add more methods with defmethod
. Running thpart
on the whole parsetree will produce the complete automaton, we don't need to do anything else!
To use the constructed FSM, we run it with the string as input. NFAs are endowed with the ability to guess perfectly when faced with a choice of next state: to run the NFA on a real computer, we must find a way to simulate this guessing. One way to do that is to guess one option, and if that doesn't work, try the other. A more efficient way to simulate perfect guessing is to follow all admissible paths simultaneously. In this approach, the simulation allows the machine to be in multiple states at once. To process each letter, it advances all the states along all the arrows that match the letter. In the worst case, the NFA might be in every state at each step, but this results in at worst a constant amount of work independent of the length of the string, so arbitrarily large input strings can be processed in linear time. The efficiency comes from tracking the set of reachable states but not which paths were used to reach them. In an NFA with n
nodes, there can only be n
reachable states at any step.
(defun runnfa (nfa str)
(let ((i 0)
(start 0)
(matches (list))
(states (list nfa)))
;; this is the counterpart for the startanchor signal
(handlerbind ((checkstartanchor
;; there's no sense to proceed matching a ^... regex
;; if the string is not at its start
(lambda (c) (when (> i 0) (returnfrom runnfa)))))
(dovec (char (concatenate 'vector str
#(nil))) ; for handling endanchor
(let ((newstates (list)))
(dolist (state states)
(dolist (tr (? state 'transitions))
(when (thmatch tr char)
(case (rt tr)
(*matchedstate* (push start matches))
(nil ) ; ignore it
(t (pushnew (rt tr) newstates)))
(return))))
(if newstates
(:= states newstates)
(:= states (list nfa)
start nil)))
(:+ i)
(unless start (:= start i))))
matches))
The thmatch
function may have methods to match a single char and a character range, as well as a particular predicate. Its implementation is trivial and left as an exercise to the reader.
Overall, interpreting an automaton is a simple and robust approach, yet if we want to squeeze all the possible performance, we can compile it directly to machine code. This is much easier to do with the DFA as it has at most 2 possible transitions from each state, so the automaton can be compiled to a multilevel conditional and even a jumptable.
Grammars
Regexes are called "regular" for a reason: there's a corresponding mathematical formalism "regular languages" that originates from the hierarchy of grammars compiled by Noah Chomsky. This hierarchy has 4 levels, each one allowing strictly more complex languages to be expressed with it. And for each level, there's an equivalent computation model:
 Type0: recursivelenumerable (or universal) grammars — Turing machine
 Type1: contextdependent (or contextsensitive) grammars — a linear bounded automaton
 Type2: contextfree grammars — pushdown automaton
 Type3: regular grammars — FSM
We have already discussed the bottom layer of the hierarchy. Regular languages are the most limited (and thus the simplest to implement): for example, you can write a regex a{15}b{15}
, but you won't be able to express a{n}b{n}
for an arbitrary n
, i.e. ensure that b
is repeated the same number of times as a
. The top layer corresponds to all programs and so all the programming science and lore, in general, is applicable to it. Now, let's talk about contextfree grammars which are another type that is heavily used in practice and even has a dedicated set of algorithms. Such grammars can be used not only for simple matching but also for parsing and generation. Parsing, as we have seen above, is the process of transforming a text that is assumed to follow the rules of a certain grammar into the structured form that corresponds to the particular rules that can be applied to this text. And generation is the reverse process: applying the rules, obtain the text. This topic is huge and there's a lot of literature on it including the famous Dragon Book.
Parsing is used for processing both artificial (including programming) and natural languages. And, although different sets of rules may be used, as well as different approaches for selecting a particular rule, the resulting structure will be a tree. In fact, formally, each grammar consists of 4 items:
 The set of terminals (leaves of the parse tree) or tokens of the text: these could be words or characters for the natural language; keywords, identifiers, and literals for the programming language; etc.
 The set of nonterminals — symbols used to name different items in the rules and in the resulting parse tree — the nonleaf nodes of the tree. These symbols are abstract and not encountered in the actual text. The examples of nonterminals could be
VB
(verb) orNP
(noun phrase) in natural language parsing, andifsection
ortemplateargument
in parsing of C++ code.  The root symbol (which should be one of the nonterminals).
 The set of production rules that have twosides: a lefthand (lhs) and a righthand (rhs) one. In the lefthand side, there should be at least one nonterminal, which is substituted with a number of other terminals or nonterminals in the righthand side. During generation, the rule allows the algorithm to select a particular surface form for an abstract nonterminal (for example, turn a nonterminal
VB
into a worddo
). During parsing, which is a reverse process, it allows the program, when it's looking at a particular substring, to replace it with a nonterminal and expand the tree structure. When the parsing process reaches the root symbol in the by performing such substitution and expansion, it is considered terminated.
Each compiler has to use parsing as a step in transforming the source into executable code. Also, parsing may be applied for any data format (for instance, JSON) to transform it into machine data. In natural language processing, parsing is used to build the various tree representations of the sentence, which encode linguistic rules and structure.
There are many different types of parsers that differ in the additional constraints they impose on the structure of the production rules of the grammar. The generic contextfree constraint is that in each production rule the lefthand side may only be a single nonterminal. The most widespread of contextfree grammars are LL(k) (in particular, LL(1)) and LR (LR(1), SLR, LALR, GLR, etc). For example, LL(1) parsers (one of the easiest to build) parses the input from left to right, performing leftmost derivation of the sentence, and it is allowed to look ahead at most 1 character. Not all combinations of derivation rules allow the algorithm to build a parser that will be able to perform unambiguous rule selection under such constraints. But, as the LL(1) parsing is simple and efficient, some authors of grammars specifically target their language to be LL(1)parseable. For example, Pascal and other programming languages created by Niklas Wirth fall into this category.
There are also two principal approaches to implementing the parser: a topdown and a bottomup one. In a topdown approach, the parser tries to build the tree from the root, while, in a bottomup one, it tries to find the rules that apply to groups of terminal symbols and then combine those until the root symbol is reached. Obviously, we can't enumerate all parsing algorithms here, so we'll study only a single approach, which is one of the most widespread, efficient, and flexible ones — ShiftReduce Parsing. It's a bottomup linear algorithm that can be considered one of the instances of the pushdown automaton approach — a theoretical computational model for contextfree grammars.
A shiftreduce parser operates on a queue of tokens of the original sentence. It also has access to a stack. At each step, the algorithm can perform:
 either a
shift
operation: take the token from the queue and push it onto the stack  or a
reduce
operation: take the top items from the stack, select a matching rule from the grammar, and add the corresponding subtree to the partial parse tree, in the process, removing the items from the stack
Thus, for each token, it will perform exactly 2 "movement" operations: push it onto the stack and pop from the stack. Plus, it will perform rule lookup, which requires a constant number of operations (maximum length of the rhs of any rule) if an efficient structure is used for storing the rules. A hashtable indexed by the rhs's or a trie are good choices for that.
Here's a small example from the domain of NLP syntactic parsing. Let's consider a toy grammar:
S > NP VP .
NP > DET ADJ NOUN
NP > PRP$ NOUN ; PRP$ is a posessive pronoun
VP > VERB VP
VP > VERB NP
and the following vocabulary:
DET > aanthe
NOUN > elephantpijamas
ADJ > largesmall
VERB > iswearing
PRP$ > my
No, let's parse the sentence (already tokenized): A large elephant is wearing my pyjamas .
First, we'll need to perform partofspeech tagging, which, in this example, is a matter of looking up the appropriate nonterminals from the vocabulary grammar. This will result in the following:
DET ADJ NOUN VERB VERB PRP$ NOUN .
       
A large elephant is wearing my pyjamas .
This POS tags will serve the role of terminals for our parsing grammar. Now, the shiftreduce process itself begins:
1. Initial queue: (DET ADJ NOUN VERB VERB PRP$ NOUN .)
Initial stack: ()
Operation: shift
2. Queue: (ADJ NOUN VERB VERB PRP$ NOUN .)
Stack: (DET)
Operation: shift (as there are no rules with the rhs DET)
3. Queue: (NOUN VERB VERB PRP$ NOUN .)
Stack: (ADJ DET)
Operation: shift
4. Queue: (VERB VERB PRP$ NOUN .)
Stack: (NOUN ADJ DET)
Operation: reduce (rule NP > DET ADJ NOUN)
; we match the rules in reverse to the stack
5. Queue: (VERB VERB PRP$ NOUN .)
Stack: (NP)
Operation: shift
6. Queue: (VERB PRP$ NOUN .)
Stack: (VERB NP)
Operation: shift
7. Queue: (PRP$ NOUN .)
Stack: (VERB VERB NP)
Operation: shift
8. Queue: (NOUN .)
Stack: (PRP$ VERB VERB NP)
Operation: shift
9. Queue: (.)
Stack: (NOUN PRP$ VERB VERB NP)
Operation: reduce (rule: NP > PRP$ NOUN)
10. Queue: (.)
Stack: (NP VERB VERB NP)
Operation: reduce (rule: VP > VERB NP)
11. Queue: (.)
Stack: (VP VERB NP)
Operation: reduce (rule: VP > VERB VP)
12. Queue: (.)
Stack: (VP NP)
Operation: shift
11. Queue: ()
Stack: (. VP NP)
Operation: reduce (rule: S > NP VP .)
12. Reached root symbol  end.
The resulting parse tree is:
__________S___________
/ \ \
/ __VP__ \
/ / \ \
/ / __VP_ \
/ / / \ \
___NP_____ / / _NP_ \
/  \ / / / \ \
DET ADJ NOUN VERB VERB PRP$ NOUN .
       
A large elephant is wearing my pyjamas .
The implementation of the basic algorithm is very simple:
(defstruct grammar
rules
maxlength)
(defmacro grammar (&rest rules)
`(makegrammar
:rules (pairs>ht (mapcar (lambda (rule)
(pair (nthcdr 2 rule) (first rule)))
',rules))
:maxlength
(let ((max 0))
(dolist (rule ',rules)
(when (> #1=(length (nthcdr 2 rule)) max)
(:= max #1#)))
max))) ; #1= & #1# are readermacros for anonymous variables
(defun parse (grammar queue)
(let ((stack (list)))
(loop :while queue :do
(print stack) ; diagnostic output
(ifit (findrule stack grammar)
;; reduce
(dotimes (i (length (cdr it))
(push it stack))
(pop stack))
;; shift
(push (pop queue) stack))
:finally (return (findrule stack grammar)))))
(defun findrule (stack grammar)
(let (prefix)
(loop :for item in stack
:repeat (? grammar 'maxlength) :do
(push (car (mklist item)) prefix)
(whenit (? grammar 'rules prefix)
;; otherwise parsing will fail with a stack
;; containing a number of partial subtrees
(return (cons it (reverse (subseq stack 0 (length prefix)))))))))
CLUSER> (parse (print (grammar (S > NP VP .)
(NP > DET ADJ NOUN)
(NP > PRP$ NOUN)
(VP > VERB VP)
(VP > VERB NP)))
'(DET ADJ NOUN VERB VERB PRP$ NOUN .))
#S(GRAMMAR
:RULES #{
'(NP VP .) S
'(DET ADJ NOUN) NP
'(PRP$ NOUN) NP
'(VERB VP) VP
'(VERB NP) VP
}
:MAXLENGTH 3)
NIL
(DET)
(ADJ DET)
(NOUN ADJ DET)
((NP DET ADJ NOUN))
(VERB (NP DET ADJ NOUN))
(VERB VERB (NP DET ADJ NOUN))
(PRP$ VERB VERB (NP DET ADJ NOUN))
(NOUN PRP$ VERB VERB (NP DET ADJ NOUN))
((NP PRP$ NOUN) VERB VERB (NP DET ADJ NOUN))
((VP VERB (NP PRP$ NOUN)) VERB (NP DET ADJ NOUN))
((VP VERB (VP VERB (NP PRP$ NOUN))) (NP DET ADJ NOUN))
(S (NP DET ADJ NOUN) (VP VERB (VP VERB (NP PRP$ NOUN))) .)
However, the additional level of complexity of the algorithm arises when the grammar becomes ambiguous, i.e. there may be situations when several rules apply. Shiftreduce is a greedy algorithm, so, in its basic form, it will select some rule (for instance, with the shortest rhs or just the first match), and it cannot backtrack. This may result in a parsing failure. If some form of rule weights is added, the greedy selection may produce a suboptimal parse. Anyway, there's no option of backtracking to correct a parsing error. In the NLP domain, the peculiarity of shiftreduce parsing application is that the number of rules is quite significant (it can reach thousands) and, certainly, there's ambiguity. In this setting, shiftreduce parsing is paired with machine learning technics, which perform a "soft" selection of the action to take at each step, as reduce
is applicable almost always, so a naive greedy technique becomes pointless.
Actually, shiftreduce would better be called something like stackqueue parsing, as different parsers may not limit the implementation to just the shift and reduce operations. For example, an NLP parser that allows the construction of nonprojective trees (those, where the arrows may cross, i.e. subsequent words may not always belong to a single or subsequent upperlevel categories), adds a swap
operation. A more advanced NLP parser that produces a graph structure called an AMR (abstract meaning representation) has 9 different operations.
Shiftreduce parsing is implemented in many of the parser generator tools, which generate a parser program from a set of production rules. For instance, the popular Unix tool yacc
is a LALR parser generator that uses shiftreduce. Another popular tool ANTLR is a parser generator for LL(k) languages that uses a nonshiftreduce direct pushdown automatonbased implementation.
Besides shiftreduce and similar automatabased parsers, there are many other parsing technics used in practice. For example, CYK probabilistic parsing was popular in NLP for some time, but it's an O(n^3)
algorithm, so it gradually fell from grace and lost to machinelearning enhanced shiftreduce variants. Another approach is packrat parsing (based on PEG — parsing expression grammars) that has a great Lisp parsergenerator library esrap. Packrat is a more powerful topdown parsing approach with backtracking and unlimited lookahead that nevertheless guarantees linear parse time. Any language defined by an LL(k) or LR(k) grammar can be recognized by a packrat parser, in addition to many languages that conventional lineartime algorithms do not support. This additional power simplifies the handling of common syntactic idioms such as the widespread but troublesome longestmatch rule, enables the use of sophisticated disambiguation strategies such as syntactic and semantic predicates, provides better grammar composition properties, and allows lexical analysis to be integrated seamlessly into parsing. The last feature makes packrat very appealing to the programmers as they don't have to define separate tools for lexical analysis (tokenization and token categorization) and parsing. Moreover, the rules for tokens use the same syntax, which is also quite similar to regular expression syntax. For example, here's a portion of the esrap rules for parsing tables in Markdown documents. The Markdown table may look something like this:
 LeftAligned  Center Aligned  Right Aligned 
 : :: :
 col 3 is  some wordy text  $1600 
 col 2 is  centered  $12 
 zebra stripes  are neat  $1 
You can see that the code is quite selfexplanatory: each defrule
form consists of a rule name (lhs), its rhs, and a transformation of the rhs into a data structure. For instance, in the rule tablerow
the rhs is (and (& #\) (+ tablecell) #\ sp newline)
. The row should start with a 
char followed by 1 or more tablecell
s (a separate rule), and ended by 
with some space charctaers and a newline. And the transformation (:destructure (_ cells &rest __) ...
only cares about the content, i.e. the table cells.
(defrule sp (* spacechar)
(:text t))
(defrule tablecell (and #\
sp
(* (and (! (or (and sp #\) endline)) inline))
sp
(& #\))
(:destructure (_ __ content &rest ___)
(mapcar 'second content)))
(defrule tablerow (and (& #\) (+ tablecell) #\ sp newline)
(:destructure (_ cells &rest __)
(mapcar (lambda (a) (cons :plain a))
cells)))
(defrule tablealigncell (and sp (? #\:) (+ #\) (? #\:) sp #\)
(:destructure (_ left __ right &rest ___)
(if right (if left 'center 'right) (when left 'left))))
(defrule tablealignrow (and #\ (+ tablealigncell) sp newline)
(:destructure (_ aligns &rest __)
aligns))
(defrule tablehead (and tablerow tablealignrow))
To conclude the topic of parsing, I wanted to pose a question: can it be used to match the regular expressions? And the answer, of course, is that it can, as we are operating in a more powerful paradigm that includes the regexes as a subdomain. However, the critical showstopper of applying parsing to this problem is the need to define the grammar instead of writing a compact and more or less intuitive regex...
String Search in Action: Plagiarism Detection
Plagiarism detection is a very challenging problem that doesn't have an exact solution. The reason is that there's no exact definition of what can be considered plagiarism and what can't, the boundary is rather blurry. Obviously, if the text or its part is just copypasted, everything is clear. But, usually (and especially when they know that plagiarism detection is at play), people will apply their creativity to alter the text in some slight or even significant ways. However, over the years, researchers have come up with numerous algorithms of plagiarism detection, with quality good enough to be used in our educational institutions. The problem is very popular and there are even shared task challenges dedicated to improving plagiarism catchers. It's somewhat an arms race between the plagiarists and the detection systems.
One of the earliest but, still, quite effective ways of implementing plagiarism detection is the Shingle algorithm. It is also based on the idea of using hashes and some basic statistical sampling techniques. The algorithm operates in the following stages:
 Text normalization (this may include case normalization, reduction of the words to basic forms, error correction, cleanup of punctuation, stopwords, etc.)
 Selection of the shingles and calculation of their hashes.
 Sampling the shingles from the text at question.
 Comparison of the hashes of the original shingles to the sampled hashes and evaluation.
The single shingle is a continues sequence of words from the normalized text (another name for this object, in NLP, is ngram
). The original text will give us (1 n)
shingles, where n
is the number of words. The hashes of the shingles are normal string hashes (like fnv1).
The text, which is analyzed for plagiarism, is also split into shingles, but not all of them are used. Just a random sample of m
. The Sampling theorem can give a good estimate of the number that can be trusted with a high degree of confidence. For efficient comparison, all the original hashes can be stored in a hashset. If the number of overlapping shingles exceeds some threshold, the text can be considered plagiarised. The other take on the result of the algorithm application may be to return the plagiarism degree, which will be the percentage of the overlapping shingles. The complexity of the algorithm is O(n + m)
.
In a sense, the Shingle algorithm may be viewed as an instance of massive string search, where the outcome we're interested in is not so much the positions of the patterns in the text (although, those may also be used to indicate the parts of the text that are plagiarismsuspicious) as the fact that they are present in it.
Takeaways
Strings are peculiar objects: initially, it may seem that they are just arrays. But, beyond this simple understanding, due to the main usage patterns, a much more interesting picture can be seen. Advanced string representations and algorithms are examples of specialpurpose optimization applied to generalpurpose data structures. This is another reason why strings are presented at the end of the part on derived data structures: string algorithms make heavy use of the material we have covered previously, such as trees and graphs.
We have also discussed the FSMs — a powerful datastructure that can be used to reliably implement complex workflows. FSMs may be used not only for string matching but also for implementing protocol handling (for example, in the HTTP server), complex user interactions, and so on. The Erlang programming language even has a standard library behavior gen_fsm
(replaced by the newer gen_statem
) that is a framework for easy implementation of FSMs — as many Erlang applications are mass service systems that have state machinelike operation.
P.S. Originally, I expected this chapter to be one of the smallest in the book, but it turned out to be the longest one. Strings are not so simple as they might seem... ;)
[1] A proper suffix is a suffix that is at least one character shorter than the string itself. For example, in the string abc
the proper suffices are bc
and c
.
[2] Perl is only the most conspicuous example of a large number of popular programs that use the same algorithm; the same applies to Python, or PHP, or Ruby, or many other languages.
Didier Verna — Declt 3.0 "Montgomery Scott" is released
@20191119 00:00 · 18 days agoI'm pleased to announce the release of Declt 3.0 "Montgomery Scott", my Texinfo reference manual generator for Common Lisp libraries.
This is a new major release of the library, although the changes may not be so visible from the outside. The main concern in this release has been to increase the robustness of the output, from three different angles.
 Many places from where Declt attempts to extract meaningful information are underspecified (ASDF system slots notably).
 The prettyprinting of Lisp items can be difficult, given the liberalism and extensibility of the language.
 Finally, several restrictions in the syntax of Texinfo itself (anchor names notably) get in the way.
These issues were described in a paper presented at the TeX Users Group conference, this summer in Palo Alto. This release goes a long way toward fixing them.
The new generated reference manuals look mostly the same as before, but some important things have changed under the hood, notably the names of hyperlinks and crossreferences (backwardincompatible, hence the increase in the major version number). The output is now also much more robust with respect to the final output format: the generation of HTML, DVI, Postscript, PDF, and even plain Info should work fine and has been tested on all Quicklisp libraries (in fact, Quickref will also be upgraded in the near future to provide all those formats at once).
Those improvements do come at a cost. Unicode is now required (even for Info readers). To be honest, many Lisp libraries already had that implicit requirement. Also, this release depends on improvements and bug fixes only available in Texinfo 6.7, now a requirement as well. I have to thank Gavin Smith, from the Texinfo team, for his collaboration.
Apart from that, this release also has a number of bug fixes.
 Some method specializers were not handled properly.
 The manuals were missing documentation for some nonLisp source files.
 There were some glitches in the prettyprinting of unreadable objects.
Get it at the usual place, and enjoy!
Lispers.de — Berlin Lispers Meetup, Monday, 25th November 2019
@20191119 00:00 · 18 days agoWe meet again on Monday 8pm, 25th November. Our host this time will be MaxGerd Retzlaff (www.retzlaffwenger.com).
Berlin Lispers is about all flavors of Lisp including Common Lisp, Clojure and Scheme.
MaxGerd Retzlaff will talk a bit about his current programming for his PhD research[1] in Computer Graphics. Topics include preprocessing of noisy point clouds from a custombuilt laser scanner, parsing binary data, coordinate transformations, drift detection using the bisection method, interpolation of scan lines (with Alpha Catmull Rom splines), and his matrix library struct matrices[2].
[1] http://cg.ivd.kit.edu/retzlaff
[2] http://matroid.org/flux/structmatrices.lisp / .asd
We meet in Max's office at Wichertstraße 2, 10439 BerlinPrenzlauer Berg, it is the storefront office at street level, the bell reads "MaxGerd Retzlaff". It is located right around the corner of S+U Schönhauser Allee in 2 min walking distance. In case of questions call Christian +49 1578 70 51 61 4.
Wimpie Nortje — HTTP Server: Why Hunchentoot instead of Clack?
@20191118 00:00 · 19 days agoWhen you do not use a framework for web application development, one of the first choices to make is to select an HTTP server.
Many web frameworks are built on Clack/Lack. It provides a unified interface to the major CL web server implementations. If you need the ability to easily switch between different web servers without having to change application code, for example you are worried that your chosen HTTP server becomes unsupported, then using such a unified interface is a sensible idea.
On the downside, to achieve such a unified interface, Clack has to limit itself to the lowest common denominator. All the available web servers have some particular strong feature which is usually the reason why one would pick a specific library. Clack removes this distinction between them and makes them all appear the same. It also forces Clack to then replace many of the features already provided by some libraries with its own implementation, like for example sending large files or handling static files.
Clack does have some features not provided by any web server which may be a good reason to consider its use. One such feature is custom middleware. Middleware are functions that automatically run before or after every request handler without the handler having to make the function calls. Things like cookie handling or CSRF prevention. While it may be possible to extend the web servers with middlewarelike functionality it is a core feature of Clack.
If Clackspecific features are not relevant and you are not worried that your chosen HTTP server will vanish into thin air then there is not much reason to use it, especially if you need a specific web server's distinguishing feature.
I have used Clack before and it did not properly support some Hunchentoot features that I needed. While I like the middleware concept a lot I decided to use the web server directly to avoid running into similar issues again.
I considered some of the more mature webservers and settled on Hunchentoot. My reasons are:
 The learning curve is smaller because I have used it before.
 It is probably the most used and most mature web server project so I expect it to have the least bugs.
 It is slow compared to other web servers but there are enough other places to optimise for speed to make up for it. If it ever becomes a problem.
 I don't need any of the defining features of the other web servers.
When selecting an HTTP server Clack should be considered just as seriously as any of the web servers because it does provide some useful features. However, it should not be taken for granted that Clack is the best choice in all situations. There are times and reasons why a specific web server may be a better choice.
Wimpie Nortje — Picking libraries for web development
@20191117 00:00 · 20 days agoI started a project in Common Lisp again. Selective Share is a hosted secrets management service for serverside software. The project has a web app and a command line client, both of which are written in Common Lisp for the same reasons I previously stated in this post.
For the server software I decided to do the web development without a framework. There are full blown web frameworks available like Lucerne, Ningle and Caveman but they have far fewer users than the likes of Rails or Django, which means they are not nearly as matured and their development progress much slower.
When working without a framework one spends a lot of time selecting the libraries to use for each task. Common Lisp being what it is there is usually a choice of libraries for every task. This is great when you know what you need but it makes initial progress very slow when you are new to a particular field. For example, there are at least seven libraries for dealing with JSON, each with its own specialities and shortcomings.
This series of posts describe some of my design decisions and library selections. It should provide a springboard for frameworkless web development but may also be more generally helpful.
The topics covered are listed below, links will be added as the posts become available:
 Project framework
 Database
 DB access
 ORM / DAO
 DB migrations
 HTTP server
 Request routing and middleware
 JSON
 Templating
 Test framework
 Logging
 Session management
 Password hashing
 Storing configuration
 Building binaries
 CLI parameters
ABCL Dev — Turning the Bear past Java 11
@20191114 14:29 · 22 days agoturn to a rumored permanent slumber, and a complete absence of the
possibility of new updates. The Bear reviews all submitted
patches, so if you don't like the current implementation,
please get out and push. Or pull? Or fast forward? The Bear is
seemingly always confused with Git...
As previously rumored, patches for getting the Armed Bear
Common Lisp implementation running on Java 11 have landed in
the trunk.
Unless anyone objects strongly, we will release ABCL 1.6.0 as soon
as possible with the intention to support the openjdk{6,7,8,11} runtimes.
After stabilizing the 1.6 branch with the inevitable necessary and
unknown updates, we can turn our attention to the ABCL 2.0 branch which
will introduce support for additional runtimes but drop support for emitting
openjdk6 or openjdk7 compatible bytecode.
Stay tuned, Bear fans...
Lispers.de — LispMeetup in Hamburg on Monday, 4th November 2019
@20191101 00:00 · 36 days agoWe meet at Ristorante Opera, Dammtorstraße 7, Hamburg, starting around 19:00 CEST on 4th November 2019.
This is an informal gathering of Lispers of all experience levels.
Topics might include Advent of Code, Lisp jobs, cryptography, event sourcing.
Eugene Zaikonnikov — Yocto recipe for CLISP
@20191027 15:00 · 40 days agoSome time ago I put together a simple recipe for building CLISP with Yocto. It is pinned to 2.49.93 and should work both for target and native builds. Tested with Sumo only.
Vsevolod Dyomkin — Programmatically Drawing Simple Graphviz Graphs
@20191025 10:48 · 42 days agoFor my book, I had to draw a number of graphs. Obviously, I wanted to have a programmatic way to do that, and Graphviz is the gotolibrary for that. In Lisp, the interface to Graphviz is cldot, but, for me, it wasn't easy to figure out from the manual the "simple and easy" way to use it. I.e. I couldn't find a stupid beginnerlevel interface, so I had to code it myself. Here's the implementation that allows anyone with a REPL to send to Graphviz lists of edges and obtain graph images.
Generated images:
Vsevolod Dyomkin — Programming Algorithms: Graphs
@20191024 13:09 · 43 days agoGraphs have already been mentioned several times, in the book, in quite diverse contexts. Actually, if you are familiar with graphs you can spot opportunities to use them in quite different areas for problems that aren't explicitly formulated with graphs in mind. So, in this chapter, we'll discuss how to handle graphs to develop such intuition to some degree.
But first, let's list the most prominent examples of the direct graph applications, some of which we'll see here in action:
 pathfinding
 network analysis
 dependency analysis in planning, compilers, etc.
 various optimization problems
 distributing and optimizing computations
 knowledge representation and reasoning with it
 meaning representation in natural language processing
Graphs may be thought of as a generalization of trees: indeed, trees are, as we said earlier, connected directed acyclic graphs. But there's an important distinction in the patterns of the usage of graphs and trees. Graphs, much more frequently than trees, have weights associated with the edges, which adds a whole new dimension both to algorithms for processing them and to possible data that can be represented in the graph form. So, while the main application of trees is reflecting some hierarchy, for graphs, it is often more about determining connectedness and its magnitude, based on the weights.
Graph Representations
A graph is, basically, a set of nodes (called "vertices", V
) and an enumeration of their pairs ("edges", E
). The edges may be directed or undirected (i.e. bidirectional), and also weighted or unweighted. There are many ways that may be used to represent these sets, which have varied utility for different situations. Here are the most common ones:
 as a linked structure:
(defstruct node data links)
wherelinks
may be either a list of othernode
s, possibly, paired with weights, or a list ofedge
structures represented as(defsturct edge source destination weight)
. For directed graphs, this representation will be similar to a singlylinked list but for undirected — to a heavier doublylinked one  as an adjacency matrix (
V x V
). This matrix is indexed by vertices and has zeroes when there's no connection between them and some nonzero number for the weight (1 — in case of unweighted graphs) when there is a connection. Undirected graphs have a symmetric adjacency matrix and so need to store only the abovediagonal half of it  as an adjacency list that enumerates for each vertex the other vertices it's connected to and the weights of connections
 as an incidence matrix (
V x E
). This matrix is similar to the previous representation, but with much more wasted space. The adjacency list may be thought of as a sparse representation of the incidence matrix. The matrix representation may be more useful for hypergraphs (that have more than 2 vertices for each edge), though  just as a list of edges
Topological Sort
Graphs may be divided into several kinds according to the different properties they have and specific algorithms, which work on them:
 disjoint (with several unconnected subgraphs), connected, and fullyconnected (every vertex is linked to all the others)
 cyclic and acyclic, including directed acyclic (DAG)
 bipartite: when there are 2 groups of vertices and each vertex from one group is connected only to the vertices from the other
In practice, Directed Acyclic Graphs are quite important. These are directed graphs, in which there's no vertex that you can start a path from and return back to it. They find applications in optimizing scheduling and computation, determining historical and other types of dependencies (for example, in dataflow programming and even spreadsheets), etc. In particular, every compiler would use one and even make
will when building the operational plan. The basic algorithm on DAGs is Topological sort. It creates a partial ordering of the graph's vertices which ensures that every child vertex is always preceding all of its ancestors.
Here is an example. This is a DAG:
And these are the variants of its topological ordering:
6 4 5 3 2 1 8 7
6 4 5 2 3 1 8 7
8 7 6 4 5 3 2 1
8 7 6 4 5 2 3 1
There are several variants as the graph is disjoint, and also the order in which the vertices are traversed is not fully deterministic.
There are two common approaches to topological sort: the Kahn's algorithm and the DFSbased one. Here is the DFS version:
 Choose an arbitrary vertex and perform the DFS from it until a vertex is found without children that weren't visited during the DFS.
 While performing the DFS, add each vertex to the set of visited ones. Also check that the vertex hasn't been visited already, or else the graph is not acyclic.
 Then, add the vertex we have found to the resulting sorted array.
 Return to the previous vertex and repeat searching for the next descendant that doesn't have children and add it.
 Finally, when all of the current vertex's children are visited add it to the result array.
 Repeat this for the next unvisited vertex until no unvisited ones remain.
Why does the algorithm satisfy the desired constraints? First of all, it is obvious that it will visit all the vertices. Next, when we add the vertex we have already added all of its descendants — satisfying the main requirement. Finally, there's a consistency check during the execution of the algorithm that ensures there are no cycles.
Before proceeding to the implementation, as with other graph algorithms, it makes sense to ponder what representation will work the best for this problem. The default one — a linked structure — suits it quite well as we'll have to iterate all the outgoing edges of each node. If we had to traverse by incoming edges then it wouldn't have worked, but a matrix one would have.
(defstruct node
id edges)
(defstruct edge
src dst label)
(defstruct (graph (:concname nil) (:printobject pprintgraph))
(nodes (makehashtable))) ; mapping of node ids to nodes
As usual, we'll need a more visual way to display the graph than the default printfunction. But that is pretty tricky considering that graphs may have an arbitrary structure with possibly intersecting edges. The simplest approach for small graphs would be to just draw the adjacency matrix. We'll utilize it for our examples (relying on the fact that we have control over the set of node ids):
(defun pprintgraph (graph stream)
(let ((ids (sort (keys (nodes graph)) '<)))
(format stream "~{ ~A~}~%" ids) ; here, Tab is used for space
(dolist (id1 ids)
(let ((node (? graph 'nodes id1)))
(format stream "~A" id1)
(dolist (id2 ids)
(format stream " ~:[~;x~]" ; here, Tab as well
(find id2 (? node 'edges) :key 'edgedst)))
(terpri stream)))))
Also, let's create a function to simplify graph initialization:
(defun initgraph (edges)
(with ((rez (makegraph))
(nodes (nodes rez)))
(loop :for (src dst) :in edges :do
(let ((srcnode (getset# src nodes (makenode :id src))))
(getset# dst nodes (makenode :id dst))
(push (makeedge :src src :dst dst)
(? srcnode 'edges))))
rez))
CLUSER> (initgraph '((7 8)
(1 3)
(1 2)
(3 4)
(3 5)
(2 4)
(2 5)
(5 4)
(5 6)
(4 6)))
1 2 3 4 5 6 7 8
1 x x
2 x x
3 x x
4 x
5 x x
6
7 x
8
So, we already see in action 3 different ways of graphs representation: linked, matrix, and edges lists.
Now, we can implement and test topological sort:
(defun toposort (graph)
(let ((nodes (nodes graph))
(visited (makehashtable))
(rez (vec)))
(dokv (id node nodes)
(unless (? visited id)
(visit node nodes visited rez)))
rez))
(defun visit (node nodes visited rez)
(dolist (edge (? node 'edges))
(with ((id (? edge 'dst))
(child (? nodes id)))
(unless (find id rez)
(assert (not (? visited id)) nil
"The graph isn't acyclic for vertex: ~A" id)
(:= (? visited id) t)
(visit child nodes visited rez))))
(vectorpushextend (? node 'id) rez)
rez)
CLUSER> (toposort (initgraph '((7 8)
(1 3)
(1 2)
(3 4)
(3 5)
(2 4)
(2 5)
(5 4)
(5 6)
(4 6))))
#(8 7 6 4 5 2 3 1)
This technique of tracking the visited nodes is used in almost every graph algorithm. As noted previously, it can either be implemented using an additional hashtable (like in the example) or by adding a boolean flag to the vertex/edge structure itself.
MST
Now, we can move to algorithms that work with weighted graphs. They represent the majority of the interesting graphbased solutions. One of the most basic of them is determining the Minimum Spanning Tree. Its purpose is to select only those graph edges that form a tree with the lowest total sum of weights. Spanning trees play an important role in network routing where there is a number of protocols that directly use them: STP (Spanning Tree Protocol), RSTP (Rapid STP), MSTP (Multiple STP), etc.
If we consider the graph from the previous picture, its MST will include the edges 12, 13, 34, 35, 56, and 78. Its total weight will be 24.
Although there are quite a few MST algorithms, the most wellknown are Prim's and Kruskal's. Both of them rely on some interesting solutions and are worth studying.
Prim's Algorithm
Prim's algorithm grows the tree one edge at a time, starting from an arbitrary vertex. At each step, the leastweight edge that has one of the vertices already in the MST and the other one outside is added to the tree. This algorithm always has an MST of the already processed subgraph, and when all the vertices are visited, the MST of the whole graph is completed. The most interesting property of Prim's algorithm is that its time complexity depends on the choice of the data structure for ordering the edges by weight. The straightforward approach that searches for the shortest edge will have O(V^2)
complexity, but if we use a priority queue it can be reduced to O(E logV)
with a binary heap or even O(E + V logV)
with a Fibonacci heap. Obviously, V logV
is significantly smaller than E logV
for the majority of graphs: up to E = V^2
for fullyconnected graphs.
Here's the implementation of the Prim's algorithm with an abstract heap:
(defun primmst (graph)
(let ((initialweights (list))
(mst (list))
(total 0)
weights
edges
cur)
(dokv (id node (nodes graph))
(if cur
(push (pair id (or (? edges id)
;; a standard constant that is
;; a good enough substitute for infinity
mostpositivefixnum))
initialweights)
(:= cur id
edges (? node 'edges))))
(:= weights (heapify initialweights))
(loop
(with (((id weight) (heappop weights)))
(unless id (return))
(when (? edges id)
;; if not, we have moved to the new connected component
;; so there's no edge connecting it to the previous one
(push (pair cur id) mst)
(:+ total weight))
(dokv (id w edges)
(when (< w weight)
(heapdecreasekey weights id w)))
(:= cur id
edges (? graph 'nodes id 'edges))))
(values mst
total)))
To make it work, we need to perform several modifications:
 first of all, the list of all node edges should be change to a hashtable to ensure
O(1)
access by child id  the heap should store not only the keys but also values (a trivial change)
 we need to implement another fundamental heap operation
heapdecreasekey
, which we haven't mentioned in the previous chapter
For the binary heap, it's, actually, just a matter of performing heapup
. But the tricky part is that it requires an initial search for the key. To ensure constanttime search and, subsequently, O(log n)
total complexity, we need to store the pointers to heap elements in a separate hashtable.
Let's confirm the stated complexity of this implementation? First, the outer loop operates for each vertex so it has V
iterations. Each iteration has an inner loop that involves a heappop
(O(log V)
) and a heapupdate
(also O(log V)
) for a number of vertices, plus a small number of constanttime operations. heappop
will be invoked exactly once per vertex, so it will need O(V logV)
total operations, and heapupdate
will be called at most once for each edge (O(E logV)
). Considering that E
is usually greater than V
, this is how we can arrive at the final complexity estimate.
The Fibonacci heap improves on the binary heap, in this context, as its decreasekey
operation is O(1)
instead of O(log V)
, so we are left with just O(V logV)
for heappop
s and E
heapdecreasekey
s. Unlike the binary heap, the Fibonacci one is not just a single tree but a set of trees. And this is used in decreasekey: instead of popping an item up the heap and rearranging it in the process, a new tree rooted at this element is cut from the current one. This is not always possible in constant time as there are some invariants that might be violated, which will in turn trigger some updates to the newly created two trees. Yet, using an amortized cost of the operation is still O(1)
.
Here's a brief description of the principle behind the Fibonacci heap adapted from Wikipedia:
A Fibonacci heap is a collection of heaps. The trees do not have a prescribed shape and, in the extreme case, every element may be its own separate tree. This flexibility allows some operations to be executed in a lazy manner, postponing the work for later operations. For example, merging heaps is done simply by concatenating the two lists of trees, and operation decrease key sometimes cuts a node from its parent and forms a new tree. However, at some point order needs to be introduced to the heap to achieve the desired running time. In particular, every node can have at most O(log n)
children and the size of a subtree rooted in a node with k
children is at least F(k+2)
, where F(k)
is the k
th Fibonacci number. This is achieved by the rule that we can cut at most one child of each nonroot node. When a second child is cut, the node itself needs to be cut from its parent and becomes the root of a new tree. The number of trees is decreased in the operation delete minimum, where trees are linked together. Here's an example Fibonacci heap that consists of 3 trees:
6 2 1 < minimum
 /  \
5 3 4 7

8

9
Kruskal's Algorithm
Kruskal's algorithm operates from the point of view of not vertices but edges. At each step, it adds to the tree the current smallest edge unless it will produce a cycle. Obviously, the biggest challenge here is to efficiently find the cycle. Yet, the good news is that, like with the Prim's algorithm, we also have already access to an efficient solution for this problem — UnionFind. Isn't it great that we already have built a library of techniques that may be reused in creating more advanced algorithms? Actually, this is the goal of developing as an algorithms programmer — to be able to see a way to reduce the problem, at least partially, to some already known and proven solution.
Like Prim's algorithm, Kruskal's approach also has O(E logV)
complexity: for each vertex, it needs to find the minimum edge not forming a cycle with the already built partial MST. With UnionFind, this search requires O(logE)
, but, as E
is at most V^2
, logE
is at most logV^2
that is equal to 2 logV
. Unlike Prim's algorithm, the partial MST built by the Kruskal's algorithm isn't necessary a tree for the already processed part of the graph.
The implementation of the algorithm, using the existing code for UnionFind is trivial and left as an exercise to the reader.
Pathfinding
So far, we have only looked at problems with unweighted graphs. Now, we can move to weighted ones. Pathfinding in graphs is a huge topic that is crucial in many domains: maps, games, networks, etc. The goal, usually, is to find the shortest path between two nodes in a directed weighted graph. Yet, there may be variations like finding shortest paths from a selected node to all other nodes, finding the shortest path in a maze (that may be represented as a grid graph with all edges of weight 1), etc.
There are, once again, two classic pathfinding algorithms, each one with a certain feature that makes it interesting and notable. Dijkstra's algorithm is a classic example of greedy algorithms as its alternative name suggests — shortest path first (SPF). The A* builds upon it by adding the notion of an heuristic. Dijkstra's approach is the basis of many computer network routing algorithms, such as ISIS and OSPF, while A* and modifications are often used in games, as well as in pathfinding on the maps.
Dijkstra's Algorithm
The idea of Dijkstra's pathfinding is to perform a limited BFS on the graph only looking at the edges that don't lead us "away" from the target. Dijkstra's approach is very similar to the Prim's MST algorithm: it also uses a heap (Binary or Fibonacci) to store the shortest paths from the origin to each node with their weighs (lengths). At each step, it selects the minimum from the heap, expands it to the neighbor nodes, and updates the weights of the neighbors if they become smaller (the weights start from infinity).
For our SPF implementation we'll need to use the same trick that was shown in the UnionFind implementation — extend the node structure to hold its weight and the path leading to it:
(defstruct (spfnode (:include node))
(weight mostpositiveinfinity)
(path (list)))
Here is the main algorithm:
(defun spf (graph src dst)
(with ((nodes (? graph 'nodes))
(spf (list))
;; the following code should express initialize
;; the heap with a single node of weight 0
;; and all other nodes of weight MOSTPOSITIVEFIXNUM
;; (instead of running a O(n*log n) HEAPIFY)
(weights (initweightsheap nodes src)))
(loop
(with (((id weight) (heappop weights)))
(cond ((eql id dst) (let ((dst (? nodes dst)))
;; we return 2 values: the path and its length
(return (values (cons dst (? dst 'path))
(? dst 'weight)))))
((= mostpositivefixnum weight) (return))) ; no path exists
(dolist (edge (? nodes id 'edges))
(with ((cur (? edge 'dst))
(node (? nodes cur))
(w (+ weight (? edge 'weight))))
(when (< w (? node 'weight))
(heapdecreasekey weights cur w)
(:= (? node weight) w
(? node path) (cons (? nodes id)
(? nodes id 'path))))))))))
A* Algorithm
There are many ways to improve the vanilla SPF. One of them is to move inparallel from both sides: the source and the destination.
A* algorithm (also called BestFirst Search) improves upon the Dijkstra's method by changing how the weight of the path is estimated. Initially, it was just the distance we've already traveled in the search, which is known exactly. But we don't know for sure the length of the remaining part. However, in Euclidian and similar spaces, where the triangle inequality holds (that the direct distance between 2 points is not greater than the distance between them through any other point) it's not an unreasonable assumption that the direct path will be shorter than the circuitous ones. This premise does not always hold as there may be obstacles, but quite often it does. So, we add a second term to the weight, which is the direct distance between the current node and the destination. This simple idea underpins the A* search and allows it to perform much faster in many realworld scenarios, although its theoretical complexity is the same as for simple SPF. The exact guesstimate of the remaining distance is called the algorithm's heuristic and should be specified for each domain separately: for maps, it is the linear distance, but there are clever ways to invent similar estimates where distances can't be calculated directly.
Overall, this algorithm is one of the simplest examples of the heuristic approach. The idea of heuristics, basically, lies in finding patterns that may significantly improve the performance of the algorithm for the common cases, although their efficiency can't be proven for the general case. Isn't it the same approach as, for example, hashtables or splay trees that also don't guarantee the same optimal performance for each operation. The difference is that, although those techniques have possible local cases of suboptimality they provide global probabilistic guarantees. For heuristic algorithms, usually, even such estimations are not available, although they may be performed for some of them. For instance, the performance of A* algorithm will suffer if there is an "obstacle" on the direct path to the destination, and it's not possible to predict, for the general case, what will be the configuration of the graph and where the obstacles will be. Yet, even in the worst case, A* will still have at least the same speed as the basic SPF.
The changes to the SPF algorithm needed for A* are the following:
initweightsheap
will use the value of the heuristic instead ofmostpositivefixnum
as the initial weight. This will also require us to change the loop termination criteria from(= mostpositivefixnum weight)
by adding some notion of visited nodes there will be additional term added to the weight of the node formula:
(+ weight (? edge 'weight) (heuristic node))
A good comparison of the benefits A* brings over simple SPF may be shown with this picture of pathfinding on a rectangular grid without diagonal connections, where each node is labeled with its 2dcoordinates. To find the path from node (0 0)
to (2 2)
(length 4) using the Dijkstra's algorithm, we'll need to visit all of the points in the grid:
0 1 2
0 + .
1 .
2
0 1 2
0 + . .
1 . .
2 .
0 1 2
0 + . .
1 . . .
2 . .
0 1 2
0 + > v
1 . . v
2 . . +
With A*, however, we'll move straight to the point:
0 1 2
0 + .
1 .
2
0 1 2
0 + .
1 . .
2
0 1 2
0 + .
1 . . .
2 .
0 1 2
0 + v
1 . > v
2 . +
The final path, in these pictures, is selected by the rule to always open the left neighbor first.
Maximum Flow
Weighted directed graphs are often used to represent different kinds of networks. And one of the main tasks on such networks is efficient capacity planning. The main algorithm for that is Maximum Flow calculation. It works on the socalled transport networks contain three kinds of vertices: a source, a sink, and intermediate nodes. The source has only outgoing edges, the sink has only incoming, and all the other nodes obey the balance condition: the total weights (flow) of all incoming and outgoing edges are equal. The task of determining maximum flow is to estimate the largest amount that can flow through the whole net from the source to the sink. Besides knowing the actual capacity of the network, this also allows finding the bottlenecks and edges that are not fully utilized. From such point of view, the problem is called Minimum Cut estimation.
There are many approaches to solving this problem. The most direct and intuitive of them is the FordFulkerson method. It is a greedy algorithm, once again, that computes the maximum flow by trying all the paths from source to sink until there is some residual capacity available. These paths are called "augmented paths" as they augment the network flow. And, to track the residual capacity, a copy of the initial weight graph called the "residual graph" is maintained. With each new path added to the total flow, its flow is subtracted from the weights of all of its edges in the residual graph. Besides — and this is the key point in the algorithm that allows it to be optimal despite its greediness — the same amount is added to the backward edges in the residual graph. The backward edges don't exist in the original graph, and they are added to the residual graph in order to let the subsequent iterations reduce the flow along some edge, but not below zero. Why this may be necessary? Each graph node has a maximum input and output capacity. It is possible to saturate the output capacity by different input edges and the optimal edge to use depends on the whole graph, so, in a single greedy step, it's not possible to determine over which edges more incoming flow should be directed. The backward edges virtually increase the output capacity by the value of the seized input capacity thus allowing the algorithm to redistribute the flow later on if necessary.
We'll implement the FFA using the matrix graph representation. First of all, to show it in action, and also as it's easy to deal with backward edges in a matrix as they are already present, just with zero initial capacity. However, as this matrix will be sparse, in the majority of the cases, to achieve optimal efficiency, just like with most other graph algorithms, we'll need to use a better way to store the edges, for instance, an edge list. With it, we could implement the addition of backward edges directly but lazily during the processing of each augmented path.
(defstuct mfedge
beg end capacity)
(defun maxflow (g)
(let ((rg (copyarray g)) ; residual graph
(rez 0))
(loop :for path := (augpath rg) :while path :do
(let ((flow mostpositivefixnum))
;; the flow along the path is the residual capacity
;; of the least wide edge
(dolist (edge path)
(let ((cap (? edge 'capacity)))
(when (< (abs cap) flow)
(:= flow (abs cap)))))
(dolist (edge path)
(with (((beg end) ? edge))
(: (aref rg beg end) flow)
(:+ (aref rg end beg) flow)))
(:+ rez flow)))
rez))
(defun augpath (g)
(with ((sink (1 (arraydimension g 0)))
(visited (makearray (1+ sink) :initialelement nil)))
(labels ((dfs (g i)
(if (zerop (aref g i sink))
(dotimes (j sink)
(unless (or (zerop (aref g i j))
(? visited j))
(whenit (dfs g j)
(:= (? visited j) t)
(return (cons (makemfedge :beg i :end j
:capacity (aref g i j))
it)))))
(list (makemfedge :beg i :end sink
:capacity (aref g i sink))))))
(dfs g 0))))
CLUSER> (maxflow #2A((0 4 4 0 0 0)
(0 0 0 4 2 0)
(0 0 0 1 2 0)
(0 0 0 0 0 3)
(0 0 0 0 0 5))))
7
So, as you can see from the code, to find an augmented path, we need to perform DFS on the graph from the source, sequentially examining the edges with some residual capacity to find a path to the sink.
A peculiarity of this algorithm is that there is no certainty that we'll eventually reach the state when there will be no augmented paths left. The FFA works correctly for integer and rational weights, but when they are irrational it is not guaranteed to terminate. When the capacities are integers, the runtime of FordFulkerson is bounded by O(E f)
where f
is the maximum flow in the graph. This is because each augmented path can be found in O(E)
time and it increases the flow by an integer amount of at least 1. A variation of the FordFulkerson algorithm with guaranteed termination and a runtime independent of the maximum flow value is the EdmondsKarp algorithm, which runs in O(V E^2)
.
Graphs in Action: PageRank
Another important set of problems from the field of network analysis is determining "centers of influence", densely and sparsely populated parts, and "cliques". PageRank is the wellknown algorithm for ranking the nodes in terms of influence (i.e. the number and weight of incoming connections they have), which was the secret sauce behind Google's initial success as a search engine. It will be the last of the graph algorithms we'll discuss in this chapter, so many more will remain untouched. We'll be returning to some of them in the following chapters, and you'll be seeing them in many problems once you develop an eye for spotting the graphs hidden in many domains.
PageRank algorithm outputs a probability distribution of the likelihood that a person randomly clicking on links will arrive at any particular page. This distribution ranks the relative importance of all pages. The probability is expressed as a numeric value between 0 and 1, but Google used to multiply it by 10 and round to the greater integer, so PR of 10 corresponded to the probability of 0.9 and more and PR=1 — to the interval from 0 to 0.1. In the context of Pagerank, all web pages are the nodes in the socalled webgraph, and the links between them are the edges, originally, weighted equally.
PageRank is an iterative algorithm that may be considered an instance of the very popular, in unsupervised optimization and machine learning, Expectation Maximization (EM) approach. The general idea of EM is to randomly initialize the quantities that we want to estimate, and then iteratively recalculate each quantity, using the information from the neighbors, to "move" it closer to the value that ensures optimality of the target function. Epochs (an iteration that spans the whole data set using each node at most once) of such recalculation should continue either until the whole epoch doesn't produce a significant change of the loss function we're optimizing, i.e. we have reached the stationary point, or a satisfactory number of iterations was performed. Sometimes a stationary point either can't be reached or will take too long to reach, but, according to Pareto's principle, 20% of effort might have moved us 80% to the goal.
In each epoch, we recalculate the PageRank of all nodes by transferring weights from a node equally to all of its neighbors. The neighbors with more inbound connections will thus receive more weight. However, the PageRank concept adds a condition that an imaginary surfer who is randomly clicking on links will eventually stop clicking. The probability that the transfer will continue is called a damping factor d
. Various studies have tested different damping factors, but it is generally assumed that the damping factor for the webgraph will be set around 0.85. The damping factor is subtracted from 1 (and in some variations of the algorithm, the result is divided by the number of documents in the collection) and this term is then added to the product of the damping factor and the sum of the incoming PageRank scores. The damping factor is subtracted from 1 (and in some variations of the algorithm, the result is divided by the number of documents (N) in the collection) and this term is then added to the product of the damping factor and the sum of the incoming PageRank scores. So the PageRank of a page is mostly derived from the PageRanks of other pages. The damping factor adjusts the derived value downward.
Implementation
Actually, PageRank can be computed both iteratively and algebraically. In algebraic form, each PageRank iteration may be expressed simply as:
(:= pr (+ (* d (mat* g pr))
(/ ( 1 d) n)))
where g
is the graph incidence matrix and pr
is the vector of PageRank for each node.
However, the definitive property of PageRank is that it is estimated for huge graphs. I.e. directly representing them as matrices isn't possible as well as performing the matrix operations on them. The iterative algorithm allows more control as well as distributing of the computation, so it is usually preferred, in practice, not only for Pagerank but also for most other optimization techniques. Sp PageRank should be viewed primarily as a distributed algorithm. The need to implement it on a large cluster triggered the development by Google of the influential MapReduce distributed computation framework.
Here is a simplified PageRank implementation of the iterative method:
(defun pagerank (g &key (d 0.85) (repeat 100))
(with ((n (tally (nodes g)))
(pr (makearrray n :initialelement (/ 1 n)))
(loop :repeat repeat :do
(let ((pr2 (map 'vector (lambda (x) ( 1 (/ d n)))
pr)))
(dokv (i node nodes)
(let ((p (? pr i))
(m (tally (? node 'children)))
(dokv (j child (? node 'children))
(:+ (? pr2 j) (* d (/ p m))))))
(:= pr pr2))
pr))
We use the same graph representation as previously and perform the update "backwards": not by gathering all incoming edges, which will require us to add another layer of data that is both not necessary and hard to maintain, but transferring the PR value over outgoing edges one by one. Such an approach also makes the computation trivially to distribute as we can split the whole graph into arbitrary set of nodes and the computation for each set can be performed in parallel: we'll just need to maintain a local copy of the pr2
vector and merge it at the end of each iteration by simple summation. This method naturally fits the mapreduce framework: the map
step is the inner node loop, while the reduce
step is merging of the pr2
vectors.
Takeaways
 The more we progress into advanced topics of this book, the more apparent will be the tendency to reuse the approaches, tools, and technologies we have developed previously. Graph algorithms are good demonstrations of new features and qualities that can be obtained by a smart combination and reuse of existing data structures.
 Many graph algorithms are greedy. This means that they use the locally optimal solution trying to arrive at a global one. This is conditioned by the structure — or rather lack of structure — of graphs that don't have a specific hierarchy to guide the optimal solution. The greediness, however, shouldn't mean suboptimality. In many greedy algorithms, like FFA, there is a way to play back the wrong solution. Others provide a way to trade off execution speed and optimality. A good example of the latter approach is Beam search that we'll discuss in the next chapters.
 In A*, we had a first glimpse of heuristic algorithms — an area that may be quite appealing to many programmers who are used to solving the problem primarily optimizing for its main scenarios. This approach may lack some mathematical rigor, but it also has its place and we'll see other heuristic algorithms in the following chapters that are, like A*, the best practical solution in their domains: for instance, the Monte Carlo Tree Search (MCTS).
 Another thing that becomes more apparent in the progress of this book is how small the percentage of the domain we can cover in detail in each chapter. This is true for graphs: we have just scratched the surface and outlined the main approaches to handling them. We'll see more of graphrelated stuff in the following chapters, as well. Graph algorithms may be quite useful in a great variety of areas that not necessarily have a direct formulation as graph problems (like maps or networks do) and so developing an intuition to recognize the hidden graph structure may help the programmer reuse the existing elegant techniques instead of having to deal with own cumbersome adhoc solutions.
Lispers.de — Berlin Lispers Meetup, Monday, 28th October 2019
@20191024 09:30 · 43 days agoWe meet again on Monday 8pm, 28th October. Our host this time is James Anderson (www.dydra.com).
Berlin Lispers is about all flavors of Lisp including Emacs Lisp, Scheme, and Clojure.
This time we talk about the Advent of Code and look at examples in Common Lisp.
[1] https://twitter.com/ericwastl/status/1178596028662112256
[2] https://adventofcode.com/
[3] https://old.reddit.com/r/adventofcode/
We meet in the TautHaus at Engeldamm 70 in BerlinMitte, the bell is "James Anderson". It is located in 10min walking distance from U Moritzplatz or U Kottbusser Tor or Ostbahnhof. In case of questions call Christian +49 1578 70 51 61 4.
Lispers.de — Berlin Lispers  Talk Announcement  Thursday, 24th October 2019
@20191021 15:10 · 46 days agoGeorg Kinnemann will give a talk about LISP and AI with the title "Die Entwicklung eines SpezialComputers für Künstliche Intelligenz in der DDR" at Deutsche Technikmuseum Berlin.
5.30pm, 24th October
Deutsches Technikmuseum, Trebbiner Straße 9, Vortragssaal, 4. OG
[1] https://sdtb.de/technikmuseum/veranstaltungen/einspezialcomputerfuerkuenstlicheintelligenzinderddr/
[2] https://lispers.de/docs/kispezialcomputerdtb20191024.txt
Quicklisp news — October 2019 Quicklisp dist update now available
@20191014 13:14 · 53 days ago 3bz — deflate decompressor — MIT
 bp — Bitcoin Protocol components in Common Lisp — MIT
 cardiogram — Simple test framework — MIT
 cesdi — Provides a computeeffectiveslotdefinitioninitargs generic function that allows for more ergonomic initialization of effective slot definition objects. — Unlicense
 chameleon — Configuration management facilities for Common Lisp with multiple profile support. — MIT
 ciutils — A set of tools for using CI platforms — MIT
 clclsparse — Common Lisp bindings for clSPARSE — Apache License, Version 2.0
 clecma48 — This package exports a macro for defining ECMA48 control functions and the 162 functions defined by this. — AGPLv3
 clflattree — A flattree implementation in Common Lisp. — MIT
 clkraken — A Common Lisp API client for the Kraken exchange — MIT
 clnaivestore — This is a naive, persisted, in memory (lazy loading) data store for Common Lisp. — MIT
 clshlex — Lexical analyzer for simple shelllike syntax. — MIT
 clsmtlib — SMT object supporting SMTLIB communication over input and output streams — BSD3Clause
 clwadlerpprint — An implementation of A Prettier Printer in Common Lisp. — Apache2.0/MIT
 classowary — An implementation of the Cassowary linear constraint solver toolkit — zlib
 clsqllocaltime — Allows the use of localtime:timestamp objects in CLSQL models and queries — MIT license
 datamuse — Common Lisp library for accessing the Datamuse wordfinding API — MIT
 datecalc — Package for simple date calculation — GPL or Artistic
 fontdiscovery — Find system font files matching a font spec. — zlib
 horsehtml — Parenscript/HTML done better — MIT
 hunchentootmultiacceptor — Multiple domain support under single hunchentoot acceptor — Apache License, Version 2.0
 lila — a cleaner language based on Common Lisp — MIT
 linearprogramming — A library for solving linear programming problems — MIT
 lsx — Embeddable HTML templating engine with JSXlike syntax — BSD 2Clause
 markup — markup provides a readermacro to read HTML/XML tags inside of Common Lisp code — Apache License, Version 2.0
 numutils — Numerical utilities for Common Lisp — Boost Software License  Version 1.0
 orizuruorm — An ORM for Common Lisp and PostgreSQL. — GPLv3
 paren6 — Paren6 is a set of ES6 macros for Parenscript — Apache License, version 2.0
 pngloadfast — A reader for the PNG image format. — MIT
 polisher — Infix notation to Sexpression translator — MIT
 select — DSL for array slices. — Boost
 simpleparalleltasks — Evaluate forms in parallel — GPL3
 stripe — A client for the Stripe payment API. — MIT
 trivialextensiblesequences — Portability library for the extensible sequences protocol. — zlib
 trivialpackagelocalnicknames — Portability library for packagelocal nicknames — Public domain
 uax14 — Implementation of the Unicode Standards Annex #14's line breaking algorithm — zlib
 uax9 — Implementation of the Unicode Standards Annex #9's bidirectional text algorithm — zlib
 withoutputtostream — Provides a simple way of directing output to a stream according to the concise and intuitive semantics of FORMAT's stream argument. — Public Domain
 ziz — An ad hoc Quicklisp distribution. — MIT
To get this update, use (ql:updatedist "quicklisp").
If you get a "badly sized local archive" error during the update, you can also safely use the DELETEANDRETRY restart to proceed. This error was introduced by a metadata problem on my end. Sorry about that!
Didier Verna — TFM 1.0 "Artificial Uncial" is released
@20191009 00:00 · 59 days agoI'm happy to announce the first stable version of TFM, a TeX Font Metrics parser for Common Lisp. TFM is the standard font description format used by TeX. The TFM library parses and decodes TFM files into an abstract data structure, providing easy access to the corresponding font information. Get it here.
Lispers.de — LispMeetup in Hamburg on Monday, 7th October 2019
@20191007 00:00 · 61 days agoWe meet at Ristorante Opera, Dammtorstraße 7, Hamburg, starting around 19:00 CEST on 7th October 2019.
This is an informal gathering of Lispers of all experience levels.
Paul Khuong — A Couple of (Probabilistic) Worstcase Bounds for Robin Hood Linear Probing
@20190929 19:44 · 68 days agoI like to think of Robin Hood hash tables with linear probing as arrays sorted on uniformly distributed keys, with gaps. That makes it clearer that we can use these tables to implement algorithms based on merging sorted streams in bulk, as well as ones that rely on fast point lookups. A key question with randomised data structures is how badly we can expect them to perform.
A bound on the length of contiguous runs of entries
There’s a lot of work on the expected time complexity of operations on linear probing Robin Hood hash tables. Alfredo Viola, along with a few collaborators, has long been exploring the distribution of displacements (i.e., search times) for random elements. The packed memory array angle has also been around for a while.^{1}
I’m a bit wary of the “random element” aspect of the linear probing bounds: while I’m comfortable with an expectation over the hash function (i.e., over the uniformly distributed hash values), a program could repeatedly ask for the same key, and consistently experience worsethanexpected performance. I’m more interested in bounding the worstcase displacement (the distance between the ideal location for an element, and where it is actually located) across all values in a randomly generated^{2} Robin Hood table, with high enough probability. That probability doesn’t have to be extremely high: \(p = 0.9\) or even \(p = 0.5\) is good enough, as long as we can either rehash with an independent hash function, or the probability of failure drops exponentially enough as the displacement leeway grows.
The people who study hashing with buckets, or hashing as load balancing, seem more interested in these probable worstcase bounds: as soon as one bucket overflows, it’s game over for that hash table! In that context, we wish to determine how much headroom we must reserve in each bucket, on top of the expected occupancy, in order to make sure failures are rare enough. That’s a balls into bins problem, where the \(m\) balls are entries in the hash table, and the \(n\) bins its hash buckets.
Raab and Steger’s Simple and Tight Analysis of the “balls into bins” problem shows that the case where the average occupancy grows with \((\log m)\sp{k}\) and \(k > 1\) has potential, when it comes to worstcase bounds that shrink quickly enough: we only need headroom that grows with \(\sqrt{\log\sp{k+1} m} = \log\sp{(k+1)/2} m\), slightly more than the square root of the average occupancy.
The only issue is that the ballsintobins analysis is asymptotic, and, more importantly, doesn’t apply at all to linear probing!
One could propose a form of packed memory array, where the sorted set is subdivided in chunks such that the expected load per chunk is in \(\Theta(\log\sp{k} m)\), and the size of each chunk multiplicatively larger (more than \(\log\sp{(k+1)/2}m\))…
Can we instead derive similar bounds with regular linear probing? It turns out that Raab and Steger’s bounds are indeed simple: they find the probability of overflow for one bucket, and derive a union bound for the probability of overflow for any bucket by multiplying the singlebucket failure probability by the number of buckets. Moreover, the singlebucket case itself is a Binomial confidence interval.
We can use the same approach for linear probing; I don’t expect a tight result, but it might be useful.
Let’s say we want to determine how unlikely it is to observe a clump of \(\log\sp{k}n\) entries, where \(n\) is the capacity of the hash table. We can bound the probability of observing such a clump starting at index 0 in the backing array, and multiply by the size of the array for our union bound (the clump could start anywhere in the array).
Given density \(d = m / n\), where \(m\) is the number of entries and \(n\) is the size of the array that backs the hash table, the probability that any given element falls in a range of size \(\log\sp{k}n\) is \(p = d/n \log\sp{k}n\). The number of entries in such a range follows a Binomial distribution \(B(dn, p)\), with expected value \(d \log\sp{k}n\). We want to determine the maximum density \(d\) such that \(\mathrm{Pr}[B(dn, p) > \log\sp{k}n] < \frac{\alpha}{dn}\), where \(\alpha\) is our overall failure rate. If rehashing is acceptable, we can let \(\alpha = 0.5\), and expect to find a suitably uniform hash function after half a rehash on average.
We know we want \(k > 1\) for the tail to shrink rapidly enough as \(n\) grows, but even \(\log\sp{2}n\) doesn’t shrink very rapidly. After some trial and error, I settled on a chunk size \(s(n) = 5 \log\sb{2}\sp{3/2} n\). That’s not great for small or medium sized tables (e.g., \(s(1024) = 158.1\)), but grows slowly, and reflects extreme worst cases; in practice, we can expect the worst case for any table to be more reasonable.
Assuming we have a quantile function for the Binomial distribution, we can find the occupancy of our chunk, at \(q = 1  \frac{\alpha}{n}\). The occupancy is a monotonic function of the density, so we can use, e.g., bisection search to find the maximum density such that the probability that we saturate our chunk is \(\frac{\alpha}{n}\), and thus the probability that any continuous run of entries has size at least \(s(n) = 5 \log\sb{2}\sp{3/2} n\) is less than \(\alpha\).
For \(\alpha = 0.5\), the plot of densities looks like the following.
This curve roughly matches the shape of some my older purely numerical experiments with Robin Hood hashing. When the table is small (less than \(\approx 1000\)), \(\log n\) is a large fraction of \(n\), so the probability of finding a run of size \(s(n) = 5 \log\sb{2}\sp{3/2} n\) is low. When the table is much larger, the asymptotic result kicks in, and the probabiliy slowly shrinks. However, even around the worst case \(n \approx 4500\), we can exceed \(77\%\) density and only observe a run of length \(s(n)\) half the time.
If we really don’t want to rehash, we can let \(\alpha = 10\sp{10}\), which compresses the curve and shifts it down: the minimum value is now slightly above \(50\%\) density, and we can clearly see the growth in permissible density as the size of the table grows.
In practice, we can dynamically compute the worstcase displacement, which is always less than the longest run (i.e., less than \(s(n) = 5 \log\sb{2}\sp{3/2} n\)). However, having nonasymptotic bounds lets us write sizespecialised code and know that its assumptions are likely to be satisfied in real life.
Bounding buffer sizes for operations on sorted hashed streams
I mentioned at the beginning of this post that we can also manipulate Robin Hood hash tables as sorted sets, where the sort keys are uniformly distributed hash values.
Let’s say we wished to merge the immutable source table S
into the
larger destination table D
inplace, without copying all of D
.
For example, from
S = [2, 3, 4]
D = [1, 6, 7, 9, 10];
we want the merged result
D' = [1, 2, 3, 4, 6, 7, 9, 10].
The issue here is that, even with gaps, we might have to overwrite
elements of D
, and buffer them in some scratch space until we get to
their final position. In this case, all three elements of S
must be
inserted between the first and second elements of D
, so we could
need to buffer D[1:4]
.
How large of a merge buffer should we reasonably plan for?
In general, we might have to buffer as many elements as there are in
the smaller table of S
and D
. However, we’re working with hash
values, so we can expect them to be distributed uniformly. That
should give us some grip on the problem.
We can do even better and only assume that both sorted sets were
sampled from the same underlying distribution. The key idea
is that the rank of an element in S
is equal to the
value of S
’s
empirical distribution function
for that element, multiplied by the size of S
(similarly for
D
).
The amount of buffering we might need is simply a measure of the
worstcase difference between the two empirical DFs: the more S
get
ahead of D
, the more we need to buffer values of D
before
overwriting them (if we’re very unlucky, we might need a buffer the
same size as S
). That’s the
twosample KolmogorovSmirnov statistic, and we have
simple bounds for that distance.
With probability \(1  \alpha\), we’ll consume from S
and D
at the same rate \(\pm \sqrt{\frac{(S + D) \ln \alpha}{2 S D}}\).
We can let \(\alpha = 10\sp{10}\) and preallocate a buffer of size
\[S \sqrt{\frac{(S + D) \ln \alpha}{2 S D}} < \sqrt{\frac{23.03 S (S + D)}{2 D}}.\]
In the worst case, \(S = D\), and we can preallocate a buffer of size \(\sqrt{23.03 D} < 4.8 \sqrt{D}\) and only need to grow the buffer every ten billion (\(\alpha\sp{1}\)) merge.
The same bound applies in a stream processing setting; I assume this is closer to what Frank had in mind when he brought up this question.
Let’s assume a “push” dataflow model, where we still work on sorted sets of uniformly distributed hash values (and the data tuples associated with them), but now in streams that generate values every tick. The buffer size problem now sounds as follows. We wish to implement a sorted merge operator for two input streams that generate one value every tick, and we can’t tell our sources to cease producing values; how much buffer space might we need in order to merge them correctly?
Again, we can go back to the KolmogorovSmirnov statistic. In this case however, we could buffer each stream independently, so we’re looking for critical values for the onesided onesample KolmogorovSmirnov test (how much one stream might get ahead of the hypothetical exactly uniform stream). We have recent (1990) simple and tight bounds for this case as well.
The critical values for the onesided case are stronger than the twosided twosample critical values we used earlier: given an overflow probability of \(1  \alpha\), we need to buffer at most \(\sqrt{\frac{n \ln \alpha}{2}},\) elements. For \(\alpha = 10\sp{20}\) that’s less than \(4.8 \sqrt{n}\).^{3} This square root scaling is pretty good news in practice: shrinking \(n\) to \(\sqrt{n}\) tends to correspond to going down a rung or two in the storage hierarchy. For example, \(10\sp{15}\) elements is clearly in the range of distributed storage; however, such a humongous stream calls for a buffer of fewer than \(1.5 \cdot 10\sp{8}\) elements, which, at a couple gigabytes, should fit in RAM on one large machine. Similarly, \(10\sp{10}\) elements might fill the RAM on one machine, but the corresponding buffer of less than half a million elements could fit in L3 cache, while one million elements could fill the L3, and 4800 elements fit in L1 or L2.
What I find neat about this (probabilistic) bound on the buffer size is its independence from the size of the other inputs to the merge operator. We can have a shared \(\Theta(\sqrt{n})\)size buffer in front of each stream, and do all our operations without worrying about getting stuck (unless we’re extremely unlucky, in which case we can grow the buffer a bit and resume or restart the computation).
Probably of more theoretical interest is the fact that these bounds do not assume a uniform distribution, only that all the input streams are identically and independently sampled from the same underlying distribution. That’s the beauty of working in terms of the (inverse) distribution functions.
I don’t think there’s anything deeper
That’s it. Two cute tricks that use wellunderstood statistical distributions in hashed data structure and algorithm design. I doubt there’s anything to generalise from either bounding approach.
However, I definitely believe they’re useful in practice. I like knowing that I can expect the maximum displacement for a table of \(n\) elements with Robin Hood linear probing to be less than \(5 \log\sb{2}^{3/2} n\), because that lets me select an appropriate option for each table, as a function of that table’s maximum displacement, while knowing the range of displacements I might have to handle. Having a strong bound on how much I might have to buffer for stream join operators feels even more useful: I can preallocate a single buffer per stream and not think about efficiently growing the buffer, or signaling that a consumer is falling behind. The probability that I’ll need a larger buffer is so low that I just need to handle it, however inefficiently. In a replicated system, where each node picks an independent hash function, I would even consider crashing when the buffer is too small!
Vsevolod Dyomkin — Programming Algorithms: Trees
@20190928 08:51 · 69 days ago
Couldn't resist adding this xkcd
Balancing a binary tree is the infamous interview problem that has all that folklore and debate associated with it. To tell you the truth, like the other 99% of programmers, I never had to perform this task for some workrelated project. And not even due to the existence of readymade libraries, but because selfbalancing binary trees are, actually, pretty rarely used. But trees, in general, are ubiquitous even if you may not recognize their presence. The source code we operate with, at some stage of its life, is represented as a tree (a popular term here is Abstract Syntax Tree or AST, but the abstract variant is not the only one the compilers process). The directory structure of the file system is the tree. The objectoriented class hierarchy is likewise. And so on. So, returning to interview questions, trees indeed are a good area as they allow to cover a number of basic points: linked data structures, recursion, complexity. But there's a much better task, which I have encountered a lot in practice and also used quite successfully in the interview process: breadthfirst tree traversal. We'll talk about it a bit later.
Similar to how hashtables can be thought of as more sophisticated arrays (they are sometimes even called "associative arrays"), trees may be considered an expansion of linked lists. Although technically, a few specific trees are implemented not as a linked data structure but are based on arrays, the majority of trees are linked. Like hashtables, some trees also allow for efficient access to the element by key, representing an alternative keyvalue implementation option.
Basically, a tree is a recursive data structure that consists of nodes. Each node may have zero or more children. If the node doesn't have a parent, it is called the root of the tree. And the constraint on trees is that the root is always single. Graphs may be considered a generalization of trees that don't impose this constraint, and we'll discuss them in a separate chapter. In graph terms, a tree is an acyclic directed singlecomponent graph. Directed means that there's a oneway parentchild relation. And acyclic means that a child can't have a connection to the parent neither directly, nor through some other nodes (in the opposite case, what will be the parent and what — the child?) The recursive nature of trees manifests in the fact that if we extract an arbitrary node of the tree with all of its descendants, the resulting part will remain a tree. We can call it a subtree. Besides parentchild or, more generally, ancestordescendant "vertical" relationships that apply to all the nodes in the tree, we can also talk about horizontal siblings — the set of nodes that have the same parent/ancestor.
Another important tree concept is the distinction between terminal (leaf) and nonterminal (branch) nodes. Leaf nodes don't have any children. In some trees, the data is stored only in the leaves with branch nodes serving to structure the tree in a certain manner. In other trees, the data is stored in all nodes without any distinction.
Implementation Variants
As we said, the default tree implementation is a linked structure. A linked list may be considered a degenerate tree with all nodes having a single child. A tree node may have more than one child, and so, in a linked representation, each tree root or subroot is the origin of a number of linked lists (sometimes, they are called "paths").
Tree: a
/ \
b c
/ \ \
d e f
Lists:
a > b > d
a > b > e
a > c > f
b > d
b > e
c > f
So, a simple linked tree implementation will look a lot like a linked list one:
(defstruct (treenode (:concname nil))
key
children) ; instead of linked list's next
CLUSER> (with ((f (maketreenode :key "f"))
(e (maketreenode :key "e"))
(d (maketreenode :key "d"))
(c (maketreenode :key "c" :children (list f)))
(b (maketreenode :key "b" :children (list d e))))
(maketreenode :key "a"
:children (list b c)))
#S(TREENODE
:KEY "a"
:CHILDREN (#S(TREENODE
:KEY "b"
:CHILDREN (#S(TREENODE :KEY "d" :CHILDREN NIL)
#S(TREENODE :KEY "e" :CHILDREN NIL)))
#S(TREENODE
:KEY "c"
:CHILDREN (#S(TREENODE :KEY "f" :CHILDREN NIL)))))
Similar to lists that had to be constructed from tail to head, we had to populate the tree in reverse order: from leaves to root. With lists, we could, as an alternative, use push and reverse the result, in the end. But, for trees, there's no such operation as reverse.
Obviously, not only lists can be used as a data structure to hold the children. When the number of children is fixed (for example, in a binary tree), they may be defined as separate slots: e.g. left
and right
. Another option will be to use a keyvalue, which allows assigning labels to tree edges (as the keys of the kv), but the downside is that the ordering isn't defined (unless we use an ordered kv like a linked hashtable). We may also want to assign weights or other properties to the edges, and, in this case, either an additional collection (say childweights
) or a separate edge
struct should be defined to store all those properties. In the latter case, the node
structure will contain edges
instead of children
. In fact, the tree can also be represented as a list of such edge
structures, although this approach is quite inefficient, for most of the use cases.
Another tree representation utilizes the available linked list implementation directly instead of reimplementing it. Let's consider the following simple Lisp form:
(defun foo (bar)
"Foo function."
(baz bar))
It is a tree with the root containing the symbol defun
and 4 children:
 the terminal symbol
foo
 the tree containing the function arguments (
(bar)
)  the terminal sting (the docstring "Foo function.")
 and the tree containing the form to evaluate (
(baz bar)
)
By default, in the listbased tree, the first element is the head and the rest are the leaves. This representation is very compact and convenient for humans, so it is used not only for source code. For example, you can see a similar representation for the constituency trees, in linguistics:
(TOP (S (NP (DT This)) (VP (VBZ is) (NP (DT a) (NN test))) (. .)))
;; if we'd like to use the above form as Lisp code,
;; we'd have to shield the symbol "." with : (. .) instead of (. .)
It is equivalent to the following parse tree:
TOP
/  \
 VP 
  \ 
NP  NP 
  / \ 
DT VBZ DT NN .
This is a test .
Another, more specific, alternative is when we are interested only in the terminal nodes. In that case, there will be no explicit root and each list item will be a subtree. The following trees are equivalent:
(((a b (c d)) e) (f (g h)))
<root>
/ \
/ \ / \
/  \ e f /\
a b /\ g h
c d
A tree that has all terminals at the same depth and all nonterminal nodes present — a complete tree — with a specified number of children may be stored in a vector. This is a very efficient implementation that we'll have a glance at when we'll talk about heaps.
Finally, a tree may be also represented, although quite inefficiently, with a matrix (only one half is necessary).
Tree Traversal
It should be noted that, unlike with other structures, basic operations, such as tree construction, modification, element search and retrieval, work differently for different tree variants. Thus we'll discuss them further when describing those variants.
Yet, one treespecific operation is common to all tree representations: traversal. Traversing a tree means iterating over its subtrees or nodes in a certain order. The most direct traversal is called depthfirst search or DFS. It is the recursive traversal from parent to child and then to the next child after we return from the recursion. The simplest DFS for our treenode
based tree may be coded in the following manner:
(defun dfsnode (fn root)
(call fn (key root))
(dolist (child (children root))
(dfsnode fn child)))
CLUSER> (dfsnode 'print *tree*) ; where *tree* is taken from the previous example
"a"
"b"
"d"
"e"
"c"
"f"
In the spirit of Lisp, we could also define a convenience macro:
(defmacro dotreedfs ((value root) &body body)
(let ((node (gensym))) ; this code is needed to prevent possible symbol collisions for NODE
`(dfsnode (lambda (,node)
(let ((,value (key ,node)))
,@body))
,root)))
And if we'd like to traverse a tree represented as a list, the changes are minor:
(defun dfslist (fn tree)
;; we need to handle both subtrees (lists) and leaves (atoms)
;; so, we'll just convert everything to a list
(let ((tree (mklist tree)))
(call fn (first tree))
(dolist (child (rest tree))
(dfslist fn child))))
CLUSER> (dfslist 'print '(defun foo (bar)
"Foo function."
(baz bar)))
DEFUN
FOO
BAR
"Foo function."
BAZ
BAR
Recursion is very natural in tree traversal: we could even say that trees are recursion realized in a data structure. And the good news here is that, very rarely, there's a chance to hit recursion limits as the majority of trees are not infinite, and also the height of the tree, which conditions the depth of recursion, grows proportionally to the logarithm of the tree size[1], and that's pretty slow.
These simple DFS implementations apply the function before descending down the tree. This style is called preorder traversal. There are alternative styles: inorder and postorder. With postorder, the call is executed after the recursion returns, i.e. on the recursive ascent:
(defun postdfs (fn node)
(dolist (child (children node))
(postdfs fn child))
(call fn (key node)))
CLUSER> (postdfs 'print *tree*)
"d"
"e"
"b"
"f"
"c"
"a"
Inorder traversal is applicable only to binary trees: first traverse the left side, then call fn
and then descend into the right side.
An alternative traversal approach is Breadthfirst search (BFS). It isn't so natural as DFS as it traverses the tree layer by layer, i.e. it has to, first, accumulate all the nodes that have the same depth and then integrate them. In the general case, it isn't justified, but there's a number of algorithms where exactly such ordering is required.
Here is an implementation of BFS (preorder) for our treenode
s:
(defun bfs (fn nodes)
(let ((nextlevel (list)))
(dolist (node (mklist nodes))
(call fn (key node))
(dolist (child (children node))
(push child nextlevel)))
(when nextlevel
(bfs fn (reverse nextlevel)))))
CLUSER> (bfs 'print *tree*)
"a"
"b"
"c"
"d"
"e"
"f"
An advantage of BFS traversal is that it can handle potentially unbounded trees, i.e. it is suitable for processing trees in a streamed manner, layerbylayer.
In objectorientation, tree traversal is usually accomplished with by the means of the socalled Visitor pattern. Basically, it's the same approach of passing a function to the traversal procedure but in disguise of additional (and excessive) OOrelated machinery. Here is a Visitor pattern example in Java:
interface ITreeVisitor {
List<ITreeNode> children;
void visit(ITreeNode node);
}
interface ITreeNode {
void accept(ITreeVisitor visitor);
}
interface IFn {
void call(ITreeNode);
}
class TreeNode implements ITreeNode {
public void accept(ITreeVisitor visitor) {
visitor.visit(this);
}
}
class PreOrderTreeVisitor implements ITreeVisitor {
private IFn fn;
public PreOrderTreeVisitor(IFn fn) {
this.fn = fn;
}
public void visit(ITreeNode node) {
fn.call(node);
for (ITreeeNode child : node.children())
child.visit(this);
}
}
The zest of this example is the implementation of the method visit
that calls the function with the current node and iterates over its children by recursively applying the same visitor. You can see that it's exactly the same as our dfsnode
.
One of the interesting treetraversal tasks is tree printing. There are many ways in which trees can be displayed. The simplest one is directorystyle (like the one used by the Unix tree
utility):
$ tree /etc/acpi
/etc/acpi
├── asuswireless.sh
├── events
│ ├── asuskeyboardbacklightdown
│ ├── asuskeyboardbacklightup
│ ├── asuswirelessoff
│ └── asuswirelesson
└── undock.sh
It may be implemented with DFS and only requires tracking of the current level in the tree:
(defun pprinttreedfs (node &optional (level 0) (skiplevels (makehashtable)))
(when (= 0 level)
(format t "~A~%" (key node)))
(let ((lastindex (1 (length (children node)))))
(doindex (i child (children node))
(let ((lastchildp (= i lastindex)))
(dotimes (j level)
(format t "~C " (if (? skiplevels j) #\Space #\│)))
(format t "~C── ~A~%"
(if lastchildp #\└ #\├)
(key child))
(:= (? skiplevels level) lastchildp)
(pprinttreedfs child
(1+ level)
skiplevels))))))
CLUSER> (pprinttreedfs *tree*)
a
├── b
│ ├── d
│ └── e
└── c
└── f
1+
and 1
are standard Lisp shortucts for adding/substracting 1 from a number. The skiplevels
argument is used for the last elements to not print the excess │
.
A more complicated variant is toptobottom printing:
;; example from CLNLP library
CLUSER> (nlp:pprinttree
'(TOP (S (NP (NN "This"))
(VP (VBZ "is")
(NP (DT "a")
(JJ "simple")
(NN "test")))
(. ".")))
TOP
:
S
.:.
: VP :
: .. :
NP : NP :
: : .:. :
NN VBZ DT JJ NN .
: : : : : :
This is a simple test .
This style, most probably, will need a BFS and a careful calculation of spans of each node to properly align everything. Implementing such a function is left as an exercise to the reader, and a very enlightening one, I should say.
Binary Search Trees
Now, we can return to the topic of basic operations on tree elements. The advantage of trees is that, when built properly, they guarantee O(log n)
for all the main operations: search, insertion, modification, and deletion.
This quality is achieved by keeping the leaves sorted and the trees in a balanced state. "Balanced" means that any pair of paths from the root to the leaves have lengths that may differ by at most some predefined quantity: ideally, just 1 (AVL trees), or, as in the case of RedBlack trees, the longest path can be at most twice as long as the shortest. Yet, such situations when all the elements align along a single path, effectively, turning the tree into a list, should be completely ruled out. We have already seen, with Binary search and Quicksort (remember the justification for the 3medians rule), why this constraint guarantees logarithmic complexity.
The classic example of balanced trees are Binary Search Trees (BSTs), of which AVL and RedBlack trees are the most popular variants. All the properties of BSTs may be trivially extended to nary trees, so we'll discuss the topic using the binary trees examples.
Just to reiterate the general intuition for the logarithmic complexity of tree operations, let's examine a complete binary tree: a tree that has all levels completely filled with elements, except maybe for the last one. In it, we have n
elements, and each level contains twice as many nodes as the previous. This property means that n
is not greater than (+ 1 2 4 ... (/ k 2) k)
, where k
is the capacity of the last level. This formula is nothing but the sum of a geometric progression with the number of items equal to h
, which is, by the textbook:
(/ (* 1 ( 1 (expt 2 h)))
( 1 2))
In turn, thisexpression may be reduced to: ( (expt 2 h) 1)
. So (+ n 1)
equals to (expt 2 h)
, i.e. the height of the tree (h
) equals to (log (+ n 1) 2)
.
BSTs have the ordering property: if some element is to the right of another in the tree, it should consistently be greater (or smaller — depending on the ordering direction). This constraint means that after the tree is built, just extracting its elements by performing an inorder DFS produces a sorted array. The Treesort algorithm utilizes this approach directly to achieve the same O(n * log n)
complexity as other efficient sorting algorithms. This n * log n
is the complexity of each insertion (O(log n)
) multiplied by the number of times it should be performed (n
). So, Treesort operates by taking an array and adding its elements to the BST, then traversing the tree and putting the encountered elements into the resulting array, in a proper order.
Besides, the ordering property also means that, after adding a new element to the tree, in the general case, it should be rebalanced as the newly added element may not be placed in an arbitrary spot, but has just two admissible locations, and choosing any of those may violate the balance constraint. The specific balance invariants and approaches to tree rebalancing are the distinctive properties of each variant of BSTs that we will see below.
Splay Trees
A Splay tree represents a kind of BST that is one of the simplest to understand and to implement. It is also quite useful in practice. It has the least strict constraints and a nice property that recently accessed elements occur near the root. Thus, a Splay tree can naturally act as an LRUcache. However, there are degraded scenarios that result in O(n)
access performance, although, the average complexity of Splay tree operations is O(log n)
due to amortization (we'll talk about it in a bit).
The approach to balancing a Splay tree is to move the element we have accessed/inserted into the root position. The movement is performed by a series of operations that are called tree rotations. A certain pattern of rotations forms a step of the algorithm. For all BSTs, there are just two possible tree rotations, and they serve as the basic block, in all balancing algorithms. A rotation may be either a left or a right one. Their purpose is to put the left or the right child into the position of its parent, preserving the order of all the other child elements. The rotations can be illustrated by the following diagrams in which x
is the parent node, y
is the target child node that will become the new parent, and A
,B
,C
are subtrees. It is said that the rotation is performed around the edge x > y
.
Left rotation:
x y
/ \ / \
y C > A x
/ \ / \
A B B C
Right rotation:
x y
/ \ / \
A y > x C
/ \ / \
B C A B
As you see, the left and right rotations are complementary operations, i.e. performing one after the other will return the tree to the original state. During the rotation, the inner subtree (B
) has its parent changed from y
to x
.
Here's an implementation of rotations:
(defstruct (bstnode (:concname nil)
(:printobject (lambda (node out)
(format out "[~a~@[~a~]~@[~a~]]"
(key node)
(lt node)
(rt node)))))
key
val ; we won't use this slot in the examples,
; but without it, in realworld use cases,
; the tree doesn't make any sense :)
lt ; left child
rt) ; right child
(defun treerotate (node parent grandparent)
(cond
((eql node (lt parent)) (:= (lt parent) (rt node)
(rt node) parent))
((eql node (rt parent)) (:= (rt parent) (lt node)
(lt node) parent))
(t (error "NODE (~A) is not the child of PARENT (~A)"
node parent)))
(cond
((null grandparent) (returnfrom treerotate node))
((eql parent (lt grandparent)) (:= (lt grandparent) node))
((eql parent (rt grandparent)) (:= (rt grandparent) node))
(t (error "PARENT (~A) is not the child of GRANDPARENT (~A)"
parent grandparent))))
You have probably noticed that we need to pass to this function not only the nodes on the edge around which the rotation is executed but also the grandparent node of the target to link the changes to the tree. If grandparent
is not supplied, it is assumed that parent
is the root and we need to separately reassign the variable holding the reference to the tree to child
, after the rotation.
Splay trees combine rotations into three possible actions:
 The Zig step is used to make the node the new root when it's already the direct child of the root. It is accomplished by a single left/right rotation(depending on whether the target is to the left or to the right of the
root
) followed by an assignment.  The Zigzig step is a combination of two zig steps that is performed when both the target node and its parent are left/right nodes. The first rotation is around the edge between the target node and its parent, and the second — around the target and its former grandparent that has become its new parent, after the first rotation.
 The Zigzag step is performed when the target and its parent are not in the same direction: either one is left while the other is right or vise versa. In this case, correspondingly, first a left rotation around the parent is needed, and then a right one around its former grandparent (that has now become the new parent of the target). Or vice versa.
However, with our implementation of tree rotations, we don't have to distinguish the 3 different steps and the implementation of the operation splay
becomes really trivial:
(defun splay (node &rest chain)
(loop :for (parent grandparent) :on chain :do
(treerotate node parent grandparent))
node)
The key point here and in the implementation of Splay tree operations is the use of reverse chains of nodes from the child to the root which will allow us to perform chains of splay operations in an endtoend manner and also custom modifications of the tree structure.
From the code, it is clear that splaying requires at maximum the same number of steps as the height of the tree because each rotation brings the target element 1 level up. Now, let's discuss why all Splay tree operations are O(log n)
. Element access requires binary search for the element in the tree, which is O(log n)
provided the tree is balanced, and then splaying it to root — also O(log n)
. Deletion requires search, then swapping the element either with the rightmost child of its left subtree or the leftmost child of its right subtree (direct predecessor/successor) — to make it childless, removing it, and, finally, splaying the parent of the removed node. And update is, at worst, deletion followed by insertion.
Here is the implementation of the Splay tree built of bstnode
s and restricted to only arithmetic comparison operations. All of the highlevel functions, such as stsearch
, stinsert
or stdelete
return the new tree root obtained after that should substitute the previous one in the caller code.
(defun nodechain (item root &optional chain)
"Return as the values the node equal to ITEM or the closest one to it
and the chain of nodes leading to it, in the splay tree based in ROOT."
(if root
(with (((key lt rt) ? root)
(chain (cons root chain)))
(cond ((= item key) (values root
chain))
((< item key) (stsearch item lt chain))
((> item key) (stsearch item rt chain))))
(values nil
chain)))
(defun stsearch (item root)
(with ((node chain (nodechain item root)))
(when node
(apply 'splay chain))))
(defun stinsert (item root)
(assert root nil "Can't insert item into a null tree")
(with ((node chain (stsearch item root)))
(unless node
(let ((parent (first chain)))
;; here, we use the property of the := expression
;; that it returns the item being set
(push (:= (? parent (if (> (key parent) item) 'lt 'rt))
(makebstnode :key item))
chain)))
(apply 'splay chain)))
(defun idir (dir)
(case dir
(lt 'rt)
(rt 'lt)))
(defun closestchild (node)
(dolist (dir '(lt rt))
(let ((parent nil)
(current nil))
(do ((child (call dir node) (call (idir dir) child)))
((null child) (when current
(returnfrom closestchild
(values dir
current
parent))))
(:= current child
parent current)))))
(defun stdelete (item root)
(with ((node chain (stsearch item root))
(parent (second chain)))
(if (null node)
root ; ITEM was not found
(with ((dir child childparent (closestchild node))
(idir (idir dir)))
(when parent
(:= (? parent (if (eql (lt parent) node) 'lt 'rt))
child))
(when child
(:= (? child idir) (? node idir))
(when childparent
(:= (? childparent idir) (? child dir))))
(if parent
(apply 'splay (rest chain))
child)))))
(defun stupdate (old new root)
(stinsert new (stdelete old root)))
The deletion is somewhat tricky due to the need to account for different cases: when removing the root, the direct child of the root, or the other node.
Let's test the Splay tree operation in the REPL (coding pprintbst
as a slight modification of pprinttreedfs
is left as an excercise to the reader):
CLUSER> (defparameter *st* (makebstnode :key 5))
CLUSER> *st*
[5]
CLUSER> (pprintbst (:= *st* (stinsert 1 *st*)))
1
├── .
└── 5
CLUSER> (pprintbst (:= *st* (stinsert 10 *st*)))
10
├── 1
│ ├── .
│ └── 5
└── .
CLUSER> (pprintbst (:= *st* (stinsert 3 *st*)))
3
├── 1
└── 10
├── .
└── 5
CLUSER> (pprintbst (:= *st* (stinsert 7 *st*)))
7
├── 3
│ ├── 1
│ └── 5
└── 10
CLUSER> (pprintbst (:= *st* (stinsert 8 *st*)))
8
├── 7
│ ├── 3
│ │ ├── 1
│ │ └── 5
│ └── .
└── 10
CLUSER> (pprintbst (:= *st* (stinsert 2 *st*)))
2
├── 1
└── 8
├── 7
│ ├── 3
│ │ ├── .
│ │ └── 5
│ └── .
└── 10
CLUSER> (pprintbst (:= *st* (stinsert 4 *st*)))
4
├── 2
│ ├── 1
│ └── 3
└── 8
├── 7
│ ├── 5
│ └── .
└── 10
CLUSER> *st*
[4[2[1][3]][8[7[5]][10]]]
As you can see, the tree gets constantly rearranged at every insertion.
Accessing an element, when it's found in the tree, also triggers tree restructuring:
RTLUSER> (pprintbst (stsearch 5 *st*))
5
├── 4
│ ├── 2
│ │ ├── 1
│ │ └── 3
│ └── .
└── 8
├── 7
└── 10
The insertion and deletion operations, for the Splay tree, also may have an alternative implementation: first, split the tree in two at the place of the element to be added/removed and then combine them. For insertion, the combination is performed by making the new element the root and linking the previously split subtrees to its left and right. As for deletion, splitting the Splay tree requires splaying the target element and then breaking the two subtrees apart (removing the target that has become the root). The combination is also O(log n)
and it is performed by splaying the rightmost node of the left subtree (the largest element) so that it doesn't have the right child. Then the right subtree can be linked to this vacant slot.
Although regular access to the Splay tree requires splaying of the element we have touched, tree traversal should be implemented without splaying. Or rather, just the normal DFS/BFS procedures should be used. First of all, this approach will keep the complexity of the operation at O(n)
without the unnecessary log n
multiplier added by the splaying operations. Besides, accessing all the elements inorder will trigger the edgecase scenario and turn the Splay tree into a list — exactly the situation we want to avoid.
Complexity Analysis
All of those considerations apply under the assumption that all the tree operations are O(log n)
. But we haven't proven it yet. Turns out that, for Splay trees, it isn't a trivial task and requires amortized analysis. Basically, this approach averages the cost of all operations over all tree elements. Amortized analysis allows us to confidently use many advanced data structures for which it isn't possible to prove the required time bounds for individual operations, but the general performance over the lifetime of the data structure is in those bounds.
The principal tool of the amortized analysis is the potential method. Its idea is to combine, for each operation, not only its direct cost but also the change to the potential cost of other operations that it brings. For Splay trees, we can observe that only zigzig and zigzag steps are important, for the analysis, as zig step happens only once for each splay operation and changes the height of the tree by at most 1. Also, both zigzig and zigzag have the same potential.
Rigorously calculating the exact potential requires a number of mathematical proofs that we don't have space to show here, so let's just list the main results.
The potential of the whole Splay tree is the sum of the ranks of all nodes, where rank is the logarithm of the number of elements in the subtree rooted at node:
(defun rank (node)
(let ((size 0))
(dotreedfs (_ node)
(:+ size))
(log size 2)))The change of potential produced by a single zigzig/zigzag step can be calculated in the following manner:
(+ ( (rank grandparentnew) (rank grandparentold))
( (rank parentnew) (rank parentold))
( (rank nodenew) (rank nodeold)))Since
(= (rank nodenew) (rank grandparentold))
it can be reduced to:( (+ (rank grandparentnew) (rank parentnew))
(+ (rank parentold) (rank nodeold)))Which is not larger than:
( (+ (rank grandparentnew) (rank nodenew))
(* 2 (rank nodeold)))Which, in turn, due to the concavity of the log function, may be reduced to:
( (* 3 ( (rank nodenew) (rank nodeold))) 2)
The amortized cost of any step is 2 operations larger than the change in potential as we need to perform 2 tree rotations, so it's not larger than:
(* 3 ( (rank nodenew) (rank nodeold)))
When summed over the entire splay operation, this expression "telescopes" to
(* 3 ( (rank root) (rank node)))
which isO(log n)
. Telescoping means that when we calculate the sum of the cost of all zigzag/zigzig steps, the inner terms cancel each other and only the boundary ones remain. The difference in ranks is, in the worst case,log n
as the rank of the root is(log n 2)
and the rank of the arbitrary node is between that value and(log 1 2)
(0).Finally, the total cost for
m
splay operations isO(m log n + n log n)
, wherem log n
term represents the total amortized cost of a sequence ofm
operations andn log n
is the change in potential that it brings.
As mentioned, the above exposition is just a cursory look at the application of the potential method that skips some important details. If you want to learn more you can start with this discussion on CS Theory StackExchange.
To conclude, similar to hashtables, the performance of Splay tree operations for a concrete element depends on the order of the insertion/removal of all the elements of the tree, i.e. it has an unpredictable (random) nature. This property is a disadvantage compared to some other BST variants that provide precise performance guarantees. Another disadvantage, in some situations, is that the tree is constantly restructured, which makes it mostly unfit for usage as a persistent data structure and also may not play well with many storage options. Yet, Splay trees are simple and, in many situations, due to their LRUproperty, may be preferable over other BSTs.
RedBlack and AVL Trees
Another BST that also has similar complexity characteristics to Splay trees and, in general, a somewhat similar approach to rebalancing is the Scapegoat tree. Both of these BSTs don't require storing any additional information about the current state of the tree, which results in the random aspect of their operation. And although it is smoothed over all the tree accesses, it may not be acceptable in some usage scenarios.
An alternative approach, if we want to exclude the random factor, is to track the tree state. Tracking may be achieved by adding just 1 bit to each tree node (as with RedBlack trees) or 2 bits, the socalled balance factors (AVL trees)[2]. However, for most of the highlevel languages, including Lisp, we'll need to go to great lengths or even perform lowlevel nonportable hacking to, actually, ensure that exactly 1 or 2 bits is spent for this data, as the standard structure implementation will allocate a whole word even for a bitsized slot. Moreover, in C likewise, due to cache alignment, the structure will also have the size aligned to memory word boundaries. So, by and large, usually we don't really care whether the data we'll need to track is a single bit flag or a full integer counter.
The balance guarantee of an RB tree is that, for each node, the height of the left and right subtrees may differ by at most a factor of 2. Such boundary condition occurs when the longer path contains alternating red and black nodes, and the shorter — only black nodes. Balancing is ensured by the requirement to satisfy the following invariants:
 Each tree node is assigned a label: red or black (basically, a 1bit flag: 0 or 1).
 The root should be black (0).
 All the leaves are also black (0). And the leaves don't hold any data. A good implementation strategy to satisfy this property is to have a constant singleton terminal node that all preterminals will link to. (
(defparameter *rbleaf* (makerbnode))
).  If a parent node is red (1) then both its children should be black (0). Due to mock leaves, each node has exactly 2 children.
 Every path from a given node to any of its descendant leaf nodes should contain the same number of black nodes.
So, to keep the tree in a balanced state, the insert/update/delete operations should perform rebalancing when the constraints are violated. Robert Sedgewick has proposed the simplest version of the redblack tree called the LeftLeaning RedBlack Tree (LLRB). The LLRB maintains an additional invariant that all red links must lean left except during inserts and deletes, which makes for the simplest implementation of the operations. Below, we can see the outline of the insert operation:
(defstruct (rbnode (:include bstnode) (:concname nil))
(red nil :type boolean))
(defun rbinsert (item root &optional parent)
(let ((node (makerbnode :key item)))
(when (null root)
(returnfrom rbinsert node))
(when (and (red (lt root))
(red (rt root)))
(:= (red root) (not (red root))
(red (lt root)) nil
(red (rt root)) nil))
(cond ((< (key root) value)
(:= (lt root) (rbinsert node (lt root) root)))
((> (key root) value)
(:= (rt root) (rbinsert node (rt root) root))))
(when (and (red (rt root))
(not (red (lt root))))
(:= (red (lt root)) (red root)
root (treerotate (lt root) root parent)
(red root) t))
(when (and (red (lt root))
(not (red (rt root))))
(:= (red (rt root)) (red root)
root (treerotate (rt root) root parent)
(red root) t)))
root)
This code is more of an outline. You can easily find the complete implementation of the RBtree on the internet. The key here is to understand the principle of their operation. Also, we won't discuss AVL trees, in detail. Suffice to say that they are based on the same principles but use a different set of balancing operations.
Both RedBlack and AVL trees may be used when worstcase performance guarantees are required, for example, in realtime systems. Besides, they serve a basis for implementing persistent datastructures that we'll talk about later. The Java TreeMap
and similar data structures from the standard libraries of many languages are implemented with one of these BSTs. And the implementations of them both are present in the Linux kernel and are used as data structures for various queues.
OK, now you know how to balance a binary tree :D
BTrees
Btree is a generalization of a BST that allows for more than two children. The number of children is not unbounded and should be in a predefined range. For instance, the simplest Btree — 23 tree — allows for 2 or 3 children. Such trees combine the main advantage of selfbalanced trees — logarithmic access time — with the benefit of arrays — locality — the property which allows for faster cache access or retrieval from the storage. That's why Btrees are mainly used in data storage systems. Overall, Btree implementations perform the same trick as we saw in prodsort
: switching to sequential search when the sequence becomes small enough to fit into the cache line of the CPU.
Each internal node of a Btree contains a number of keys. For a 23 tree, the number is either 1 or 2. The keys act as separation values which divide the subtrees. For example, if the keys are x
and y
, all the values in the leftmost subtree will be less than x
, all values in the middle subtree will be between x
and y
, and all values in the rightmost subtree will be greater than y
. Here is an example:
[ 7 . 18 ]
/  \
[ 1 . 3 ] [ 10 . 15 ] [ 20 . _ ]
This tree has 4 nodes. Each node has 2 key slots and may have 0 (in the case of the leaf nodes), 2 or 3 children. The node structure for it might look like this:
(defstruct 23node
key1
key2
val1
val2
lt
md
rt)
Yet, a more general Btree node would, probably, contain arrays for keys/values and children links:
(defstruct btnode
(keys (makearray *maxkeys*))
(vals (makearray *maxkeys*))
(children (makearray (1+ *maxkeys*)))
The element search in a Btree is very similar to that of a BST. Just, there will be up to *maxkeys*
comparisons instead of 1, in each node. Insertion is more tricky as it may require rearranging the tree items to satisfy its invariants. A Btree is kept balanced after insertion by the procedure of splitting a wouldbe overfilled node, of (1+ n)
keys, into two (/ n 2)
key siblings and inserting the midvalue key into the parent. That's why, usually, the range of the number of keys in the node, in the Btree is chosen to be between k
and (* 2 k)
. Also, in practice, k
will be pretty large: an order of 10s or even 100. Depth only increases when the root is split, maintaining balance. Similarly, a Btree is kept balanced after deletion by merging or redistributing keys among siblings to maintain the minimum number of keys for nonroot nodes. A merger reduces the number of keys in the parent potentially forcing it to merge or redistribute keys with its siblings, and so on. The depth of the tree will increase slowly as elements are added to it, but an increase in the overall depth is infrequent and results in all leaf nodes being one more node farther away from the root.
A version of Btrees that is particularly developed for storage systems and is used in a number of filesystems, such as NTFS and ext4, and databases, such as Oracle and SQLite, is B+ trees. A B+ tree can be viewed as a Btree in which each node contains only keys (not keyvalue pairs), and to which an additional level is added at the bottom with linked leaves. The leaves of the B+ tree are linked to one another in a linked list, making range queries or an (ordered) iteration through the blocks simpler and more efficient. Such a property could not be achieved in a Btree, since not all keys are present in the leaves: some are stored in the root or intermediate nodes.
However, a newer Linux filesystem, developed specifically for use on the SSDs and called btrfs uses plain Btrees instead of B+ trees because the former allows implementing copyonwrite, which is needed for efficient snapshots. The issue with B+ trees is that its leaf nodes are interlinked, so if a leaf were copyonwrite, its siblings and parents would have to be as well, as would their siblings and parents and so on until the entire tree was copied. We can recall the same situation pertaining to the doublylinked list compared to singlylinked ones. So, a modified Btree without leaf linkage is used in btrfs, with a refcount associated with each tree node but stored in an adhoc free map structure.
Overall, Btrees are a very natural continuation of BSTs, so we won't spend more time with them here. I believe, it should be clear how to deal with them, overall. Surely, there are a lot of Btree variants that have their nuances, but those should be studied in the context of a particular problem they are considered for.
Heaps
A different variant of a binary tree is a Binary Heap. Heaps are used in many different algorithms, such as path pathfinding, encoding, minimum spanning tree, etc. They even have their own O(log n)
sorting algorithm — the elegant Heapsort. In a heap, each element is either the smallest (minheap) or the largest (maxheap) element of its subtree. It is also a complete tree and the last layer should be filled lefttoright. This invariant makes the heap well suited for keeping track of element priorities. So Priority Queues are, usually, based on heaps. Thus, it's beneficial to be aware of the existence of this peculiar data structure.
The constraints on the heap allow representing it in a compact and efficient manner — as a simple vector. Its first element is the heap root, the second and third are its left and right child (if present) and so on, by recursion. This arrangement permits access to the parent and children of any element using the simple offsetbased formulas (in which the element is identified by its index):
(defun hparent (i)
"Calculate the index of the parent of the heap element with index I."
(floor ( i 1) 2))
(defun hrt (i)
"Calculate the index of the right child of the heap element with index I."
(* (+ i 1) 2))
(defun hlt (i)
"Calculate the index of the left child of the heap element with index I."
( (hrt i) 1))
So, to implement a heap, we don't need to define a custom node structure, and besides, can get to any element in O(1)
! Here is the utility to rearrange an arbitrary array in a minheap formation (in other words, we can consider a binary heap to be a special arrangement of array elements). It works by iteratively placing each element in its proper place by swapping with children until it's larger than both of the children.
(defun heapify (vec)
(let ((mid (floor (length vec) 2)))
(dotimes (i mid)
(heapdown vec ( mid i 1))))
vec)
(defun heapdown (vec beg &optional (end (length vec)))
(let ((l (hlt beg))
(r (hrt beg)))
(when (< l end)
(let ((child (if (or (>= r end)
(> (? vec l)
(? vec r)))
l r)))
(when (> (? vec child)
(? vec beg))
;; rotatef swaps the elements of the sequence
(rotatef (? vec beg)
(? vec child))
(heapdown vec child end)))))
vec)
And here is the reverse operation to pop the item up the heap:
(defun heapup (vec i)
(when (> (? vec i)
(? vec (hparent i)))
(rotatef (? vec i)
(? vec (hparent i)))
(heapup vec (hparent i)))
heap)
Also, as with other data structures, it's essential to be able to visualize the content of the heap in a convenient form, as well as to check the invariants. These tasks may be accomplished with the help of the following functions:
(defun drawheap (vec)
(format t "~%")
(with ((size (length vec))
(h (+ 1 (floor (log size 2)))))
(dotimes (i h)
(let ((spaces (loop :repeat ( (expt 2 ( h i)) 1) :collect #\Space)))
(dotimes (j (expt 2 i))
(let ((k (+ (expt 2 i) j 1)))
(when (= k size) (return))
(format t "~{~C~}~2D~{~C~}" spaces (? vec k) spaces)))
(format t "~%"))))
(format t "~%")
vec)
(defun checkheap (vec)
(dotimes (i (floor (length vec) 2))
(when (= (hlt i) (length vec)) (return))
(assert (not (> (? vec (hlt i)) (? vec i)))
() "Left child (~A) is > parent at position ~A (~A)."
(? vec (hlt i)) i (? vec i))
(when (= (hrt i) (length vec)) (return))
(assert (not (> (? vec (hrt i)) (? vec i)))
() "Right child (~A) is > than parent at position ~A (~A)."
(? vec (hrt i)) i (? vec i)))
vec)
CLUSER> (checkheap #(10 5 8 2 3 7 1 9))
Left child (9) is > parent at position 3 (2).
[Condition of type SIMPLEERROR]
CLUSER> (checkheap (drawheap (heapify #(1 22 10 5 3 7 8 9 7 13))))
22
13 10
9 3 7 8
5 7 1
#(22 13 10 9 3 7 8 5 7 1)
Due to the regular nature of the heap, drawing it with BFS is much simpler than for most other trees. As with ordered trees, heap element insertion and deletion require repositioning of some of the elements. (defun heappush (node vec)
(vectorpushextend node vec)
(heapup vec (1 (length vec))))
(defun heappop (vec)
(rotatef (? vec 0) (? vec ( (length vec) 1)))
;; PROG1 is used to return the result of the first form
;; instead of the last, like it happens with PROGN
(prog1 (vectorpop vec)
(heapdown vec 0)))
Now, we can implement Heapsort. The idea is to iteratively arrange the array in heap order element by element. Each arrangement will take log n
time as we're pushing the item down a complete binary tree the height of which is log n
. And we'll need to perform n
such iterations.
(defun heapsort (vec)
(heapify vec)
(dotimes (i (length vec))
(let ((last ( (length vec) i 1)))
(rotatef (? vec 0)
(? vec last))
(heapdown vec 0 last)))
vec)
CLUSER> (heapsort #(1 22 10 5 3 7 8 9 7 13))
#(1 3 5 7 7 8 9 10 13 22)
There are so many sorting algorithms, so why invent another one? That's a totally valid point, but the advantage of heaps is that they keep the maximum/minimum element constantly at the top so you don't have to perform a full sort or even descend into the tree if you need just the top element. This simplification is especially relevant if we constantly need to access such elements as with priority queues.
Actually, a heap should not necessarily be a tree. Besides the Binary Heap, there are also Binomial, Fibonacci and other kinds of heaps that may even not necessary be trees, but even collections of trees (forests). We'll discuss some of them in more detail in the next chapters, in the context of the algorithms for which their use makes a notable difference in performance.
Tries
If I were to answer the question, what's the most underappreciated data structure, I'd probably say, a trie. For me, tries are a gift that keeps on giving, and they have already saved me program performance in a couple of situations that seemed hopeless. Besides, they are very simple to understand and implement.
A trie is also called a prefix tree. It is, usually, used to optimize dictionary storage and lookup when the dictionary has a lot of entries and there is some overlap between them. The most obvious example is a normal English language dictionary. A lot of words have common stems ("work", "word", "worry" all share the same beginning "wor"), there are many wordforms of the same word ("word", "words", "wording", "worded").
Thre are many approaches to trie implementation. Let's discuss with the most straightforward and, to so to say, primitive one. Here is a trie for representing a string dictionary that is characterbased and uses an alist to store children pointers:
(defstruct (trnode (:concname nil))
val
(children (list)))
(defun trlookup (key root)
(dovec (ch key
;; when iteration terminates normally
;; we have found the node we were looking for
(val root))
(ifit (assoc1 ch (children root))
(:= root it)
(return))))
(defun tradd (key val root)
(let ((i 0))
(dovec (ch key)
(ifit (assoc1 ch (children root))
(:= root it
i (1+ i))
(return)))
(if (= i (length key))
;; there was already something at key 
;; several actions might be taken:
;; signal an error (continuable), overwrite, abort
(cerror "Assign a new value"
"There was already a value at key: ~A" (val root))
(dovec (ch (slice key i))
(let ((child (maketrnode)))
(push (cons ch child) (children root))
(:= root child))))
(:= (val root) val)))
CLUSER> (defparameter *trie* (maketrnode))
*TRIE*
CLUSER> *trie*
#S(TRNODE :VAL NIL :CHILDREN NIL)
For the sake of brevity, we won't define a special printfunction for our trie and will use a default one. In a real setting, though, it is highly advisable.
CLUSER> (trlookup "word" *trie*)
NIL
CLUSER> (tradd "word" 42 *trie*)
42
CLUSER> *trie*
#S(TRNODE
:VAL NIL
:CHILDREN ((#\w
. #S(TRNODE
:VAL NIL
:CHILDREN ((#\o
. #S(TRNODE
:VAL NIL
:CHILDREN ((#\r
. #S(TRNODE
:VAL NIL
:CHILDREN ((#\d
. #S(TRNODE
:VAL 42
:CHILDREN NIL)))))))))))))
CLUSER> (trlookup "word" *trie*)
42
CLUSER> (tradd "word" :foo *trie*)
There was already a value at key: 42
[Condition of type SIMPLEERROR]
Restarts:
0: [CONTINUE] Assign a new value
1: [RETRY] Retry SLIME REPL evaluation request.
2: [*ABORT] Return to SLIME's top level.
3: [ABORT] abort thread (#<THREAD "replthread" RUNNING {100F6297C3}>)
Backtrace:
0: (TRADD "word" :FOO #S(TRNODE :VAL 42 :CHILDREN NIL))
1: (SBINT:SIMPLEEVALINLEXENV (TRADD "word" :FOO *TRIE*) #<NULLLEXENV>)
2: (EVAL (TRADD "word" :FOO *TRIE*))
more
;;; Take the restart 0
:FOO
ClUSER> (tradd "we" :baz *trie*)
:BAZ
CLUSER> *trie*
#S(TRNODE
:VAL NIL
:CHILDREN ((#\w
. #S(TRNODE
:VAL NIL
:CHILDREN ((#\e . #S(TRNODE :VAL :BAZ :CHILDREN NIL))
(#\o
. #S(TRNODE
:VAL NIL
:CHILDREN ((#\r
. #S(TRNODE
:VAL NIL
:CHILDREN ((#\k
. #S(TRNODE
:VAL :BAR
:CHILDREN NIL))
(#\d
. #S(TRNODE
:VAL :FOO
:CHILDREN NIL)))))))))))))
There are many ways to optimize this trie implementation. First of all, you can see that some space is wasted on intermediate nodes with no values. This is mended by Radix Trees (also known as Patricia Trees) that merge all intermediate nodes. I.e., our trie would change into the following more compact structure:
#S(TRNODE
:VAL NIL
:CHILDREN ((#\w
. #S(TRNODE
:VAL NIL
:CHILDREN ((#\e . #S(TRNODE :VAL :BAZ :CHILDREN NIL))
("or" . #S(TRNODE
:VAL NIL
:CHILDREN ((#\k
. #S(TRNODE
:VAL :BAR
:CHILDREN NIL))
(#\d
. #S(TRNODE
:VAL :FOO
:CHILDREN NIL)))))))))))))
Besides, there are ways to utilize the array to store trie offsets (similar to heaps), instead of using a linked backbone for it. Such variant is called a succinct trie. Also, there are compressed (Ctries), hasharray mapped (HAMTs), and other kinds of tries.
The main advantage of tries is efficient space usage thanks to the elimination of repetition in keys storage. In many scenarios, usage of tries also improves the speed of access. Consider the task of matching against a dictionary of phrases, for example, biological or medical terms, names of companies or works of art, etc. These are usually 23 words long phrases, but, occasionally, there may be an outlier of 10 or more words. The straightforward approach would be to put the dictionary into a hashtable, then iterate over the input string trying to find the phrases in the table, starting from each word. The only question is: where do we put an end of the phrase? As we said, the phrase may be from 1 to, say, 10 words in length. With a hashtable, we have to check every variant: a singleword phrase, a twoword one, and so on up to the maximum length. Moreover, if there are phrases with the same beginning, which is often the case, we'd do duplicate work of hashing that beginning, for each variant (unless we use an additive hash, but this isn't adviced for hashtables). With a trie, all the duplication is not necessary: we can iteratively match each word until we either find the match in the tree or discover that there is no continuation of the current subphrase.
Trees in Action: Efficient Mapping
Finally, the last family of tree data structures I had to mention is trees for representing spatial relations. Overall, mapping and pathfinding is an area that prompted the creation of a wide range of useful algorithms and data structures. There are two fundamental operations for processing spatial data: nearest neighbor search and range queries. Given the points on the plane, how do we determine the closest points to a particular one? How do we retrieve all points inside a rectangle or a circle? A primitive approach is to loop through all the points and collect the relevant information, which results in at least O(n)
complexity — prohibitively expensive if the number of points is beyond several tens or hundreds. And such problems, by the way, arise not only in the field of processing geospatial data (they are at the core of such systems as PostGIS, mapping libraries, etc.) but also in Machine Learning (for instance, the kNN algorithm directly requires such calculations) and other areas.
A more efficient solution has an O(log n)
complexity and is, as you might expect, based on indexing the data in a specialpurpose tree. The changes to the tree will also have O(log n)
complexity, while the initial indexing is O(n log n)
. However, in most of the applications that use this technic, changes are much less frequent than read operations, so the upfront cost pays off.
There is a number of trees that allow efficient storage of spatial data: segment trees, interval trees, kd trees, Rtrees, etc. The most common spatial data structure is an Rtree (rectangletree). It distributes all the points in an n
dimensional space (usually, n
will be 2 or 3) among the leaves of the tree by recursively dividing the space into k
rectangles holding roughly the same number of points until each tree node has at most k
points. Let's say we have started from 1000 points on the plane and chosen k
to be 10. In this case, the first level of the tree (i.e. children of the root) will contain 10 nodes, each one having as the value the dimensions of the rectangle that bounds approximately 100 points. Every node like that will have 10 more children, each one having around 10 points. Maybe, some will have more, and, in this case, we'll give those nodes 10 children each with, probably, 1 or 2 points in the rectangles they will command. Now, we can perform a range search with the obtained tree by selecting only the nodes that intersect the query rectangle. For a small query box, this approach will result in the discarding of the majority of the nodes at each level of the tree. So, a range search over an Rtree has O(k log n)
where k
is the number of intersecting rectangles.
Now, let's consider neighbor search. Obviously, the closest points to a particular one we are examining lie either in the same rectangle as the point or in the closest ones to it. So, we need to, first, find the smallest bounding rectangle, which contains our point, perform the search in it, then, if we haven't got enough points yet, process the siblings of the current tree node in the order of their proximity to it.
There are many other spatial problems that may be efficiently solved with this approach. One thing to note is that the described procedures require the tree to store, in the leaf nodes, references to every point contained in their rectangles.
Takeaways
So, balancing a tree isn't such a unique and interesting task. On the contrary, it's quite simple yet boring due to the number of edge cases you have to account for. Yet, we have just scratched the surface of the general topic of trees. It is vast: the Wikipedia section for tree data structures contains almost 100 of them and it's, definitely, not complete. Moreover, new tree variants will surely be invented in the future. But you will hardly deal with more than just a few variants during the course of your career, spending the majority of time with the simple "unconstrained" trees. And we have seen, in action, the basic principles of tree operation that will be helpful, in the process.
There's a couple of other general observations about programming algorithms we can draw from this chapter:
 Trees are very versatile data structures that are a default choice when you need to represent some hierarchy. They are also one of a few data structures for which recursive processing is not only admissible but also natural and efficient.
 Visualization is key to efficient debugging of complex datastructures. Unfortunately, it's hard to show that in the book how I have spent several hours on the code for the splay tree, but without an efficient way to display the trees coupled with dynamic tracing, I would probably have spent twice as much. And, both the
printfunction
for individual node andpprintbst
were helpful here.
[1] This statement is stricktly true for balanced trees, but, even for imbalanced trees, such estimation is usually correct.
[2] Although it was shown that this value may also be reduced to a single bit using a clever implementation trick.
Lispers.de — Berlin Lispers Meetup, Monday, 30th September 2019
@20190927 09:30 · 70 days agoWe meet again on Monday 8pm, 30th September. Our host this time is James Anderson (www.dydra.com).
Berlin Lispers is about all flavors of Lisp including Common Lisp, Clojure and Scheme.
MaxGerd Retzlaff will talk about his art installation "Other Dimension" [1] as part of the "Gewebtes Licht" exhibition in 2011 and how he used Embeddable Common Lisp for it.
[1] http://otherdimension.net/ in collaboration with Alex Wenger
We meet in the TautHaus at Engeldamm 70 in BerlinMitte, the bell is "James Anderson". It is located in 10min walking distance from U Moritzplatz or U Kottbusser Tor or Ostbahnhof. In case of questions call Christian +49 1578 70 51 61 4.
Vsevolod Dyomkin — Programming Algorithms: HashTables
@20190911 19:25 · 86 days agoNow, we can move on to studying advanced data structures which are built on top of the basic ones such as arrays and lists, but may exhibit distinct properties, have different use cases, and special algorithms. Many of them will combine the basic data structures to obtain new properties not accessible to the underlying structures. The first and most important of these advanced structures is, undoubtedly, the hashtable. However vast is the list of candidates to serve as keyvalues, hashtables are the default choice for implementing them.
Also, hashsets, in general, serve as the main representation for medium and largesized sets as they ensure O(1)
membership test, as well as optimal settheoretic operations complexity. A simple version of a hashset can be created using a normal hashtable with t
for all values.
Implementation
The basic properties of hashtables are average O(1)
access and support for arbitrary keys. These features can be realized by storing the items in an array at indices determined by a specialized function that maps the keys in a pseudorandom way — hashes them. Technically, the keys should pertain to the domain that allows hashing, but, in practice, it is always possible to ensure either directly or by using an intermediate transformation. The choice of variants for the hashfunction is rather big, but there are some limitations to keep in mind:
 As the backing array has a limited number of cells (
n
), the function should produce values in the interval[0, n)
. This limitation can be respected by a 2step process: first, produce a number in an arbitrary range (for instance, a 32bit integer) and then take the remainder of its division byn
.  Ideally, the distribution of indices should be uniform, but similar keys should map to quite distinct indices. I.e. hashing should turn things which are close, into things which are distant. This way, even very small changes to the input will yield sweeping changes in the value of the hash. This property is called the "avalanche effect".
Dealing with Collisions
Even better would be if there were no collisions — situations when two or more keys are mapped to the same index. Is that, at all, possible? Theoretically, yes, but all the practical implementations that we have found so far are too slow and not feasible for a hashtable that is dynamically updated. However, such approaches may be used if the keyset is static and known beforehand. They will be covered in the discussion of perfect hashtables.
For dynamic hashtables, we have to accept that collisions are inevitable. The probability of collisions is governed by an interesting phenomenon called "The Birthday Paradox". Let's say, we have a group of people of some size, for instance, 20. What is the probability that two of them have birthdays on the same date? It may seem quite improbable, considering that there are 365 days in a year and we are talking just about a handful of people. But if you take into account that we need to examine each pair of people to learn about their possible birthday collision that will give us (/ (* 20 19) 2)
, i.e. 190 pairs. We can calculate the exact probability by taking the complement to the probability that no one has a birthday collision, which is easier to reason about. The probability that two people don't share their birthday is (/ ( 365 1) 365)
: there's only 1 chance in 365 that they do. For three people, we can use the chain rule and state that the probability that they don't have a birthday collision is a product of the probability that any two of them don't have it and that the third person also doesn't share a birthday with any of them. This results in (* (/ 364 365) (/ ( 365 2) 365))
. The value ( 365 2)
refers to the third person not having a birthday intersection with neither the first nor the second individually, and those are distinct, as we have already asserted in the first term. Continuing in such fashion, we can count the number for 20 persons:
(defun birthdaycollisionprob (n)
(let ((rez 1))
(dotimes (i n)
(:* rez (/ ( 365 i) 365)))
;; don't forget that we want the complement
;; of the probability of no collisions,
;; hence ( 1.0 ...)
( 1.0 rez)))
CLUSER> (birthdaycollisionprob 20)
0.4114384
So, among 20 people, there's already a 40% chance of observing a coinciding birthday. And this number grows quickly: it will become 50% at 23, 70% at 30, and 99.9% at just 70!
But why, on Earth, you could ask, have we started to discusss birthdays? Well, if you substitute keys for persons and the array size for the number of days in a year, you'll get the formula of the probability of at least one collision among the hashed keys in an array, provided the hash function produces perfectly uniform output. (It will be even higher if the distribution is nonuniform).
(defun hashcollisionprob (n size)
(let ((rez 1))
(dotimes (i n)
(:* rez (/ ( size i) size)))
( 1.0 rez)))
Let's say, we have 10 keys. What should be the array size to be safe against collisions?
CLUSER> (hashcollisionprob 10 10)
0.9996371
99.9%. OK, we don't stand a chance to accidentally get a perfect layout. :( What if we double the array size?
CLUSER> (hashcollisionprob 10 20)
0.9345271
93%. Still, pretty high.
CLUSER> (hashcollisionprob 10 100)
0.37184352
CLUSER> (hashcollisionprob 10 10000)
0.004491329
So, if we were to use a 10kelement array to store 10 items the chance of a collision would fall below 1%. Not practical...
Note that the number depends on both arguments, so (hashcollisionprob 10 100)
(0.37) is not the same as (hashcollisionprob 20 200)
(0.63).
We did this exercise to completely abandon any hope of avoiding collisions and accept that they are inevitable. Such mind/coding experiments may be an effective smoketest of our novel algorithmic ideas: before we go fullspeed and implement them, it makes sense to perform some backoftheenvelope feasibility calculations.
Now, let's discuss what difference the presence of these collisions makes to our hashtable idea and how to deal with this issue. The obvious solution is to have a fallback option: when two keys hash to the same index, store both of the items in a list. The retrieval operation, in this case, will require a sequential scan to find the requested key and return the corresponding value. Such an approach is called "chaining" and it is used by some implementations. Yet, it has a number of drawbacks:
 It complicates the implementation: we now have to deal with both a static array and a dynamic list/array/tree. This change opens a possibility for some hardtocatch bugs, especially, in the concurrent settings.
 It requires more memory than the hashtable backing array, so we will be in a situation when some of the slots of the array are empty while others chain several elements.
 It will have poor performance due to the necessity of dealing with a linked structure and, what's worse, not respecting cache locality: the chain will not fit in the original array so at least one additional RAM roundtrip will be required.
One upside of this approach is that it can store more elements than the size of the backing array. And, in the extreme case, it degrades to bucketing: when a small number of buckets point to long chains of randomly shuffled elements.
The more widelyused alternative to chaining is called "open addressing" or "closed hashing". With it, the chains are, basically, stored in the same backing array. The algorithm is simple: when the calculated hash is pointing at an already occupied slot in the array, find the next vacant slot by cycling over the array. If the table isn't full we're guaranteed to find one. If it is full, we need to resize it, first. Now, when the element is retrieved by key, we need to perform the same procedure: calculate the hash, then compare the key of the item at the returned index. if the keys are the same, we've found the desired element, otherwise — we need to cycle over the array comparing keys until we encounter the item we need.
Here's an implementation of the simple open addressing hashtable using eql
for keys comparison:
(defstruct ht
array
(count 0))
(defun ht (&rest kvs)
(let ((rez (makeht :array (makearray 16))))
(loop :for (k v) :in kvs :do
(addht k v rez))
rez))
(defun htget (key ht)
(with ((size (length @ht.array)))
(start (rem (hash key) size)))
(do ((count 0 (1+ count))
(i start (rem (1+ i) size))
(item (? ht 'array start)
(? ht 'array i))
((or (null item)
(= count size)))
(when (eql key (car item))
(return
(values (cdr item)
;; the second value is an index, at which
;; the item was found (also used to distinguish
;; the value nil from not found, which is also
;; represented by nil but without the second value)
i))))))
(defun htadd (key val ht)
(with ((array (htarray ht))
(size (length array)))
;; flet defines a local function that has access
;; to the local variables defined in HTADD
(flet ((additem (k v)
(do ((i (rem (hash k) size)
(rem (1+ i) size))
((null @ht.array#i)
(:= @ht.array#i (cons k v)))
;; this doloop doesn't have a body
)))
;; TALLY is a generic function for size retrieval, from RUTILS
(when (= (tally ht) size)
;; when the backing array is full
;; expand it to have the length equal to the next power of 2
(:= size (expt 2 (ceiling (log (1+ count) 2)))
@ht.array (makearray size))
;; and readd its contents
(dovec (item array)
(additem (car item) (cdr item)))
;; finally, add the new item
(additem key val)))
(defun htrem (key ht)
;; here, we use the index of the item
;; returned as the 2nd value of HTGET
;; (whenit is a so called anaphoric macro, from RUTILS,
;; that assigns the value of its first argument
;; to an implicitly created variable IT
;; and evaluates the body when IT isn't null)
(whenit (nthvalue 2 (htget key ht))
(void (? ht 'array it))
;; return the index to indicate that the item was found
it))
To avoid constant resizing of the hashtable, just as with dynamic arrays, the backing array is, usually, allocated to have the size equal to a power of 2: 16 elements, to begin with. When it is filled up to a certain capacity it is resized to the next power of 2: 32, in this case. Usually, around 7080% is considered peak occupancy as too collisions may happen afterward and the table access performance severely degrades. In practice, this means that normal openaddressing hashtables also waste from 20 to 50 percent of allocated space. This inefficiency becomes a serious problem with large tables, so other implementation strategies become preferable when the size of data reaches tens and hundreds of megabytes. Note that, in our trivial implementation above, we have, effectively, used the threshold of 100% to simplify the code. Adding a configurable threshold is just a matter of introducing a parameter and initiating resizing not when (= (htcount ht) size)
but upon (= (htcount ht) (floor size threshold))
. As we've seen, resizing the hashtable requires calculating the new indices for all stored elements and adding them anew into the resized array.
Analyzing the complexity of the access function of the hashtable and proving that it is amortized O(1)
isn't trivial. It depends on the properties of the hashfunction, which should ensure good uniformity. Besides, the resizing threshold also matters: the more elements are in the table, the higher the chance of collisions. Also, you should keep in mind that if the keys possess some strange qualities that prevent them from being hashed uniformly, the theoretical results will not hold.
In short, if we consider a hashtable with 60% occupancy (which should be the average number, for a common table) we end up with the following probabilities:
 probability that we'll need just 1 operation to access the item (i.e. the initially indexed slot is empty): 0.4
 probability that we'll need 2 operations (the current slot is occupied, the next one is empty):
(* 0.6 0.4)
— 0.24  probability that we'll need 3 operations:
(* (expt 0.6 2) 0.4)
— 0.14  probability that we'll need 4 operations:
(* (expt 0.6 3) 0.4)
— 0.09
Actually, these calculations are slightly off and the correct probability of finding an empty slot should be somewhat lower, although the larger the table is, the smaller the deviation in the numbers. Finding out why is left as an exercise for the reader :)
As you see, there's a progression here. With probability around 0.87, we'll need no more than 4 operations. Without continuing with the arithmetic, I think, it should be obvious that we'll need, on average, around 3 operations to access each item and the probability that we'll need twice as many (6) is quite low (below 5%). So, we can say that the number of access operations is constant (i.e. independent of the number of elements in the table) and is determined only by the occupancy percent. So, if we keep the occupancy in the reasonable bounds, named earlier, on average, 1 hash code calculation/lookup and a couple of retrievals and equality comparisons will be needed to access an item in our hashtable.
HashCode
So, we can conclude that a hashtable is primarily parametrized by two things: the hashfunction and the equality predicate. In Lisp, in particular, there's a choice of just the four standard equality predicates: eq
, eql
, equal
, and equalp
. It's somewhat of a legacy that you can't use other comparison functions so some implementations, as an extension, allow th programmer to specify other predicates. However, in practice, the following approach is sufficient for the majority of the hashtable use cases:
 use the
eql
predicate if the keys are numbers, characters, or symbols  use
equal
if the keys are strings or lists of the mentioned items  use
equalp
if the keys are vectors, structs, CLOS objects or anything else containing one of those
But I'd recommend trying your best to avoid using the complex keys requiring equalp
. Besides the performance penalty of using the heaviest equality predicate that performs deep structural comparison, structs, and vectors, in particular, will most likely hash to the same index. Here is a quote from one of the implementors describing why this happens:
Structs have no extra space to store a unique hash code within them. The decision was made to implement this because automatic inclusion of a hashing slot in all structure objects would have made all structs an average of one word longer. For small structs this is unacceptable. Instead, the user may define a struct with an extra slot, and the constructor for that struct type could store a unique value into that slot (either a random value or a value gotten by incrementing a counter each time the constructor is run). Also, create a hash generating function which accesses this hashslot to generate its value. If the structs to be hashed are buried inside a list, then this hash function would need to know how to traverse these keys to obtain a unique value. Finally, then, build your hashtable using the
:hashfunction
argument to makehashtable (still using the equal test argument), to create a hashtable which will be welldistributed. Alternatively, and if you can guarantee that none of the slots in your structures will be changed after they are used as keys in the hashtable, you can use theequalp
test function in your makehashtable call, rather than equal. If you do, however, make sure that these struct objects don't change, because then they may not be found in the hashtable.
But what if you still need to use a struct or a CLOS object as a hash key (for instance, if you want to put them in a set)? There are three possible workarounds:
 Choose one of their slots as a key (if you can guarantee its uniqueness).
 Add a special slot to hold a unique value that will serve as a key.
 Use the literal representation obtained by calling the printfunction of the object. Still, you'll need to ensure that it will be unique and constant. Using an item that changes while being the hash key is a source of very nasty bugs, so avoid it at all cost.
These considerations are also applicable to the question of why Java requires defining both equals
and hashCode
methods for objects that are used as keys in the hashtable or hashset.
Advanced Hashing Techniques
Beyond the direct implementation of open addressing, called "linear probing" (for it tries to resolve collisions by performing a linear scan for an empty slot), a number of approaches were proposed to improve hash distribution and reduce the collision rate. However, for the general case, their superiority remains questionable, and so the utility of a particular approach has to be tested in the context of the situations when linear probing demonstrates suboptimal behavior. One type of such situations occurs when the hashcodes become clustered near some locations due to deficiencies of either the hashfunction or the keyset.
The simplest modification of linear probing is called "quadratic probing". It operates by performing the search for the next vacant slot using the linear probing offsets (or some other sequence of offsets) that are just raised to the power 2. I.e. if, with linear probing, the offset sequence was 1,2,3,etc, with the quadratic one, it is 1,4,9,... "Double hashing" is another simple alternative, which, instead of a linear sequence of offsets, calculates the offsets using another hashfunction. This approach makes the sequence specific to each key, so the keys that map to the same location will have different possible variants of collision resolution. "2choice hashing" also uses 2 hashfunctions but selects the particular one for each key based on the distance from the original index it has to be moved for collision resolution.
More elaborate changes to the original idea are proposed in Cuckoo, Hopscotch, and Robin Hood caching, to name some of the popular alternatives. We won't discuss them now, but if the need arises to implement a nonstandard hashtable it's worth studying all of those before proceeding with an idea of your own. Although, who knows, someday you might come up with a viable alternative technique, as well...
HashFunctions
The class of possible hashfunctions is very diverse: any function that sufficiently randomizes the key hashes will do. But what good enough means? One of the ways to find out is to look at the the pictures of the distribution of hashes. Yet, there are other factors that may condition the choice: speed, complexity of implementation, collision resistance (important for cryptographic hashes that we won't discuss in this book).
The good news is that, for most practical purposes, there's a single function that is both fast and easy to implement and understand. It is called FNV1a.
(defparameter *fnvprimes*
'((32 . 16777619)
(64 . 1099511628211)
(128 . 309485009821345068724781371)
(256 . 374144419156711147060143317175368453031918731002211)))
(defparameter *fnvoffsets*
'((32 . 2166136261)
(64 . 14695981039346656037)
(128 . 144066263297769815596495629667062367629)
(256 . 100029257958052580907070968620625704837092796014241193945225284501741471925557)))
(defun fnv1a (x &key (bits 32))
(assert (member bits '(32 64 128 256)))
(let ((rez (assoc1 bits *fnvoffsets*))
(prime (assoc1 bits *fnvprimes*)))
(dotimes (i (/ bits 8))
(:= rez (ldb (byte bits 0)
(* (logxor rez (ldb (byte 8 (* i 8)) x))
prime))))
rez))
The constants *fnvprimes*
and *fnvoffsets*
are precalculated up to 1024 bits (here, I used just a portion of the tables).
Note that, in this implementation, we use normal Lisp multiplication (*
) that is not limited to fixedsize numbers (32bit, 64bit,...) so we need to extract only the first bits
with ldb
.
Also note that if you were to calculate FNV1a with some online hash calculator you'd, probably, get a different result. Experimenting with it, I noticed that it is the same if we use only the nonzero bytes from the input number. This observation aligns well with calculating the hash for simple strings when each character is a single byte. For them the hashfunction would look like the following:
(defun fnv1astr (str)
(let ((rez (assoc1 32 *fnvoffsets*))
(prime (assoc1 32 *fnvprimes*)))
(dovec (char str)
(:= rez (ldb 32 (* (logxor rez (ldb (byte 8 (* i 8))
(charcode char)))
prime))))
rez))
So, even such a simple hashfunction has nuances in its implementation and it should be meticulously checked against some reference implementation or a set of expected results.
Alongside FNV1a, there's also FNV1, which is a slightly worse variation, but it may be used if we need to apply 2 different hash functions at once (like, in 2way or double hashing).
What is the source of the hashing property of FNV1a? Xors and modulos. Combining these simple and efficient operations is enough to create a desired level of randomization. Most of the other hashfunctions use the same building blocks as FNV1a. They all perform arithmetic (usually, addition and multiplication as division is slow) and xor'ing, adding into the mix some prime numbers. For instance, here's what the code for another popular hashfunction "djb2" approximately looks like:
(defun djb2str (str)
(let ((rez 5381) ; a DJB2 prime number
(loop :for char :across str :do
(:= rez (ldb 32 (+ (charcode char)
(ldb 32 (+ (ash rez 5)
rez)))))))
rez))
Operations
The generic keyvalue operations we have discussed in the previous chapter, obviously also apply to hashtables. There are also specific lowlevel ones, defined by the Lisp standard. And it's worth mentioning that, in regards to hashtables, I find the standard quite lacking so a lot of utilities were added as part of RUTILS. The reason for the deficiency in the stadard is, I believe, that when hashtables had been added to Lisp they had been still pretty novel technology not widely adopted in the programming languages community. So there had been neither any significant experience using them, nor a good understanding of the important role they would play. Languages such as Python or Clojure as well as the ones that were designed even later, were developed with this knowledge already in mind. Yet, this situation doesn't pose insurmountable difficulty for Lisp users as the language provides advanced extension tools such as macros and reader macros, so the necessary parts can be added and, in fact, exist as 3rdparty extensions. Using them becomes just a question of changing your habits and adapting to more efficient approaches. The situation is different for the users of many other languages, such as Java users, who had to wait for the new major version of the language to get access to such things as literal hashtable initialization. The feature I consider to be crucially important to improving the level of code clarity, in the declarative paradigm.
Initialization
Normally, the hashtable can be created with makehashtable
, which has a number of configuration options, including :test
(default: eql
). Most of the implementations allow the programmer to make synchronized (threadsafe) hashtables via another configuration parameter, but the variants of concurrency control will differ.
Yet, it is important to have a way to define hashtables already preinitialized with a number of keyvalue pairs, and makehashtable
can't handle this. Preinitialized hash tables represent a common necessity for tables serving as dictionaries, and such preinitialization greatly simplifies many code patterns. Thus RUTILS provides such a syntax (in fact, in 2 flavors) with the help of reader macros:
#{equal "foo" :bar "baz" 42}
#h(equal "foo" :bar "baz" 42)
Both of these expressions will expand into a call to makehashtable
with equal
test and a two calls to set operation to populate the table with the kvpairs "foo" :bar
and "baz" 42
. For this stuff to work, you need to switch to the appropriate readtable by executing: (namedreadtables:inreadtable rutilsreadtable)
.
The readermacro to parse #h()
style literal readtables isn't very complicated. As all readermacros, it operates on the character stream of the program text, processing one character at a time. Here is it's implementation:
(defun #hreader (stream char arg)
(readchar stream) ; skip the open paren
;; we can also add a sanity check to ensure that this character
;; is indeed a #\(
(with (;; readdelimitedlist is a standard library function
;; that reads items until a delimiter is encountered
;; and then returns them as a list of parsed Lisp objects
(sexp (readdelimitedlist #\) stream t))
;; the idea is that the first element may be a hashtable
;; test function; in this case, the number of items in the
;; definition will be odd as each keyvalue pair should have
;; an even number of elements
(test (when (oddp (length sexp))
(first sexp)))
;; the rest of the values, after the possible test function,
;; are keyvalue pairs
(kvs (group 2 (if test (rest sexp) sexp)))
(ht (gensym)))
`(let ((,ht (makehashtable :test ',(or test 'eql))))
;; iterate the tail of the KVS list (:on loop clause)
;; and, for each keyvalue pair, generate an expression
;; to add the value for the key in the resulting hashtable
,@(mapcar (lambda (kv)
`(:= (? ,ht ,(first kv)) ,(second kv)))
,kvs)
,ht)))
After such a function is defined, it can be plugged into the standard readtable:
(setdispatchmacrocharacter #\# #\h '#hreader)
Or it may be used in a namedreadtable (you can learn how to do that, from the docs).
printhashtable
is the utility to perform the reverse operation — display hashtables in the similar manner:
RUTILS> (printhashtable #h(equal "foo" :bar "baz" 42))
#{EQUAL
"foo" :BAR
"baz" 42
}
#<HASHTABLE :TEST EQUAL :COUNT 2 {10127C0003}>
The last line of the output is the default Lisp printed representation of the hashtable. As you see, it is opaque and doesn't display the elements of the table. RUTILS also allows switching to printing the literal representation instead of the standard one with the help of toggleprinthashtable
. However, this extension is intended only for debugging purposes as it is not fully standardconforming.
Access
Accessing the hashtable elements is performed with gethash
, which returns two things: the value at key and t
when the key was found in the table, or two nils otherwise. By using (:= (gethash key ht) val)
(or (:= (? ht key) val)
) we can modify the stored value. Notice the reverse order of arguments of gethash
compared to the usual order in most accessor functions, when the structure is placed first and the key second. However, gethash
differs from generic ?
in that it accepts an optional argument that is used as the default value if the requested key is not present in the table. In some languages, like Python, there's a notion of "default hashtables" that may be initialized with a common default element. In Lisp, a different approach is taken. However, it's possible to easily implement default hashtables and plug them into the genericelt
mechanism:
(defstruct defaulthashtable
table
defaultvalue)
(defun gethashdefault (key ht)
(gethash key (? ht 'table) (? ht 'defaultvalue)))
(defmethod genericelt ((kv defaulthashtable) key &rest keys)
(gethashdefault key kv))
RUTILS also defines a number of aliases/shorthands for hashtable operations. As the #
symbol is etymologically associated with hashes, it is used in the names of all these functions:
get#
is a shorthand and a more distinctive alias forgethash
set#
is an alias for(:= (gethash ...
getset#
is an implementation of the common pattern: this operation either retrieves the value if the key is found in the table or calculates its third argument returns it and also sets it for the given key for future retrievalrem#
is an alias forremhash
(remove the element from the table)take#
both returns the key and removes it (unlikerem#
that only removes)in#
tests for the presence of the key in the table also,
p#
is an abbreviated version ofprinthashtable
Iteration
Hashtables are unordered collections, in principle. But, still, there is always a way to iterate over them in some (unspecified) order. The standard utility for that is either maphash
, which unlike map
doesn't populate the resulting collection and is called just for the side effects, or the special loop
syntax. Both are suboptimal, from several points of view, so RUTILS defines a couple of alternative options:
dotable
functions in the same manner asdolist
except that it uses two variables: for the key and the valuemapkv
, mentioned in the previous chapter, works just likemapcar
by creating a new result table with the same configuration as the hashtable it iterates over and assigns the results of invoking the first argument — the function of two elements — with each of the kvpairs
Despite the absence of a predefined ordering, there are ways in which some order may be introduced. For example, in SBCL, the order in which the elements are added, is preserved by using additional vectors called indexvector
and nextvector
that store this information. Another option which allows forcing arbitrary ordering is to use the socalled Linked HashTable. It is a combination of a hashtable and a linked list: each keyvalue pair also has the next pointer, which links it to some other item in the table. This way, it is possible to have ordered keyvalues without resorting to treebased structures. A poor man's linked hashtable can be created on top of the normal one with the following trick: substitute values by pairs containing a value plus a pointer to the next pair and keep track of the pointer to the first pair in a special slot.
(deftruct linkedhashtableitem
key
val
next)
(defstruct linkedhashtable
table
head
tail)
(defun gethashlinked (key ht)
(? ht 'table key 'val))
(defun sethashlinked (key ht val)
;; The initial order of items is the order of addition.
;; If we'd like to impose a different order,
;; we'll have to perform reordering after each addition
;; or implement a custom sethash function.
(with (((table head tail) ? ht)
(cur (gethash key table)))
(if cur
(:= (? cur 'val) val)
(let ((new (makelinkedhashtableitem
:key key :val val)))
(when (null head)
(:= (? ht 'head) new))
(:= (? ht 'tail)
(if tail
(:= (? ht 'tail 'next) new)
new))))))
(defmethod mapkv (fn (ht linkedhashtable))
(with ((rez (makelinkedhashtable
:table (hashtable :key (hashtablekey (? rez 'table)))))
(prev nil))
(do ((item (? rez 'head) (? item 'next)))
((null item))
(sethashlinked key rez (call fn (? item 'val))))))
The issue with this approach, as you can see from the code, is that we also need to store the key, and it duplicates the data also stored in the backing hashtable itself. So, an efficient linked hashtable has to be implemented from scratch using an array as a base instead of a hashtable.
Perfect Hashing
In the previous exposition, we have concluded that using hashtables implies a significant level of reserved unused space (up to 30%) and inevitable collisions. Yet, if the keyset is static and known beforehand, we can do better: find a hashfunction, which will exclude collisions (simple perfect hashing) and even totally get rid of reserved space (minimal perfect hashing, MPH). Although the last variant will still need extra space to store the additional information about the hashfunctions, it may be much smaller: in some methods, down to ~34 bits per key, so just 510% overhead. Statistically speaking, constructing such a hashfunction is possible. But the search for its parameters may require some trial and error.
Implementation
The general idea is simple, but how to find the appropriate hashfunction? There are several approaches described in sometimes hardtofollow scientific papers and a number of cryptic programs in lowlevel C libraries. At a certain point in time, I needed to implement some variant of an MPH so I read those papers and studied the libraries to some extent. Not the most pleasant process, I should confess. One of my twitter pals once wrote: "Looks like it's easier for people to read 40 blog posts than a single whitepaper." And, although he was putting a negative connotation to it, I recognized the statement as a very precise description of what a research engineer does: read a whitepaper (or a dozen, for what it's worth) and transform it into working code and — as a possible byproduct — into an explanation ("blog post") that other engineers will understand and be able to reproduce. And it's not a skill every software developer should be easily capable of. Not all papers can even be reproduced because the experiment was not set up correctly, some parts of the description are missing, the data is not available, etc. Of those, which, in principle, can be, only some are presented in the form that is clear enough to be reliably programmed.
Here is one of the variants of minimal perfect hashing that possesses such qualities. It works for datasets of any size as a 3step process:
 At the first stage, by the use of a common hashfunction (in particular, the Jenkins hash), all keys are nearuniformly distributed into buckets, so that the number of keys in each bucket doesn't exceed 256. It can be achieved with very high probability if the hash divisor is set to
(ceiling (length keyset) 200)
. This allows the algorithm to work for data sets of arbitrary size, thereby reducing the problem to a simpler one that already has a known solution.  Next, for each bucket, the perfect hash function is constructed. This function is a table (and it's an important mathematical fact that each discrete function is equivalent to a table, albeit, potentially, of unlimited length). The table contains bytesized offsets for each hash code, calculated by another application of the Jenkins hash, which produces two values in one go (actually, three, but one of them is not used). The divisor of the hashfunction, this time, equals to double the number of elements in the bucket. And the uniqueness requirement is that the sum of offsets corresponding, in the table, to the two values produced by the Jenkins hash is unique, for each key. To check if the constraint is satisfied, the hashes are treated as vertices of a graph, and if it happens to be acyclic (the probability of this event is quite high if the parameters are chosen properly), the requirement can be satisfied, and it is possible to construct the perfect hash function, by the process described as the next step. Otherwise, we change the seed of the Jenkins hash and try again until the resulting graph is acyclic. In practice, just a couple of tries are needed.
 Finally, the hashfunction for the current bucket may be constructed from the graph by the CHM92 algorithm (named after the authors and the year of the paper), which is another version of perfect hashing but suitable only for limited keysets. Here, you can see the CHM92 formula implemented in code:
(defstruct mpht
(data nil :type simplevector)
(offsets nil :type (simplearray octet))
(meta nil :type (simplearray quad))
(div nil))
;; div is the divisor of the toplevel hash, which is calculated as:
;; (/ (1 (length meta)) 2)
(defun mphtindex (item mpht)
(with (((offsets meta div) ? mpht)
(bucketid (* (mod (jenkinshash item) div) 2))
(bucketoffset (? meta bucketid))
(bucketseed (? meta (+ 1 bucketid)))
;; the number of items in the bucket is calculated
;; by substracting the offset of the next bucket
;; from the offset of the current one
(bucketcount ( (? meta (+ 2 bucketid))
bucketoffset))
(hash1 hash2 (jenkinshash2 item bucketseed bucketdiv))
(base (* bucketoffset 2)))
(+ bucketoffset (mod (+ (? offsets (+ base hash1))
(? offsets (+ base hash2)))
bucketcount))))
This algorithm guarantees exactly O(1)
hashtable access and uses 2 bytes per key, i.e. it will result in a constant 25% overhead on the table's size (in a 64bit system): 2 bytesized offsets for the hashes plus negligible 8 bytes per bucket (each bucket contains ~200 elements) for meta information. Better spaceutilization solutions (up to 4 times more efficient) exist, but they are harder to implement and explain.
The Jenkins hashfunction was chosen for two reasons:
 Primarily, because, being a relatively goodquality hash, it has a configurable parameter
seed
that is used for probabilistic probing (searching for an acyclic graph). On the contrary, FNV1a doesn't work well with an arbitrary prime hence the usage of a precalculated one that isn't subject to change.  Also, it produces 3 pseudorandom numbers right away, and we need 2 for the second stage of the algorithm.
The CHM92 Algorithm
The CHM92 algorithm operates by performing a depthfirst search (DFS) on the graph, in the process, labeling the edges with unique numbers and calculating the corresponding offset for each of the Jenkins hash values. In the picture, you can see one of the possible labelings: each vertex is the value of one of the two hashcodes returned by jenkinshash2
for each key, and every edge, connecting them, corresponds to a key that produced the hashes. The unique indices of the edges were obtained during DFS. Now, each hashcode is mapped iteratively to the number that is ( edgeindex othervertexindex)
. So, some codes will map to the same number, but it is guaranteed that, for each key, the sum of two corresponding numbers will be unique (as the edge indices are unique).
CHM92 is an example of the probabilistic algorithms we will discuss in more detail near the end of the book.
Let's say we have implemented the described scheme like I did in the consttable library. Now, we need to perform the measurements to validate that we have, in fact, achieved the desired improvement over the standard hashtable implementation. In this case, we are interested not only in speed measurements, which we already know how to perform but also in calculating the space occupied.
The latter goal is harder to achieve. Usually, most of the programming languages will provide the analog of a sizeof
function that returns the space occupied by an array, a structure or an object. Here, we're interested not in "shallow" sizeof
but in a "deep" one that will descend into the structure's slots and add their sizes recursively.
First, let's create functions to populate the tables with a significant number of random string keyvalue pairs.
(defun randomstring (size)
(coerce (loop :repeat size :collect (codechar (+ 32 (random 100))))
'string))
(defun randomhashtable (&key (n 100000))
(let ((rez (makehashtable :test 'equal)))
(loop :repeat n :do
(:= (? rez (randomstring (+ 3 (random 4))))
(randomstring (+ 3 (random 4)))))
rez))
(defun randomconsttable (&key (n 100000))
(let ((rez (makeconsttable :test 'equal)))
(loop :repeat n :do
(:= (? rez (randomstring (+ 3 (random 4))))
(randomstring (+ 3 (random 4)))))
rez))
A very approximate space measurement may be performed using the standard operator room
. But it doesn't provide detailed perobject statistics. Here's a result of the room
measurement, in SBCL (the format of the report will be somewhat different, for each implementation):
CLUSER> (room)
Dynamic space usage is: 45,076,224 bytes.
Immobile space usage is: 18,998,832 bytes (64,672 bytes overhead).
Readonly space usage is: 0 bytes.
Static space usage is: 1,264 bytes.
Control stack usage is: 9,048 bytes.
Binding stack usage is: 640 bytes.
Control and binding stack usage is for the current thread only.
Garbage collection is currently enabled.
Breakdown for dynamic space:
11,369,232 bytes for 76,040 simplevector objects
9,095,952 bytes for 160,669 instance objects
8,289,568 bytes for 518,098 cons objects
3,105,920 bytes for 54,655 simplearrayunsignedbyte8 objects
2,789,168 bytes for 54,537 simplebasestring objects
2,344,672 bytes for 9,217 simplecharacterstring objects
6,973,472 bytes for 115,152 other objects
43,967,984 bytes for 988,368 dynamic objects (space total)
Breakdown for immobile space:
16,197,840 bytes for 24,269 code objects
1,286,496 bytes for 26,789 symbol objects
1,041,936 bytes for 27,922 other objects
18,526,272 bytes for 78,980 immobile objects (space total)
CLUSER> (defparameter *ht* (randomhashtable))
*HT*
CLUSER> (room)
...
Breakdown for dynamic space:
13,349,920 bytes for 77,984 simplevector objects
11,127,008 bytes for 208,576 simplecharacterstring objects
9,147,824 bytes for 161,469 instance objects
8,419,360 bytes for 526,210 cons objects
3,517,792 bytes for 2,997 simplearrayunsignedbyte32 objects
3,106,288 bytes for 54,661 simplearrayunsignedbyte8 objects
7,671,168 bytes for 166,882 other objects
56,339,360 bytes for 1,198,779 dynamic objects (space total)
So, it seems like we added roughly 10 megabytes by creating a hashtable with 100,000 random 59 character keys and values. Almost all of that space went into the keys and values themselves — 9 Mb ("11,127,008 bytes for 208,576 simplecharacterstring objects" versus "2,344,672 bytes for 9,217 simplecharacterstring objects" — a bit less than 200,000 new strings were added).
Also, if we examine the hashtable, we can see that its occupancy is rather high — around 90%! (The number of keys 99706 instead of 10000 tells us that there was a small portion of duplicate keys among the randomly generated ones).
CLUSER> (describe *ht*)
#<HASHTABLE :TEST EQUAL :COUNT 99706 {1002162EF3}>
[hashtable]
Occupancy: 0.9
Rehashthreshold: 1.0
Rehashsize: 1.5
Size: 111411
And now, a simple time measurement:
CLUSER> (let ((keys (keys *ht*)))
(time (loop :repeat 100 :do
(dolist (k keys)
(gethash k *ht*)))))
Evaluation took:
0.029 seconds of real time
0.032000 seconds of total run time (0.032000 user, 0.000000 system)
110.34% CPU
72,079,880 processor cycles
0 bytes consed
Now, let's try the consttable
s that are the MPHT implementation:
СLUSER> (time (defparameter *ct* (cstab:buildconsttable *ht*)))
...................................................................................................
Evaluation took:
0.864 seconds of real time
...
СLUSER> (room)
...
Breakdown for dynamic space:
14,179,584 bytes for 78,624 simplevector objects
11,128,464 bytes for 208,582 simplecharacterstring objects
9,169,120 bytes for 161,815 instance objects
8,481,536 bytes for 530,096 cons objects
3,521,808 bytes for 2,998 simplearrayunsignedbyte32 objects
3,305,984 bytes for 54,668 simplearrayunsignedbyte8 objects
7,678,064 bytes for 166,992 other objects
57,464,560 bytes for 1,203,775 dynamic objects (space total)
Another megabyte was added for the metadata of the new table, which doesn't seem significantly different from the hashtable version. Surely, often we'd like to be much more precise in space measurements. For this, SBCL recently added an allocation profiler sbaprof
, but we won't go into the details of its usage, in this chapter.
And now, time measurement:
CLUSER> (let ((keys (keys *ht*)))
(time (loop :repeat 100 :do
(dolist (k keys)
(cstab:csget k *ct*)))))
Evaluation took:
3.561 seconds of real time
Oops, a twoordersofmagnitude slowdown! Probably, it has to do with many factors: the lack of optimization in my implementation compared to the one in SBCL, the need to calculate more hashes and with a slower hashfunction, etc. I'm sure that the implementation may be sped up at least an order of magnitude, but, even then, what's the benefit of using it over the default hashtables? Especially, considering that MPHTs have a lot of moving parts and rely on a number of "lowlevel" algorithms like graph traversal or efficient membership testing, most of which need a custom efficient implementation...
Still, there's one dimension in which MPHTs may provide an advantage: significantly reduce space usage by not storing the keys. Though, it becomes problematic if we need to distinguish the keys that are in the table from the unknown ones as those will also hash to some index, i.e. overlap with an existing key. So, either the keyspace should be known beforehand and exhaustively covered in the table or some precursory membership test is necessary when we anticipate the possibility of unseen keys. Yet, there are ways to perform the test efficiently (exactly or probabilistically), which require much less storage space than would be needed to store the keys themselves. Some of them we'll see in the following chapters.
If the keys are omitted, the whole table may be reduced to a Jumptable. Jumptables are a lowlevel trick possible when all the keys are integers in the interval [0, n)
. It removes the necessity to perform sequential equality comparisons for every possible branch until one of the conditions matches: instead, the numbers are used directly as an offset. I.e. the table is represented by a vector, each hashcode being the index in that vector.
A jumptable for the MPHT will be simply a data array, but sometimes evaluation of different code is required for different keys. Such more complex behavior may be implemented in Lisp using the lowestlevel operators tagbody
and go
(and a bit of macrology if we need to generate a huge table). This implementation will be a complete analog of the C switch
statement. The skeleton for such "executable" table will look like this, where 0, 1,... are goto labels:
(block nil
(tagbody (go key)
0 (return (dosomething0))
1 (return (dosomething1))
...))
Distributed HashTables
Another active area of hashtablerelated research is algorithms for distributing them over the network. This is a natural way to represent a lot of datasets, and thus there are numerous storage systems (both general and specialpurpose) which are built as distributed hashtables. Among them are, for instance, Amazon DynamoDB or an influential opensource project Kademlia. We will discuss in more detail, in the chapter on Distributed Algorithms, some of the technologies developed for this use case, and here I wanted to mention just one concept.
Consistent Hashing addresses the problem of distributing the hashcodes among k
storage nodes under the realworld limitations that some of them may become temporarily unavailable or new peers may be added into the system. The changes result in changes of the value of k
. The straightforward approach would just divide the space of all codes into k
equal portions and select the node into whose portion the particular key maps. Yet, if k
is changed, all the keys need to be rehashed, which we'd like to avoid at all cost as rehashing the whole database and moving the majority of the keys between the nodes, at once, will saturate the network and bring the system to a complete halt.
The idea or rather the tweak behind Consistent Hashing is simple: we also hash the node ids and store the keys on the node that has the next hashcode larger than the hash of the key (modulo n
, i.e. wrap around 0). Now, when a new node is added, it is placed on this socalled "hash ring" between two other peers, so only part of the keys from a single node (the next on the ring) require being redistributed to it. Likewise, when the node is removed, only its keys need to be reassigned to the next peer on the ring (it is supposed that the data is stored in multiple copies on different nodes, so when one of the nodes disappears the data doesn't become totally lost).
The only problem with applying this approach directly is the uneven distribution of keys originating from uneven placement of the hashcodes of the nodes on the hash ring. This problem can be solved with another simple tweak: have multiple ids for each node that will be hashed to different locations, effectively emulating a larger number of virtual nodes, each storing a smaller portion of the keys. Due to the randomization property of hashes, not so many virtual nodes will be needed, to obtain a nearly uniform distribution of keys over the nodes.
A more general version of this approach is called Rendezvous Hashing. In it, the key for the item is combined with the node id for each node and then hashed. The largest value of the hash determines the designated node to store the item.
Hashing in Action: Content Addressing
Hashtables are so ubiquitous that it's, actually, difficult to single out some peculiar use case. Instead, let's talk about hashfunctions. They can find numerous uses beyond determining the positions of the items in the hashtable, and one of them is called "content addressing": globally identify a piece of data by its fingerprint instead of using external meta information like name or path. This is one of the suggested building blocks for largescale distributed storage systems, but it works locally, as well: your git SCM system silently uses it behind the scenes to identify the changesets it operates upon.
The advantages of Content Addressing are:
 Potential for space economy: if the system has a chance of operating on repeated items (like git does, although it's not the only reason for choosing such naming scheme for blobs: the other being the lack of a better variant), content addressing will make it possible to avoid storing them multiple times.
 It guarantees that the links will always return the same content, regardless of where it is retrieved from, who added it to the network, how and when. This enables such distributed protocols as BitTorrent that split the original file into multiple pieces, each one identified by its hash. These pieces can be distributed in an untrusted network.
 As mentioned above, content addressing also results in a conflictfree naming scheme (provided that the hash has enough bits — usually, cryptographic hashes such as SHA1 are used for this purpose, although, in many cases, such powerful hashfunctions are an overkill).
Takeaways
This chapter resented a number of complex approaches that require a lot of attention to detail to be implemented efficiently. On the surface, the hashtable concept may seem rather simple, but, as we have seen, the productiongrade implementations are not that straightforward. What general conclusions can we make?
 In such mathematically loaded areas as hashfunction and hashtable implementation, rigorous testing is critically important. For there is a number of unexpected sources of errors: incorrect implementation, integer overflow, concurrency issues, etc. A good testing strategy is to use an already existing trusted implementation and perform a largescale comparison testing with a lot of random inputs. We haven't discussed the testing code here but will return to the practical implementation of such testing frameworks in the following chapters.
 Besides, a correct implementation doesn't necessarily mean a fast one. Lowlevel optimization techniques play a crucial role here.
 In the implementation of MPHT, we have seen in action another important approach to solving algorithmic and, more generally, mathematic problems: reducing them to a problem that has a known solution.
 Space measurement is another important area of algorithms evaluation that is somewhat harder to accomplish than runtime profiling. We'll also see more usage of both of these tools throughout the book.
Nicolas Hafner — Library Policy Update  Confession 88
@20190905 10:19 · 92 days ago
Recently there's been a few changes to the libraries I maintain, as well as to the new libraries I publish. I thought that these changes would be of general interest, even if you might not use my libraries at all.
License Switch from Artistic 2 to Zlib
Previously all of my libraries were published under the Artistic 2 license. However, a problem was brought to my attention in that license, specifically prevailing to Lisp distributions. Namely, §8 in the license text:
8) You are permitted to link Modified and Standard Versions with other works, to embed the Package in a larger work of your own, or to build standalone binary or bytecode versions of applications that include the Package, and Distribute the result without restriction, provided the result does not expose a direct interface to the Package.
The problem being this, in specific:
provided the result does not expose a direct interface to the Package.
This seems to prohibit distributing an application that would expose something like a REPL or a scripting interface that allows the user to interface with the library. I'm not sure what "direct" means in this context, nor what "linking" really means for Lisp  the GPL licenses have a similar issue. Either way, the implications of this restriction are severe enough that I was convinced to abandon the license.
I have since changed the license of all of my libraries and projects to Zlib. Everyone is also hereby permitted to use any prior versions of my projects that were licensed under Artistic 2 under the zlib license terms.
Why Zlib and not MIT or BSD? Simply because I like the clauses that prohibit claiming credit for the software. In any case, I hope that alleviates potential concerns people had about using my software due to the license terms.
Fully Qualified Domain Names for Packages
Since the package local nicknames extension to Common Lisp is now supported widely enough for my tastes, I have decided to drop the short nickname I used to include for packages so far. All of my packages have always included a FQDN name, but also included a shorter nickname for convenience. Newly published libraries will not do this anymore.
You are now encouraged to use the packages by adding local nicknames to your own package definitions. For instance, to use the new Classowary library, you would do something like this:
(defpackage #:org.my.stuff.package
(:use #:cl)
(:localnicknames
(#:cass #:org.shirakumo.classowary)))
If you use an implementation  like Allegro or LispWorks  that does not support package local nicknames yet, please contact support and request the addition of this feature. It should not be a difficult feature to add, and there is a comprehensive test suite available to aid the process.
If you are a user of my packages and have so far been using the short name, please update your own packages to use a nickname. You won't have to change any other code as long as you do that. You should do this because I would like to alleviate the package namespace pollution problem by removing the short, global aliases from my libraries in the future. Thus, consider the short names to be officially deprecated.
Closing Thoughts
I hope that with these changes I can help push the ecosystem in a positive direction. I would also like to remind everyone of the portability effort. I think the variety of implementations on offer is a big asset to Common Lisp, and I would very much like it to be possible for people to write high quality libraries that work on a variety of implementations without having to expel a huge amount of time repeating the porting work. As such, it would be amazing if people could help to improve existing portability libraries and implementations to extend support coverage.
More articles and Lispy things coming soon, I hope!
Lispers.de — LispMeetup in Hamburg on Monday, 2nd September 2019
@20190831 00:00 · 98 days agoWe meet at Ristorante Opera, Dammtorstraße 7, Hamburg, starting around 19:00 CEST on 2nd September 2019.
This is an informal gathering of Lispers of all experience levels.
For older items, see the Planet Lisp Archives.
Last updated: 20191206 21:44