Planet Lisp

Joe MarshallBut what if you really want to push a stack frame?

· 16 hours ago
If you really don't want tail recursion, the solution is simple: don't put the call in “tail position”. We define a rather trivial function dont-tail-call and use it like this:
(dont-tail-call
  (foo x))
The semantics of Scheme are that the arguments are evaluated before the call to the function, so the call to foo is required to occur before the call to dont-tail-call which necessitates allocation of a continuation.

But what about a really clever compiler that somehow “optimizes” away the call to dont-tail-call? Such an “optimization” is highly questionable because it changes the observable semantics of the program so it is no longer call-by-value. A compiler that performed that transformation wouldn't be a Scheme compiler because Scheme is a call-by-value language.

But what if we had really clever compiler and we disregard the change in semantics? We can easily thwart the compiler by deferring the definition of dont-tail-call until run time. Even the cleverest compiler cannot see into the future.

The definition of dont-tail-call is left to the reader, as is how to defer it's definition until run time.

Joe MarshallAfraid of Tail Recursion

· 33 hours ago
It's well known fact among proponents of tail recursion that some people just don't get it. They view tail recursion at best as a quirky compiler optimization that turns some recursive calls into loops. At worst, they see it as some sort of voodoo, or a debugging pessimization. They see little value in it. Some have outright disdain for it.

Tail recursion isn't just about turning recursive calls into loops. It's about changing how you look at function calling. Tail recursion just happens to fall out of this new viewpoint.

Most programmers, I think, view function calls as if they were akin to a short vacation. You pack up the arguments in your luggage, travel to the destination, unpack your luggage, do some stuff, repack your luggage with some souvenirs, return home, unpack everything and resume life where you left off. Your souvenirs are the return value.

Should you need a vacation from your vacation, you do the same thing: pack up the arguments in your luggage, travel to your new destination, unpack your luggage, do some stuff, repack your luggage with some souvenirs, return to your original vacation spot and resume your original vacation.

Tail recursion aficionados realize that the journey itself is the important part of the function call, and that a vacation includes two journeys. On the first journey you pack up the arguments, including the return ticket, in your luggage, use the outbound ticket to journey to the destination, unpack your luggage, and start doing stuff. When you run out of stuff to do, you make the second journey. You fetch the return ticket, repack your luggage, take the ticket to wherever it leads (presumably back home), unpack everything, and resume doing whatever you were doing there.

But perhaps you want to visit grandma instead of going directly home. Then we change the script slightly. When you run out of things to do on your vacation, you pack up your luggage with your souvenirs and the return ticket, then you journey to grandma's house, where you unpack and start doing stuff. Eventually you are done visiting grandma, so then you fetch the return ticket, repack your luggage, take the ticket to wherever it leads, unpack everything, and resume doing stuff there. It's a three-legged journey. You don't go from grandma's back to the vacation resort — there's nothing left for you to do there. You take the return ticket directly home.

Viewing things this way, a function call involves packaging the arguments in a suitable way, deallocating any temporary storage, and then making an unconditional transfer to the function, where we unpack the arguments and resume execution of the program. It is simply “a goto that passes arguments”.*

A function return is simply “a goto that passes a return value”. It involves packaging the return value in a suitable way, deallocating any temporary storage, and then making an unconditional transfer to the return address, where we resume execution of the program.

A tail recursive function call is simply “a goto that passes arguments”. It involves packaging the arguments in a suitable way, deallocating any temporary storage and then making an unconditional transfer to the function, where we resume execution of the program.

Do we really deallocate temporary storage before every control transfer? Certainly a return pops the topmost stack frame, and as often implemented, a tail recursive function call deallocates its stack frame or replaces it before transferring control, but a non tail recursive call? It does so as well, it's just that it also has to pack those values into a new continuation for the return trip. We use an implementation trick to avoid the absurdity of actually moving these values around: we move the base of frame pointer instead. Voila, we simultaneously deallocate the stack frame and allocate the continuation with the right values already in place.

Deallocating storage before each control transfer is an important part of the protocol. We're making a unconditional transfer to a destination with the hope, but no guarantee, that we'll come back, so we'd better tidy up before we leave. This ensures that we won't leave a trail of garbage behind us on each transfer which would accumulate and adversely affect the space complexity of our program.

Once you view a function call and return as not being a single sequence, but each one a separate, and virtually identical sequence, then tail recursion becomes a natural consequence. Tail recursion isn't a special case of function call, it is the same thing as a function call, the only difference being whether a new continuation (the "return ticket") is allocated in order to come back. Even function returns are the same thing, the only difference being that destination is (usually) determined dynamically rather than statically. Tail recursion isn't just another optimization, it's the result of treating inter-procedural control transfer symmetrically.

Another natural consequence is greatly increased options for control flow. Instead of a strict call/return pattern, you can make "three-legged" trips, or however many legs you need. You can make loops that incorporate one, two, or even a dynamically changing number of functions. You can replace compiler-generated returns with user-provided function calls (continuation-passing style) and implement arbitrarily complex control and data flow like multiple return values, exception handling, backtracking, coroutines, and patterns that don't even have names yet. And of course you can mix and match these patterns with the standard call and return pattern as well.

The phrase "tail recursion" is shorthand for this symmetric view of interprocedural control flow and is meant to encompass all these consequences and control flow options that spring from it. It's not about simply turning recursive functions into loops.

People who are afraid of tail recursion seem unable to see any value in taking up the symmetric viewpoint despite the fact that it opens up a whole new set of control flow techniques (in particular continuation-passing style). They find the notion that a procedure call is “a goto that passes arguments” “nonsensical”. A lot of good research has come from taking this nonsense seriously.


*The view that a function call is simply a “a goto that passes arguments” was developed by Steele in his “Lambda papers”.

The important point of cleaning up before the control transfer was formalized by Clinger in “Proper Tail Recursion and Space Efficiency”.

Someone — it might have been Clinger, but I cannot find a reference — called tail recursion “garbage collection for the stack”. The stack, being so much more limited in size than the heap, needs it that much more. Indeed Clinger notes the tight connection between tail recursion and heap garbage collection and points out that heap garbage collection is hampered if the stack is retaining pointers to logically dead data structures. If the dead structures are large enough, garbage collection can be rendered useless. Yet many popular languages provide garbage collection but not tail recursion.

The only difference between a call and return is that typically the call is to a statically known location and the return address is dynamically passed as a "hidden" argument. But some compilers, like Siskind's Stalin compiler, statically determine the return address as well.

The only difference between a function call and a tail recursive function call is when you need to return to the caller to complete some work. In this case, the caller needs to allocate a new continuation so that control is eventually returned. If there is no further work to be done in the caller, it doesn't create a new continuation, but simply passes along the one that was passed to it.

Many compilers have been written that handle function calls, tail recursive function calls, and returns identically. They only change what code they emit for handling the continuation allocation. These compilers naturally produce tail recursive code.

Most machines provide special purpose support for a LIFO stack. It is tempting to use the stack for allocation of continuations because they are almost always allocated and deallocated in LIFO order, and a stack gives higher performance when this is the case. Many compilers do in fact use the stack for continuations and argument passing. Some, like Winklemann's Chicken compiler follow Baker's suggestion and treat the stack as an "nursery" for the heap. Others avoid using the stack altogether because of the extra complexity it entails. And others cannot use the stack because of constraints placed on stack usage by OS conventions or the underlying virtual machine model.

Paul KhuongLazy Linear Knapsack

· 40 hours ago

The continuous knapsack problem may be the simplest non-trivial linear programming problem:

\[\max_{x \in [0, 1]^n} p'x\] subject to \[w'x \leq b.\]

It has a linear objective, one constraint, and each decision variable is bounded to ensure the optimum exists. Note the key difference from the binary knapsack problem: decision variables are allowed to take any value between 0 and 1. In other words, we can, e.g., stick half of a profitable but large item in the knapsack. That's why this knapsack problem can be solved in linear time.

Dual to primal is reasonable

Duality also lets us determine the shape of all optimal solutions to this problem. For each item \(i\) with weight \(w_i\) and profit \(p_i\), let its profit ratio be \(r_i = p_i / w_i,\) and let \(\lambda^\star\) be the optimal dual (Lagrange or linear) multiplier associated with the capacity constraint \(w'x \leq b.\) If \(\lambda^\star = 0,\) we simply take all items with a positive profit ratio (\(r_i > 0\)) and a non-negative weight \(w_i \geq 0.\) Otherwise, every item with a profit ratio \(r_i > \lambda^\star\) will be at its weight upper bound (1 if \(w_i \geq 0\), 0 otherwise), and items with \(r_i < \lambda^\star\) will instead be at their lower bound (0 of \(w_i \leq 0\), and 1 otherwise).

Critical items, items with \(r_i = \lambda^\star,\) will take any value that results in \(w'x = b.\) Given \(\lambda^\star,\) we can derive the sum of weights for non-critical items; divide the remaining capacity for critical items by the total weight of critical items, and let that be the value for every critical item (with the appropriate sign for the weight).

For example, if we have capacity \(b = 10,\) and the sum of weights for non-critical items in the knsapsack is \(8,\) we're left with another two units of capacity to distribute however we want among critical items (they all have the same profit ratio \(r_i = \lambda^\star,\) so it doesn't matter where that capacity goes). Say critical items with a positive weight have a collective weight of 4; we could then assign a value of \(2 / 4 = 0.5\) to the corresponding decision variable (and 0 for critical items with a non-positive weight).

We could instead have \(b = 10,\) and the sum of weights for non-critical items in the knapsack \(12\): we must find two units of capacity among critical items (they all cost \(r_i = \lambda^\star\) per unit, so it doesn't matter which). If critical items with a negative weight have a collective weight of \(-3,\) we could assign a value of \(-2 / -3 = 0.6\overline{6}\) to the corresponding decision variables, and 0 for critical items with a non-negative weight.

The last case highlights something important about the knapsack: in general, we can't assume that the weights or profits are positive. We could have an item with a non-positive weight and non-negative profit (that's always worth taking), an item with positive weight and negative profit (never interesting), or weights and profits of the same sign. The last case is the only one that calls for actual decision making. Classically, items with negative weight and profit are rewritten away, by assuming they're taken in the knapsack, and replacing them with a decision variable for the complementary decision of removing that item from the knapsack (i.e., removing the additional capacity in order to improve the profit). I'll try to treat them directly as much as possible, because that reduction can be a significant fraction of solve times in practice.

The characterisation of optimal solutions above makes it easy to directly handle elements with a negative weight: just find the optimal multiplier, compute the contribution of non-critical elements (with decision variables at a bound) to the left-hand side of the capacity constraint, separately sums the negative and positive weights for critical elements, then do a final pass to distribute the remaining capacity to critical elements (and 0-weight / 0-value elements if one wishes).

Solving the dual looks like selection

Finding the optimal multiplier \(\lambda^\star\) is similar to a selection problem: the value is either 0 (the capacity constraint is redundant), or one of the profit ratios \(r_i,\) and, given a multiplier value \(\lambda,\) we can determine if it's too high or too low in linear time. If the non-critical elements yield a left-hand side such that critical elements can't add enough capacity (i.e., no solution with the optimal form can be feasible), \(\lambda\) is too low. If the maximum weight of potentially optimal solutions is too low, \(\lambda\) is too high.

We can thus sort the items by profit ratio \(r_i\), compute the total weight corresponding to each ratio with a prefix sum (with a pre-pass to sum all negative weights), and perform a linear (or binary) search to find the critical profit ratio. Moreover, the status of non-critical items is monotonic as \(\lambda\) grows: if an item with positive weight is taken at \(\lambda_0\), it is also taken for every \(\lambda \leq \lambda_0\), and a negative-weight item that's taken at \(\lambda_0\) is also taken for every \(\lambda \geq \lambda_0.\) This means we can adapt selection algorithms like Quickselect to solve the continuous knapsack problem in linear time.

I'm looking at large instances, so I would like to run these algorithms in parallel or even distributed on multiple machines, and ideally use GPUs or SIMD extensions. Unfortunately, selection doesn't parallelise very well: we can run a distributed quickselect where every processor partitions the data in its local RAM, but that still requires a logarithmic number of iterations.

Selection looks like quantile estimation; does the dual?

Lazy Select offers a completely different angle for the selection problem. Selecting the \(k\)th smallest element from a list of \(n\) elements is the same as finding the \(k / n\)th quantile1 in that list of \(n\) elements. We can use concentration bounds2 to estimate quantiles from a sample of, e.g., \(m = n^{3/4}\) elements: the population quantile value is very probably between the \(qm - \frac{\log m}{\sqrt{m}}\)th and \(qm + \frac{\log m}{\sqrt{m}}\)th values of the sample. Moreover, this range very probably includes at most \(\mathcal{O}(n^{3/4})\) elements3, so a second pass suffices to buffer all the elements around the quantile, and find the exact quantile. Even with a much smaller sample size \(m = \sqrt{n},\) we would only need four passes.

Unfortunately, we can't directly use that correspondence between selection and quantile estimation for the continuous knapsack.

I tried to apply a similar idea by sampling the knapsack elements equiprobably, and extrapolating from a solution to the sample. For every \(\lambda,\) we can derive a selection function \(f_\lambda (i) = I[r_i \geq \lambda]w_i\) (invert the condition if the weight is negative), and scale up \(\sum_i f(i)\) from the sample to the population). As long as we sample independently of \(f\), we can reuse the same sample for all \(f_\lambda.\) The difficulty here is that, while the error for Lazy Select scales as a function of \(n,\) the equivalent bounds with variable weights are a function of \(n(|\max_i w_i| + |\min_i w_i|)^2.\) That doesn't seem necessarily practical; scaling with \(\sum_i |w_i|\) would be more reasonable.

Good news: we can hit that, thanks to linearity.

Let's assume weights are all integers. Any item with weight \(w_i\) is equivalent to \(w_i\) subitems with unit weight (or \(-w_i\) elements with negative unit weight), and the same profit ratio \(r_i\), i.e., profit \(p_i / |w_i|\). The range of subitem weights is now a constant.

We could sample uniformly from the subitems with a Bernoulli for each subitem, but that's clearly linear time in the sum of weights, rather than the number of elements. If we wish to sample roughly \(m\) elements from a total weight \(W = \sum_i |w_i|,\) we can instead determine how many subitems (units of weight) to skip before sampling with a Geometric of success probability \(m / W.\) This shows us how to lift the integrality constraint on weights: sample from an Exponential with the same parameter \(m / W!\)

That helps, but we could still end up spending much more than constant time on very heavy elements. The trick is to deterministically special-case these elements: stash any element with large weight \(w_i \geq W / m\) to the side, exactly once. By Markov's inequality,4 we know there aren't too many heavy elements: at most \(m.\)

Let's test this out

The heart of the estimation problem can be formalised as follows: given a list of elements \(i \in [n]\) with weight \(w_i \geq 0\), generate a sample of \(m \leq n\) elements ahead of time. After the sample has been generated, we want to accept an arbitrary predicate \(p \in \{0,1\}^n\) and estimate \(\sum_{i\in [n]} p(i) w_i.\)

We just had a sketch of an algorithm for this problem. Let's see what it looks like in Python. The initial sample logic has to determine the total weight, and sample items with probability proportional to their weight. Items heavier than the cutoff are not considered in the sample and instead saved to an auxiliary list.

sample.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
def total_weight(items):
    return sum(weight for (_, weight) in items)


def sample_by_weight(items, rate, cutoff):
    """Samples from a list of (index, weight), with weight >= 0.

    Items with weight >= cutoff are taken with probability one.
    Others are sampled with rate `rate` / unit of weight.
    """
    sample = []
    large = []
    next_sample = random.expovariate(rate)
    for item in items:
        index, weight = item
        if weight >= cutoff:
            large.append(item)
        else:
            next_sample -= weight
            while next_sample <= 0:
                sample.append(index)
                next_sample += random.expovariate(rate)
    return sample, large

We can assemble the resulting sample (and list of "large" elements) to compute a lower bound on the weight of items that satisfy any predicate that's independent of the sampling decisions. The value for large elements is trivial: we have a list of all large elements. We can subtract the weight of all large elements from the total item weight, and determine how much we have to extrapolate up.

extrapolate.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
def hoeffding(n, alpha):
    """Determines how much we can expect a sample of n i.i.d. values
    sampled from a Bernouli to differ, given an error rate of alpha.

    Given a sample X of n i.i.d. values from a Bernoulli distribution,
    let delta be \bar{X} - E[\bar{X}], the one-sided difference
    between the sample average value and the expected sample average.

    Hoeffding's upper bound (see below) is conservative when the
    empirical probability is close to 0 or 1 (trivially, it can yield
    confidence bounds that are outside [0, 1]!), but simple, and in
    general not much worse than tighter confidence interval.

    P(delta >= eps) <= exp(-2 eps^2 n) = alpha
      -> -2 eps^2 n = ln alpha
     <->        eps = sqrt[-(ln alpha) / 2n ]

    """
    return math.sqrt(- math.log(alpha) / (2 * n))


def eval_weight(total_weight, sample, large, predicate, alpha):
    """Given a population's total weight, a memoryless sample (by weight)
    from the population's items, and large items that were
    deterministically picked, evaluates a lower bound for the sum of
    weights for items in the population that satisfy predicate.
    
    The lower bound is taken with error rate <= alpha.
    """
    large_sum = sum(weight for (index, weight) in large if predicate(index))
    # The remainder was up for sampling, unit of weight at a time.
    sampled_weight = total_weight - sum(weight for (_, weight) in large)
    if sampled_weight <= 0 or not sample:
        return large_sum
    # Estimate the Binomial success rate with a Beta
    successes = sum(1 if predicate(x) else 0 for x in sample)
    failures = len(sample) - successes
    # We want a lower bound, and the uniform prior can result in a
    # (valid) bound that's higher than the empirical rate, so take the
    # min of the two.
    empirical_rate = successes / sampled_weight
    delta = hoeffding(len(sample), alpha)
    return large_sum + sampled_weight * max(0, empirical_rate - delta)

And finally, here's how we can sample from an arbitrary list of items, compure a lower bound on the weight of items that satisfy a predicate, and compare that with the real lower bound.

lower_bound.py
1
2
3
4
5
6
7
8
9
def compare_bounds(items, rate, alpha, predicate):
    total = total_weight(items)
    # We expect a sample size of roughly rate * len(items), and
    # at most rate * len(items) large items.
    sample, large = sample_by_weight(items, rate, rate * total)
    lower_bound = eval_weight(total, sample, large, predicate, alpha)
    # Check if the lower bound is valid.
    actual = sum(weight for (index, weight) in items if predicate(index))
    return lower_bound <= actual + 1e-8, lower_bound, actual

How do we test that? Far too often, I see tests for randomised algorithms where the success rate is computed over randomly generated inputs. That's too weak! For example, this approach could lead us to accept that the identity function is a randomised sort function, with success probability \(\frac{1}{n!}.\)

The property we're looking for is that, for any input, the success rate (with the expectation over the pseudorandom sampling decisions) is as high as requested.

For a given input (list of items and predicate), we can use the Confidence sequence method (CSM) to confirm that the lower bound is valid at least \(1 - \alpha\) of the time.

csm_test.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
def compare_bounds_generator(test_case, rate, alpha):
    items = [(i, w) for (i, (w, _)) in enumerate(test_case)]
    chosen = set(i for (i, (_, p)) in enumerate(test_case) if p)
    while True:
        yield compare_bounds(items, rate, alpha, lambda x: x in chosen)[0]


def check_bounds(test_case, rate, alpha):
    """Test case is a list of pairs of weight and predicate value
       rate is the sample rate
       alpha is the confidence parameter for the lower bound.
    """
    wanted = 1 - alpha  # The Hoeffding bound is conservative, so
                        # this should let csm_driver stop quickly.
    result = csm.csm_driver(compare_bounds_generator(test_case,
                                                     rate,
                                                     alpha),
                            wanted,
                            1e-6,  # Wrong conclusion with p < 1e-6.
                            file=sys.stderr
                            )
    stop, actual, *_ = result
    assert actual >= wanted, "Result: %s" % str(result)

With a false positive rate of at most one in a million,5 we can run automated tests against check_bounds. I'll use Hypothesis to generate list of pairs of weight and predicate value:

test_bounds.py
1
2
3
4
5
6
7
8
9
from hypothesis import given, settings, Verbosity
import hypothesis.strategies as st

@given(test_case=st.lists(st.tuples(st.floats(min_value=0, max_value=1),
                                    st.booleans())),
       rate=st.floats(min_value=1e-6, max_value=0.5),
       alpha=st.floats(min_value=0.05, max_value=0.25))
def test_bounds(test_case, rate, alpha):
    check_bounds(test_case, rate, alpha)

Bimodal inputs tend to be harder, so we can add a specialised test generator.

test_bimodal_bounds.py
1
2
3
4
5
6
@given(test_case=st.lists(st.tuples(st.one_of(st.just(0.1), st.just(1)),
                                    st.booleans())),
       rate=st.floats(min_value=1e-6, max_value=0.5),
       alpha=st.floats(min_value=0.05, max_value=0.25))
def test_bimodal_bounds(test_case, rate, alpha):
    check_bounds(test_case, rate, alpha)

Again, we use Hypothesis to generate inputs, and the Confidence sequence method (available in C, Common Lisp, and Python) to check that the lower bound is valid with probability at least \(1 - \alpha\). The CSM tests for this statistical property with power 1 and adjustable error rate (in our case, one in a million): we only provide a generator for success values, and the driver adaptively determines when it makes sense to make a call and stop generating more data, while accounting for multiple hypothesis testing.

TL;DR: the estimation algorithm for individual sampling passes works, and the combination of Hypothesis and Confidence Sequence Method lets us painlessly test for a statistical property.

We can iteratively use this sampling procedure to derive lower and (symmetrically) upper bounds for the optimal Lagrange multiplier \(\lambda^\star,\) and Hoeffding's inequality lets us control the probability that the lower and upper bounds are valid. Typically, we'd use a tolerance of \(\sqrt{\log(n) / n},\) for an error rate of \(1 / n^2.\) I prefer to simply use something like \(7 / \sqrt{n}:\) the error rate is then less than \(10^{-42},\) orders of manitude smaller than the probability of hardware failure in any given nanosecond.6 We can still check for failure of our Las Vegas algorithm, but if something went wrong, it's much more likely that we detected a hardware failure than anything else. It's like running SuperPi to stress test a computer, except the work is useful. 😉

Repeat as necessary to solve a knapsack

How many sampling passes do we need? Our bounds are in terms of the sum of item weight: if we let our sample size be in \(\Theta(\sqrt{n}),\) the sum of weights \(\sum_i |w_i|\) for unfathomed items (that may or may not be chosen depending on the exact optimal multiplier \(\lambda^\star\) in the current range) will very probably shrink by a factor of \(\Omega(n^{1/4}).\) The initial sum can, in the worst case, be exponentially larger than the bitlength of the input, so even a division by \(n^{1/4}\) isn't necessarily that great.

I intend to apply this Lazy Linear Knapsack algorithm on subproblems in a more interesting solver, and I know that the sum of weights is bounded by the size of the initial problem, so that's good enough for me! After a constant (\(\approx 4\)) number of passes, the difference in item weight between the lower and upper bound on \(\lambda^\star\) should also be at most 1. One or two additional passes will get me near optimality (e.g., within \(10^{-4}\)), and the lower bound on \(\lambda^\star\) should thus yield a super-optimal solution that's infeasible by at most \(10^{-4},\) which is, for my intended usage (again), good enough.

Given an optimal enough \(\lambda^\star,\) we can construct an explicit solution in one pass, plus a simple fixup for critical items. This Lazy Knapsack seems pretty reasonable for parallel or GPU computing: each sampling pass only needs to read the items (i.e., no partitioning-like shuffling) before writing a fraction of the data to a sample buffer, and we only need a constant number of passes (around 6 or 7) in the worst case.


  1. It's more like a fractional percentile, but you know what I mean: the value such that the distribution function at that point equals \(k / n\). 

  2. Binomial bounds offer even stronger confidence intervals when the estimate is close to 0 or 1 (where Hoeffding's bound would yield a confidence interval that juts outside \([0, 1]\)), but don't impact worst-case performance. 

  3. Thanks to Hoeffding's inequality, again. 

  4. That's a troll. I think any self-respecting computer person would rather see it as a sort of pigeonhole argument. 

  5. We're juggling a handful of error rates here. We're checking whether the success rate for the Lazy Knapsack sampling subroutine is at least as high as \(1 - \alpha,\) as requested in the test parameters, and we're doing so with another randomised procedure that will give an incorrect conclusion at most once every one million invocation. 

  6. This classic Google study found 8% of DIMMs hit at least one error per year; that's more than one single-bit error every \(10^9\) DIMM-second, and they're mostly hard errors. More recently, Facebook reported that uncorrectable errors affect 0.03% of servers each month; that's more than one uncorrectable error every \(10^{10}\) server-second. If we performed one statistical test every nanosecond, the probability of memory failure alone would still dominate statistical errors by \(10^{20}!\) 

Joe MarshallAfraid of Recursion

· 2 days ago
Here's a trick I learned several years ago for transferring time-series data. In the case in question, I needed to transfer a bunch of timestamped records, but the server had a quirk to it. If you asked for too many records at once, it would simply return an error code and give up on your request. There was no way to know beforehand how many records might exist in any given time span, so you could get an error code on nearly any request, unless it was for a very short time span. On the other hand, many requests for long time spans would succeed because they had few records in them. Despite this quirk, the code was really simple:
List<record> xfer (Timestamp start, Timestamp end) {
    try {
        return tryFetchRecords(start, end);
    } catch (TooManyRecordsException e) {
        Timestamp mid = (start + end)/2;
        List<record> firstHalf = xfer (start, mid);
        List<record> secondHalf = xfer (mid, end);
        return firstHalf.addAll(secondHalf);
    }
}
On any request, if the server returned the error code, we would simply bisect the time span, recursively ask for each half separately, and combine the two halves. Should the bisected time span still contain too many records, the time span would be bisected again. The recursion would continue until the time span was small enough that the server could honor the request and return some records. The recursion would then unwind, combining the returned records into larger and larger lists until we had all the records for our original time span. Since the time span would be cut in half on each recurrence, the depth of recursion would be proportional to the logarithm (base 2) of the total number of records, which would be a reasonably small number even with an enormous number of records.

It's certainly possible to avoid recursion and do this iteratively by some form of paging, but the code would be slightly more complex. The server is not very cooperative, so there is no easy way to determine an appropriate page size beforehand, and the server doesn't support a “paging token” to help keep track of progress. The recursive solution finds an appropriate transfer size by trial and error, and keeps track of progress more or less automatically. An iterative paging solution would have to do these things more explicitly and this would make the iterative code a bit more complex. And why add any complexity when it isn't really necessary?

I thought this solution was really cool when I first saw it. I've used this trick for transferring time series data many times. It makes the server very simple to write because the protocol requires so little of it. It simply has to refuse to answer requests that return too many results. The client code is just about the 10 lines above.

But when I first suggest this code to people I usually get “push back” (that is, until they see it work in action, then they usually get on board with it). People seem unsure about the use of recursion and want a more complex protocol where the client and server negotiate a page size or cooperatively pass a paging token back and forth on each request and response. Their eyes glaze over as soon as they see the recursive call. They want to avoid recursion just because it's recursive.

I've seen “aversion to recursion” happen in a lot of circumstances, not just this one. Recursion isn't the solution to everything. No tool solves all problems. But it is an important tool that often offers elegant solutions to many problems. Programmers shouldn't be afraid of using it when it is appropriate.

Joe MarshallUnsyndicated blog

· 4 days ago
I've noticed that my blog posts are replicated in Planet Lisp and Planet Scheme, and here I am spamming them with random math stuff. So I'm creating a new blog, Jrm's Random Blog, where I can feel free to post about math, science, computers in general, and whatever else bugs me, without spamming the Lisp and Scheme readers. I'll keep posting to Abstract Heresies, but try to keep it more Lisp and computer language focused.

Joe MarshallGroups, semigroups, monoids, and computers

· 6 days ago
The day after I rant about mathematicians, I make a math post. “Do I contradict myself? Very well, then, I contradict myself, I am large, I contain multitudes.” — Walt Whitman

A group is a mathematical concept. It's pretty simple. It consists of a set, G, and an operation, *, which can be used to combine any two elements of G. What the set contains is not that important. It is the * operation we're interested in, and we can usually swap out G for another set without causing too many problems other than having to change the type signature of *. There are four axioms that * must obey
  • Closure—combining any two elements of G using * just gives you another element in G.
    Note that this means you can build an arbitrary binary tree of combinations: e.g.(* (* a b) (* (* c d) e))). These trees will always be like a tree of cons cells. In some sense, the closure axiom is equivalent to saying that all the elements of G have the same type and that the * operator operates on values of that type and produces values of that type. The closure axiom along with the binary operation means that we can reduce any tree of combinations to a single value.
  • Associativity(* (* a b) c) = (* a (* b c)) for any a, b, and c. This implies that you can take any arbitrary tree of combinations: e.g.(* (* a b) (* (* c d) e))) and simply flatten it into a list (* a b c d e), or given the flat sequence (* a b c d e) we can add parenthesis anywhere we like: (* a (* b c) d e). If we stop here and only have the closure and associativity axiom, we have what is called a “semigroup”. You can use the * operation to “fold” a semigroup down to single value, or to keep an accumulator and incrementally fold elements into the accumulator.
  • Identity element—There has to be an identity element id such that (* id x) = (* x id) = x for all x. It will be unique. If you see the identity object in a combination (* a b id c d), you can simply remove it: (* a b c d). The identity element also comes in handy as an initial value when you are folding a sequence. If you have some concept that would be a group except it doesn't have an identity element, then you can often just make one up and add it to the set G.
  • Inverse element—For every element in G there has to be another element, that when combined with the first, gives you the identity. So if a is an element in G, there has to be some other element, call it b, such that (* a b) = (* b a) = id. The inverse element is usually notated with a little -1: a-1. If you have an element in a combination right next to it's inverse: (* a x x-1 c), you can combine the element and it's inverse to get the identity: (* a id c), and then remove the identity: (* a c)
Frequently you run into something that obeys all the axioms but the inverse element axiom. This is called a monoid. A monoid is very much like a group except that you can get "stuck" when manipulating it if you run into one of the non-invertible elements because there's no inverse to "undo" it. There are certain things about monoids that are true only “if the appropriate inverses exist”. You run into that qualifier a lot when dealing with monoids. You don't need that qualifier if you are dealing with a group because they do exist by axiom. Or we could say that calling something a group is simply shorthand for adding “if the appropriate inverses exist” everywhere.

What does this have to do with computers? Consider the set of all subroutines with the operation of concatenation. It is closed — concatenating two subroutines gives you a third subroutine. It is associative — you just concatenate them linearly. There is an identity element, usually called no-op. And many, but not all, subroutines have inverses. So we have a monoid.

Consider the set of all strings with the operation of concatenation. It is closed, associative, the empty string is the identity element. It is a monoid.

Consider the set of functions whose input type is the same as the result type with the operation of composition. It is closed, associative, the identity function is the identity element. It is a monoid. If we consider only the subset of functions that also have inverses, we have a group. This particular monoid or group comes in especially handy because composition of functions is so useful.

Consider the set of invertible 2x2 matrices with integer components, a determinant of 1 or -1, and the operation of matrix multiply. It is closed, associative, there is an identity matrix, and I already said just consider the invertible ones. It forms a group. This group comes in handy for implementing arbitrary precision arithmetic. (Thanks to Bradley Lucier for the correction of the condition on the determinant. This makes the matrix continue to have integer components upon inversion, keeping things closed.)

The permutations of a list form a group. The integers under addition form a group.

These things are everywhere. And it isn't a coincidence. The concepts of a group, monoid, and semigroup are meant to capture the essence of what it is to have a foldable sequence of elements. (Can I complain about mathematicians here? They make up so much terminology and abstraction that it is virtually impossible to get at what they really mean. We're just talking about sequences of elements and trying to find some minimal axioms that you need to have to fold them, but try to find literature that actually says that's what we're doing is like trying to pull hen's teeth.)

So what good are groups, monoids, and semigroups? Aside from the obvious fact that foldable sequences are ubiquitous and really useful, that is. Not immediately apparent from the axioms is that in addition to folding a sequence, you can transform a sequence into a different, but equivalent one. If the appropriate inverses exist (there's that phrase), you can “unfold” some or all elements of a sequence. So by judicious folding and unfolding, you can transform a sequence.

Here's an unusual abstract example. Consider a pipeline which has a set of nodes and communicates values of the same type between the nodes. Values accumulate at the nodes until they are transmitted to the next node in the pipeline. We start with all the values in the initial node (on the right) and transmit them to the left:
(pipeline (node) (node) (node a b c))  ;; transmit the a
(pipeline (node) (node a) (node b c))  ;; transmit the b
(pipeline (node) (node a b) (node c))  ;; transmit the a
(pipeline (node a) (node b) (node c))  ;; transmit the c
(pipeline (node a) (node b c) (node))  ;; transmit the b
(pipeline (node a b) (node c) (node))  ;; transmit the c
(pipeline (node a b c) (node) (node))  ;; done
If the values we transmit are drawn from a group, we can replace each node with the group's * operator:
(* identity identity (* a b c))  ;; transmit the a
(* identity (* identity a) (* b c))  ;; transmit the b
(* identity (* a b) (* identity c))  ;; transmit the a
(* (* identity a) (* identity  b) (* identity c))  ;; transmit the c
(* (* identity a) (* b c) identity)  ;; transmit the b
(* (* a b) (* identity c) identity)  ;; transmit the c
(* (* a b c) identity identity)  ;; done
The astute reader will notice that all we're doing is making use of the associativity axiom and moving the parenthesis around so that the values seem to move between the different nodes. But we preserve the invariant that the “value” of the entire pipeline doesn't change as the values move. The * operator need not be concatenate, which would give simple queuing behavior, but can be any operator satisfying the axioms giving us much more interesting pipelines. One implementation of arbitrary precision arithmetic transmits Möbius transformations along just such a pipeline to refine the upper and lower limits of a computed approximation. In this implementation, the * operator is the composition of Möbius transformations.

Here's a more concrete example. If you have a series of nested functions: (f (g x)) and both f and g take and return the same type, rewrite it as ((compose f g) x) and use a little group theory on it.
(f (g x))
((compose f g) x)
;; or more explicitly
((fold-left compose identity (list f g)) x)
If the appropriate inverses exist, then there will be another function h such that (compose f g) is equal to (compose h f) essentially allowing you to "slide" g to the left "through" f. It is relatively easy to see that h must be equivalent to (compose f g f-1). Mathematicians say that h is conjugate to g. Conjugates always have a form like aba-1. By finding conjugates, you can take a sequence and slide the elements left and right through other elements. This also allows you to fold things out of order. (Or in the pipeline example, transmit items out of order.) If we were left folding into an accumulator, folding h before f is equivalent to folding g after f. Another way of looking at it is this. Suppose we're standing to the left of f and looking through the “lens” of f at g. h is what g "looks like" when viewed through f.

If we want, we can define slide such that (compose slide (compose f g)) is equivalent to (compose h f). slide is (compose h f g-1 f-1). (This isn't a generic slide sequence, it only works on (compose f g). It ought to be an identity because (compose f g) is equivalent to (compose h f).) I complained that mathematicians provided too few concrete examples, so here is a concrete example using list permutations:
> (reverse (rotate-left '(a b c d)))
(a d c b)

;; rewrite as explicit fold-left of compose
> ((fold-left compose identity (list reverse rotate-left)) '(a b c d))
(a d c b)

;; sliding rotate-left through reverse turns it into rotate-right
> ((fold-left compose identity (list rotate-right reverse)) '(a b c d))
(a d c b)

;; A sequence that when composed with (list reverse rotate-left) turns it into
;; (rotate-right reverse)
> (define slide 
    (fold-left compose identity (list rotate-right reverse rotate-right reverse)))
slide

> ((fold-left compose identity (list slide reverse rotate-left)) '(a b c d))
(a d c b)

;; rewrite back to direct procedure calls
> (rotate-right (reverse '(a b c d)))
(a d c b)

;; and slide ought to be an identity
> ((fold-left compose identity (list slide)) '(a b c d))
(a b c d)

Or suppose you have (f (g x)), but for some reason you want(g (f x)) (which would, in general, be a different value unless f and g happen to commute). Again, rewrite (f (g x)) as ((compose f g) x) and apply a little group theory. If the appropriate inverses exist, there will be a function commute-fg such that (compose commute-fg (compose f g)) is equivalent to (compose g f). With a little thought, you can see that commute-fg is equivalent to (compose g f g-1 f-1). (Again, this isn't a generic commute, it only causes this specific f and g to commute.) commute-fg is called a commutator because it makes f and g commute. Commutators always have the form aba-1b-1. By finding commutators and inserting them in the right place, you can take a sequence and swap adjacent elements. Again, a concrete example with lists:
;; an illustration of what swap-first two does
> (swap-first-two '(a b c d))
(b a c d)

;; we're given
> (reverse (swap-first-two '(a b c d)))
(d c a b)

;; but we want, for some reason to reverse first
> (swap-first-two (reverse '(a b c d)))
(c d b a)

;; rewrite as fold-left of compose
> ((fold-left compose identity (list reverse swap-first-two)) '(a b c d))
(d c a b)

;; define our commutator
;; note that swap-first-two and reverse are their own inverses
> (define commute-fg
    (fold-left compose identity (list swap-first-two reverse swap-first-two reverse)))

;; make f and g commute
;; observe that it returns the desired result
> ((fold-left compose identity (list commute-fg reverse swap-first-two)) '(a b c d))
(c d b a)

There's two interesting things here. First, notice that in both examples I convert (f (g x)) to ((fold-left compose identity (list f g)) x) and then proceed to ignore x and just consider (fold-left compose identity (list f g)) as if x didn't exist. I've abstracted away the x. (Of course I have to eventually supply the x if I want an answer, but it only comes back at the last moment.) Second, notice that although slide and commute-fg are foldable sequences, I use them as if they were higher order functions operating on the foldable sequence (compose f g) to transform it, first into (compose h f), second into (compose g f). This second thing is a neat trick. We're taking a function that operates on lists and treating it as if it were a higher-order function that operates on functions. This is called the “action” of slide and commute-fg because it appears as if elements of the set G of our group can “act” directly on other elements.

Every element in the underlying set G of a group has an action associated with it which operates directly on other elements in G. This is an important concept in group theory. Now earlier I said that the actual elements of G don't matter much, so the action must be more closely tied to the operator *. And if we swap out G for another set we'll still have the same actions, they'll just be associated with the elements of the new set (in an isomorphic way). The actions are pretty abstract.

There's a lot more one could say about the actions. They are a rich source of interesting math. My brain is getting fatigued with all this abstraction, so I'll leave the topic be for now.

If group theory is about the essence of what it means to have a foldable sequence, then category theory is about the essence of composition. They offer two somewhat different approaches to similar material. What do you do with sequences but compose them? What comes from composition but a sequence? Many concepts in group theory carry over into category theory. Naturally a completely different set of terminology is used, but the concepts are there.

But that's enough group theory for today and category theory can wait until later posts.

Joe MarshallMath is hard, let's go shopping

· 7 days ago
I find mathematics, with all it's weird terminology and abstraction and equations, hard to understand. That's kind of funny coming from someone like me who makes a living from a branch of mathematics. I find computers and programming to be rather easy to understand — probably because I've had a lot of practice. But computer science is just applied logic and programming is arguably just the study of the computable functions, so you'd think math would come naturally. It doesn't.

One problem I've found is that as much as mathematicians pride themselves on rigor, they tend to be a bit sloppy and leave out important details. Computer scientists don't leave out important details because then the programs won't run. It's true that too much detail can clutter things up, but leaving out the detail and relying on “context” just increases the intellectual burden on the reader.

I will give mathematician's credit for thinking about edge cases perhaps more than a computer scientist would. It can be easy to be a bit complacent with edge cases because the computer will likely do something even if you don't think too hard about what it ought to do. But a good computer scientist tries to reduce the number of edge cases or at least make them coherent with the non-edge cases.*

Mathematicians seem to take perverse pleasure in being obscure. Computer scientists strive to be as obvious as possible because like as not, they are the ones that have to revisit the code they wrote and don't want to have to remember what they were thinking at the time. It's just easier to spell things out explicitly and obviously so that you can get back up to speed quickly when you have to debug your own stupid code. Every time I pick up some literature on category theory, I get hit with a “Wall of Terminology” denser than the “Wall of Sound” on a Phil Spector recording. It's fundamentally simple stuff, but it is dressed up in pants so fancy one has a hard time extracting the plain meaning. What seems to be universal in category theory is my difficulty in getting past page 4.

I once read a mathematical paper that talked about an algorithm with three tuning parameters: α, β, and another α. No decent computer programmer would give the same name to two different variables. Which α was which was supposed to be “obvious” from the context. The brainpower needed to keep track of the different αs was absurd and a complete waste of effort when calling the variable something else, like γ would have done the trick.

And don't ask a mathematician to write computer code. That's the one time they'll leave out all the abstraction. Instead of a nice piece of abstract, functional code, you'll get a mess of imperative code that smashes and bashes its way to a solution with no explanation of how it got there. It's a lot easier to take some abstract, functional code and figure out a more optimal way, probably imperative way to do it than it is to take a more optimal imperative piece of code and figure out the abstract, functional meaning of it.

I've found it to be extremely helpful when a computer paper includes one or two concrete examples of what it is talking about. That way, if I try to go implement code that does what the paper suggests, there's some indication that I'm on the right track. I'm more confident that I understand the paper if I have working code that produces the exact same values the paper's authors got. It's harder to find concrete examples in a math paper, and it is easier to think you know what it says but be far off base if there aren't any examples.

Maybe I shouldn't blame mathematicians so much and look a little closer to home. Perhaps I should study harder instead of demanding to be spoon fed difficult concepts. But then I read Feynman, S&ICP, S&ICM, and Jaynes and discover that maybe I just need a simple explanation that makes sense to me.

Sturgeon's Revelation is “90% of everything is crap”. This is true of both mathematical papers and computer science papers.



*An old joke illustrates the importance of thinking of edge cases: A programmer implements a bar. The test engineer goes in and orders a beer, orders zero beers, orders 999999999 beers, orders -1 beers, orders a lizard, and declares the bar ready for release. The first customer comes in and asks to use the restroom. The bar catches fire and burns down.

Charles ZhangSBCL20 in Vienna

· 7 days ago

Last month, I attended the SBCL20 workshop in Vienna. Many thanks to the organizers and sponsors for inviting me to give a talk about my RISC-V porting work to SBCL and allowing me to otherwise throw around some ideas in the air with a bunch of SBCLites.

This was my first Lisp conference. It was really nice being able to meet a lot of people who up until then had only been floating voices on the internet. Given the location, it's no surprise that most of the attendees were European, but what did surprise me was the actual turnout for the event, some even having attended SBCL10. I, like what it seems to be many others, was certainly not expecting around 25 to attend. (Robert Smith had given a paltry estimate of about 5!)

On Sunday we had a nice tour around some cool places around Vienna by our gracious host, Phillip Marek. I got to the group right as they were in Donauturm, and had lunch afterwards. We then moved to Karlsplatz where Phillip hunted for daguerreotypes. Fortune looked down upon us that day, since it was a nice sunny 10°C in Vienna in the middle of December!

Then on Monday, we proceeded to start the workshop proper, at about 8:30 am. We were hosted by the Bundesrechnenzentrum (the Austrian Federal Computing Center), and accordingly, after Christophe kicked off the workshop, had some BRZ representatives talk about how the work they do combats things like tax fraud in Austria. We had a nice room with a lot of space, mixer-style tables and snacks in the back of the room. At first, the schedule was that Douglas Katzman was to go Monday morning, and Robert Smith, in the afternoon, with my talk scheduled for Tuesday morning. I ended up asking Robert if he would switch with me as I was pretty anxious to get my talk over with that day... And thus we pressed forward into our first talk of the day, maybe at around 10:30 am.

SBCL & Unix

Doug Katzman talked about his work at Google getting SBCL to work with Unix better. For those of you who don't know, he's done a lot of work on SBCL over the past couple of years, not only adding a lot of new features to the GC and making it play better with applications which have alien parts to them, but also has done a tremendous amount of cleanup on the internals and has helped SBCL become even more Sanely Bootstrappable. That's a topic for another time, and I hope Doug or Christophe will have the time to write up about the recent improvements to the process, since it really is quite interesting.

Anyway, what Doug talked about was his work on making SBCL more amenable to external debugging tools, such as gdb and external profilers. It seems like they interface with aliens a lot from Lisp at Google, so it's nice to have backtraces from alien tools understand Lisp. It turns out a lot of prerequisite work was needed to make SBCL play nice like this, including implementing a non-moving GC runtime, so that Lisp objects and especially Lisp code (which are normally dynamic space objects and move around just like everything else) can't evade the aliens and will always have known locations.

Now it's time for questions, and hacking around until the next talk! (oh, wait a second...) Christophe had encouraged us all to 'go forth and produce something' in the meantimes, but I needed to add a few more examples to my slides and eat something before I gave my talk. We had some cold sandwiches of various types for the day, and people started working on various projects.

RISC-V porting talk, VOPs

Around 1:10 pm or so, I went up to the podium to get my laptop set up for the talk. The HDMI cable when plugged into my laptop directly didn't work, but curiously fitting the HDMI cable through a USB3 converter and connecting that to my laptop made the projector work. Anyway, I got a laser pointer, which, now that I think back on it, probably waved around way too much and was probably fairly distracting. The slides are now posted on the website if you're curious what I talked about. I ended up following it pretty closely and was unsure how much detail to get into because I wasn't sure of the audience's familiarity with the SBCL internals, which porting a new backend is usually going to get pretty deep into.

There was general knowledge of the internal VOP facility in SBCL though, which is usually defined for a given backend to translate the generic machine independent low level intermediate representation (IR2 in internals parlance, VMR (virtual machine representation) in "public" internals documentation parlance) to target-specific machine code. Lots of people want to write their own inline assembly and integrate them with high level Lisp functions (with register allocation done for them), usually so they can use some hardware feature SBCL doesn't expose at a high level directly. For example, SIMD instructions are popular for number crunching people to want to use with SBCL. Well, VOPs are a nice way to do this, so that's why so many people knew about them. Except for VOP lifetimes. Lots of people were also confused about VOP lifetimes. The only reason I ended up understanding VOP lifetimes was because I had debugged too many backend issues where the register allocator destroyed a register I needed the value of. In fact, the first patches I got (from Phillip Mathias Schäfer) for SB-ROTATE-BYTE support on RISC-V had lifetime issues, which I fixed before merging. And, incidentally, right after my talk, Doug showed me a tool written by Alastair Bridgewater called voplife.el that visualizes VOP lifetimes and told me that he never writes VOPs without that tool. Well, that would've been nice to have! And then Christophe told me that of course the tool didn't exist when he was doing backend work.

Speaking of 'back in my day', in my slides I gave out (to use an Irish expression) about how long bootstrapping took with the emulator. Christophe proceeds to tell me about his experience porting to HPPA machines in the early 2000's where it took about a full day to wait for the system to bootstrap... It's easy to forget that Moore's law happens (happened?) sometimes.

Oh, and just so I remember for the future, I got some questions from Tobias Rittweiler about how I handled memory model issues. I basically said I didn't, because I was porting a new CPU, not an OS, since the Linux support routines handle almost all of those concerns. Then Doug asked me about why Load-Immediate-64 on RISC-V was so complicated: couldn't I have just loaded a word from memory? To which I responded that it's not clear whether its more expensive to load a word from memory versus materialize it with only register operations. This is a problem they solved in the RISC-V GCC backend, and last time I checked, the RISC-V backend for LLVM just punts and does the basic, unoptimized sequence. Then he asked me why I started with Cheney GC, which I deflected straight away to Christophe, who made the initial decision. He basically said, "it's easy to fit Cheney GC entirely in my head at once." Fair.

Monday lightning talks

After the talk we had some more time to work on stuff, like demos for the lightning talks. I didn't really do much besides talking to people about our internals though. Rui from 3e asked me about supporting tracing through local functions. One of my main interests is actually optimizing away local functions, so, I'm probably not the one who's going to implement it (unless you paid me to), but it seems fairly straightforward to port the support which was added to CMUCL after the fork.

Then we had our Monday afternoon lightning talks. I remember this just being a 'if you have something to say or demo come up and do it' kind of thing. Marco Heisig went up first to talk about SB-SIMD. I had helped him debug getting the VOPs installed into SBCL properly a little bit before his demo, and he showed us some cool support he's adding. He ended up sending a follow up email after the conference with a more formal proposal to integrate it into SBCL. I hope he has the time to move it forward and have it in-tree in some form or another.

Then james anderson, in what is a very memorable moment for me, went up for his 'demo' which ended up being a quite punctuated proclamation: 'We need concurrent GC!' Indeed, we do.

I'm already starting to forget the details of the remaining talks on Monday. Folks who were there, help me remember!

Dinner

We had an official SBCL20 dinner afterwards, and it was time for some Austrian food. I sat in front of Marco and next to Luís Oliveira and enjoyed some Viennese schnitzel. I asked for tap water and got something that looked like but was definitely not tap water...

SBCL & quantum computing

Tuesday morning was a similar drill. We had a new (smaller) room, and this time, we needed our passports for access. Tuesday was lightning talk day, but first, Robert Smith gave a talk about how they use SBCL for quantum computing at Rigetti. They have a state-of-the-art quantum compiler and a quantum simulator, but Robert first gave us a quick primer on some physics and math (including tensor products in a concrete way, which was a nice breath of fresh air after I had what seemed like endless classes characterizing it according to its universal property). His slides are online, check it out! He's also interested in making SBCL play nice with aliens, but in a different way than Doug is. For one thing, he's interested in making an ECL-like API for SBCL to expose their quantum compiler code compiled with SBCL as a traditional C API. What really struck me in his talk was their compiler's ability to propagate fidelity of qubits to make the compiler sometimes 'miscompile' to sometimes get a more 'accurate' answer. (Scarequotes because quantum is spooky.)

Also, they rewrote one of their Lisp applications into another language due to outside pressure, but the rewrite was slower. It's also cool to know that, according to him, most of Robert's team actually did not have much Lisp exposure before joining, giving a sense that the community is still kicking.

Tuesday lightning talks

We proceeded to hack on more stuff after the talk and had a hot meal for lunch this time. I actually started working on something this time. A conversation with Doug the previous day had me saying that we do loop invariant code motion and stuff, to which Doug said, "but we don't." So, I looked and he was right, although I was genuinely surprised because it is a transformation our compiler framework easily supports. We do do a lot of traditional optimizations, in addition to some state of the art dynamic language type inference (stuff all written in the late 80's!) since our intermediate representations are well suited for that sort of thing. In fact, SBCL's front-end intermediate representation is essentially CPS in a flow graph, which anticipates a lot of research in the area done in the 90's and 2000's, basically being locally equivalent to SSA and not falling into the trap of being bound to scope trees.

So I started working on loop invariant code motion, and while I didn't quite finish by the end of the conference, I did get a proof of concept afterwards that almost self builds and works alright. Though after discovering some ancient notes by the original implementer (Rob MacLachlan) on the issue, I've decided I took the wrong approach after all. (The issue is that I worked on the level of IR1 instead of IR2.) Oh well.

Meanwhile, we had lightning talks starting around 1:00 pm with a short break at around 2:45 pm, if I recall correctly. The full list of topics that day is on the website, in order. We descended into a bit of a wishlist game, with Phillip talking about where to move hosting for SBCL. (The options were, stay with SourceForge, move to GitHub, move to GitLab, move to common-lisp.net hosted GitLab. It was honestly quite the controversy.) Then I talked about the loop invariant code motion work I was doing briefly, and then asked the audience who has heard of Block Compilation. I don't remember the exact number, but I think there were more who didn't know than who knew.  After complaining a little about how SBCL doesn't have it even though CMUCL does, I made it one of my big wishlist item, since I think that the ability to do whole program optimization is pretty important for a high-performance compiler, especially for a dynamic language like Lisp where most of the dynamic facilities go unused once an application is up and running in production (usually). Well, I ended up (re)implementing it yesterday, so maybe people will learn about it again. I might write up about it sooner or later. Then Stelian Ionescu talked about his wishlist items (such as gutting out a lot of the 'useless' backends) and we opened it up to the floor.

Wrap-up

After the official end of the conference, most of the crew went across the street into a mall to eat and chat for the rest of the night. Doug ended up showing me some cross disassembler stuff after some prompting about its removal, while Luís did a great job getting relocatable-heaps working on Windows next to us, which he promptly got upstreamed after the workshop. Great to see that new projects were motivated and finished as a result of SBCL20. It was a fun time, and, as Zach Beane said, I'm hoping we organize and meet again soon!


Joe MarshallPalindromes, redux, and the Sufficiently Smart Compiler

· 8 days ago
The Sufficiently Smart Compiler is mentioned by authors as shorthand for “a compiler that performs nearly all reasonable optimizations, but in particular this one I want”. Many attempts were made up through the 80's and maybe into the 90's to write a Sufficiently Smart Compiler that would perform all “reasonable” optimizations, and although many impressive results have been obtained, there always seem to be fairly obvious optimizations that remain unoptimized. These days it seems that people realize that there will be good compilers and some very good compilers, but never a Sufficiently Smart Compiler. Nonetheless, it is worth considering a Sufficiently Smart Compiler as a tool for thought experiments.

I was curious what would be necessary for a Sufficiently Smart Compiler to generate optimal code for the palindrome problem given the naive algorithm.

The naive algorithm is inspired by the axioms
  • A zero or one element string is a palindrome.
  • If the first char matches the last char, and the middle is a palindrome, the result is a palindrome.
and gives us this:
(define (palindrome1? string)
  (or (< (string-length string) 2)
      (and (char=? (string-ref string 0)
                   (string-ref string (- (string-length string) 1)))
           (palindrome1? (substring string 1 (- (string-length string) 1))))))

The higher performing algorithm is inspired by the idea of keeping two pointers to each end of a string and comparing the characters at the pointers. If the characters are the same, you move the pointers inward and when they meet, you have seen a palindrome. If at any point the characters differ, you don't have a palindrome:
(define (palindrome2? string)
  (define (scan front-pointer rear-pointer)
    (or (>= front-pointer rear-pointer)
        (and (char=? (string-ref string front-pointer)
                     (string-ref string rear-pointer))
             (scan (+ front-pointer 1) (- rear-pointer 1))))
  (scan 0 (- (string-length string) 1)))
As you can see, these really aren't very different to start with. Both algorithms are iterative and both work their way in from the outside of the string. There are basically two differences. First, access to the rear of the string is either by a rear pointer, or by using the string-length of the string and subtracting 1. Second, the iterative call either uses substring or moves the pointers closer together.

First, let's assume that our processor has can reference through an indexed offset. This would mean we could point at the element one beyond the rear-pointer and not incur overhead. This isn't an unreasonable assumption for a CISC architecture such as an x86, but would probably cause 1 instruction overhead on a RISC architecture. So the second algorithm becomes this:
(define (palindrome2? string)
  (define (scan front-pointer rear-pointer)
    (or (< (- rear-pointer front-pointer) 2)
        (and (char=? (string-ref string front-pointer)
                     (string-ref string (- rear-pointer 1)))
             (scan (+ front-pointer 1) (- rear-pointer 1)))))
  (scan 0 (string-length string)))

Now this next assumption is a bit more of a stretch. The implementation of palindrome1? uses substring on each iteration and that's going to result in a lot of string copying. If our implementation used “slices” instead of copying the string, then there will be a lot less copying going on:
(define (palindrome1? string)
  (or (< (- (slice-end string) (slice-start string)) 2)
      (and (char=? (string-ref string (slice-start string))
                   (string-ref string (- (slice-end string) 1)))
           (palindrome1? 
             (substring string (+ (slice-start string) 1) (- (slice-end string) 1))))))

It is not uncommon for a compiler to introduce internal procedures for looping, so we can do that.
(define (palindrome1? string)
  (define (scan slice)
    (or (< (- (slice-end slice) (slice-start slice)) 2)
        (and (char=? (slice-ref slice (slice-start slice))
                     (slice-ref slice (- (slice-end slice) 1)))
             (scan (subslice slice (+ (slice-start slice) 1) (- (slice-end slice) 1))))))
  (scan (make-slice 0 (string-length string))))

We'll enter fantasy land again and let our compiler be smart enough to “spread” the slice data structure into the argument list of scan. This is no doubt asking too much from our compiler, but the information is available and it could in theory be done:
(define (palindrome1? string)
  (define (scan slice-start slice-end)
    (or (< (- slice-end slice-start) 2)
        (and (char=? (slice-ref string slice-start)
                     (slice-ref string (- slice-end 1)))
             (scan (+ slice-start 1) (- slice-end 1)))))
  (scan 0 (string-length string)))

And now we have palindrome2? (modulo renaming).

This doesn't really prove anything. But with a couple of somewhat unlikely compiler tricks, the naive version could be transformed to the more optimized version. It suggests that a it would be surprising but not a complete shock for an ambitious compiler writer to attempt.

I wish someone would write that Sufficiently Smart Compiler.

Joe MarshallCons cells vs. Linked Lists

· 9 days ago
Cons cells and linked lists are the meat and potatoes of Lisp programming. Linked lists are the primary structure that everything operates on and cons cells are the Lego blocks they are made of. For an experienced Lisp programmer, cons cells just fade into the background. You know they are there as the glue holding everything together, but it is the linked list that you keep in mind. One could construct all sorts of weird trees, dags, and graphs out of cons cells, but in general you keep things in nice linear singly-linked lists terminated with a nice, full-stop NIL.

Cons cells are nearly the perfect concrete implementation of an abstract two-tuple. They are first-class objects: you can assign them to variables, stuff them in arrays, pass and return them as values, and check them for identity. They are orthogonal to other data types; only a cons-cell returns 't to consp. They are opaque — except for the defined operations of car and cdr, you cannot access the contents of a cons cell. And while they are usually implemented as adjacent memory locations, they hide their representation and there have been many Lisps that have used unusual concrete representations of cons cells like parallel arrays of the car and cdr parts or bit codes to omit the cdr altogether through “cdr coding”. All operations on cons cells can be reduced to the basic operations cons, consp, car, cdr, (setf car), and (setf cdr). (If we had immutable cons cells, we could even get rid of the last two, but then we'd want some other means for creating circular and semi-circular structure.*)

So I find it somewhat surprising that the standard linked list implementation in Lisp is a just a terrible example of an abstract data type. This no doubt happened because linked lists got standardized well before abstract data types were really understood.

The big problem with linked lists is that instead of being orthogonal to other data types, it is a subdomain of cons-cells. The representation of a singly linked list is completely exposed: it is a cons cell, without even a wrapper object to tell you if you are dealing with the list itself or its representation. It is only by common convention that certain cons cell structures are considered to represent linked lists. And it isn't immediately clear whether the representation is meant to be a pointer to the first pair of the list, or to the entire “spine” of the list. It is often treated both ways. There is little distinction between a list primitive and a cons cell primitive, which usually doesn't get you into trouble, except in those few cases where it can cause major confusion, like when you have to handle “improper” or “dotted” lists.

Lists are mutable because their representation is mutable and not hidden. It is possible to mutate the representation such that it no longer represents a list anymore, “magically” changing any list that includes the mutated structure into something else. This means either a lot of defensive copying must be done if lists are used as arguments or passed as values, or an unenforced convention to avoid mutation of list structure must be developed in the Lisp culture. We've been pretty good at the latter, even documenting when you can and when you cannot rely on lists being mutated by library functions, but there are always a few people who go against the grain for the sake of “efficiency” (or plain orneriness) and write code that is impossible to use because you cannot easily tell what might be mutated behind your back.

With any abstract data type, there are conceptually a pair of functions that are used to transport objects across the abstraction barrier. One, call it abs->rep, takes an abstract object and exposes its representation. It is usually provided automatically by the compiler and called upon entry to the object's methods. In Java, for example, it establishes bindings for the this pointer and the private and protected fields of the object so that the method can use them. The complimentary function, call it rep->abs takes the representation of an object and hides it in an opaque, abstract version for clients of the object to use. The clients have no way to manipulate the representation of the object because they only have access to opaque, abstract version. In Java, for example, the compiler automatically does this after object construction and when the this pointer is returned properly cast to the abstract data type. The this pointer and private and protected fields of the object go out of scope and are no longer accessible.

These functions are usually provided by the compiler and often have no real implementation. The compiler simply ensures that representation comes into scope when the method is called (conceptually calling abs->rep) and that the representation goes out of scope when the method returns (conceptually calling rep->abs). No actual code is generated or exists at run time. It's easy to forget this is happening because the compiler does all the work for you. You just toggle the little bit in your head about whether you are “inside” the object or “outside” the object. If you forget, you can just examine the lexical nesting to see if the representation is in scope.

In Lisp, however, for a singly linked list, not only are these functions omitted, they are completely fictitious. It is only in the programmers head that what was once considered a linked list is now to be considered a pointer to head cell of list (abs->rep) and only probably in the programmers head that the reverse (rep->abs) is happening on the way out. It doesn't matter much if he or she forgets this because the written code is the same either way. It only matters if he or she somewhere down the line uses a cons-cell operation where a list operation is actually what should be used. This can lead to common rookie mistakes like
  • Using cons where list is wanted, yielding (1 . 2) where (1 2) is desired. (The “unwanted dot” problem.)
  • Using list where cons is wanted, yielding (1 (2)) where (1 2) is desired. (The “too many parenthesis” problem.)
  • Confusion about whether ((1 2) 3 4) is meant to be a three-tuple of a list and two integers, or a two-tuple of two lists. (It's both, depending on the unwritten intent of the programmer.)
  • Using cons or list where append is wanted, yielding ((1 2) 3 4) or ((1 2) (3 4)) when (1 2 3 4) is desired. (Again, “too many parenthesis”.)
  • Use of (append ... (list <element>)) to “cons” to the “right end” of a list, leading to O(n2) algorithms rather than O(n).
Now don't get me wrong. I like Lisp and I like linked lists. And I'm not suggesting we avoid using them in favor of some other well-designed abstract data type. I just think they're an awful example of how to implement an abstract data type and perhaps that's why it is difficult for beginners to learn how to use them properly. It might also be worthwhile to implement a Lisp with proper (and immutable) abstract linked lists. It wouldn't make much difference to experienced programmers who are already used to applying the representation/abstraction interface in their heads, but it might make it easier for novices to manipulate linked list and cons cells (and keep them apart).

If you want to be completely contrary, consider Olin Shiver's suggestion: all objects — cons cells, strings, integers, null, etc. — are lists. It's just that every object other than a cons cell is a zero element dotted list. Now rather than being a subtype of cons cells, lists become a supertype of all objects. This viewpoint can probably be made coherent, but it does raise a lot of questions. Here are some that come to mind:
  • Is (length '(1 2 . 3)) the same as (length '(1 2 3))? If not, what is (length '(1 2 . 3))
  • Should lists retain their “dottedness” when passed through functions like memq or map? What is (memq 2 '(1 2 . 3))? What about (memq 3 '(1 2 . 3))?
  • What is (reverse '(1 2 . 3))? Is (compose reverse reverse) an identity?
This was extensively discussed on the SRFI-1 mailing list, so I won't rehash the discussion here. The questions I raised above, and many more, were raised and discussed. Eventually, it was decided that continuing to be backwards compatible was an important consideration. (Personally, I think the notion plays havoc with the group theoretic properties of lists, and that is enough to make it suspect.)

There is a good argument that “dotted” lists are rarely used and almost always a mistake, but they are built in to the grammar of Scheme as an indicator of “rest” arguments, so getting rid of them would require some other way to specify “rest” arguments. Racket takes things further by allowing doubly dotted lists to indicate infix notation: (a . < . b)

Just for kicks, I took things in the other direction and wrote some C# code that implements singly-linked lists as their own abstract data type using special, immutable cons cells that require that their CDR be either an existing singly-linked list or the empty list. “Dotted” lists are not a problem because you simply cannot construct one. The representation of a list is explicitly coded as a pointer to the head cons cell of the list. The code illustrates how the abstract list is turned into a the pointer to the cons cell when it is carried across the abstraction barrier and how it is turned back into an abstract list when carried back out. Again, I'm not suggesting anyone use the code, or take it as a serious proposal. (For one thing, it doesn't address what to do about circular lists, or the dotted lists in the Scheme grammar.) It was just a fun hack for illustrative purposes. It is available here for those interested.

*Many years back, Henry Baker said “C'mon, cons cells should just be immutable.” (if I am remembering the exact quote correctly). I agree with his sentiment. Combine immutable cons cells with “hash consing” and the appropriate equality primitives and you get directed acyclic graphs (and their space properties) “for free”. We'd either have to do without circular structure or use another means to achieve it. Since circular structure often leads to divergent programs I wouldn't consider it a great loss, but some may disagree. Perhaps they might be assuaged by a nice set of primitive procedures for creating and manipulating circular cons cell structure.

Joe MarshallJust for fun, transformations on lists

· 10 days ago
The mathematician in me likes to think about what happens to data and programs when you apply certain transformations to them. Here's a simple example. Imagine the function swap that simply makes a new cons cell by swapping the car and cdr of an existing cons cell:
(define (swap cell) (cons (cdr cell) (car cell)))

> (swap '(1 . 2))
(2 . 1)

> (swap (swap '(1 . 2)))
(1 . 2)
As we'd expect, two swaps are equivalent to no swaps at all. Indeed any even number of swaps are equivalent. Any odd number of swaps is equivalent to one swap.

What if we call swap on a list?
> (swap '(1 2 3))
((2 3) . 1)
That's odd looking. But swapping again returns it to normal
> (swap (swap '(1 2 3)))
(1 2 3)

But swap only swaps the top-level cell. Let's define deep-swap that descends into the car and cdr if possible:
(define (deep-swap cell)
  (cons (if (pair? (cdr cell)) (deep-swap (cdr cell)) (cdr cell))
        (if (pair? (car cell)) (deep-swap (car cell)) (car cell))))

> (deep-swap '((1 . 2) . (3 . 4)))
((4 . 3) 2 . 1)
Wait, what? Oh, the list printer is just interpreting the second cons cell as a part of a top-level list. We can see this by trying this:
> '((4 . 3) . (2 . 1))
((4 . 3) 2 . 1)
So we just have to be aware of list printer eliding the dots for us.

What if we call deep-swap on a list?
> (deep-swap '(1 2 3 4))
((((() . 4) . 3) . 2) . 1)
Fortunately, deep-swap, like swap, undoes itself.
> (deep-swap (deep-swap '(1 2 3 4)))
(1 2 3 4)
It's easy to see that swap and deep-swap should commute.
> (equal? (swap (deep-swap '((a . b) . (c . d))))
          (deep-swap (swap '((a . b) . (c . d)))))
#t
Alternatively, define compose
;; Simple composition of two functions
(define (compose2 outer inner)
  (lambda (x) (outer (inner x))))

;; Composition of arbitrary number of functions
(define (compose f . fs)
  (if (null? fs)
      f
      (compose2 f (apply compose fs))))

> (equal? ((compose swap deep-swap) '((a . b) . (c . d)))
          ((compose deep-swap swap) '((a . b) . (c . d))))
#t
So you can just move all the swaps together and all the deep-swaps together, then remove pairs of each one.

swap and deep-swap don't have very complex behavior, so let's turn to lists. We can define rotate-left as follows:
(define (rotate-left l) (append (cdr l) (list (car l))))

> (rotate-left '(1 2 3 4))
(2 3 4 1)

> (rotate-left (rotate-left '(1 2 3 4)))
(3 4 1 2)
(This is horribly inefficient, so this is just for fun, not production code.) Now what happens when we combine rotate-left with reverse?
> (reverse (rotate-left (reverse '(1 2 3 4))))
(4 1 2 3)

(define rotate-right (compose reverse rotate-left reverse))

> (rotate-right '(1 2 3 4))
(4 1 2 3)
rotate-left becomes rotate-right when used “under” reverse. Of course rotate-left doesn't commute with reverse: (reverse (reverse (rotate-left '(1 2 3 4)))) the reverses cancel each other and we're left with a rotate-left.

We can define “deep” versions of reverse, rotate-left, and rotate-right:
(define (deeply f)
  (lambda (l)
    (if (list? l)
        (f (map (deeply f) l))
        l)))

> ((deeply reverse) '((1 2 3) 4 5 (6 7 (8 9 10))))
(((10 9 8) 7 6) 5 4 (3 2 1))

> ((deeply rotate-left) '((1 2 3) 4 5 (6 7 (8 9 10))))
(4 5 (7 (9 10 8) 6) (2 3 1))

> ((deeply rotate-right) '((1 2 3) 4 5 (6 7 (8 9 10))))
(((10 8 9) 6 7) (3 1 2) 4 5)
Naturally, a (deeply rotate-left) will undo a (deeply rotate-right). You might suspect that the composition of (deeply reverse), (deeply rotate-left), and (deeply reverse) is equivalent to (deeply rotate-right), and you'd be right (I suspected as much, too, but it didn't seem so obvious, so I checked).

Notice that the deeper list structure has 3 elements each, but the topmost list structure has 4 elements, so 3 deep rotations is equivalent to one shallow rotation in the opposite direction, or (compose rotate-left (deeply rotate-left) (deeply rotate-left) (deeply rotate-left)) is an identity. In fact, the shallow rotate-left commutes freely with (compose (deeply-rotate left) (deeply rotate-left) (deeply rotate-left))
;; These are all equivalent identities
(compose rotate-left (deeply rotate-left) (deeply rotate-left) (deeply rotate-left))
(compose (deeply rotate-left) rotate-left (deeply rotate-left) (deeply rotate-left))
(compose (deeply rotate-left) (deeply rotate-left) rotate-left (deeply rotate-left))
(compose (deeply rotate-left) (deeply rotate-left) (deeply rotate-left) rotate-left)

Arbitrary application of these operators will scramble your list structure much like arbitrary rotations will scramble a Rubik's cube. The analogy is more than skin deep: group theory can be used to describe and analyze the combinatorics of both. Group theory tells us that operations of the form F-1GF are likely to be interesting. And indeed:
> ((compose rotate-right reverse rotate-left) '((1 2 3) 4 5 (6 7 (8 9 10))))
(4 (1 2 3) (6 7 (8 9 10)) 5)

(define involute (compose rotate-right reverse rotate-left))
swaps the outside elements with the inside ones. And if we compose a rotate-left with this, we find we've reversed only the last three elements in the list ((1 2 3) (6 7 (8 9 10)) 5 4).

Just using these operators, there seems to be no way to get to get to '((1 2 3) 5 4 (6 7 (8 9 10))) (at least I couldn't find one), so I defined another operator:
(define (call-on-tail f)
  (lambda (x)
    (cons (car x) (f (cdr x)))))
which leaves the head element alone while applying the transformation to the rest.
> ((compose rotate-left reverse (call-on-tail rotate-right) involute)
   '((1 2 3) 4 5 (6 7 (8 9 10))))
((1 2 3) 5 4 (6 7 (8 9 10)))

These functions can move elements up and down the list structure:
(define (leftmost c)
  (if (pair? c)
      (leftmost (car c))
      c))

(define (replace-leftmost c new-value)
  (if (pair? c)
      (cons (replace-leftmost (car c) new-value) (cdr c))
      new-value))

(define (rightmost c)
  (if (pair? c)
      (if (null? (cdr c))
          (rightmost (car c))
          (rightmost (cdr c)))
      c))

(define (replace-rightmost c new-value)
  (if (pair? c)
      (if (null? (cdr c))
          (cons (replace-rightmost (car c) new-value) (cdr c))
          (cons (car c) (replace-rightmost (cdr c) new-value)))
      new-value))

(define (swap-ends l)
  (replace-leftmost (replace-rightmost l (leftmost l)) (rightmost l)))

> (swap-ends '((1 2 3) 4 5 (6 7 (8 9 10))))
((10 2 3) 4 5 (6 7 (8 9 1)))

> ((compose involute swap-ends involute) '((1 2 3) 4 5 (6 7 (8 9 10))))
((1 2 3) 5 4 (6 7 (8 9 10)))

> ((deeply swap-ends) '((1 2 3) 4 5 (6 7 (8 9 10))))
((6 2 1) 4 5 (8 7 (10 9 3)))

> ((compose (deeply swap-ends)
            (deeply swap-ends)
            (deeply swap-ends)
            (deeply swap-ends)
            (deeply swap-ends)) '((1 2 3) 4 5 (6 7 (8 9 10))))
((1 2 3) 4 5 (6 7 (8 9 10)))

There's no real application for all this, except maybe to come up with some puzzles. It's just fun to noodle around with list transformations to see what you can come up with, and to practice your list manipulation skills. You really need a language with a REPL to play around like this. A parsimonious syntax like Lisp helps, too. It would have been a bit more difficult to fool around if I had to put the appropriate commas, curly braces, brackets, and semicolons in just right.

None of these operations work on circular lists, but you can imagine that rotations and reversals could work on fully circular lists, but I'm not sure how they'd make sense on lists with circular tails. It would be challenging to make them work, though. They also don't work on “dotted” lists — they throw an error when they run into the dotted item at the end of the list. But it is fairly easy to imagine how they might be made to work on a dotted list. It would be much less of a challenge to implement.

Vsevolod DyomkinProgramming Algorithms: Approximation

· 10 days ago

This chapter will be a collection of stuff from somewhat related but still distinct domains. What unites it is that all the algorithms we will discuss are, after all, targeted at calculating approximations to some mathematical functions. There are no advanced data structures involved, neither is the aim to find a clever way to improve the runtime of some common operations. No, these algorithms are about calculations and computing an acceptable result within the allocated time budget.

Combinatorial Optimization

Dynamic Programming is a framework that can be used for finding the optimal value of some loss function when there are multiple configurations of the problem space that result in different values. Such search is an example of discrete optimization for there is a countable number of states of the system and a distinct value of the cost function we're optimizing corresponding to each state. There are also similar problems that have an unlimited and uncountable number of states, but there is still a way to find a global or local optimum of the cost function for them. They comprise the continuous optimization domain. Why is optimization not just a specialized area relevant to a few practitioners but a toolbox that every senior programmer should know how to utilize? The primary reason is that it is applicable in almost any domain: the problem just needs to be large enough to rule out simple brute force. You can optimize how the data is stored or how the packets are routed, how the blueprint is laid out or the servers are loaded. Many people are just not used to looking at their problems this way. Also, understanding optimization is an important prerequisite for having a good grasp of machine learning, which is revolutionizing the programming world.

DP is an efficient and, overall, great optimization approach, but it can't succeed if the problem doesn't have an optimal substructure. Combinatorial Optimization approaches deal with finding a near-optimum for the problems where an exhaustive search requires O(2^n) computations. Such problems are called NP-hard and a classic example of those is the Travelling Salesman (TSP). The task is to find an optimal order of edges in a cycle spanning all vertices of a fully-connected weighted graph. As we saw previously, this problem doesn't have an optimal substructure, i.e. an optimal partial solution isn't necessarily a part of the best overall one, and so taking the shortest edge doesn't allow the search procedure to narrow down the search space when looking at the next vertex. A direct naive approach to TSP will enumerate all the possible variants and select the one with a minimal cost. However, the number of variants is n!, so this approach becomes intractable very fast. A toy example of visiting all the capitals of the 50 US states has 10^64 variants. This is where quantum computers promise to overturn the situation, but while we're waiting for them to mature, the only feasible approach is developing approximation methods that will get us a good enough solution in polynomial (ideally, linear) time. TSP may look like a purely theoretical problem, but it has some real-world applications. Besides vehicle routing, automated drilling and soldering in electronics is another example. Yet, even more important is that there are many other combinatorial optimization problems, but, in essence, the approaches to solving one of them apply to all the rest. I.e., like with shortest path, coming up with an efficient solution to TSP allows to efficiently solve a very broad range of problems over a variety of domains.

So, let's write down the code for the basic TSP solution. As usual, we have to select the appropriate graph representation. From one point of view, we're dealing with a fully-connected graph, so every representation will work and a matrix one will be the most convenient. However, storing an n^2-sized array is not the best option, especially for a large n. A better "distributed" representation might be useful here. Yet, for the TSP graph, an even better approach would be to do the opposite of our usual optimization trick: trade computation for storage space. When the graph is fully-connected, usually, there exists some kind of an underlying metric space that contains all the vertices. The common example is the Euclidian space, in which each vertex has a coordinate (for example, the latitude and longitude). Anyway, whichever way to represent the vertex position is used, the critical requirement is the existence of the metric that may be calculated at any time (and fast). Under such conditions, we don't have to store the edges at all. So, our graph will be just a list of vertices.

Let's use the example with the US state capitals. Each vertex will be representated as a pair of floats (lat & lon). We can retireve the raw data from the Wikipedia article about the US capitols (with an 'o') and extract the values we need with the following code snippet[1], which cuts a few corners:

(defstruct city
name lat lon)

(defparameter *wp-link* "https://en.wikipedia.org/w/index.php?title=List_of_state_and_territorial_capitols_in_the_United_States&action=edit&section=1")

(defparameter *cs*
(with ((raw (drakma:http-request *wp-link*))
(coords-regex (ppcre:create-scanner "\\{\\{coord\\|(\\d+)\\|(\\d+)\\|([.\\d]+)\\|.\\|(\\d+)\\|(\\d+)\\|([.\\d]+)\\|.\\|type"))
(capitals (list)))
(flet ((dms->rad (vec off)
(* (/ pi 180)
(+ (? vec (+ off 0))
(/ (? vec (+ off 1)) 60)
(/ (? vec (+ off 2)) 3600)))))
(dolist (line (split #\Newline (slice raw
(search "{| class=\"wikitable sortable\"" raw)
(search "</textarea><div class='editOptions'>" raw))))
(when-it (and (starts-with "|" line)
(search "{{coord" line))
(with ((_ coords (ppcre:scan-to-strings coords-regex line))
(coords (map* 'read-from-string coords)))
(push (make-city :name (slice line (position-if 'alpha-char-p line)
(position-if (lambda (ch) (member ch '(#\] #\|)))
line :start 1))
:lat (dms->rad coords 0)
:lon (dms->rad coords 3))
capitals)))))
(coerce capitals 'vector)))

CL-USER> (length *cs*)
50

We also need to define the metric. The calculation of distances on Earth, though, is not so straightforward as on a plain. Usually, as a first approximation, the haversine formula is used that provides the estimate of the shortest distance over the surface "as-the-crow-flies" (ignoring the relief).

(defun earth-dist (c1 c2)
(with ((lat1 (? c1 'lat))
(lat2 (? c2 'lat))
(a (+ (expt (sin (/ (- lat2 lat1) 2))
2)
(* (cos lat1)
(cos lat2)
(expt (sin (/ (- (? c2 'lon) (? c1 'lon)) 2))
2)))))
(* 1.2742e7 ; Earth diameter
(atan (sqrt a) (sqrt (- 1 a))))))

With the metric at our disposal, let's define the function that will calculate the length of the whole path and use it for a number of random paths (we'll use the RUTILS function shuffle to produce a random path).

(defun path-length (path)
(let ((rez (earth-dist (? path 0) (? path -1))))
(dotimes (i (1- (length path)))
(:+ rez (earth-dist (? path i) (? path (1+ i)))))
rez))

CL-USER> (path-length *cs*)
9.451802301259182d7
CL-USER> (path-length (shuffle *cs*))
9.964776273250546d7
CL-USER> (path-length (shuffle *cs*))
1.009761841183094d8

We can see that an average path may have a length of around 10k kilometers. However, we don't know anything about the shortest or the longest one, and to find out reliably, we'll have to evaluate 50! paths... Yet, as we accept the sad fact that it is not possible to do with our current technology, it's not time to give up yet. Yes, we may not be able to find the absolute best path, but at least we can try to improve on the random one. Already, the three previous calculations had a variance of 5%. So, if we're lucky, maybe we could hit a better path purely by chance. Let's try a thousand paths using our usual argmin pattern:

(defun random-search (path n)
(let ((min (path-length path))
(arg path))
(loop :repeat n :do
(with ((path (shuffle path))
(len (path-length path)))
(when (< len min)
(:= min len
arg path))))
(values arg
min)))

CL-USER> (:= *print-length* 2)
2
CL-USER> (random-search *cs* 1000)
(#S(CITY :NAME "Atlanta" :LAT 0.5890359059538811d0 ...)
#S(CITY :NAME "Montpelier, Vermont" :LAT 0.772521512027179d0 ...) ...)
7.756170773802838d7

OK, we've got a sizable 20% improvement. What about 1,000,000 combinations?

CL-USER> (time (random-search *cs* 1000000))
Evaluation took:
31.338 seconds of real time
...
(#S(CITY :NAME "Boise, Idaho" :LAT 0.7612723873453388d0 ...)
#S(CITY :NAME "Helena, Montana" :LAT 0.813073800024579d0 ...) ...)
6.746660953705506d7

Cool, another 15%. Should we continue increasing the size of the sample? Maybe, after a day of computations, we could get the path length down by another 20-30%. And that's already a good gain. Surely, we could also parallelize the algorithm or use a supercomputer in order to analyze many more variants. But there should be something smarter than simple brute force, right?

Local Search

Local Search is the "dumbest" of these smart approaches, built upon the following idea: if we had a way to systematically improve our solution, instead of performing purely random sampling, we could arrive at better variants much faster. The local search procedure starts from a random path and continues improving it until the optimum is reached. This optimum will be a local one (hence the name), but it will still be better than what we have started with. Besides, we could run the optimization procedure many times from a different initial point, basically, getting the benefits of the brute force approach. We can think of the multiple runs local search as sampling + optimization.

(defun local-search (path improve-fn)
(let ((min (path-length path))
(cc 0)) ; iteration count
(loop
(:+ cc)
(if-it (call improve-fn path)
(:= min (path-length it)
path it)
(return (values path
min
cc))))))

For this code to work, we also need to supply the improve-fn. Coming up with it is where the creativity of the algorithmic researcher needs to be channeled into. Different problems (and even a single problem) may allow for different approaches. For TSP, there are several improvement possibilities discovered so far. And all of them use the planar (2d) nature of the graph we're processing. It is an additional constraint that has a useful consequence: if the paths between two pairs of nodes intersect, definitely, there are also shorter paths between them that are nonintersecting. So, swapping the edges will improve the whole path. If we were to draw a picture of this swap, it would look like this (the edges A-D and C-B intersect, while A-B and C-D don't and hence their total length is shorter):

 - A   B -      - A - B -
X ==>
- C D - - C - D -

This rule allows us to specify the so-called 2-opt improvement procedure:

(defun 2-opt (path)
(loop :repeat (* 2 (length path)) :do
(with ((len (length path))
(v1 (random len))
(v1* (if (= #1=(1+ v1) len) 0 #1#))
(v2 (loop :for v := (random len)
:when (and (/= v v1) (/= v (1- v1))) :do (return v)))
(v2* (if (= #2=(1+ v2) len) 0 #2#)))
(when (< (+ (path-length (vec (? path v1) (? path v2)))
(path-length (vec (? path v1*) (? path v2*))))
(+ (path-length (vec (? path v1) (? path v1*)))
(path-length (vec (? path v2) (? path v2*)))))
(let ((beg (min v1* v2*))
(end (max v1* v2*)))
(return (concatenate 'vector
(subseq path 0 beg)
(reverse (subseq path beg end))
(subseq path end))))))))

Note that we do not need to perform a complicated check for path intersection (which requires an algorithm of its own and there is a number of papers dedicated to this task). In fact, we don't care if there is an intersection: we just need to know that the new path, which consists of the newly replaced edges and a reversed part of the path between the two inner nodes of the old edges, is shorter. One more thing to notice is that this implementation doesn't perform an exhaustive analysis of all possible edge swaps, which is suggested by the original 2-opt algorithm (a O(n^2) operation). Here, we select just a random pair. Both variants are acceptable, and ours is simpler to implement.

CL-USER> (local-search *cs* '2-opt)
#(#S(CITY :NAME "Jackson, Mississippi" :LAT 0.5638092223095238d0 ...)
#S(CITY :NAME "Baton Rouge, Louisiana" :LAT 0.5315762080646039d0 ...) ...)
3.242702077795514d7
111

So, outright, we've got a 100% improvement on the random-search path obtained after a much larger number of iterations. Iteration counting was added to the code in order to estimate the work we had to do. To make a fair comparison, let's run random-search with the same n (111):

CL-USER> (random-search *cs* 111)
#(#S(CITY :NAME "Boise, Idaho" :LAT 0.7612723873453388d0 ...)
#S(CITY :NAME "Springfield, Illinois" :LAT 0.6946151297363367d0 ...) ...)
7.522044767585556d7

But this is still not 100% fair as we haven't yet factored in the time needed for the 2-opt call which is much heavier than the way random search operates. In my estimates, 111 iterations of local-search took 4 times as long, so...

CL-USER> (random-search *cs* 444)
#(#S(CITY :NAME "Lansing, Michigan" :LAT 0.745844229097319d0 ...)
#S(CITY :NAME "Springfield, Illinois" :LAT 0.6946151297363367d0 ...) ...)
7.537249874357127d7

Now, the runtimes are the same, but there's not really much improvement in the random search outcome. That's expected for, as we have already observed, achieving a significant improvement in random-search results requires performing orders of magnitude more operations.

Finally, let's define multi-local-search to leverage the power of random sampling:

(defun multi-local-search (path n)
(let ((min (path-length path))
(arg path))
(loop :repeat n :do
(with ((cur (local-search (shuffle path) '2-opt)))
(when (< #1=(path-length cur) min)
(:= min #1#
arg cur))))
(values arg
min)))

CL-USER> (time (multi-local-search *cs* 1000))
Evaluation took:
22.394 seconds of real time
...
#(#S(CITY :NAME "Atlanta" :LAT 0.5890359059538811d0 ...)
#S(CITY :NAME "Montgomery, Alabama" :LAT 0.5650930224896327d0 ...) ...)
2.8086843039667137d7

Quite a good improvement that took only 20 seconds to achieve!

As a final touch, let's draw the paths on the map. It's always good to double-check the result using some visual approach when it's available. Here is our original random path (Anchorage and Honolulu are a bit off due to the issues with the map projection):

This is the result of random search with a million iterations:

And this is our multistart local search outcome. Looks nice, doesn't it?

2-opt is the simplest path improving technique. There are more advanced ones like 3-opt and Lin-Kernighan heuristic. Yet, the principle remains the same: for local search to work, we have to find a way to locally improve our current best solution.

Another direction of the development of the basic algorithm, besides better local improvement procedures and trying multiple times, is devising a way to avoid being stuck in local optima. Simulated Annealing is the most well-known technique for that. The idea is to replace unconditional selection of a better variant (if it exists) with a probabilistic one. The name and inspiration for the technique come from the physical process of cooling molten materials down to the solid state. When molten steel is cooled too quickly, cracks and bubbles form, marring its surface and structural integrity. Annealing is a metallurgical technique that uses a disciplined cooling schedule to efficiently bring the steel to a low-energy, optimal state. The application of this idea to the optimization procedure introduces the temperature parameter T. At each step, a new state is produced from the current one. For instance, it can be achieved using 2-opt, although the algorithm doesn't impose the limitation on the state to necessarily be better than the current one, so even such a simple thing as a random swap of vertices in the path is admissible. Next, unlike with local search, the transition to the candidate step doesn't happen unconditionally, but with a probability proportional to (/ 1 T). Initially, we start with a high value of T and then decrease it following some annealing schedule. Eventually, T falls to 0 towards the end of the allotted time budget. In this way, the system is expected to wander, at first, towards a broad region of the search space containing good solutions, ignoring small fluctuations; then the drift towards low-energy regions becomes narrower and narrower; and, finally, it transitions to ordinary local search according to the steepest descent heuristic.

Evolutionary Algorithms

Local search is the most simple example of a family of approaches that are collectively called Metaheuristics. All the algorithms from this family operate, in general, by sampling and evaluating a set of solutions which is too large to be completely evaluated. The difference is in the specific approach to sampling that is employed.

A prominent group of metaheuristic approaches is called Evolutionary (and/or nature-inspired) algorithms. It includes such methods as Genetic Algorithms, Ant Colony and Particle Swarm Optimization, Cellular and even Grammatical Evolution. The general idea is to perform optimization in parallel by maintaining the so-called population of states and alter this population using a set of rules that improve the aggregate quality of the whole set while permitting some outliers in hopes that they may lead to better solutions unexplored by the currently fittest part of the population.

We'll take a brief glance at evolutionary approaches using the example of Genetic ALgorithms, which are, probably, the most well-known technique among them. The genetic algorithm (GA) views each possible state of the system as an individual "genome" (encoded as a vector). GA is best viewed as a framework that requires specification of several procedures that operate on the genomes of the current population:

  • The initialization procedure which creates the initial population. After it, the size of the population remains constant, but each individual may be replaced with another one obtained by applying the evolution procedures.
  • The fitness function that evaluates the quality of the genome and assigns some weight to it. For TSP, the length of the path is the fitness function. For this problem, the smaller is the value of the function the better.
  • The selection procedure specifies which items from the population to use for generating new variants. In the simplest case, this procedure can use the whole population.
  • The evolution operations which may be applied. The usual GA operations are mutation and crossover, although others can be devised also.

Mutation operates on a single genome and alters some of its slots according to a specified rule. 2-opt may be a valid mutation strategy, although even the generation of a random permutation of the TSP nodes may work if it is applied to a part of the genome and not to the whole. By controlling the magnitude of mutation (what portion of the genome is allowed to be involved in it) it is possible to choose the level of stochasticity in this process. But the key idea is that each change should retain at least some resemblance with the previous version, or we'll just end up with stochastic search.

The crossbreeding operation isn't, strictly speaking, necessary in the GA, but some of the implementations use it. This process transforms two partial solutions into two others by swapping some of the parts. Of course, it's not possible to apply directly to TSP, as it would result in the violation of the main problem constraint of producing a loop that spans all the nodes. Instead, another procedure called the ordered crossover should be used. Without crossbreeding, GA may be considered a parallel version of local search.

Here is the basic GA skeleton. It requires definition of the procedures init-population, select-candidates, mutate, crossbread, and score-fitness.

(defun ga (population-size &key (n 100))
(let ((genomes (init-population population-size)))
(loop :repeat n :do
(let ((candidates (select-candidates genomes)))
(dolist (ex (mapcar 'mutate candidates))
(push ex genomes))
(dolist (ex (crossbread candidates))
(push ex genomes)))
(:= genomes (take population-size (sort genomes 'score-fitness))))))

This template is not a gold standard, it can also be tweaked and altered, but you've got a general idea. The other evolutionary optimization methods also follow the same principles but define different ways to evolve the population. For example, Particle Swarm Optimization operates by moving candidate solutions (particles) around in the search space according to simple mathematical formulae over their position and velocity. The movement of each particle is influenced by its local best known position, as well as guided toward the global best known positions in the search space. And those are, in turn, updated as better positions are found by other particles. By the way, the same idea underlines the Particle Filter algorithm used in signal processing and statistical inference.

Branch & Bound

Metaheuristics can be, in general, classified as local search optimization methods for they operate in a bottom-up manner by selecting a random solution and trying to improve it by gradual change. The opposite approach is global search that tries to systematically find the optimum by narrowing the whole problem space. We have already seen the same pattern of two alternative ways to approach the task — top-down and bottom-up — in parsing, and it also manifests in other domains that permit problem formulation as a search task.

How is a top-down systematic evaluation of the combinatorial search space even possible? Obviously, not in its entirety. However, there are methods that allow the algorithm to rule out significant chunks that certainly contain suboptimal solutions and narrow the search to only the relevant portions of the domain that may be much smaller in cardinality. If we manage to discard, this way, a large number of variants, we have more time to evaluate the other parts, thus achieving better results (for example, with Local search).

The classic global search is represented by the Branch & Bound method. It views the set of all candidate solutions as a rooted tree with the full set being at the root. The algorithm explores branches of this tree, which represent subsets of the solution set. Before enumerating the candidate solutions of a branch, the branch is checked against upper and lower estimated bounds on the optimal solution and is discarded if it cannot produce a better solution than the best one found so far by the algorithm. The key feature of the algorithm is efficient bounds estimation. When it is not possible, the algorithm degenerates to an exhaustive search.

Here is a skeleton B&B implementation. Similar to the one for Genetic Algorithms, it relies on providing implementations of the key procedures separately for each search problem. For the case of TSP, the function will accept a graph and all the permutations of its vertices comprise the search space. We'll use the branch struct to represent the subspace we're dealing with. We can narrow down the search by pinning a particular subset of edges: this way, the subspace will contain only the variants originating from the possible permutations of the vertices that are not attached to those edges.

(defstruct branch
(upper most-positive-fixnum)
(lower 0)
(edges (list))

The b&b procedure will operate on the graph g and will have an option to either work until the shortest path is found or terminate after n steps.

(defun b&b (g &key n)
(with ((cur (vertices g))
(min (cost cur)))
(arg cur)
(q (make-branch :upper min :lower (lower-bound g ())))
(loop :for i :from 0
:for branch := (pop q) :while item :do
(when (eql i n) (return))
(if (branchp branch)
(dolist (item (branch-out branch))
;; we leave only the subbranches that can,
;; at least in theory, improve on the current solution
(when (< (branch-lower item) upper)
(push item q)))
(let ((cost (branch-upper branch)))
(when (< cost lower)
(:= lower cost
arg branch)))))
(values cur
cost)))

The branch-out function is rather trivial: it will generate all the possible variants by expanding the current edge set with a single new edge, and it will also calculate the bounds for each variant. The most challenging part is figuring out the way to compute the lower-bound. The key insight here is the observation that each path in the graph is not shorter than half the sum of the shortest edges attached to each vertex. So, the lower bound for a branch with pinned edges e1, e2, and e3 will be the sum of the lengths of these edges plus half the sum of the shortest edges attached to all the other vertices that those edges don't cover. It is the most straightforward and raw approximation that will allow the algorithm to operate. It can be further improved upon — a home task for the reader is to devise ways to make it more precise and estimate if they are worth applying in terms of computational complexity.

B&B may also use additional heuristics to further optimize its performance at the expense of producing a slightly more suboptimal solution. For example, one may wish to stop branching when the gap between the upper and lower bounds becomes smaller than a certain threshold. Another improvement may be to use a priority queue instead of a stack, in the example, in order to process the most promising branches first.

One more thing I wanted to mention in the context of global heuristic search is Monte Carlo Tree Search (MCTS), which, in my view, uses a very similar strategy to B&B. It is the currently dominant method for finding near-optimal paths in the decision tree for turn-based and other similar games (like go or chess). The difference between B&B and MCTS is that, typically, B&B will use a conservative exact lower bound for determining which branches to skip. MCTS, instead, calculates the estimate of the potential of the branch to yield the optimal solution by performing the sampling of a number of random items from the branch and averaging their scores. So, it can be considered a "softer" variant of B&B. The two approaches can be also combined, for example, to prioritize the branch in the B&B queue. The term "Monte Carlo", by the way, is applied to many algorithms that use uniform random sampling as the basis of their operation.

Gradient Descent

The key idea behind Local Search was to find a way to somehow improve the current best solution and change it in that direction. It can be similarly utilized when switching from discrete problems to continuous ones. And in this realm, the direction of improvement (actually, the best possible one) is called the gradient (or rather, the opposite of the gradient). Gradient Descent (GD) is the principal optimization approach, in the continuous space, that works in the same manner as Local Search: find the direction of improvement and progress alongside it. There's also a vulgar name for this approach: hill climbing. It has a lot of variations and improvements that we'll discuss in this chapter. But we'll start with the code for the basic algorithm. Once again, it will be a template that can be filled in with specific implementation details for the particular problem. We see this "framework" pattern recurring over and over in optimization methods as most of them provide a general solution that can be applied in various domains and be appropriately adjusted for each one.

(defun gd (fn data &key n (learning-rate 0.1) (precision 1e-6))
(let ((ws (init-weights fn))
(cost (cost fn ws))
(i 0))
(loop
(update-weights ws learning-rate
(grad fn ws data))
(when (or (< (- (:= cost (cost fn ws))
cost)
precision)
(eql n (:+ i)))
(return)))
(values ws
cost))

This procedure optimizes the weights (ws) of some function fn. Moreover, whether we know or not the mathematical formula for fn, doesn't really matter: the key is to be able to compute grad, which may be done analytically (using a formula that is just coded) or in a purely data-driven fashion (what Backprop, which we have seen in the previous chapter, does). ws will usually be a vector or a matrix and grad will be an array fo the same dimensions. In the simplest and not interesting toy case, both are just scalar numbers.

Besides, in this framework, we need to define the following procedures:

  • init-weights sets the starting values in the ws vector according to fn. There are several popular ways to do that: the obvious set to all zeroes, which doesn't work in conjunction with backrpop; sample from a uniform distribution with a small amplitude; more advanced heuristics like Xavier initialization.
  • update-weights has a simple mathematical formulation: (:- ws (* learning-rate gradient)). But as ws is usually a multi-dimensional structure, in Lisp we can't just use - and * on them as these operations are reserved for dealing with numbers.
  • it is also important to be able to calculate the cost function (also often called, "loss"). As you can see from the code, the GD procedure may terminate in two cases: either it has used the whole iteration budget assigned to it, or it has approached the optimum very closely, so that, at each new iteration, the change in the value of the cost function is negligible. Apart from this usage, tracking the cost function is also important to monitor the "learning" process (another name for the optimization procedure, popular in this domain). If GD operating correctly, the cost should monotonically decrease at each step.

This template is the most basic one and you can see a lot of ways of its further improvement and tuning. One important direction is controlling the learning rate: similar to Simulated Annealing, it may change over time according to some schedule or heuristics.

Another set of issues that we won't elaborate upon now are related to dealing with numeric precision, and they also include such problems as vanishing/exploding gradients.

Improving GD

In the majority of interesting real-world optimization problems, the gradient can't be computed analytically using a formula. Instead, it has to be recovered from the data, and this is a computationally-intensive process: for each item in the dataset, we have to run the "forward" computation and then compute the gradient in the "backward" step. A diametrically opposite approach in terms of both computation speed and quality of the gradient would be to take just a single item and use the gradient for it as an approximation of the actual gradient. From the statistics point of view, after a long sequence of such samples, we have to converge to some optimum anyway. This technique, called Stochastic Gradient Descent (SGD), can be considered a form of combining sampling with gradient descent. Yet, sampling could be also applied directly the dataset directly. The latter approach is called Batch Gradient Descent, and it combines the best of both worlds: decent performance and a much more predictable and close to the actual value of the gradient, which is more suitable for supporting the more advanced approaches, such as momentum.

In essence, momentum makes the gradient that is calculated on a batch of samples more straightforward and less prone to oscillation due to the random fluctuations of the batch samples. It is, basically, achieved by applying using the moving average of the gradient. Different momentum-based algorithms operate by combining the currently computed value of the update with the previous value. For example, the simple SGD with momentum will have the following update code:

(let ((dws 0))
(loop
(with ((batch (sample data batch-size))
(g (calculate-gradient batch)))
(:= dws (- (* decay-rate dws)
(* learning-rate g)))
(:+ ws dws))))

An alternative variant is called the Nesterov accelerated gradient which uses the following update procedure:

(let ((dws 0))
(loop
(:+ ws dws)
(with ((batch (sample data batch-size))
(g (- (* learning-rate (calculate-gradient batch)))))
(:= dws (+ (* decay-rate dws) g))
(:+ ws g))))

I.e., we first perform the update using the previous momentum, and only then calculate the gradient and perform the gradient-based update. The motivation for it is the following: while the gradient term always points in the right direction, the momentum term may not. If the momentum term points in the wrong direction or overshoots, the gradient can still "go back" and correct it in the same update step.

Another direction of GD improvement is using the adaptive learning-rate. For instance, the famous Adam algorithm tracks per-cell learning rate for the ws matrix.

These are not all the ways, in which plain gradient descent may be made more sophisticated — in order to converge faster. I won't mention here second-order methods or conjugate gradients. Numerous papers exploring this space continue being published.

Sampling

Speaking about sampling that we have mentioned several times throughout this book... I think this is a good place to mention a couple of simple sampling tricks that may prove useful in many different problems.

The sampling that is used in SGD is the simplest form of random selection that is executed by picking a random element from the set and repeating it the specified number of times. This sampling is called "with replacement". The reason for this is that after picking an element it is not removed from the set (i.e. it can be considered "replaced" by an equal element), and so it can be picked again. Such an approach is the simplest one to implement and reason about. There's also the "without replacement" version that removes the element from the set after selecting it. It ensures that each element may be picked only once, but also causes the change in probabilities of picking elements on subsequent iterations.

Here is an abstract (as we don't specify the representation of the set and the realted size, remove-item, and empty? procedures) implementation of these sampling methods:

(defun sample (n set &key (with-replacement t))
(loop :repeat n
:for i := (random (size set))
:collect (? set i)
:unless with-replacement :do
(remove-item set i)
(when (empty? set) (loop-finish)))

This simplest approach samples from a uniform probability distribution, i.e. it assumes that the elements of the set have an equal chance of being selected. In many tasks, these probabilities have to be different. For such cases, a more general sampling implementation is needed:

(defun sample-from-dist (n dist)
;; here, DIST is a hash-table with keys being items
;; and values — their probabilities
(let ((scale (reduce '+ (vals dist))))
(loop :repeat n :collect
(let ((r (* scale (random 1.0)))
(acc 0))
(dotable (k v dist)
(:+ acc v)
(when (>= acc r)
(return k)))))))

CL-USER> (sample-from-dist 10 #h(:foo 2
:quux 1
:baz 10))
(:BAZ :BAZ :BAZ :QUUX :BAZ :BAZ :BAZ :BAZ :BAZ :FOO)

I'm surprised how often I have to retell this simple sampling technique. In it, all the items are placed on a [0, 1) interval occupying the parts proportionate to their weight in the probability distribution (:baz will have 80% of the weight in the distribution above). Then we put a random point in this interval and determine in which part it falls.

The final sampling approach I'd like to show here — quite a popular one for programming interviews — is Reservoir Sampling. It deals with uniform sampling from an infinite set. Well, how do you represent an infinite set? For practical purposes, it can be thought of as a stream. So, the items are read sequentially from this stream and we need to decide which ones to collect and which to skip. This is achieved by the following procedure:

(defun reservoir-sample (n stream)
(let ((rez (make-array n :initial-element nil))) ; reservoir
(handler-case
(loop :for item := (read stream)
:for i :from 0
:for r := (random (1+ i))
:do (cond
;; initially, fill the reservoir with the first N items
((< i n) (:= (? rez i) item))
;; replace the R-th item with probability
;; proportionate to (- 1 (/ R N))
((< r n) (:= (? rez r) item))))
;; sampling stops when the stream is exhausted
;; we'll use an input stream and read items from it
(end-of-file () rez))))

CL-USER> (with-input-from-string (in "foo foo foo foo bar bar baz")
(reservoir-sample 3 in))
#(BAR BAZ FOO)
CL-USER> (with-input-from-string (in "foo foo foo foo bar bar baz")
(reservoir-sample 3 in))
#(FOO FOO FOO)
CL-USER> (with-input-from-string (in "foo foo foo foo bar bar baz")
(reservoir-sample 3 in))
#(BAZ FOO FOO)
CL-USER> (with-input-from-string (in (format nil "~{~A ~}"
(loop :for i :from 0 :to 100 :collect i)))
(reservoir-sample 10 in))
#(30 42 66 68 76 5 22 39 51 24) ; note that 5 stayed on the appropriate position where it was placed initially

Matrix Factorization

Matrix factorization is a decomposition of a matrix into a product of matrices. It has many different variants that find for particular classes of problems. Matrix factorization is a computationally-intensive task that has many applications: from machine learning to information retrieval to data compression. Its use cases include: background removal in images, topic modeling, collaborative filtering, CT scan reconstruction, etc.

Among many factorization methods, the following two stand out as the most prominent: Singular Value Decomposition (SVD) and non-negative matrix factorization/non-negative sparse coding (NNSC). NNSC is interesting as it produces much sharper vectors that still remain sparse, i.e. all the information is concentrated in the non-null slots.

Singular Value Decomposition

SVD is the generalization of the eigendecomposition (which is defined only for square matrices) to any matrix. It is extremely important as the eigenvectors define the basis of the matrix and the eigenvalues — the relative importance of the eigenvectors. Once SVD is performed, using the obtained vectors, we can immediately figure out a lot of useful properties of the dataset. Thus, SVD is behind such methods as PCA in statistical analysis, LSI topic modeling in NLP, etc.

Formally, the singular value decomposition of an m x n matrix M is a factorization of the form (* U S V), where U is an m x m unitary matrix, V is an n x n unitary matrix, and S (usually, greek sigma) is an m x n rectangular diagonal matrix with non-negative real numbers on the diagonal. The columns of U are left-singular vectors of M, the rows of V are right-singular vectors, and the diagonal elements of S are known as the singular values of M.

The singular value decomposition can be computed either analytically or via approximation methods. The analytic approach is not tractable for large matrices — the ones that occur in practice. Thus, approximation methods are used. One of the well-known algorithms is QuasiSVD that was developed as a result of the famous Netflix challenge in the 2000s. The idea behind QuasiSVD is, basically, gradient descent. The algorithm approximates the decomposition with random matrices and then iteratively improves it using the following formula:

(defun svd-1 (u v rank training-data &key (learning-rate 0.001))
(dotimes (f rank)
(loop :for (i j val) :in training-data :do
(let ((err (- val (predict rank u i v j))))
(:+ (aref u f i) (* learning-rate err (aref v f j)))
(:+ (aref v f j) (* learning-rate err (aref u f i))))))))

The described method is called QuasiSVD because the singular values are not explicit: the decomposition is into just two matrices of non-unit vectors. Another constraint of the algorithm is that the rank of the decomposition (the number of features) should be specified by the user. Yet, for practical purposes, this is often what is actually needed. Here is a brief description at the usage of the method for predicting movie reviews for the Netflix challenge.

For visualizing the problem, it makes sense to think of the data as a big sparsely filled matrix, with users across the top and movies down the side, and each cell in the matrix either contains an observed rating (1-5) for that movie (row) by that user (column) or is blank meaning you don't know. This matrix would have about 8.5 billion entries (number of users times number of movies). Note also that this means you are only given values for one in 85 of the cells. The rest are all blank.

The assumption is that a user's rating of a movie is composed of a sum of preferences about the various aspects of that movie. For example, imagine that we limit it to forty aspects, such that each movie is described only by forty values saying how much that movie exemplifies each aspect, and correspondingly each user is described by forty values saying how much they prefer each aspect. To combine these all together into a rating, we just multiply each user preference by the corresponding movie aspect, and then add those forty leanings up into a final opinion of how much that user likes that movie. [...] Such a model requires (* 40 (+ 17k 500k)) or about 20M values — 400 times less than the original 8.5B.

Here is the function that approximates the rating. The QuasiSVD matrix u is user-features and vmovie-features. As you see, we don't need to further factor u and v into the matrix of singular values and the unit vectors matrices.

(defun predict-rating (rank user-features user movie-features movie)
(loop :for f :from 0 :below rank
:sum (* (aref user-features f user)
(aref movie-features f movie))))

Fourier Transform

The last item we'll discuss in this chapter is not exactly an optimization problem, but it's also a numeric algorithm that bears a lot of significance to the previous one and has broad practical applications. The Discrete Fourier Transform (DFT) is the most important discrete transform, used to perform Fourier analysis in many practical applications: in digital signal processing, the function is any quantity or signal that varies over time, such as the pressure of a sound wave, a radio signal, or daily temperature readings, sampled over a finite time interval; in image processing, the samples can be the values of pixels along a row or column of a raster image.

It is said that the Fourier Transform transforms a "signal" from the time/space domain (represented by observed samples) into the frequency domain. Put simply, a time-domain graph shows how a signal changes over time, whereas a frequency-domain graph shows how much of the signal lies within each given frequency band over a range of frequencies. The inverse Fourier Transform performs the reverse operation and converts the frequency-domain signal back into the time domain. Explaining the deep meaning of the transform is beyond the scope of this book, the only thing worth mentioning here is that operating on the frequency domain allows us to perform many useful operations on the signal, such as determining the most important features, compression (that we'll discuss below), etc.

The complexity of computing DFT naively just by applying its definition on n samples is O(n^2):

(defun dft (vec)
(with ((n (length vec))
(rez (make-array n))
(scale (/ (- (* 2 pi #c(0 1))) n)))
;; #c(0 1) is imaginary unit (i) - Lisp allows us to operate on complex numbers directly
(dotimes (i n)
(:= (? rez i) (loop :for j :from 0 :below n
:sum (* (? vec j) (exp (* scale i j))))))))

However, the well-known Fast Fourier Transform (FFT) achieves a much better performance of O(n log n). Actually, a group of algorithms shares the name FFT, but their main principle is the same. You might have already guessed, from our previous chapters, that such reduction in complexity is achieved with the help of the divide-and-conquer approach. A radix-2 decimation-in-time (DIT) FFT is the simplest and most common form of the Cooley-Tukey algorithm, which is the standard FFT implementation. It first computes the DFTs of the even-indexed inputs (indices: 0, 2, ..., (- n 2)) and of the odd-indexed inputs (indices: 1, 3, ..., (- n 1)), and then combines those two results to produce the DFT of the whole sequence. This idea is utilized recursively. What enables such decomposition is the observation that thanks to the periodicity of the complex exponential, the elements (? rez i) and (? rez (+ i n/2)) may be calculated from the FFTs of the same subsequences. The formulas are the following:

(let ((e (fft-of-even-indexed-part))
(o (fft-of-odd-indexed-part))
(scale (exp (/ (- (* 2 pi #c(0 1) i))
n)))
(n/2 (floor n 2)))
(:= (? rez i) (+ (? e i) (* scale (? o i)))
(? rez (+ i n/2)) (- (? e i) (* scale (? o i)))))

Fourier Transform in Action: JPEG

Fourier Transform — or rather its variant that uses only cosine functions[2] and operates on real numbers — the Discrete Cosine Transform (DCT) is the enabling factor of the main lossy media compression formats, such as JPEG, MPEG, and MP3. All of them achieve the drastic reduction in the size of the compressed file by first transforming it into the frequency domain, and then identifying the long tail of low amplitude frequencies and removing all the data that is associated with these frequencies (which is, basically, noise). Such an approach allows specifying a threshold of the percentage of data that should be discarded and retained. The use of cosine rather than sine functions is critical for compression since it turns out that fewer cosine functions are needed to approximate a typical signal. Also, this allows sticking to only real numbers. DCTs are equivalent to DFTs of roughly twice the length, operating on real data with even symmetry. There are, actually, eight different DCT variants, and we won't go into detail about their differences.

The general JPEG compression procedure operates in the following steps:

  • an RGB to YCbCr color space conversion (a special color space with luminescence and chrominance components more suited for further processing)
  • division of the image into 8 x 8 pixel blocks
  • shifting the pixel values from [0,256) to [-128,128)
  • applying DCT to each block from left to right, top to bottom
  • compressing of each block through quantization
  • entropy encoding the quantized matrix (we'll dicuss this in the next chapter)
  • compressed image is reconstructed through the reverse process using the Inverse Discrete Cosine Transform (IDCT)

The quantization step is where the lossy part of compression takes place. It aims at reducing most of the less important high-frequency DCT coefficients to zero, the more zeros the better the image will compress. Lower frequencies are used to reconstruct the image because the human eye is more sensitive to them and higher frequencies are discarded.

P.S. Also, further development of the Fourier-related transforms for lossy compression lies in using the Wavelet family of transforms.

Take-aways

It was not easy to select the name for this chapter. Originally, I planned to dedicate it to optimization approaches. Then I thought that a number of other numerical algorithms need to be presented, but they were not substantial enough to justify a separate chapter. After all, I saw that what all these different approaches are about is, first of all, approximation. And, after gathering all the descriptions in one place and combining them, I came to the conclusion that approximation is, in a way, a more general and correct term than optimization. Although they go hand in hand, and it's somewhat hard to say which one enables the other...

A conclusion that we can draw from this chapter is that the main optimization methods currently in use boil down to greedy local probabilistic search. In both the discrete and continuous domains, the key idea is to quickly find the direction, in which we can somewhat improve the current state of the system, and advance alongside that direction. All the rest is, basically, fine-tuning of this concept. There are alternatives, but local search aka gradient descent aka hill climbing dominates the optimization landscape.

Another interesting observation can be made that many approaches we have seen here are more of the templates or frameworks than algorithms. Branch & Bound, Genetic Programming or Local Search define a certain skeleton that should be filled with domain-specific code which will perform the main computations. Such a "big picture" approach is somewhat uncommon to the algorithm world that tends to concentrate on the low-level details and optimize them down to the last bit. So, the skills needed to design such generic frameworks are no less important to the algorithmic developers than knowledge of the low-level optimization techniques.

SGD, SVD, MCTS, NNSC, FFT — this sphere has plenty of algorithms with abbreviated names for solving particular numerical problems. We have discussed only the most well-known and principal ones with broad practical significance in the context of software development. But, besides them, there are many other famous numerical algorithms like the Sieve of Eratosthenes, the Finite Element Method, the Simplex Method, and so on and so forth. Yet, many of the ways to tackle them and the issues you will encounter in the process are, essentially, similar.


[1] It uses the popular drakma HTTP client and cl-ppcre regex library.

[2] The DFT uses a complex exponent, which consists of a cosine and a sine part.

Joe MarshallGendl / SBCL / Ubuntu / WSL / Windows and an experiment in live blogging

· 11 days ago
I'm helping David Cooper by trying to run a demo of his Gendl software. He's preparing a new release and I get to be the guinea pig. For fun, I'm live blogging as we go along just as an experiment.

I'm running SBCL 1.4.5debian under Ubuntu 18.04.3 under Windows Subsystem for Linux (WSL 1) under Windows 10 Home edition. I have to say that Ubuntu on WSL is remarkably stable and useful. It seems to act just like the Ubuntu I'm used to using and runs ELF executables without modification. The GNU tool chain just seems to work and I used apt-get install to fetch and install SBCL, which hasn't complained either. I've git cloned David's release and I'm now awaiting further instruction.

While waiting, I've installed slime and am running SBCL under Emacs 25.2.2. Quicklisp is used to install and start Gendl. This starts up a web server that provides the UI in another thread.

So far, this entire melange of software is working quite smoothly despite the oddball combination of the parts. Lisp has habit of tickling memory protection bugs and threading bugs. Unix isn't supposed to get along with Windows. Windows isn't known to get along with Unix very well, either, unless you isolate them from each other through a virtual machine. But WSL is doing its job acting as a go-between. I don't know the details of how it works, but I do know that you need to have some virtualization enabled, but it isn't a full virtual machine (Windows 10 Home edition doesn't support Hyper-V). In WSL, the Linux side can see the Windows disk as a mount point, but it doesn't seem that the Windows side can see the Linux disk. WSL gives you a bash shell in a window capable of running Emacs, or you can run an XWindows server like Xming under Windows if you want the full X experience. Performance seems reasonably snappy.



Well live blogging didn't work. It just felt rude to be typing and not paying full attention while someone was demonstrating some software that he had obviously worked very hard on. So I'll give a re-cap of what I understood from the demo. I'm no expert on CAD systems and most likely misunderstood important points, so take this as a novice's view. I asked David to help me correct this write-up. Not surprisingly, there's one important point I misunderstood, so I'll put in David's explanation.

Gendl is inspired by ICAD. Through use of CLOS and some macros, Gendl provides a DSL that allows you to design an object through a tree-like decomposition of its component pieces. Each component has properties, and these can be inherited by subcomponents (so, for example, when you paint a component a certain color, the color propagates down the component tree to the child components and they get painted, too).

(Here I messed up majorly.) In David's words
The technical term for the properties is “messages.” In many ways this is a message-passing object system, inspired by Flavors, which was inspired by SmallTalk (The ICAD System was built with Flavors, not CLOS).

Note there are two, orthogonal, concepts of "inheriting" going on. Normal class mixins provide one type of inheritance -- an object definition will inherit all the messages of its mixins. We usually call this "inheritance."

The passing of values from parent instance to child instance and other descendant instances in the instance tree is called “passing down” rather than inheritance, to avoid confusion with that other, different, inheritance. Messages can be passed down implicitly (by making them trickle-down-slots), or passed down explicitly by including them as keyword/value pairs in the object specification.

Another way to think of this is that the class (or “type”) of a given instance can inherit from multiple mixins, can contain multiple child object instances, but can have at most one Parent object in the instance tree (and it can have zero Parent objects if it's the root)

Also:

The mixin inheritance relationship is an “is a” relationship (essentially the same as normal CLOS inheritance).

The parent-child relationship is a “has a” relationship, and comes along with automatic demand-driven (i.e. lazy) dependency tracking.

The base component contains a co-ordinate system, so all components have a co-ordinate system relative to the root object. One primary module or application of Gendl is the web-based UI Geysr, that allows you to navigate the component tree, inspect components, change their properties, and visualize how they fit together. This module also provides an “exploded” view of the object if desired, where each subcomponent is rendered displaced a small distance from it's actual location. It can also render an exploded view of the assembly processs for the object.

The DSL provides a way of specifying a components location via constraint-like functional messages between components. This means that components can get resized when necessary, angles recomputed when necessary, and even the number of subassemblies recomputed dynamically when the object changes. In the “bench” example David showed me, the number of slats in the bench seat was recomputed dynamically from the seat width, the slat width, and constraints on the slat spacing. The user simply specified the perimeter of the seating area and Gendl figured out how many 2x4's you'd need to cover it. (Maybe I'm easily impressed, but I thought this was pretty neat.)

Another part of Gendl is the Process Planning module which computes a Manufacturing Bill of Processes and Raw Materials. This is much more interesting. Again using the DSL provided by CLOS, you define rules on how to build components from their constituent parts. Then Gendl, starting at the root node, infers a construction plan by recursive descent through the components and combining the rules. When it is done, you have the instructions for building the entire assembly. For the bench example, it started with purchase of sufficient 2x4s for the frame, seat, and back, then cutting the 2x4s to length, some at an angle for the reclining back, then fastening the cut pieces together, finally painting the result. The leaf nodes in the process planning tree represent the Bill of Materials for the object under construction. I thought this was pretty cool, too.

While I know nothing about CAD, I do know a little about computer languages, so I'll opine a little on the DSL. The basic operation in Gendl is define-object which is a macro that extends defclass. Objects have slots (fields) that hold values. One feature of Gendl is the ability to define “trickle down” slots whose values are available in the environment of subcomponents. Thus when you attach a new seat slat to the seat of the bench, it can use the trickle down value to match its paint color with that of the rest of the seat. This use of “environmental” properties isn't unique to Gendl, but it is worth note. David says, “Trickle-down slots are essentially a short-hand way of passing messages from a parent down to its descendants.

The recursive nature of Lisp matches well with the recursive decomposition of objects. The DSL makes it obvious to even the casual observer how the objects are composed. The value of the slots in the object can be defined as any Lisp expression, but drawing from the constraint-like language subset makes it obvious how the pieces relate to each other. You don't specify that the arm rests are 60 inches from the center (although you could), you specify that they butt up against the backrest. Given that constraint, you can move one subassembly and Gendl will automatically recompute the positions of the adjoining assemblies. While this is powerful, I suspect you need a fair amount of experience in setting up the constraints to do it so that it is useful.

Something I completely missed because it is natural to me and other Lisp hackers, but not to CAD designers, is the power of having an Emacs/Slime REPL in parallel with the web-based interface. The web-based interface gives you a natural point-and-click, drag-and-drop, UI with the instant visual feedback while at the same time you have the powerful DSL available in a REPL for experimenting directly with the object tree in a natural text-based way. (Sometimes people forget that text itself is highly abstract visual medium with thousands of years of development behind it.) In David's demo, he would move between the REPL and the UI extremely frequently, keeping both windows open at the same time, sometimes manipulating the object graphically, other times textually. In retrospect, this looks like a huge advantage, but at the time it seemed like an obvious way to use the system. I expect other jaded command-line hackers would have the same experience.

The rules for constructing components are written as CLOS methods that are specialized to the component they are constructing. It is rather obvious what rule applies for a given component. In addition it is obvious what to specialize on for constructing a component. The rule files are straightforward and reasonably terse.

Given the shortness of the demo (there was a lot to grasp), and my own inexperience, I don't think I could write an object description from scratch, but given an existing description I feel confident I could add a small subcomponent. I can see how this product would be useful in design and manufacturing and I think the learning curve doesn't look too steep if one is willing to start small before working up to something hugely complex.

Thanks to David for giving me the demo and for helping correct my blog post. I take credit for all the errors and misconceptions.

Joe MarshallMicro-services

· 13 days ago
All the cool kids are doing micro-services these days. They are either writing their new apps as a micro-service architecture, or are in the process of changing their legacy apps to micro-service architectures. The fact that everyone seems to be doing it surely means there are no drawbacks of any note, or that the advantages vastly outweigh them.

I see these advantages to a micro-service architecture:
  • Architectural impedance match — each architectural feature maps to a few obvious micro-services.  For example, if you are adding a feature that simply surfaces a data field to a user interface, you probably need to modify little more than the component that generates the UI.  If you are adding a richer service, you'll need to add a persistence component and a business logic component in addition to the UI component. But the reason there is such a good match between the architecture and the micro-services is simply because you draw circles around each architectural component in your design and declare it to be a micro-service. Basically, if it is worth drawing as a separate component, you consider it worth implementing it as a micro-service.

    This impedance match works well with Agile/SCRUM planning. At the beginning of each sprint, it is fairly easy to assign resources to handle the changes necessary to the existing components or to write new components. Each service is usually tiny enough that it doesn't exceed the capacity of a single team. If you can assign more than one team to a component, then the component is too large and needs to be broken into more micro-services. For a more complex feature, it is possible to plan to bring the micro-service online over a period of more than one sprint without adding too much to the risk of the component.
  • Deployment impedance match — each architectural feature maps to only a handful of obvious micro-services, each one usually implemented by a stand-alone "plug-in" program.  The micro-services are usually run in a separate container where they can crash and be restarted with little ill effect on the program at large - perhaps some functionality is temporarily lost while the micro-service is restarted, or maybe a hot backup is ready to take over.  The containers are typically something like "docker" and are monitored with "kubernetes" to ensure they are up and running.  Structuring the program as a set of micro-services facilitates this kind of of monitoring and restarting. This works well when you can distribute your micro-services over a fleet of virtual and physical machines.
  • Possible bug reduction — From experience, it seems that two 500 line programs have fewer bugs than a single 1000 line program. In theory, this would scale and a program made of dozens of hundred-line subprograms would have far fewer bugs than if that same program were written as a single subprogram with several thousands of lines of code. In practice, however, I think that it isn't so simple.
    Two bug-free programs might have emergent buggy behavior when they try to co-ordinate their behavior. Emergent bugs are very hard to find and fix.
  • Robustness — it is hard to completely knock down a micro-service architecture. No matter how many subprocesses you manage to disable, at least a few will still remain running. If we are satisfied with occasional missing pieces of our program, this is robustness of a sort.
  • Dependency injection — Back in the 80's, we'd limit the spread of complexity by structuring larger programs to only know about those smaller subprograms that they depended directly upon. This was eventually given a name and declared “a good thing”. You get dependency injection almost for free in a micro-services architecture if not least because the services won't even know about each other unless you tell them. You are virtually forced to follow a reasonable design principle because you have to do something to get the services to know about each other, and this minimal effort also happens to minimize unwanted information sharing.

It is a time-honored tradition in computer programming to take any idea that offers some advantage, give it a name, elevate it to a “principle” and use it to the exclusion of any prior good idea. “Micro-service architecture” is the latest fad, and it has so rapidly replaced “object-oriented programming” that you can often find both in the same job description. This is almost a contradiction because a micro-service architecture often cuts straight through the objects manipulated by the program. What is conceptually a single object might be represented in several completely different ways in various micro-services in the system. It might be reconstituted from a relational mapping by way of an ORM, serialized into JSON format for RPCs, and returned as part of a nested GraphQL query before having its guts evacuated and spread out through the Document Object Model for display. The identity of the object through the system is completely lost.

Despite the plethora of wonderful ideas that come from a micro-service architecture, there seem to be some downsides.
  • Ubiquitous process barriers — A process barrier is thrown up between every functional component of a program. This helps limit the effect of a crash, but greatly hinders cooperation and synchronization. Error handling and try/finally clauses don't easily work across process boundaries, and what before might be handled by a critical section of code now might require a distributed locking protocol.
  • Ubiquitous RPCs — all communication between the micro-services have to take place between the tin-cans and strings we call remote procedure calls. RPCs, besides being slow, introduce failure modes that normal procedure calls don't have. And if the programmer doesn't think hard about these new failure modes, the RPC will be far less robust than its direct counterpart.
  • Anti-object oriented — all objects in the problem domain of a micro-service architecture must be simple enough to marshal into JSON objects so they can be passed as arguments and returned as values. As noted in a previous essay, the more complex the object, the harderit is to marshal across a procedure boundary, and if fields are hard to marshal, methods are virtually impossible. There is a strong incentive to keep "objects" in JSON format as they are passed through the system and only parse the JSON strings as they are needed by the lowest layer. This is the antithesis of object-oriented design which is to try to make objects as similar to their real-world counterparts as possible and encapsulate behavior (methods) along with the data.
  • Non robustness — while it is hard to knock over an entire micro-architecture application, it can be all too easy to knock over parts of it. A program that is 90% functional 100% of the time is also 10% broken 100% of the time. If that 10% is an often used part of the program, it can appear very non robust. So what if the header and footer of each page (each its own micro-service, naturally) is up nearly 100% of the time when the body of each page is missing half its data. A colleague of mine noted that a certain micro-architecture he was working on seemed very fragile for something that is supposed to be so robust.
  • Verbosity — the frameworks used to support a micro-service architecture makes plain old Java look downright terse. Objects and methods have dozens of annotations describing how to marshal the objects and hook up the RPCs. Some frameworks make use of the redundancy of Java to aid in automatically constructing a micro-architecture system, but these often rely on reflection and automated code generation that introduces its own set of problems. And Java's habit of putting every class in a separate file means you often have to follow a tortuous trail through the code to find the path from the request coming in to the method that actually handles the logic that implements the request.

Despite these drawbacks, I expect to see a lot more micro-architectures in the future because they match up so well with “factory style” software production. It really makes it easier to manage teams of developers. I expect enthusiasm to be curbed a bit when it is noticed that rewriting a program as a set of micro-services doesn't really result in a dramatic improvement in the reliability of the program or a dramatic reduction in the complexity of a large body of code.

Joe MarshallIf 6 was 9

· 14 days ago
Now if 6 turned out to be 9
I don't mind, I don't mind
Alright, if all the hippies cut off all their hair
I don't care, I don't care
Dig, 'cos I got my own world to live through
And I ain't gonna copy you           -- Jimi Hendrix 
Fortunately, integers are immutable and usually implemented that way (with a few notable exceptions, mostly on early computers), so there would be no harm in Jimi copying 6 around whenever he felt like it. It would always remain 6 and never turn out to unexpectedly be 9. After all, in a computer there are hundreds of thousands of copies of the integer 1 and they agree because they are all the same and are all immutable.

There is no need for “defensive copies” of immutable data, either. This can improve the Ospace() of a program, sometimes dramatically. It also makes it much easier to encapsulate the representation of abstract data, so object-oriented code becomes easier to write correctly. Without mutators, there's half the code for each field in an object.

Cache coherency is trivial when your data is immutable: simply eject old data. No need for write-back, write-through, or cache snooping.

“Immutable by default” still hasn't taken over the mainstream, but “optionally immutable” has gained a lot of popularity over the past decade, especially as people have realized the advantages of immutable data in distributed systems.

Jimi probably wasn't thinking along these lines when he wrote that song, though.





Joe MarshallRemotely like a procedure call

· 15 days ago
The moment you connect two computers together with a wire, you introduce a number of failure modes that are very hard to deal with: undeliverable messages, fragmentation of messages, duplication of messages, lost messages, out-of-order arrival, and untimely arrival come to mind. There are probably some I am forgetting. With a bit of engineering, you can eliminate some of these failure modes, or reduce the likelihood of them occurring, or trade the likelihood of one kind for another, but ultimately, when you send a network message, you cannot be sure it will reach its destination in a reasonable amount of time, if at all.

Usually, in a network, the computer that initiates the interaction is called the client, while the computer that performs the action is called the server. There are usually several computers along the way — routers, proxies, and switches — that facilitate the interaction, but their job is to act as a wire in the non erroneous case, return responses if needed, and to report errors back to the client if necessary and possible.

There are several taxonomies of network errors, but one useful way I have found to categorize them is by what sort of actions you might want to take to deal with them. If you are lucky, you don't have to deal with them at all. For instance, it might be reasonable to simply lose the occasional remote info logging message. Or if a web page fails to be delivered, it might be reasonable to rely on the recipient eventually hitting the reload button. This isn't much of a strategy, but it is easy to implement.

Many times, it is possible for the client to detect that the interaction failed. Perhaps the server returns an error code indicating the service is temporarily unavailable, or perhaps a router or switch along the way indicates that it cannot deliver a packet. This isn't the desired state of affairs, but at least the client can come to the conclusion that the interaction didn't take place. Depending on what is causing the failure, there are essentially only two options to proceeding: trying again and giving up. If the error is persistent or long-term, trying again is futile and your only option is to defer the operation indefinitely or abandon it altogether. A robust, distributed system has to be prepared to abandon interactions at any time and have a strategy for what do to recover from abandoned interactions. (For example, it could pop up a message box indicating a service is not available, or it could send an email saying that it will process the interaction at some undetermined time in the future. I suppose crashing is a strategy, but it stretches the definition of “robust”.)

Trying again, usually after a short wait, is often an appropriate strategy to deal with transient errors. For instance, an unreachable server may become reachable again after a routing change, or a service could get restarted and start serving again. It may not be possible to tell the difference between a transient error and a persistent one until you have retried it several times and have had no success. In this case you need to fall back to the persistent error strategy of deferring the operation indefinitely or abandoning it altogether. If retrying is part of strategy of handling transient errors, it becomes necessary for the server to deal with the possibility of handling the same message multiple times. It should, through use of timestamps, versioning, or other idempotent mechanisms, be able to ignore duplicate requests silently. Otherwise, the original error is handled through a retry only to cause the server to raise a duplicate request error.

An even less desired failure mode is when a message is lost or cannot be delivered, and this cannot be determined by the client. The message gets sent, but no reply is received. You are now left guessing whether it was the original message that was lost or the just the reply. The usual strategy is to eventually time-out and resend the message (essentially, you inject a transient failure after a pre-determined amount of time and then handle it like any other transient failure). Again, the server has to be designed to see a message a multiple number of times and to handle duplicates silently and gracefully. Also again, the situation may become persistent and the persistent fallback strategy has to be used.

No one size fits all, so network systems come with libraries for implementing various strategies for error recovery. A robust, distributed system will have been designed with certain failure modes in mind and have an engineered solution for running despite these errors occurring once in while. These involve designing APIs that can handle stale data through versioning timestamps and a reconciliation process, idempotent APIs, temporarily saving state locally, designating “ultimate sources of truth” for information, and designing protocols that reach “eventual consistency” given enough time. There's no panacea, though — no programming technique or trick that simply results in a robust, distributed system. A robust, distributed system has to be designed to be robust despite being distributed and you need some experienced people involved in the design and architecture of such a system at the beginning so that it will be robust once it is implemented. There is actual non-trivial work involved.

(It amuses me to see that some people understand part of the problem without understanding much about the underlying tools. I've seen packet timeout and retry mechanism bolted on to TCP streams, which already have this built in to their abstraction.)



Enter the remote procedure call. If nothing goes wrong, a network interaction that involves a return receipt bears a remote resemblance to a procedure call. The syntax can be made similar to a procedure call, it's just that the call begins over here, the computation happens over there, and the return value is sent back over here. The major difference being that it is significantly slower (and just forget about tail recursion. Although there's nothing in theory that would prevent a properly tail-recursive RPC, I've never seen one.)

RPCs are a weak, leaky abstraction. If everything goes well, the network interaction does indeed appear as if it were a (slow) procedure call, but if there are problems in the network interaction, they aren't hidden from the caller. You can encounter persistent errors, in which case the caller has to be prepared to either defer the call indefinitely or abandon it altogether. You can encounter transient errors, which suggests attempting to retry the call, but it may not be clear whether an error is transient or persistent until you've retried and failed several times. If retrying is part of the error recover strategy, then the callee has to be prepared to receive duplicate requests and discard them or handle them in some other (presumably idempotent) manner. In short, none of the errors that arise from network interactions are abstracted away by an RPC. Furthermore, network errors are often turned into RPC exceptions. This seems natural but it makes it difficult to separate the strategies for handling network error from the strategies of handling other exceptions.

A featureful RPC mechanism can aid in figuring out what happened in an error case, and can provide various recovery mechanisms, but it is still up to the programmer to determine what the appropriate recovery strategy is for each error condition.

RPCs are by their nature a late-bound mechanism. There has to be machinery in place to resolve the server that will handle the network interaction, route messages, and possibly proxy objects. All too frequently, this machinery simply doesn't work because sysadmins thoughtlessly apply over broad firewall rules that prevent the appropriate name resolution and proxying from occurring. To get around this problem some systems have resorted to using HTTP or HTTPS as the underlying RPC mechanism. These protocols are rarely blocked by firewalls as it would render it impossible for sysadmins to watch videos.

HTTP and HTTPS were designed as protocols for transferring web pages, not for general remote procedure calls. For example, errors are handled by returning a 3-digit code and small blurb of text. Shoehorning a featureful RPC mechanism into web interactions is difficult, so we are left with shuffling JSON objects over HTTP connections as our poor-man's approximation to a well-designed RPC mechanism.

RPCs can be a useful, if limited abstraction. If you are using them, you need to be careful because they don't abstract away the problems of distributed systems and it becomes all to easy to build a fragile distributed system rather than a robust one.


Addendum: It seems that I'm not the only one having problems commenting on things. Arthur Gleckler tried to write a comment but it simply got eaten and returned him to the posting page. I don't know what John Cowan is doing right, but he appears to be able to post comments just fine. Google has this unfortunate habit of letting bit rot set into their products. Then, once enough of it stops working, they kill the product or try to replace it with something no one wants. Anyway, here's what Arthur had to say:
Speaking of tail-recursive RPCs, when Google was redesigning its RPC system, I tried to get the new designers to implement a simple tail-recursive capability so that client A could call server B and receive a response from server C. That could have made a lot of common patterns at Google faster, since it was common for A to call B, which called C, which called D, and so on, e.g. to try different mechanisms for handling a particular search query. Unfortunately, I wasn't persuasive. Frankly, it's hard for people who haven't spent much time with tail-recursive languages to see the benefit, and these were hard-core C++ hackers.

Wimpie NortjeTesting command line interfaces

· 15 days ago

When you make a standalone application intended as a command line tool you may want integration tests to exercise the application through the normal command line user interface.

There are tools that can do this testing on OS executable applications but sometimes it is useful to run these tests as part of the test suite without having to first build a standalone binary.

The goal then is to test a function that only interacts via standard input and output. This code demonstrates how it can be done:

(defun application ()
  (let* (name greet)
    ;; Prompt 1
    (format *query-io* "Name: ")
    (setf name (read-line *query-io*))
    ;; Prompt 2
    (format *query-io* "Greeting: ")
    (setf greet (read-line *query-io*))
    ;; Result
    (format t "~&~A ~A!~%" greet name)))

(def-test greet ()
  (let* (;*STANDARD-INPUT* must contain all the responses at once.
         (*STANDARD-INPUT*
          (make-string-input-stream (format nil "John~%Hi~%")))
         (*STANDARD-OUTPUT* (make-string-output-stream))
         (*QUERY-IO*
          (make-two-way-stream *STANDARD-INPUT* *STANDARD-OUTPUT*))
         result)
    (application)
    (setf result (get-output-stream-string *STANDARD-OUTPUT*))
    ;; RESULT =>
    ;; "Name: Greeting: 
    ;; Hi John!
    ;; "
    (is (string-equal
         "Hi John!"
         (nth 1 (split-sequence:split-sequence
                 #\newline result))))))

The example's method only works for functions that have a fixed request/response sequence, in other words, the test does not need to change its response dynamically, all the responses can be prepared before the function is executed.

The reason for this limitation is that make-string-input-stream creates streams with fixed content. Even if the original string changes, the stream content does not. Without the ability to change the stream content after creation there is no way to implement dynamic responses to application prompts.

While it does not seem possible to create interactive output-to-input streams with Common Lisp primitives, UIOP:RUN-PROGRAM can provide such output-to-input streams to external processes. This at least indicates that it can be done for some particular case. On CCL this is accomplished with file descriptor streams. I have not investigated the detail of the implementation nor have I looked how it is done for other implementations, so it may ultimately be a dead end.

It appears as if there is no easy and uncomplicated solution for setting up a *STANDARD-OUTPUT* to *STANDARD-INPUT* pipe that will enable test functions to apply logic in-between responses. If you need such a solution some avenues to investigate are:

  • Test the application as a standalone external process with UIOP:RUN-PROGRAM. This definitely works but it is exactly what we did not want to do.
  • Look for a library that implements some kind of pipe function. cl-plumbing may be an option.
  • Try to implement piping with files or sockets.

Lispers.deHamburg Lispers Meetup, Monday, 6th Janury 2020

· 17 days ago

Happy New Year!

Our monthly gathering in Hamburg takes place as usual on the first Monday of the month, on 6th January 2020. We meet around 19:00 (“7 p. m.”) at Ristorante Opera, Dammtorstr. 7.

This is an informal event for all Lisp experience levels, and about any Lisp dialect.

Come as you are, bring lispy topics.

Quicklisp newsUpdated Quicklisp client now available

· 17 days ago
I updated the Quicklisp client yesterday. The new version has the following fixes:
  • (ql:quickload '()) is allowed and treated as an empty list of things to load - patch from Masatoshi SANO
  • :defsystem-depends-on prerequisites are loaded automatically when planning how to load systems
To get this update, use (ql:update-client).

Enjoy!

Joe MarshallSemi-abstract data types

· 20 days ago
Way back in the 70's, when solid-state computing involved chisels and rock walls, the concept of the abstract data type (ADT) was developed. The ADT encapsulated its representation as thoroughly as possible. Each ADT had a set of representation invariants that were supposed to be maintained under all circumstances. It was considered an embarrassing bug to be able to cause the representation to violate an invariant if the caller made only API calls, no matter how unusual. Additionally, each ADT was equipped with an abstraction function which described how to interpret concrete representations to the abstract ones they represented.

It was, quite frankly, a pain in the butt to document all this.  Especially because the documentation made no difference — it was just text that could say anything it wanted.  In theory, you could replace the representation of any object with an equivalent representation that obeyed the invariants, adjust the abstraction function appropriately, and no one would be the wiser.  In practice, no one ever did this.  Not even once.  So the documentation pointed out that the representation was simply the built-in Cartesian product that any decent language provided and the abstraction function was the overly broad enumeration of the elements of the product.

These days, of course, things are different.  Drawing upon lessons learned in the 80's and 90's, people are using sophisticated IDLs to ensure their abstract data types maintain invariants across processes and languages and ... ha ha ha ha ha.  Sorry.

No, these days people don't bother with abstract data types.  They just use semi-abstract JSON objects that don't even provide the basic encapsulation of their 1970's counterparts.   The representation isn't hidden at all: it is a string.  You can make string to string mappings, aggregate them and nest them in arrays, but you cannot hide the fact that you've got a string.  So much for abstraction.

It isn't even worth parsing the string to determine if it satisfies invariants.  Chances are, you're not going to process it yourself,  but instead are just going to hand it off to another process in string form and let it deal with it.  And why not?  At best you'll simply find that it is indeed well-formed.  At worst, you'll raise an error now that will probably be raised shortly anyway.  And is well-formed data really that important?  It isn't like you're going to store it persistently.....


Patrick SteinAnamorphic Fun

· 20 days ago

I want to make some bookshelf dioramas like these ones by Monde. I also want to make some that take more liberties with the limited space available. So, I am thinking about using forced perspective to create extra depth. I’m also thinking of some designs that could use right-angled mirrors to snake height into depth.

I also want to create extra width (and height) though. Not just depth. How can I do that? I can use anamorphic illusions to make things appear to stretch beyond the edges of the box. And, I can force the viewer to only view it from the illusory angle and enhance the illusion of space if I make the diorama viewable through a peephole.

So, I started experimenting yesterday with anamorphic projection onto the inside of a narrow box for viewing through a peephole.

I used the cl-svg to create grids like this that look flat when viewed (from the correct distance) through the peephole linked above.

Then, I used cl-jpeg to create images that when printed and folded into the interior of a diorama look as if the image extends beyond the sides of the diorama.

If you print this as 72dpi and then fold on the white lines, it will look like a flat, square image with the letters AB on it when you look down into it.

Then, I combined the two.

If you print this as 72dpi and then fold on the white lines, it will look like a flat, square image with the letters AB on it when you look down into it through a 200-degree peephole.

Here is the full source code (peephole.lisp) used to generate the grid and the anamorphic images.

Nicolas MartyanoffLocal CLHS access in Emacs

· 21 days ago

The CLHS, or Common Lisp HyperSpec, is one of the most important resource for any Common Lisp developer. It is derived from the official Common Lisp standard, and contains documentation for every aspect of the language.

While it is currently made available online by LispWorks, it can be useful to be able to access it locally, for example when you do not have any internet connection available.

For this purpose, LispWorks provides an archive which can be downloaded and browsed offline.

While the HyperSpec is valuable on its own, the SLIME Emacs mode provides various functions to make it even more useful.

I have found the following functions particularily useful:

  • slime-documentation-lookup, or C-c C-d h to browse the documentation associated with a symbol.
  • common-lisp-hyperspec-format, or C-c C-d ~, to lookup format control characters.
  • common-lisp-hyperspec-glossary-term, or C-c C-d g, to access terms in the glossary.

With the default configuration, Emacs will use the online HyperSpec. You can have it use a local copy by setting common-lisp-hyperspec-root to a file URI. For example, if you downloaded the content of the CLHS archive to ~/common-lisp/:

(setq common-lisp-hyperspec-root
  (concat "file://" (expand-file-name "~/common-lisp/HyperSpec/")))

And if you configure Emacs to use the EWW web browser, you can work with the CLHS without leaving your editor.

Nicolas MartyanoffASDF in Common Lisp scripts

· 23 days ago

The usual way to develop programs in Common Lisp is to use SLIME in Emacs, which starts an implementation and provides a REPL. When a program needs to be running in production, one can either execute it from source or compile it to an executable core, for example with sb-ext:save-lisp-and-die in SBCL.

While executable cores works well for conventional applications, they are less suitable for small scripts which should be easy to run without having to build anything.

Writing a basic script with SBCL is easy:

#!/bin/sh
#|
exec sbcl --script "$0" "$@"
|#

(format t "Hello world!~%")

Since UNIX shebangs cannot be used to run commands with more than one argument, it is impossible to call SBCL directly (it requires the --script argument, and #!/usr/bin/env sbcl --script contains two arguments). However it is possible to start as a simple shell script and just execute SBCL with the right arguments. And since we can include any shell commands, it is possible to support multiple Common Lisp implementations depending on the environment.

This method works. But if your script has any dependency, configuring ASDF can be tricky. ASDF can pick up system directory paths from multiple places, and you do not want your program to depend on your development environment. If you run your script in a CI environment or a production system, you will not have access to your ASDF configuration and your systems.

Fortunately, ASDF makes it possible to manually configure the source registry at runtime using asdf:initialize-source-registry, giving you total control on the places which will be used to find systems.

For example, if your Common Lisp systems happen to be stored in a systems directory at the same level as your script, you can use the :here directive:

#!/bin/sh
#|
exec sbcl --script "$0" "$@"
|#

(require 'asdf)

(asdf:initialize-source-registry
 `(:source-registry :ignore-inherited-configuration
                    (:tree (:here "systems"))))

And if you store all your systems in a Git repository, you can use submodules to include a systems directory in every project, making it simple to manage the systems you need and their version. Additionally, anyone with an implementation installed, SBCL in this example, can now execute these scripts without having to install or configure anything. This is quite useful when you work with people who do not know Common Lisp.

Of course, you can use the same method when building executables: just create a script whose only job is to setup ASDF, load your program, and dump an executable core. This way, you can make sure you control exactly which set of systems is used. And it can easily be run in a CI environment.

Quicklisp newsDecember 2019 Quicklisp dist update now available

· 26 days ago
New projects:
  • 3b-bmfont — BMFont file format readers/writers — MIT
  • bdef — Buffer definition; audio buffer abstraction for sound synthesis systems — MIT
  • cl-argparse — A python argparse inspired command line parser library — MIT
  • cl-maxsat — Common Lisp API to MAX-SAT Solvers — LGPL
  • cl-mount-info — Get information about mounted filesystems on GNU/linux. — LLGPLv3 or later
  • cl-piglow — A Pimoroni PiGlow library for Common Lisp — MIT
  • cl-simple-fsm — Easy and explicit finite state machines in Common Lisp. — MIT
  • constantfold — User-defined constant folding facility — LGPL
  • eazy-documentation — One-shot solution to the CL documentation generator. — LGPL
  • file-select — A library to invoke the native file selection dialogs to open or save files. — zlib
  • gtype — C++/Julia-like parametric types in CL, based on CLtL2 extensions — LGPL
  • lispqr — QR code encoding. — MIT
  • numcl — Numpy clone in Common Lisp, using MAGICL/LLA/MGL-MAT as the backend (in the future) — LGPL
  • osmpbf — Library to read OpenStreetMap PBF-encoded files. — MIT
  • reader — A utility library intended at providing reader macros for lambdas, mapping, accessors, hash-tables and hash-sets. — MIT
  • serializable-object — Provides a simple class and API for the objects serializable in a FASL file — LGPL
  • simple-config — loads and parses a KEY=VALUE style config file — BSD 3-Clause
  • specialized-function — Provides a Julia-like function that automatically compiles a type-specific version of the function from the same code — LGPL
  • trainable-object — Provides a metaclass and APIs for the trainable funcallable instances. — LGPL
Updated projectsadoptalexandriaaprilassert-passertion-errorbeastbinary-iobknr-datastorebobbinbpcacauchanceryci-utilscl-collidercl-digraphcl-fastcgicl-fusecl-github-v3cl-krakencl-lascl-libusbcl-markdowncl-murmurhashcl-naive-storecl-netpbmcl-online-learningcl-patternscl-pcgcl-prolog2cl-pslibcl-random-forestcl-strcl-torrentscl-yesqlcloser-mopcommand-line-argumentscommonqtconcrete-syntax-treeconfcroatoandatamusedate-calcdjulaeasy-audioeclectoresrapeventbusfemlispfxmlgendlglsl-toolkitkenzolakelisp-binaryliterate-lispmaidenmcclimmetatilities-basemethod-hooksnodguioriginpetalisppolisherpostmodernqlotquickprojectquilcqvmrpcqsb-fastcgiscalplsealable-metaobjectsselserapeumshadowsimpletskeleton-creatorslystatic-dispatchstumpwmtriviatype-itype-rumbrausocketyasonyoutube.

Removed projects: cells-gtk3, cl-digikar-utilities, cl-liballegro, cl-notebook, gbbopen, gtk-cffi, matlisp, matplotlib-cl.

cl-digikar-utilities was removed at the author's request. The other projects were removed because they no longer build, either whole or in part, and attempts to contact the authors have been fruitless.

To get this update, use (ql:update-dist "quicklisp"). Enjoy!

Joe MarshallStringly typed

· 30 days ago
Here is a typical url: http://www.host.com:8080/path/to/something?arg1=val1&arg2=val2 It has a lot of structure to it. There is the protocol, the hostname, the port, the path, and the query, which itself is made up of key/value pairs. Yet most programs don't treat URLs as data structures, they treat them as strings. (And you have about a 50/50 chance of that & being escaped correctly. Fortunately or not, many programs don't care if things are correctly escaped.)

Here is a UUID: 3e042a1d-0f68-4a78-8732-14e0731d7732 It has five obvious fields, but certain bytes of the UUID are reserved for different purposes. For instance, the most significant bits of group 3 and 4 are used to encode the type of UUID and the other bits might encode a timestamp or MAC address. Yet most programs don't treat UUIDs as data structures. In fact, most don't even bother with a separate type, they just make strings out of them.

Here are some email addresses, some are unusual: very.common@example.com, "john..doe"@example.org, mailhost!username@example.org, user%example.com@example.org These, too, are structured. All contain a fully qualified domain name separated by dots and an @-sign delimiter. Yet most programs don't bother even treating them as separate types and just make strings out of them.

JSON is a data encoding and transfer format which has nested structures and arrays, but ultimately builds structure out of string to string mappings. Often I have seen JSON format used for serializing objects for interprocess calls, network interactions, or storing objects persistently. Far less often have I seen code that deserializes JSON objects into strongly typed, structured data. It is often left in stringified form and the code uses a handful of ad hoc parsers for extracting the necessary data at the last moment, when it is wanted in a more structured format.

All these are examples of "stringly typed" programming. In this style of programming, you get none of the advantages of strong typing and structured data, and all the fun that comes from accidentally using a string that represents one kind of object where another is called for.

The first problem in a stringly typed program is that the representation of abstract objects is exposed everywhere. Callers no longer have to treat objects abstractly and usually don't bother. Why call a special operator when a string operation will suffice? Second, implementors usually don't provide a complete API — why bother when users are going to paste strings together anyway? Implementors often don't even document which string operations are valid and which lead to invalid objects. Third, strings are not a very rich data type. You cannot embed other strings within them, for example, without escape sequences, and if everything is a string, the nested escaping quickly gets out of hand. How often have you seen a regexp with half a dozen ‘/’ characters? How often have you encountered a URL with too many or too few levels of “percent” escapes? Fourth, it encourages implementors to try to be “helpful” with the few API calls they do provide. How often have you seen an API call that recursively unescapes something until there is nothing more to unescape? What if you need a representation of a URL rather than the URL itself? I recall a version of Common Lisp that “helpfully” stripped the “extra” quotation marks off the command line arguments. Unfortunately, this changed the meaning of the argument from a string that represented a pathname to the pathname itself, causing a type error. I eventually resorted to passing the character codes in as arguments and EVALing a (MAP 'STRING CHAR-CODE ...) on them to sneak them by the “helpful” command line processor.

Strings make a fine representation for many data types, but you shouldn't avoid the work needed to make a fully abstract data type even if the underlying type is just strings.


Addendum to John Cowan:

I cannot for the life of me figure out why I cannot comment on my own posts. I've turned off all my browser extensions and even tried other browsers. Nothing. Anyway, here's what I have to say:

Now I know you're trolling me. You cannot be serious about keeping this data as an unstructured string and parsing it with line noise when you need to extract parts of it. The RFC even provides a BNF grammar. You're just going to ignore that?

I've read through your papers, but they hardly argue against structure — they argue against attempting to impose too much structure. They appear to impose a "top-level" structure and argue against trying to decompose things too far because there is too much variation. I agree, but I'm sure you'll agree with me that adding the Jesuit "S.J." to the country code is not a valid operation. I'm arguing that pulling out structure to the point where it makes sense (and not going into the micro-structure where it doesn't) is a better alternative than keeping everything as a string and trying to use regexps and string-manipulation operators to act on your data. If your friend becomes a Jesuit, you're not going to want to try to use some horrific regexp to split the string between his name and his address and then string-append an "S.J."

Chaitanya GuptaOptimizing array operations for multiple element types

· 32 days ago

While working on qbase64, I stumbled over a peculiar problem: I wanted it to work as fast as possible when optimized array types (SIMPLE-ARRAY, SIMLPE-BASE-STRING, etc.) were passed to the encoding/decoding routines, but I also wanted to support the more general types.

For example, the core encoding routine in qbase64, %ENCODE, which looks something like this (simplified):

(defun %encode (bytes string)
  (loop ;; over bytes and write to string
     ...))

goes through the BYTES array, taking groups of 3 octets each and writes the encoded group of 4 characters into STRING.

If I declared its types like this:

(defun %encode (bytes string)
  (declare (type (simple-array (unsigned-byte 8)) bytes))
  (declare (type simple-base-string string))
  (declare (optimize speed))
  (loop ...))

SBCL would produce very fast code, but the function would no longer work for either ARRAY or STRING:

And if I was to redefine the routine with more general types:

(defun %encode (bytes string)
  (declare (type array bytes))
  (declare (type string string))
  (declare (optimize speed))
  (loop ...))

the code produced would be significantly slower.

My experience with generics is limited, but it seemed that generics could solve this problem elegantly. However, Common Lisp doesn't have generics, but it does support macros, so I came up with an ugly-but-gets-the-job-done hack.

I created a macro, DEFUN/TD, that would take all the different type combinations I wanted to optimize and support upfront:

(defun/td %encode (bytes string)
   (((bytes (simple-array (unsigned-byte 8))) (string simple-base-string))
    ((bytes (simple-array (unsigned-byte 8))) (string simple-string))
    ((bytes array)                            (string string)))
  (declare (optimize speed))
  (loop ...))

and generate code which would dispatch over the type combinations, then use LOCALLY to declare the types and splice the body in:

(defun %encode (bytes string)
  (cond
    ((and (typep bytes '(simple-array (unsigned-byte 8)))
          (typep string 'simple-base-string))
     (locally
       (declare (type bytes (simple-array (unsigned-byte 8))))
       (declare (type string simple-base-string))
       (declare (optimize speed))
       (loop ...)))
    ((and (typep bytes '(simple-array (unsigned-byte 8)))
          (typep string 'simple-string))
     (locally
       (declare (type bytes (simple-array (unsigned-byte 8))))
       (declare (type string simple-string))
       (declare (optimize speed))
       (loop ...)))
    ((and (typep bytes 'array)
          (typep string 'string))
     (locally
       (declare (type bytes array))
       (declare (type string string))
       (declare (optimize speed))
       (loop ...)))
    (t (error "Unsupported type combination"))))

The result is more generated code and an increase in the size of the Lisp image, but now the loop is well optimized for each type combination given to DEFUN/TD. The run-time dispatch might incur a slight penalty, but it is more than offset by the gains made.

Alternatives

This was a fairly interesting problem that I hadn't dealt with before, nevertheless it looked like a fairly common one, so I asked on the cl-pro list a couple of years ago how others solved this; Mark Cox pointed me to a few libraries:

All of these are quite interesting and attack more or less the same problem in different ways.

Is there a trick or two that I've missed? Feel free to tell me.

Zach BeaneSBCL20 in Vienna

· 33 days ago

Last week I attended the SBCL 20th anniversary workshop, held in Vienna, Austria. Here are some rough impressions - I wish I had detailed enough notes to recreate the experience for those who did not attend, but alas, this is what you get! It’s incomplete and I’m sorry if I’ve left out someone unfairly - you have to go yourself next time so you don’t miss a thing.

Structure of the gathering

I didn’t go to SBCL10, so I didn’t know what to expect. There were few speakers announced, and I worried (to myself) that this was a sign of insufficient participation, but I couldn’t have been more wrong.

Philipp Marek played host at his employer, BRZ. We had a large room laid out with several tables, ample power, notepads, markers, and treats, fresh food and drink, and a stage area with chairs set out for an audience.

The conference lead off with a BRZ representative who explained BRZ’s purpose with an English-language video and encouraged qualified people to consider joining the company.

Christophe Rhodes acted as master of ceremonies. He put the conference in historical context and explained the format. While there were only a small number of planned talks, the remainder of the two-day conference was intended for brainstorming, reminiscing, hacking, and impromptu talks by anyone with an interesting idea, demo, or controversial proclamation. He challenged all of us to pick something to work on and present it by the end of the conference.

With everyone in the same room, it was nice to see people help each other in unexpected ways. People describing a problem would get interrupted by people who could help. And not “Have you tried installing My Favorite Linux?”-type unhelpful speculation - it was always valuable in one form or another. The breadth and depth of SBCL and Common Lisp knowledge meant that, for example, someone could explain why a certain decision was made in CMUCL 30 years ago, why SBCL went in a different direction 20 years ago, and how things could be improved today.

I chose to revisit a problem I first wrote about five years ago. It boils down to this: can a third party (Quicklisp, in this case) control the dynamic environment of each load-system operation when recursively loading a system? The SBCL connection is a little loose, but it relates to how SBCL’s compiile-time analysis warnings should be shown prominently for local development, even when loading systems with ql:quickload.

A lot of people pitched in with ideas and suggestions. Mark Evanson in particular acted as a sounding board in a way that helped me get to the essence of the problem. Unfortunately, I wasn’t able to find a solution before the end of the conference.

Robert Smith and Lisp at Rigetti

Robert gave an introduction to the math behind quantum computing, and then described the success Rigetti has had with SBCL and Common Lisp for building its quantum computing simulator and compiler. CL allows exploring and adopting new ideas at a very quick pace. I liked that it wasn’t high-level “Lisp is great!” preaching to the choir, but gave some specific practical examples of how CL features made life easier.

Charlie Zhang and RISC-V

Charlie is a newcomer to SBCL development as of 2019. In the span of a few months added a new compiler backend for fun. He talked about the challenges of understanding and extending SBCL, and some of his future plans.

He generated backend the assembly with the Lisp assembler - it gave full advantages of Lisp for code manipulation at the assembly level. First and only backend to do so!

Funniest fact I overheard at dinner after the workshop: “I don’t even use Common Lisp for anything, I just hack on the compiler!”

Doug Katzman and SBCL as a Unix program

Doug Katzman talked about his recent work at Google aimed at making SBCL play well with Unix tools. Specifically debugging and profiling tools that expect a binary to act in a certain way and to provide debugging and stack info in a specific format. This makes it easier to provide insight into how SBCL is spending its time when there are many, many instances running across many, many computers. And with so many instances running, the benefits of improving performance add up in significant ways.

This use case is far removed from my typical live, ongoing, interactive development session with SBCL. It was interesting to learn about the constraints and requirements of Lisp servers in that kind of environment.

Other talks

There were a dozen or more short presentations and talks about various topics. Luís talked about porting an application to SBCL from another Lisp and some of the good and bad parts along the way. Charlie talked about contification optimization for SBCL. There were quick talks about tuning the GC, SIMD extensions, missing features of SBCL, Coalton, CL support in SWIG, and several more.

Bits and pieces

  • There are more commits to SBCL now than ever before, despite there being fewer people contributing overall
  • It would nice to have support for good CL interaction in popular new editors, but some of the existing components (e.g. LSP) are insufficient to support SLIME-like levels of interaction
  • Siscog has more than 60 people who work on Lisp programs - biggest collection in the world?

Closing wishlist items

Christophe solicited wishlist items to close out the conference. Here are a few that I jotted down:

  • Debug-oriented interpeter, e.g. with macroexpand-always behavior (the current interpreters do not especially improve the debug experience)
  • Profile-based compilation feedback (sort of like tracing jit but not exactly?)
  • Class sealing
  • Full configurable numeric tower
  • Arbitrary-precision floats
  • Concurrent GC - GC pauses cited by more than one person as a pain point

Overall

I had a great time at SBCL20 and it renewed my energy for working on Common Lisp projects. It’s great to see in person the people you are helping when you create or contribute to a community project.

Thanks to everyone who attended and spoke, to Philipp Marek for hosting, and to Christophe Rhodes for acting as MC and much more.

It’s too long to wait for SBCL30. How about an SBCL25, or even sooner?

See you there!

Paul RodriguezThe concatenated stream trick

· 36 days ago

Suppose you have a parser that reads from a stream, but the stream is missing something. E.g. you have a function that reads a string from a stream, but while the function expects the next character to be a opening quote, the opening quote has already been read from the stream.

The obvious solution is unread-char. This takes a character — the last character read from the stream — and puts it back into the stream. And this is fine for the situation where your function is being invoked as (say) a reader macro. But what if you need to call the parser by itself?

What you need in this situation is not unread-char but a concatenated stream. The stream returned by make-concatenated-stream returns all the elements from all of its arguments as as a single stream. You can add any prefix or suffix you need:

(defun read-delimited-string (stream)
  (let ((stream
          (make-concatenated-stream
           (make-string-input-stream "\"")
           stream)))
    (parse-string stream)))

Joe Marshall100% Coverage

· 36 days ago
I'm doing some consulting for a company that has a policy of 100% branch coverage on all their tests. I have mixed feelings about this. Certainly for some of the software they develop, avionics e.g., software bugs can get people killed and I can see that 100% branch coverage helps assure that this is unlikely (but it doesn't guarantee it, unfortunately. Just because you've tested all the branches doesn't mean the software is correct. But it goes a long way to making sure your software behaves predictably.)

Fortunately, I don't work on software that people's lives depend upon. I'm working on supply-chain and inventory software and here the issue is murkier. Sure, you can argue that "For want of a nail...", but realistically, a bug in the supply-chain software is simply an annoyance that means wasted space if you have too many of something or a delay in production if you have too few. And there are already many other sources of production delay. Or perhaps you'll see an ugly display if the bug is in the UI. These are annoying, but hardly life threatening.

There is a significant downside to 100% branch coverage. It means that you have to persuade the code to take branches it wouldn't normally take. Sometimes this is as easy as passing out of range values to a procedure, but sometimes it is hard. Code isn’t designed to fail, it is designed to work, so sometimes you have to induce failure to take the non-normal branch. It shouldn’t be easy to induce failure. It should be close to impossible. That makes testing the failure cases close to impossible as well. You have two equally unpleasant alternatives: either adjust the code to make it easier to induce failure, or remove the branch that handles the "impossible" case.

I am a proponent of "defensive" programming. I put "sanity checks" in my code to ensure that the system is in an expected state. It may not be possible for it to get into an unexpected state, but I put the checks in anyway. I add default clauses even if all the cases are covered. My conditional branches often have a clause for the "can't happen" case. I add sentinel values to enums that they should never be able to take on. These situations either fail-fast or drop you into the debugger depending on whether you have a debugger or not. This kind of defensive programming introduces branches that should not be possible to take, so it is not possible to get 100% branch coverage. If you want 100% branch coverage, you are left with the unpleasant options of hacking your code so it can fail sanity checks, enter states not covered by all cases, do things that "can't happen" or take on illegal values, or, alternatively, removing the sanity checks so all branches can be taken.

Code can change, and these defensive techniques help ensure that the code detects breaking changes and acts appropriately by dropping into the debugger or failing-fast.

Of course, if you are flying an airplane, you don't want your software to unexpectedly drop into the debugger and the airplane to unexpectedly drop out of the sky. In this case, a policy of 100% branch coverage makes sense. For business software, it is harder to make that case.

Addendum: Stupid tools.

Java provides a nice set of stupid tools. One prime example is the branch coverage tool. It tells you which conditionals in the code are fully tested or only partially tested. What it doesn't tell you is which branch was tested. So you have to guess. Sometimes it is fairly obvious, but other times, especially with slightly more complex conditionals, it is very difficult to figure out which branches are being taken and which are not. Now there are some reasons why the coverage tool cannot determine this basic information, and I'm sure they are logical (I understand it has something to do with it being difficult to map the optimized byte code back to the source code), but seriously, a tool that can detect coverage but cannot tell you what exactly is covered is only half baked. This encourages people to rewrite their conditionals to avoid anything more complex than single tests, which makes the code more verbose and less readable. Thanks a bunch.

Paul RodriguezThe Iterate clause trick

· 37 days ago

Iterate provides defmacro-clause to make it easy to write new clauses. Here's a simple one that builds up an FSet set:

(defmacro-clause (collecting-set x &optional into var)
  `(reducing ,x by #'with into ,var initial-value (empty-set)))

But the syntax of defmacro-clause is not as flexible as I would like. In particular, there is no way to supply more than one expression. To collect an FSet map, I would like to be able to write:

(collecting-map k v into result)

But the syntax of defmacro-clause won't allow that.

However, there is still hope. Iterate uses a code walker that expands macros. All we have to do to get both the benefits of defmacro-clause (the ability to collect to a particular variable) and the syntax we want is to write two macros: a macro with the syntax that we want, which expands into an auxiliary Iterate macro.

(defmacro-clause (collecting-map-aux kv &optional into var)
  `(reducing ,kv by (lambda (map kv)
                      (with map (car kv) (cdr kv)))
             ,var initial-value (empty-map)))

(defmacro collecting-map (k v &rest args)
  `(collecting-map-aux (cons ,k ,v) ,@args))

And then we can write:

(iterate (for x in xs)
  (collecting-map x (* x 2))


For older items, see the Planet Lisp Archives.


Last updated: 2020-01-22 03:28