Post navigation

Warning: The formula for “tolerated” comparison in this post is not correct for actual floating-point numbers. Do not use it! The next post will discuss how to compute the required value precisely in the presence of floating-point error.

Tolerant comparison is an important feature in making APL usable in the real world. But it’s also a lot slower than exact comparison. We think this tradeoff is well worth it (and of course, if you don’t want to make it, just set ⎕CT←0). But once in a while we can completely get rid of the slowdown! Added in Dyalog 17.0, along with vectorised comparison code so fast you might not even notice this change, is a cool technique that I call “tolerated comparison”. It works by turning a tolerant comparison with one argument fixed into an intolerant comparison by adjusting that fixed argument. So if you’re comparing one floating-point number to many of them, your code will run 30-40% faster than it would without this improvement.

Why tolerance?

Let’s take a trip to a slightly different APL.

⎕CT←0

A world in which tolerant comparison hasn’t been invented. Two things are equal when they’re the same.

0.1 = 0.3-0.2
0
{(0.1×⍵) = ⍵÷10} ⍳8
1 1 0 1 1 0 0 1

Okay, that’s a scary place. What happened?

The floating-point format can’t possibly represent every real number. Instead, there’s a carefully chosen set containing just shy of 2*64 numbers which can be expressed. For other numbers, we pick the closest available floating-point number. Usually this is a pretty accurate approximation. But once we perform arithmetic with a few of these values, we might go from “closest possible approximation” to “second-closest approximation” or even further away. The differences are still tiny, but when the same value—mathematically speaking—is computed in two different ways the results might be different numbers. Just barely different, but in a world without tolerant comparison that means they won’t be equal any more.

Tolerant comparison is an attempt to fix this. It’s not perfect, but it’s better in a lot of real-world situations. It makes = a useful function. The rule is that two values are equal when their difference is at most ⎕CT times the larger of their magnitudes. This is often called a “relative epsilon comparison” outside of the APL world.

Notation: To avoid confusion in the following text, I have used the APL symbols <≤=≥>≠ for intolerant comparisons, and the non-APL symbols ⋦≲≂≳⋧≉ to refer to APL’s tolerant comparisons which depend on ⎕CT. Watch out for a few quirks due to the scarcity of symbols to use: the actual APL primitives are denoted by non-APL characters, and the simpler symbol < “less-than” corresponds to the more complicated ⋦ “less-than and not approximately equal to”.

Tolerant comparisons

The famous formula for tolerant equality a≂b in terms of intolerant inequality is

(|a-b) ≤ ⎕CT×(|a)⌈(|b)

This allows for a little variation in the arguments, provided that neither one is zero. And it has some other nice properties, assuming ⎕CT≤1:

It’s reflexive: a≂a for any number a.

It’s symmetric in a and b: reversing them makes no difference in the result. This is because ⌈ and (|-) are symmetric functions.

It scales linearly: multiplying both a and b by the same nonzero value (even a negative number) does not change the result.

It’s convex in each variable: if a≂b and a≂d for some a, and c is between b and d, then a≂c.

In fact, tolerant equality as in APL is the only solution (or family of solutions, taking the ⎕CT parameter into account) to those constraints, other than the trivial one in which any two numbers are equal.

The above figure illustrates tolerant comparison of two real numbers. They are exactly equal only on the grey diagonal line, and tolerantly equal within the narrow strip around it. The diagram uses a much larger value for ⎕CT than Dyalog’s default of 1e¯14, or even its maximum allowed value of 2*¯32, either of which would make that strip invisible.

One way to define tolerant inequalities in terms of tolerant equality is to define, for instance, (a≲b) ≡ (a<b)∨(a≂b) using intolerant less-than and tolerant equality. But we’ll find it more useful to define ≲ and ≳ directly in terms of intolerant comparisons as with ≂. The formula for ≲ is:

(a-b) ≤ ⎕CT × 0⌈a⌈-b

and for ≳ we can use either (a≳b) ≡ (b≲a) or (a≳b) ≡ (-a)≲(-b) to obtain

(b-a) ≤ ⎕CT × 0⌈b⌈-a

or, negating both sides and swapping the inequality,

(a-b) ≥ ⎕CT × 0⌊a⌊-b

The 0⌈ term isn’t strictly necessary (as we’ll show later), but it helps avoid some confusion when a and b have different signs. Sometimes it’s used to speed up comparison by checking if a≤b intolerantly before performing any multiplication—if it is, the multiplication isn’t needed.

Above we can see how the equation for tolerant ≲ splits into two symmetrical half-formulas, each of which is just a little off from the exact inequality a≤b. The tolerant inequality accepts either of these conditions.

The region of tolerant equality is now the intersection of the regions for tolerant ≲ and ≳. That is:

(a≂b) ≡ (a≲b)∧(a≳b)

We’ll find this decomposition handy later on since tolerant inequalities are so much easier to work with.

Tolerating an inequality

What happens if we fix one side of the inequality? Here’s a diagram for a≲b with a dashed line drawn where b is equal to some fixed B and a varies.

There’s only one place where the black line bounding the region where a≲b intersects the line b=B. The value of a at this intersection has an interesting property: when a is (exactly) less than or equal to that value, then it is tolerantly less than or equal to B. If we want to compare B to many other floating-point values, and we know the a coordinate of the intersection above, then we can skip all of the tolerant comparison arithmetic and just use exact comparison with that value!

Once we have a way to do this for ≲ all other inequalities will follow, since (a≳B) ≡ (-a)≲(-B) and (a⋧B) ≡ ~(a≲B).

Maybe you can figure out the value from the graphs above. But it’s also easy to find the value (and prove it works!) using a fact about the maximum function: for real numbers x, y, and z, x≤y⌈z is the same as (x≤y) ∨ (x≤z). One of the advantages of treating inequalities as functions like any other is that formulas like these are easy to manipulate. Here we go:

Start with the formula for tolerant inequality, writing q in place of ⎕CT.

(a-B) ≤ q × 0⌈a⌈-B

Distributing q× over ⌈, we have

(a-B) ≤ 0 ⌈ (q×a) ⌈ (-q×B)

and using (x≤y⌈z) ≡ ((x≤y) ∨ (x≤z)) twice (just like in the previous step, this can be thought of as distributing x≤ over three arguments of an associative function all at once), we can expand this formula to

((a-B) ≤ 0) ∨ ((a-B) ≤ q×a) ∨ ((a-B) ≤ -q×B)

which simplifies to

(a ≤ B) ∨ ((a×1-q) ≤ B) ∨ (a ≤ B×1-q)

or (here we need q<1, so that 1-q is positive; otherwise we’d have to reverse the middle inequality)

(a ≤ B) ∨ (a ≤ B÷1-q) ∨ (a ≤ B×1-q)

Recombining, we get

a ≤ B ⌈ (B÷1-q) ⌈ (B×1-q)

So that’s our formula: now the tolerant equality is an intolerant equality, where the right side depends only on B!

Reading the signs

So we have a formula for a “tolerated” value b = B ⌈ (B÷1-q) ⌈ (B×1-q). When do the different possible maxima come into play?

If B=0, then everything on the right is zero, and b=0.

If B>0, then B×1-q is smaller than B, since q>0. But that means ÷1-q is larger than 1, and B÷1-q is larger than B. So b=B÷1-q.

If B<0, then the all the inequalities that we got with B>0 are reversed. This means b=B×1-q.

Note that none of these cases depend on the branch b=B, even though the first one touches it. That means the definition of tolerant inequality would be the same even if we remove that branch. By working backwards through the above derivation we find that the initial 0⌈ has no effect, and we can remove it from the definition.

Equality

Tolerant equality obviously can’t be reduced to intolerant equality, since many values can be tolerantly equal to a number but only one is intolerantly equal to it. But recalling that (a≂b) ≡ (a≲b)∧(a≳b), we can express it as two intolerant inequalities: tolerate b for ≲ as above. Then tolerate it for ≳ (this always yields a value at least as large as the first). a is tolerantly equal to b if it lies between the two tolerated values.

This case is particularly important for the set functions ⍳∊∩∪~, which by definition use tolerant equality to match values in one argument to values in the other. If the non-principal argument (the one containing values to be looked up, like the right argument to ⍳) is short enough, then each element in it is looked up using a simple linear search through the principal argument. This means that one number at a time is compared to the entire principal argument (up to the first match found). So for floats we can get a substantial improvement by tolerating each element on the right before comparison.

Caution!

That was the easy part.

In this post we’ve been assuming that floating-point numbers behave like mathematical real numbers. Given that the whole reason tolerant comparison was introduced is that they don’t, maybe that wasn’t the best assumption.

The formulas B÷1-q and B×1-q for tolerated comparison are flawed: they produce values that differ slightly from the actual number that should be used in an intolerant equivalent to tolerant comparison. This is very dangerous: many programs could break if two values compare equal in one comparison but not in another! In the next post we’ll dive into the world of floating-point error analysis and find out how to get the right values. And we’ll discover that imperfect computation isn’t all bad when we find a faster alternative to a very slow floating-point division.

It is possible to compute the progressive index without looping or recursion, as the two FinnAPL functions demonstrate.

Why It Works

The basic idea of ⍺ pix ⍵ is to substitute for each item of ⍺ and ⍵ an equivalent representative, collectively c and d, whence the result obtains as c⍳d. The equivalent representative used here is ranking, specifically the ranking of the indices in ⍺.

The ranking of an array ⍵ is a permutation of order ≢⍵. The smallest major cell is assigned 0; the next smallest is assigned 1; and so on. Ties are resolved by favoring the earlier-occurring cell. The ranking can be computed by ⍋⍋⍵. For example:

⍋⍋⍺⍳⍺,⍵ rankings of indices in ⍺ of ⍺ and ⍵, favoring ⍺⍋⍋⍺⍳⍵,⍺ rankings of indices in ⍺ of ⍵ and ⍺, favoring ⍵

The first ⍴⍺ items of the former are those for ⍺ and the first ⍴⍵ of the latter are those for ⍵, and we get

pix ← {((⍴⍺)⍴⍋⍋⍺⍳⍺,⍵) ⍳ ((⍴⍵)⍴⍋⍋⍺⍳⍵,⍺)}

The second version depends on the following properties of permutations. Let p be a permutation. Then p[⍋p] ←→ ⍳≢p, the identity permutation, and therefore ⍋p is the inverse of p. Furthermore, p[p⍳⍳≢p] ←→ ⍳≢p and so p⍳⍳≢p is also the inverse of p. The inverse is unique (that’s why it’s called the inverse), therefore ⍋p ←→ p⍳⍳≢p.

We saw previously that if p is a permutation then ⍋p ←→ p⍳⍳⍴p. The special code for ⍋p makes the two expressions run at roughly the same speed. The slight advantage for ⍋⍋x versus (⍋x)⍳⍳⍴x would increase if and when ⍋⍋ is recognized as an idiom.

A General Solution

Index-of works on cells rather than just scalars. Likewise, progressive index-of can also be extended to work on cells. The core algorithm remains the same. The generalization obtains by first reshaping ⍵ to have the same rank as ⍺ (having major cells with the same shape), applying the core algorithm, and then reshaping its result to have the same leading shape as the original ⍵. Thus:

The following are my attempts at the Phase I problems of the 2018 APL Problem Solving Competition. There are not necessarily “right answers” as personal style and taste come into play. More explanation of the code is provided here than common practice. All solutions pass all the tests specified in the official problem description.

The solutions for problems 1 and 3 are due to Brian Becker, judge and supremo of the competition. They are better than the ones I had originally.

1. Oh Say Can You See?

Given a vector or scalar of skyscraper heights, compute the number of skyscrapers which can be seen from the left. A skyscraper hides one further to the right of equal or lesser height.

visible ← {≢∪⌈\⍵}

Proceeding from left to right, each new maximum obscures subsequent equal or lesser values. The answer is the number of unique maxima. A tacit equivalent is ≢∘∪∘(⌈\).

2. Number Splitting

Split a non-negative real number into the integer and fractional parts.

split ← 0 1∘⊤

The function ⍺⊤⍵ encodes the number ⍵ in the number system specified by numeric vector ⍺ (the bases). For example, 24 60 60⊤sec expresses sec in hours, minutes, and seconds. Such expression obtains by repeated application of the process, starting from the right of ⍺: the next “digit” is the remainder of the number on division by a base, and the quotient of the division feeds into the division by the next base. Therefore, 0 1⊤⍵ divides ⍵ by 1; the remainder of that division is the requisite fractional part and the quotient the integer part. That integer part is further divided by the next base, 0. In APL, remaindering by 0 is defined to be the identity function.

You can have a good argue about the philosophy (theology?) of division by 0, but the APL definition in the context of ⍺⊤⍵ gives practically useful results: A 0 in ⍺ essentially says, whatever is left. For example, 0 24 60 60⊤sec expresses sec as days/hours/minutes/seconds.

3. Rolling Along

Given an integer vector or scalar representing a set of dice, produce a histogram of the possible totals that can be rolled.

⍳⍵ produces all possible rolls for the set of dice (the cartesian product of ⍳¨⍵) whence further application of +/¨ and then , produce a vector of all possible totals. With the vector of possible totals in hand, the required unique totals and corresponding histogram of the number of occurrences obtain readily with the key operator ⌸. (And rather messy without the key operator.) For each unique total, the operand function {⍺('*'⍴⍨≢⍵)} produces that total and a vector of * with the required number of repetitions.

The problem statement does not require it, but it would be nice if the totals are listed in increasing order. At first, I’d thought that the totals would need to be explicitly sorted to make that happen, but on further reflection realized that the unique elements of ,+/¨⍳⍵ (when produced by taking the first occurrence) are guaranteed to be sorted.

Since the zodiac signs are assigned in cycles of 12, the phrase 12| plays a key role in the solution. The residue (remainder) function | is inherently and necessarily 0-origin; the 1+ accounts for the 1-origin indexing required by the competition. Adding 0>⍵ overcomes the inconvenient fact that there is no year 0.

Essentials of the computation are brought into sharper relief if each zodiac sign is denoted by a single character:

For working with irregular-sized non-overlapping intervals, the pre-eminent function is ⍸, interval index. As with the Chinese zodiac, essentials of the computation are brought into sharper relief if each zodiac sign is denoted by a single character:

The single-character signs, Unicode U+2648 to U+2653, were found by Google search and then confirmed by https://www.visibone.com/htmlref/char/cer09800.htm. It is possible that the single-character signs do not display correctly in your browser; the line of code can be expressed alternatively as (1+d⍸⍵)⊃13⍴⎕ucs 9800+12|8+⍳12.

6. What’s Your Angle?

Check that angle brackets are balanced and not nested.

balanced ← {(∧/c∊0 1)∧0=⊃⌽c←+\1 ¯1 0['<>'⍳⍵]}

In APL, functions take array arguments, and so too indexing takes array arguments, including the indices (the “subscripts”). This fact is exploited to transform the argument string into a vector c of 1 and ¯1 and 0, according to whether a character is < or > or “other”. For the angles to be balanced, the plus scan (the partial sums) of c must be a 0-1 vector whose last element is 0.

The closely related problem where the brackets can be nested (e.g. where the brackets are parentheses), can be solved by checking that c is non-negative, that is, ∧/(×c)∊0 1 (and the last element is 0).

7. Unconditionally Shifty

Shift a boolean vector ⍵ by ⍺, where positive means a right shift and negative means a left shift.

shift ← {(≢⍵)⍴(-⍺)⌽⍵,(|⍺)⍴0}

The problem solution is facilitated by the rotate function ⍺⌽⍵, where a negative ⍺ means rotate right and positive means rotate left. Other alternative unguarded code can use ↑ or ↓ (take or drop) where a negative ⍺ means take (drop) from the right and positive means from the left.

8. Making a Good Argument

Check whether a numeric left argument to ⍺⍉⍵ is valid.

dta ← {0::0 ⋄ 1⊣⍺⍉⍵}

This is probably the shortest possible solution: Return 1 if ⍺⍉⍵ executes successfully, otherwise the error is trapped and a 0 is returned. A longer but more insightful solution is as follows:

dt ← {((≢⍺)=≢⍴⍵) ∧ ∧/⍺∊⍨⍳(≢⍴⍵)⌊(×≢⍺)⌈⌈⌈/⍺}

The first part of the conjunction checks that the length of ⍺ is the same as the rank of ⍵. (Many APL authors would write ⍴⍴⍵; I prefer ≢⍴⍵ because the result is a scalar.) The second part checks the following properties on ⍺:

all elements are positive

the elements (if any) form a dense set of integers (from 1 to ⌈/⍺)

a (necessarily incorrect) large element would not cause the ⍳ to error

The three consecutive ⌈⌈⌈, each interpreted differently by the system, are quite the masterstroke, don’t you think? ☺

9. Earlier, Later, or the Same

Return ¯1, 1, or 0 according to whether a timestamp is earlier than, later than, or simultaneous with another.

ts ← {⊃0~⍨×⍺-⍵}

The functions works by returning the first non-zero value in the signum of the difference between the arguments, or 0 if all values are zero. A tacit solution of same: ts1 ← ⊃∘(~∘0)∘× - .

10. Anagrammatically Correct

Determine whether two character vectors or scalars are anagrams of each other. Spaces are not significant. The problem statement is silent on this, but I am assuming that upper/lower case is significant.

anagram ← {g←{{⍵[⍋⍵]}⍵~' '} ⋄ (g ⍺)≡(g ⍵)}

The function works by first converting each argument to a standard form, sorted and without spaces, using the local function g, and then comparing the standard forms. In g, the idiom {⍵[⍋⍵]} sorts a vector and the phrase ⍵~' ' removes spaces and finesses the problem of scalars.

A reasonable tacit solution depends on the over operator ⍥, contemplated but not implemented:

f⍥g ⍵ ←→ f g ⍵
⍺ f⍥g ⍵ ←→ (g ⍺) f (g ⍵)

A tacit solution written with over: ≡⍥((⊂∘⍋ ⌷ ⊢)∘(~∘' ')). I myself would prefer the semi-tacit ≡⍥{{⍵[⍋⍵]}⍵~' '}.

Sometimes I want an additional functionality in the IDE. (Are you a RIDE user? We’ll cover that too!) For example, the other day, I was tracing through some very long functions to find an error which was being caught by a trap. Since the error was being caught, I couldn’t just let the function run until it would suspend. Again and again, I would press too many times, causing the error to happen and be trapped, and thus having to start all over again.

I wish I could select a line and run until there, I thought. Sure, I could set a break-point there and then continue execution, but that would drop me into the session upon hitting the break-point, and then I’d have to trace back into the function, and remember to clear the break-point. A repetitive work-flow indeed.

Make it so!

Luckily, I know someone who loves doing repetitive tasks: ⎕PFKEY. This is what I needed done:

Toggle break-point (to set it)

Resume execution

Trace

Toggle break point (to clear it)

A quick look in Options > Configure… > Keyboard Shortcuts > Code revealed that the command codes for these are BP, RM, TC, and BP again, so I tried:

Keep it so!

Of course, I wouldn’t want to be bothered with setting this up in every session. So here’s a trick to set up F-keys (or anything else for that matter). When Dyalog APL starts up, it will look for MyUCMDs\setup.dyalog in your Documents folder ($HOME/MyUCMDs/setup.dyalog on non-Windows). If this file contains a function named Setup, it will be run whenever APL starts:

Cool, but how about the RIDE?

Right, the RIDE doesn’t support ⎕PFKEY. However, Edit > Preferences > Shortcuts lets you both find the relevant command codes and assign them to F-keys. Just put <BP><RM><TC><BP> (type or paste those sixteen characters, with angle brackets and everything — don’t press the keys they symbolise!) in the PF10 input field:

The RIDE saves these preferences for you. Note that you can’t assign F-keys in $HOME/MyUCMDs/setup.dyalog because ⎕PFKEY has no effect in the RIDE, but you can still use that file to initialise other things.

Taking it one step further…

After using this for a while, I realised that I often want to “step into” a specific line . That is, I found myself pressing and then – (the default keystroke for tracing). So I’ve assigned – the same sequence, but with an additional trailing TC action:

The chart above compares the performance of ∘.∧ in Dyalog versions 16.0 and 17.0. The improvement is a factor of 3-4 across the whole range (except at multiple-of-8 lengths, where 16.0 spikes up). That’s not easy to get, especially for a function like ∘.∧ which was already the target of a fair amount of special code. And the improvements are completely general. They apply to any scalar dyadic which has a Boolean result on Booleans, which is all but +-○.

How did it happen? The answer, at least for the parts of the graph left of 1024, follows. It takes us through a surprising number of APL primitives, which I think is a great demonstration of the ways APL thinking can lead to faster algorithms even in other languages.

Binary choices

One of the many performance bumps in Dyalog version 17 is that selecting from an array of two options using a Boolean array is much faster. Dyalog 17 also takes advantage of this speed improvement by using it for a few other functions: scalar dyadics when one argument is a singleton and the other is Boolean, ⍳ (or ∊) where the right (left) argument is Boolean, and outer products where the left argument is Boolean.

Selection has been sped up for all kinds of indexed array, but the most interesting case is that in which the cells of the indexed array are Boolean, with a cell shape that’s not a multiple of 8 (when working with Boolean algorithms, we tend to call such shapes “odd”). In this case the rows of the result aren’t byte-aligned, so to generate them quickly we will need some Boolean trickery. As it turns out, a lot of Boolean trickery.

The centre of our attention will be the function Replicate (/) when the right argument is Boolean and the left is a scalar. That means something like 3/1 0 0 1 0. Replicate doesn’t seem obviously connected to indexing, but like indexing, it turns each bit of the argument into multiple bits in the result. In fact, this is by far the most important and difficult step in implementing Index, since Index may be phrased straightforwardly in terms of Replicate using exclusive or (≠):

y[x;] ←→ (((≢y)/⍪x) ∧⍤1 ≠⌿y) ≠⍤1 ⊣⌿y

If for some reason that doesn’t look straightforward to you, I’ll explain how it all fits together later. But first, Replicate!

The trouble with Replicate

Just implementing Replicate on a Boolean array isn’t difficult. Read one bit at a time from the argument, and then write it to the result the required number of times. However, reading and writing one bit at a time is slow. A fast algorithm for Replicate will work an entire machine word at a time (that is, 64 bits on 64-bit systems).

But even fairly well-implemented word-at-a-time strategies will quickly run into another problem: branching. A simple if/then construct in C can be very cheap or very expensive depending on whether the CPU is able to predict in advance which path will be taken. With odd-sized replicates the not-quite-periodic way that the writes interact with byte boundaries will make nearly all useful branching unpredictable. But there are many techniques which can allow the CPU to make choices without having to change the path it takes through a program. Coming up is one rather exotic entry in the field of branchless programming.

Small expansion factors

When beginning work on fixed-size replicates, I quickly found that, in Dyalog version 16, it was much faster to add a length-1 axis before the first, replicate that, and transpose it to the end than to replicate the last axis directly.

This is largely because I had spent a lot of time improving Transpose with Boolean arguments of various shapes in version 16 (the comparison above looks very different in version 15). Transpose on a short but wide Boolean matrix uses a technique that I call bit interleaving, which would unfortunately require another blog post to explain. To transpose a matrix with r rows, I move across the rows, using some special tricks to insert (r-1) zeroes after each bit in each one, and combine them using bitwise or. To instead replicate a vector by r, I modified this to expand the one argument as though it were going to be combined with other rows, but instead of doing that, I combined it with itself by multiplying by the number whose base-2 representation consists of r ones. (C programmers may notice that multiplying by this number is the same as shifting to the left by r and then subtracting the original value. In fact, using multiplication is faster—x86 variable shifts are slow!)

Bit interleaving is a somewhat complicated algorithm, but it had all been done by the time I started on Replicate, and modifying it didn’t take long.

Large expansion factors, the obvious way

The bit-interleaving algorithm is very fast for small numbers and gets slower as the expansion factor increases. Above 64 it is no longer usable because it only writes one word at a time, while each bit should expand to more than a word. I chose to stop using it after 32, as it turns out we can get just about the same performance at 33 using a different approach.

The old approach, on the other hand, is only about half as fast at this expansion factor. Version 16’s algorithm reads from the right argument one bit at a time, then writes that bit the appropriate number of times to the result by writing one partial byte and then using memset (which may spill over into the next bit’s area since memset can only write whole bytes; it will be overwritten when we get to the next bit). This works fine for large enough left arguments, and scales up very well: memset is provided by the C standard library and will push bytes to memory as fast as possible. But for small values it is not so good: the call to memset branches a few times based on the unpredictable size of the write, which incurs a steep branch misprediction penalty.

There’s no easy way to get around the penalty for writing some bits at a variable offset. If the expansion factor isn’t a multiple of eight, then the number of bytes each expanded bit touches will always vary by one, and having to check whether to write that byte results in an unpredictable branch. There are ways to avoid a check like this (overwriting is one), but these tend to be fast only on a small range of expansion factors and to be very tricky to write and bug-prone, especially at the boundaries of the result.

A new idea

So the problem is that writing a weird number of bits at a time is expensive. But it’s possible to get away with only writing one bit for each “memset” we want to perform! The trick is found in a pair of inverse functions well-known for their usefulness in Boolean APL code. These are {2≠/0,⍵} and ≠\, and they are analogous to the pairwise difference {¯2-/0,⍵} and progressive sum +\. Dyalog 16 has fast code for ≠\ provided by Bob Bernecky, so I knew it was a good building block for other high-performance primitives.
The diagram above illustrates the relationships between scans and pairwise differences. In the bitwise world consisting only of even and odd, ≠ is like +. But each number is its own negative, so it’s also like -. More precisely, since ⍺≠⍵ is the same as 2|⍺+⍵ on Booleans (check for yourself!) and modulus distributes over addition when the final result has a modulus applied to it, ≠\ is equivalent to 2|+\. Similarly, {2≠/⍵} or {¯2≠/⍵} is the same as {2|¯2-/⍵}. The diagram includes the extra zero in {2≠/0,⍵} as part of the vectors in the top half of the diagram, but omits it from the functions used to label arrows to avoid clutter.

Let’s take a look at the pairwise difference after we replicate a vector:
It’s very sparse—the only one bits come at indices which are multiples of five. This is because expanding v multiplies the indices of its pairwise difference by five:

Of course ↑ isn’t particularly fast on these arguments, so we will have to write our own version, which sets the entire result to zero and then writes bits in the appropriate places. Implemented properly, this goes pretty quick, the only substantial costs being a variable shift and one byte-sized write per element on the right. It’s completely branchless. Follow up with the fast ≠\ algorithm and we have a pretty good solution.

This solution performs more writes than it has to, since it writes each difference bit even if it is zero. But trying to get rid of the extra writes is a losing battle. Checking the value before each write takes much longer than just writing it, and even a fancy strategy, like using the count-leading-zeros instruction to quickly skip over all zeros before each one, will be slower for arguments without many zeros in the difference vector. But there is another way to make this algorithm a little faster, coming up after we tie up some loose ends…

Back to Index, and more xor

What about my unexplained formula for indexing? The expression

y[x;] ←→ (((≢y)/⍪x) ∧⍤1 ≠⌿y) ≠⍤1 ⊣⌿y

also relies on insight about ≠, but in a completely different context. Let’s simplify it a bit to see what’s going on. Suppose we have two Boolean arrays b and c of the same shape, and we want to use the bit a to select one of them: we want an expression that returns b if a=0 and c if a=1. For real numbers, we might use a linear combination:

b + a×(c-b)

If a=0, then the c-b term is removed and we get b. If a=1, then it stays and we have (b+(c-b)) = c.
This formula is also valid for Boolean arrays, of course, but introducing an integer c-b would be a poor choice from a performance standpoint. Instead, we can use the insight that ≠ performs the function of both + and - in the bitwise world, and transliterate our formula into

b ≠ a∧(c≠b)

Again, if a=0, then the c≠b is zeroed out, and we end up with b≠0, which is the same as b. If a=1, then it stays, leaving (b≠(c≠b)), which gives c.

If y is a shape 2 m array y←b,[¯0.5]c, then we have b ≡ ⊣⌿y and (c≠b) ≡ ≠⌿y. So b ≠ a∧(c≠b) is, rearranging some, the same as (a ∧ ≠⌿y) ≠ ⊣⌿y. But there is an implicit scalar extension in the ∧ function; we can make this explicit by writing ((m/a) ∧ ≠⌿y) ≠ ⊣⌿y. Now it’s clear that to select using each bit of a vector x, we need to turn each bit into a row with length m, and then apply the ∧ and ≠ functions on rows, giving the formula I showed earlier.

How about Outer Product? That one’s easier, with no Boolean magic required. For vector arguments, ⍺ ∘.f ⍵ is identical to ((≢⍵)/⍪⍺) f ((≢⍺),≢⍵)⍴⍵, and the same computation works for general arrays if we use the ravel length ×/∘⍴ instead of the count ≢. When the right argument has more than around 1024 bits, replicate becomes slower than the old way of just pairing each bit of the left argument with the whole of the right, so we use that method instead.

Speeding up ≠\

We left off with an expansion algorithm whose cost is dominated by the time to compute ≠\ on a vector. That computation is based on a fast method for ≠\ on a single machine word. It moves along the argument vector (which is the same as the result vector in this case) one word at a time, keeping track of the most recent bit of the result. Then each word of the result is obtained from the fast xor-scan of that word in the argument, xor-ed with the carried bit. For arbitrary input vectors, this is the best algorithm. But the word-length ≠\ computation is slowing us down, and it turns out we can do a lot better.

Writing a single bit has the same cost as writing an entire aligned machine word. So writing one bit at a time, while much better than writing a variable number of bits with arbitrary alignment, represents some wasted effort. Can we make the impending ≠\ computation easier by using full-word writes?

A definite “yes” for that question. In fact, we can eliminate all of the intra-word computation. Consider the effect of a bit that we write on the final result after applying ≠\. That bit will be spread out over all of the subsequent bits, flipping each one. In order to produce this effect on a single word (leaving the task of flipping subsequent words for later), we can thus xor an aligned word with a word which is zero before that bit and one at that bit and later. We can get this word just by shifting the all-ones word right by an appropriate amount. And then we can omit the expensive word-size ≠\ computation entirely, so that to perform the final pass we simply repeatedly xor the next word with the current one’s last bit. This greatly reduces the cost of the final pass at very little expense to the pass before it (xor-ing rather than setting a word requires reading that word’s value first, so it’s not quite free).

To summarize, there are three passes:

Set all bits of the result to zero.

For each bit in the right argument, xor the appropriate word in the result with a word which is all ones after the place where that bit starts in the result.

Iterate across words of the result, xor-ing each with the last bit of the previous word (after finishing computation on that word).

For expansion factors between 32 and 256, this is the fastest method I know of. Below 32, interleaving is faster, and above 256 simply using memset will do the trick.

The xor-based algorithm also works for Replicate or Expand when the left argument is a vector, and in Dyalog 17 it is used for both of these whenever the average expansion factor, which is equal to the ratio of result length to argument length, is 256 or less.

The end result

The graph above shows the gains in scalar-Boolean replicate between versions 16.0 and 17.0. Note the huge scale of the vertical axis—replicate is just short of a hundred times faster at the far left! Version 17’s Replicate is split into three parts, with the performance of each individual algorithm shown under the overall result. From left to right, these are the algorithm with bit interleaving, the xor-based algorithm described in the last section, and an algorithm with memset. The last of these is the same basic strategy used by version 16, but there are some improvements in version 17 that make it about 45% faster for short arguments.

The seams between these algorithms are not quite perfect on my machine, but they are not off by enough to damage performance significantly. The cutoffs favor the middle, xor-based algorithm a little. I think this is appropriate because that algorithm is likely to have consistent performance even on older machines and different architectures. In contrast, bit interleaving uses the BMI2 instruction set (available on x86 processors since 2013) and is slower if it is not present, and memset may perform differently with other implementations of the standard library or depending on available vector instruction sets. Our xor-based algorithm uses only very widely available instructions, and one of the costs—the variable shift used on the all-ones word—is actually much faster on other architectures because of a design flaw in x86. An excellent addition to Dyalog APL!

Motivation

I have been working on the Dyalog APL quicksort implementation. The following programming puzzle arose in the process of doing the QA for this work.

⍵ is a simple array. Write a function sorted, without using ⍋ or ⍒, such that sorted ⍵ is 1 if ⍵ is sorted in ascending order and 0 otherwise.

The point about not using grade is that this is supposed to be an independent check that grade is correct (remains correct) after the new work.

Real Vectors

The simplest case is when ⍵ is a numeric vector. If furthermore ⍵ are not complex numbers (a case addressed later), then

∧/ 2≤/⍵

each item being less than or equal to the next one, checks that ⍵ is sorted. Since ⍋ uses exact comparisons, here we must set ⎕ct←⎕dct←0. Morever, in order that decimal floating-point numbers (DECFs) be compared correctly, here ⎕fr←1287.

Real Arrays

More generally, when ⍵ is a non-complex numeric matrix, we must check that each row precedes or is equal to the next row. If c and d are consecutive rows, then corresponding items are compared and at the first item where they differ, c[i] must be less than d[i].

~ 0 ∊ (2>⌿⍪⍵) ⍲ <\ 2≠⌿⍪⍵

The expression incorporates two refinements:

If ⍵ is not a matrix, first apply ⍪⍵.

Instead of checking c[i] is less than d[i], check that c[i] is not greater than d[i]. This finesses the case where c≡d and there is no first item where they differ; that is, the case where <\2≠⌿⍪⍵ is all 0s for that row.

<\on a boolean vector has 0s after the first 1, (and is all 0 if there are no 1s). Therefore, <\2≠⌿⍪⍵ finds the first item (if any) where one cell differs from the next cell, and that item must not be greater than the corresponding item in the next cell.

(Muse: since x above are random numbers, there is a possibility that it is sorted and the first test above can be 1. But: if each elementary particle in the visible universe were a computer and every nanosecond each of them creates a random matrix and tests it for sortedness as above, running from the beginning of the time to the end of our lives, it is still a very safe bet that no 1 would result.)

For integer arrays, there is an alternative of using the signs of the arithmetic difference between successive cells:

{~ 0 ∊ 1≠t×<\0≠t← × 2-⌿⍪⍵} x[⍋x;]
1

(The sign where consecutive cells first differ must not be 1.) However, computing the difference on floating point numbers can founder on overflow:

Complex Numbers

Two complex numbers are ordered first by the real parts and then by the imaginary parts. (This is part of the TAO extension implemented in Dyalog APL version 17.0.) Therefore, a complex array can be tested for sortedness by testing an equivalent real array with each number replaced by their real and imaginary parts, thus:

(¯1⌽⍳1+⍴⍴⍵) ⍉ 9 11∘.○⍵
↑9 11∘○¨⍵
9 11○⍤1 0⊢⍵

Although the second expression is the shortest, it is less efficient in time, space, and number of getspace calls. The last expression is favored for its brevity and performance.

The number of getspace is a worthwhile measure. Part of the QA process is a rather stringent procedure called the “Shuffle QA”. The entire Shuffle QA takes several weeks to run and its running time is directly related to the number of getspace.

Character Arrays

None of the functions < ≤ ≥ > - × are permitted on characters. This is solved by application of ⎕ucs, converting characters to integers while preserving the ordering.

Putting It All Together

Other Considerations

That ⍵⌷⍨⊂⍋⍵ is sorted is a necessary but not sufficient condition that ⍋⍵ is correct. For example, an “adversary” can supply the following results for ⍋⍵ so that ⍵⌷⍨⊂⍋⍵ is sorted:

?≢⍵
(≢⍵)⍴?≢⍵
¯1↓⍋⍵
∊ i {⊂⍵[?⍨≢⍵]}⌸⍨ ⍵⌷⍨⊂i←⍋⍵

The last expression randomly permutes the grade indices of equal cells, a result which violates the requirement that grade indices of equal cells are in ascending order. That is, grade must be stable.

In Dyalog APL version 17.0, grade has been extended to work on non-simple arrays, the much discussed TAO, total array ordering. Checking that a non-simple array is sorted without using grade requires facilities discussed in the paper TAO Axioms and is beyond the scope of this note.