Chapter 18: Sets, Classes and Relations

In this chapter we look at more of the built-in functions of J.
The connecting theme
is, somewhat loosely, working with set, classes and relations.

Suppose that, for some list, for the purpose at hand,
the order of the items is irrelevant and the presence of duplicate items
is irrelevant.
Then we can regard the list as (representing) a finite set.
In the abstract, the set 3 1 2 1 is considered to be the same set as
1 2 3.

The word "class" we will use in the sense in which, for example,
each integer in a list belongs either to
the odd class or to the even class.

By "relation" is meant
a table of two or more columns,
expressing a relationship between a value in one column
and the corresponding value in another.
A relation with two columns, for example, is a set of pairs.

18.1 Sets

18.1.1 Membership

There is a built-in verb e. (lowercase e dot, called "Member").
The expresssion x e. y tests whether
x matches any item of y, that is, whether x is a member of the
list y.
For example:

y=: 'abcde'

'a' e. y

'w' e. y

'ef' e. y

abcde

1

0

1 0

Evidently the order of items in y is irrelevant and so is the presence
of duplicates in y.

z=: 'edcbad'

'a' e. z

'w' e. z

'ef' e. z

edcbad

1

0

1 0

We can test whether a table contains a particular row:

t =: 4 2 $ 'abcdef'

'cd' e. t

ab
cd
ef
ab

1

18.1.2 Less

There is a built-in verb -. (minus dot, called "Less").
The expression x -. y produces a list of the items of
x except those which are members of y.

x =: 'consonant'

y =: 'aeiou'

x -. y

consonant

aeiou

cnsnnt

Evidently the order of items in y is irrelevant and so is the presence of
duplicates in y.

18.1.3 Nub

There is a built-in verb ~. (tilde dot, called "Nub"). The expression
~. y produces
a list of the items of y without duplicates.

nub =: ~.

y =: 'hook'

nub y

~.

hook

hok

We can apply nub to the rows of a table:

t

nub t

ab
cd
ef
ab

ab
cd
ef

18.1.4 Nub Sieve

The verb "nub sieve" (~:) gives a
boolean vector which is true only at
the nub.

y

b =: ~: y

b # y

nub y

hook

1 1 0 1

hok

hok

18.1.5 Functions for Sets

The customary functions on sets, such as set-union,
set-intersection or set-equality, are easily defined
using the built-in functions available. For example
two sets are equal
if all members of one are members of the other,
and vice versa.

seteq =: *./ @: (e. , e.~)

1 2 3 seteq 3 1 2 1

1 2 3 seteq 1 2

1

0

18.2 The Table Adverb

Recall that the adverb / generates a verb; for example +/ is a verb
which sums lists.
More precisely, it is the monadic case of +/ which sums lists.
The dyadic case of +/ generates a table:

x =: 0 1 2

y =: 3 4 5 6

z =: x +/ y

0 1 2

3 4 5 6

3 4 5 6
4 5 6 7
5 6 7 8

The general scheme is that if we have

z =: x f/ y

then z is a table such that the value at row i column j
is given by applying f dyadically
to the pair of arguments i{x
and j{y. That is,
z contains all possible pairings
of an item of x with an item of y.
Here is another example:

x =: 'abc'

y =: 'face'

x =/ y

abc

face

0 1 0 0
0 0 0 0
0 0 1 0

The result shows, in the first row,
the value of 'a' = 'face', in the second row
the value of 'b' ='face'
and so on.

18.3 Classes

18.3.1 Self-Classify

Consider the problem of finding the counts
of letters occurring in a string (the
frequency-distribution of letters).
Here is one approach.

We form a table testing each letter for equality
with the nub.

y =: 'hook'

nub y

(nub y) =/ y

hook

hok

1 0 0 0
0 1 1 0
0 0 0 1

The expression ((nub y) = / y) can be abbreviated as (= y).
The monadic case of the built-in verb = is called
"Self-classify").

y

nub y

(nub y) =/ y

= y

hook

hok

1 0 0 0
0 1 1 0
0 0 0 1

1 0 0 0
0 1 1 0
0 0 0 1

If we sum each row of = y we obtain
the counts, in the order of the letters in the nub.

y

= y

+/ " 1 =y

hook

1 0 0 0
0 1 1 0
0 0 0 1

1 2 1

The counts can be paired with the letters of the
nub:

y

nub y

(nub y) ;" 0 (+/ " 1 =y)

hook

hok

+-+-+
|h|1|
+-+-+
|o|2|
+-+-+
|k|1|
+-+-+

18.3.2 Classification Schemes

Gardeners classify soil-types as acid, neutral or
alkaline, depending on the pH value. Suppose
that a pH less than 6 is classed as acid,
6 to 7 is neutral, and more than 7 as alkaline.
Here now is a verb to classify a pH value,
returning A for acid, N for neutral and L for alkaline (or limy).

classify =: ({ & 'ANL') @: ((>: & 6) + (> & 7))

classify 6

classify 4.8 5.1 6 7 7.1 8

N

AANNLL

The classify function we can regard as defining a
classification scheme.
The letters ANL, which are in effect
names of classes, are called the keys of the scheme.

18.3.3 The Key Adverb

Given some data (a list, say), we can classify each item
to produce a list of corresponding keys.

data =: 7 5 6 4 8

k =: classify data

7 5 6 4 8

NANAL

We can select and group together all the data in,
say, class A (all the data with key A):

data

k

k = 'A'

(k = 'A') # data

7 5 6 4 8

NANAL

0 1 0 1 0

5 4

Now suppose we wish to count the items in each class.
That is, we aim to apply the monadic verb # separately
to each group of items all of the same key.
To do this we can use the built-in adverb /. (slash dot, called "Key").

data

k =: classify data

k # /. data

7 5 6 4 8

NANAL

2 2 1

For another example, instead of counting the members
we could exhibit the members, by applying the box verb <.

data

k =: classify data

k < /. data

7 5 6 4 8

NANAL

+---+---+-+
|7 6|5 4|8|
+---+---+-+

The verb we apply can discover for itself the class of
each separate argument,
by classifying the first member:
Here the verb u
produces a boxed list: the key and count:

u =: (classify @: {.) ; #

data

k =: classify data

k u /. data

7 5 6 4 8

NANAL

+-+-+
|N|2|
+-+-+
|A|2|
+-+-+
|L|1|
+-+-+

The general scheme for the "Key" adverb is as follows.
In the expression x u /. y, we take y to be a list,
and x is a list of keys of corresponding items of y according
to some classification scheme,
and u is the verb to be applied separately to each class.
The scheme is:

x u /. y means (= x) (u @ #) y

To illustrate:

y =: 4 5 6 7 8
x =: classify y
u =: <

y

x

= x

(= x) (u @ #) y

x u /. y

4 5 6 7 8

AANNL

1 1 0 0 0
0 0 1 1 0
0 0 0 0 1

+---+---+-+
|4 5|6 7|8|
+---+---+-+

+---+---+-+
|4 5|6 7|8|
+---+---+-+

We see that each row of =x selects items from y,
and u is applied to this selection.

18.3.4 Letter-Counts Revisited

Recall the example of finding the counts of letters in a string.

y =: 'LETTUCE'

= y

(nub y) ; " 0 +/ "1 (= y)

LETTUCE

1 0 0 0 0 0 0
0 1 0 0 0 0 1
0 0 1 1 0 0 0
0 0 0 0 1 0 0
0 0 0 0 0 1 0

+-+-+
|L|1|
+-+-+
|E|2|
+-+-+
|T|2|
+-+-+
|U|1|
+-+-+
|C|1|
+-+-+

Here is a variation. We note that
we have in effect a classification scheme where
we have as many different classes as different letters: each letter
is (the key of) its own class. Thus we can write an
expression of the form y u /. y.

The applied verb u will see, each time, a list of letters, all the same.
It counts them, with #,
and takes the first, with {., to be a label for the class.

u =: {. ; #

y

= y

y u /. y

LETTUCE

1 0 0 0 0 0 0
0 1 0 0 0 0 1
0 0 1 1 0 0 0
0 0 0 0 1 0 0
0 0 0 0 0 1 0

+-+-+
|L|1|
+-+-+
|E|2|
+-+-+
|T|2|
+-+-+
|U|1|
+-+-+
|C|1|
+-+-+

18.4 Relations

Suppose there are a number of publications,
such as:

"Pigs" by Smith, on the subject of pigs

"Pets" by Brown, on cats and dogs

"Dogs" by Smith and James, on dogs

and we aim to catalog such publications.
A suitable data structure for such a catalog might be a table
relating authors to titles and another
table relating
titles to subjects. For example:

author

title

Smith

"Pigs"

Brown

"Pets"

Smith

"Dogs"

James

"Dogs"

title

subject

"Pigs"

pigs

"Pets"

dogs

"Pets"

cats

"Dogs"

dogs

Such tables we may call "relations".
The order of the rows is not significant.
Here,for the sake of simplicity, we will stick to
relations with two columns.

Now we choose a representation for our relations.
For a first approach, we choose tables of boxed strings.
The authors-titles relation is:

18.4.1 Join of Relations

From the authors-titles relation AT and the titles-subjects
relation TS we can compute an authors-subjects relation
showing which author has written a title on which subject.
We say that AT and TS are to be joined with respect to
titles, and we would expect the join to look like this:

The plan for this section is to look at a function for
computing joins, then at an improved version,
and then at the advantage of representing relations
as tables of symbols rather than boxed strings.
Finally we look at some performance comparisons.

A method is as follows.
We consider all possible pairs consisting of a row
at from table AT and a row ts from table TS.
Each pair at,ts is of the form:

author; title; title; subject

If title matches title, that is, item 1 matches item 2,
then we extract author and subject, that is, items 0 and 3.
Verbs for testing and extracting from at,ts pairs
can be written as:

test =: 1&{ = 2&{
extr =: 0 3 & {

and these verbs can be plugged into a suitable
conjunction to do the pairing. In writing this conjunction,
we aim to avoid requiring the whole set of possible pairs
to be present at the same time, since this set may be large.
We also aim to avoid any duplicates in the result.
Here is a first attempt.

The join verb as defined above is slow, because
the test and extr verbs are applied to a single x,y pair at
a time - they are scalar computations. Performance will
be better if we can give these verbs as much data as
possible to work on at one time. (This is a universal rule in J).
Vector
or array arguments are better. Here is a revised
vector-oriented version of PAIR and join, which still
avoids building the entire set of pairs.

Representing relations as tables of boxed strings,
as above, is less than efficient.
For a repeated value, the entire string is repeated.
Values are compared by comparing entire strings.

Now we look at another possibility. Rather than boxed strings,
a relation can be represented by a table of
symbols.

18.4.2 What are Symbols?

Symbols are for efficient computation with string data.
Symbols are a distinct data-type, in the same way
that characters, boxes and numbers are distinct data-types.
A symbol is a scalar which identifies, or refers to,
a string.

A symbol can be created by applying the built-in verb
s: (lowercase s colon) to a boxed string.

a =: s:
Now the variable a has a value of type symbol. We inspect
this value in the usual way:

a
`hello

and see that the value is displayed
as the original string preceded by a left-quote.
Even though a looks like a string when displayed,
it is a scalar.

a

$ a

# $ a

`hello

0

The original string is stored in a data-structure,
maintained automatically by the J system, called the
symbol-table.
Strings are not duplicated within the symbol-table.
Hence if another symbol b is created from the same string as a,
then b is equal to a.

a

b =: s:

b = a

`hello

`hello

1

Notice that the comparison is simple scalar equality, with
no need to compare the original strings.

Our relations above can be converted to arrays of
symbols, and joined as before.

sAT =: s: AT

sTS =: s: TS

sAT vjoin sTS

`Smith `Pigs
`Brown `Pets
`Smith `Dogs
`James `Dogs

`Pigs `pigs
`Pets `cats
`Pets `dogs
`Dogs `dogs

`Smith `pigs
`Brown `cats
`Brown `dogs
`Smith `dogs
`James `dogs

Symbols are lexicographically ordered to reflect the ordering
of the original strings. Hence
tables of symbols can be sorted:

sAT

/:~ sAT

`Smith `Pigs
`Brown `Pets
`Smith `Dogs
`James `Dogs

`Brown `Pets
`James `Dogs
`Smith `Dogs
`Smith `Pigs

18.4.3 Measurements Compared

Here is a utility verb giving time in seconds
to evaluate an expression,
averaged over say 4 executions.

time =: (8j5 & ":) @: (4 & (6!:2))

The examples of relations above are too small
for meaningful performance measurements,
so we make larger relations
by replicating each say 100 times.

18.4.4 Saving and Restoring the Symbol Table

For a symbol in data its original string ('hello' for example) is stored
only in the symbol table, not in data itself. The original string is needed
to display the value of the symbol.

Suppose that we write data to a file, aiming to
read it back in a new session. At the beginning of a
new session, the symbol table is empty.
Thus we must save the symbol table from the earlier
session, and reinstate it at the beginning of the new session.

First, here are two utility functions to save a value to a file
and retrieve it. (See Chapter 27 and
Chapter 28 for more about data in files.)

save =: 4 : '(3!:1 x ) 1!:2 < y '
retr =: 3 : '3!:2 (1!:1 < y )'

Save the data to a file named, say, data.xyz

data save 'data.xyz'

The symbol table is not itself a variable,
but the expression 0 s: 10 gives a value for it.
We save this value to a file named, say, symtab.xyz

(0 s: 10) save 'symtab.xyz'

Start a new J session. The symbol table is initially empty, so begin by
reinstating it from the file saved in the earlier session:

10 s: (retr 'symtab.xyz')
1

Now, with the correct symbol table in place,
we can retrieve the array of symbols data from its file: