Conventional and Inverted Grouping of Codes for Chemical Data

EUGENE MILLER, DELBERT BALLARD, JOHN KINGSTON, and MORTIMER TAUBE

In an earlier paper (1) a theoretical demonstration of the basic similarity of all storage and retrieval devices was attempted. The ability to describe storage and retrieval systems in terms of a common set of variables with overall cost, measured in capital investment and human time, as the major distinguishing characteristic, was one of the first results of the general theory:

One of the first insights we gain from the attempt to formulate a generalized storage and retrieval theory which is analogous to communication theory is that the theory must be invariant with respect to the physical form of diverse systems. Edge-punched cards, interior-punched cards, Microcards, Uniterm cards, Batten cards, magnetic tapes, or even conventional alphabetical or classed catalogs may differ from one another in cost, ease of storage, ease of retrieval, size, complexity, etc., but they are alike with respect to basic potentialities for handling different types of literature searches. For equal amounts of coding space all systems can enter an equal amount of information and for an equal number of needles, reading heads or electronic circuitry, all physical systems deliver the identical product for any search (2).

This general conclusion was further elaborated in another paper (3); but, until now, it has not been possible to present an empirical and concrete demonstration of the validity of this theoretical approach. Such a demonstration would consist of a cost comparison of two different systems, each of which contained identical stores of information and delivered the same response to the same question. If two operating systems are compared, it is usually difficult to insure that the indexing is equally adequate or that background and peripheral service functions of the systems do not affect their operation and cost.

In the “Conference Plans and Criteria for Papers” issued by the Planning Committee of the International Conference on Scientific Information, this problem was recognized and it was suggested that “It appears that persons and groups who are engaged in the development of systems embodying new

Citation Manager

"Conventional and Inverted Grouping of Codes for Chemical Data."
Proceedings of the International Conference on Scientific Information -- Two Volumes.
Washington, DC: The National Academies Press, 1959.

Please select a format:

Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 671
-->
Conventional and Inverted Grouping of Codes for Chemical Data
EUGENE MILLER, DELBERT BALLARD, JOHN KINGSTON, and MORTIMER TAUBE
In an earlier paper (1) a theoretical demonstration of the basic similarity of all storage and retrieval devices was attempted. The ability to describe storage and retrieval systems in terms of a common set of variables with overall cost, measured in capital investment and human time, as the major distinguishing characteristic, was one of the first results of the general theory:
One of the first insights we gain from the attempt to formulate a generalized storage and retrieval theory which is analogous to communication theory is that the theory must be invariant with respect to the physical form of diverse systems. Edge-punched cards, interior-punched cards, Microcards, Uniterm cards, Batten cards, magnetic tapes, or even conventional alphabetical or classed catalogs may differ from one another in cost, ease of storage, ease of retrieval, size, complexity, etc., but they are alike with respect to basic potentialities for handling different types of literature searches. For equal amounts of coding space all systems can enter an equal amount of information and for an equal number of needles, reading heads or electronic circuitry, all physical systems deliver the identical product for any search (2).
This general conclusion was further elaborated in another paper (3); but, until now, it has not been possible to present an empirical and concrete demonstration of the validity of this theoretical approach. Such a demonstration would consist of a cost comparison of two different systems, each of which contained identical stores of information and delivered the same response to the same question. If two operating systems are compared, it is usually difficult to insure that the indexing is equally adequate or that background and peripheral service functions of the systems do not affect their operation and cost.
In the “Conference Plans and Criteria for Papers” issued by the Planning Committee of the International Conference on Scientific Information, this problem was recognized and it was suggested that “It appears that persons and groups who are engaged in the development of systems embodying new
EUGENE MILLER, DELBERT BALLARD, JOHN KINGSTON, and MORTIMER TAUBE Documentation, Inc., Washington, D.C.

OCR for page 671
-->
principles for organizing subject matter could effectively compare their systems by applying them to a common set of documents.”
It is not always easy to isolate a set of documents; and it is still more difficult to insure the same competence of indexers to perform the analyses. It is fortunate, therefore, that a report of a recent innovation by the Office of Research and Development of the Patent Office (4) describes an isolated group of patents indexed in a manner which makes it practicable to set up the same information (the same group of patents) in another storage and retrieval system and obtain a direct comparison of costs for the two systems.
The systems to be compared are a Matrex system using inverted code groups and an International Business Machine punched card system using conventional code groups. Before proceeding with the comparison, some preliminary remarks about devices are in order.
The Matrex device, Fig. 1, is a species of the Batten system, like the Peek-a-boo system, using superimposed aperture cards and light penetration to detect the matching of codes. It differs from most Batten systems in its use of a drill for multiple entering rather than a punch which enters codes one card at a time.
The Patent Office system utilizes a standard IBM keypunch and a Census Bureau multicolumn sorter which is a modified IBM 101 electronic statistical
FIGURE 1.

OCR for page 671
-->
machine equipped with dial switches (instead of a plug board) for setting up the search elements. In this paper, we shall consider this machine roughly equivalent to the IBM 101.
A summary of the Patent Office system
The Patent Office system covers 2350 steroid patents which include 900 duplicates or cross-references leaving 1450 patents to be coded. A steroid compound, the subject of a steroid patent, is a complicated chemical structure. The basic nucleus of a steroid has 22 points at which different chemical groups or substituents can be attached. According to the Patent Office chemists, a particular steroid compound is describable by locating any one of 24 descriptors at any one of 22 positions of the steroid nucleus. Figure 2 (5) gives both the names of the descriptors and a table of positions. The compound illustrated has a “=” at the 4th, 9th, and 17th position; α or allo at the 1st, 3rd, 5th, 11th, 17th position; COOH or COOR at the 20th position; etc. In addition to the 24 descriptors requiring position designations, 60 descriptors are given which characterize a steroid without reference to any position. These are indicated in Fig. 3. (6) According to the Patent Office Report:
The punched card is roughly divided into two sections. Columns 1 to 48 are reserved for recording the descriptors for substituents associated with the steroid nucleus at a designated location thereof. The second portion, columns 60 to 69 is reserved for general descriptors not identifiable with any particular position on the nucleus (7).
The first 48 columns are divided into 24 two-column fields for the 24 substituents, and each two-column field provides 22 positions (with 2 left over) for the positions at which the substituents are associated with the nucleus. Although there are two unused holes in each of the 24 fields, no use is made of them for indicating the general presence of a particular substituent without reference to position. Apparently this information is not used in coding or in a search.
Figure 4 (8) illustrates the punched card corresponding to the compound shown in Fig. 2. The card can be read quite easily. Note that “=” (columns 1 and 2) has been punched in the 4th, 9th, and 17th position, etc.
Since this paper is concerned with the cost of entering and retrieving a given quantity of information from two storage and retrieval systems, it is assumed that the indexing of the patent, the determination of suitable descriptors, etc., are entirely adequate. The problem, then, is reduced to entering the descriptors in a Matrex system and searching the system by any descriptor or combination of them.

OCR for page 671
-->
FIGURE 2.

OCR for page 671
-->
FIGURE 3.

OCR for page 671
-->
FIGURE 4.
The Matrex device
Whereas, in the IBM punched card system, each card represents an item (a compound or patent) on which descriptor codes are punched and is thus an instance of conventional code grouping; in a Matrex system, each card represents a descriptor on which are recorded the codes for compounds or patents characterized by that descriptor. That is to say, a Matrex system, like the Peek-a-boo system and many similar systems, uses inverted rather than conventional grouping of codes.
For this experiment a Matrex card having room for 10,000 compounds or patents was chosen (Fig. 5). It will be noted that the illustrated card has a numbered tab in this instance, 19 (9 in the 2nd decade). Since the indexing system requires 24 descriptors×22 positions, (528 Matrex cards) plus 60 cards for general descriptors, 588 cards are required for the complete system.
The 60 cards representing general descriptors are in one group; and the cards representing the 24 positional descriptors are separated into 24 sets of 22 cards. Each of the 60 general descriptors is assigned a number, and the 60 cards are grouped randomly. The numbered cards representing the positioned descriptors are also filed randomly within eight groups, each group consisting of three numbered sets of 22 as in Fig. 6.
The random filing of cards within groups made possible by the use of Radex tabbed cards, is not integral to all uses of the Matrex system; nor restricted to it, but in this application to positions of descriptors it serves to decrease materially the time required to select and refile cards. Items are entered into the Matrex system by selecting the appropriate descriptor cards from the file and drilling a hole in all the selected cards in the position corresponding to the number of the item.

OCR for page 671
-->
FIGURE 5.
The patent illustrated in the table (Fig. 2) and the punched card (Fig. 4), which can be considered item 1, is entered by selecting from the Matrex file=, 4, 9, 17; α 1, 3, 5, 11, 17; COOH 20; etc. After all cards are selected they are placed together in the drill jig and a hole is drilled through the entire set at position one.
It is obvious that any items selected by any descriptor or combination of descriptors in the IBM system will also be delivered by the Matrex system; and that any item not delivered by the IBM system will also not be delivered by the Matrex system. A sample diagram illustrating the complete converti-

OCR for page 671
-->
FIGURE 6.
bility of code grouping is shown in Fig. 6 which compares a conventionally grouped Zatocoding system with an inverted group of codes having the same coding space.
The identity of coding between conventional grouping and inverted grouping is secured by matching the number of code positions on the Zatocard to the number of cards in the inverted system. Similarly in the Matrex system for steroids there are as many cards as there are holes assigned for use on a punched card.
In the following comparison, dollars have been used with reference to equipment; and “times” for operations for which salary rates are unknown. In all cases such times are exact only to a degree since stop watch measurements were

OCR for page 671
-->
used in some cases and not in others. However, the magnitudes are sufficiently revealing for our purposes.
Basis for comparison
In order to arrive at some basis for comparison of the Matrex and punched card apparatus for searching steroid patents, the Patent Office staff provided ten hypothetical queries and ten input problems were derived from 10 randomly selected patent cards. It might be thought that with identical indexing and identical problems, a conclusive comparison could be made. Actually, the experiment was carried out under other than ideal or perhaps even typical conditions. For example, the Patent Office group did not have readily available a tabulator which would have enormously simplified the print-out of the answers. Further, a Type 82 sorter (650 cards per minute) was used whereas a Type 83 at 1000 cards per minute would have given better speeds. On the other hand, the unique Census Bureau multicolumn sorter with dial-set wiring was used instead of the more common Type 101 which would be somewhat more costly to operate because of the more tedious plugboard wiring.
Ten input and query problems may be much too small a sample for a valid conclusion although the problems were taken entirely at random. The entire steroid patent collection (1556 patents in the machine system) may also be too small to permit reasonable extrapolation of the experimental results to larger collections. None of these inexactitudes is hopeless, however, and anyone can qualify the results to suit his requirements and draw reasonable conclusions. Perhaps the most troublesome prediction involves the tolerable size of a search answer. The search costs in either system are seriously affected by whether a search yielding, say, as many as 100 or 500 answers is acceptable, or whether the search result must be limited to a smaller number. Moreover, re-searching to vary the size of the yield may be excessively costly under certain conditions. This will be discussed later.
Input data
The following data are for system input after the data have been coded for the machines. Therefore normal total input costs cannot be determined from these data. Actually, the conditions here are not typical either for the punched card or Matrex device. The punched card data do not include conversion to punched card codes (a step normally not required in a Matrex device) and the Matrex file is aided by the use of random-filing tabs which reduces the time of both location and returning of the Matrex cards to the deck.

OCR for page 671
-->
According to the Patent Office, the time required to prepare a punched card, i.e., a patent, for the system is 2 minutes (including verification) from a code sheet. This is to be compared with the data for ten input problems for the Matrex device (Table 1).
TABLE 1 Termatrex input
Patent no.
Terms
Selection
Drill
Return to rack
Total Elapsed Time
2742485
24
4:14
:399
1.244
6:17
2763669
19
3:10
:451
.559
4:51
2275969
16
2:12
:249
.447
3:21
2785189
24
3:21
:212
.219
4:04
2662089
19
2:41
:223
.185
3:22
2159569
23
2:45
:202
.190
3:24
2186906
19
2:58
:161
.308
3:45
2734066
23
2:50
:219
.241
3:36
2408832
16
2:16
:154
.399
3:11
1985747
9
1:30
:149
.336
2:19
Average
19.2
2:48
.242
.373
3:48
Average elapsed time
3:48
It can be argued that the times are more nearly equal when punching instructions must be prepared. But this is beside the point for the purposes of this paper.
Output data
The Patent Office supplied ten queries for use in this experiment. The characteristics of these queries and the results are summarized below. The Patent Office supplied a blanket figure of 6 1/2 minutes time to present the answer cards to any query of the deck of 1556 patent cards on the Census Bureau’s multicolumn sorter. We can assume that somewhat more time would be required on a standard IBM 101 machine.
The results produced by the runs indicated by the Patent Office were duplicated on a Type 82 IBM sorter (650 cards per minute). In order to keep the running time to the lowest possible value, a predetermination was made of the order of search terms giving the final answer in the minimum sorting time. This was determined absolutely for this experiment. In normal practice this ideal sort may not always be achieved, although a skilled searcher might do nearly as well. An important point to remember in examining the following table is that the search yield is merely a presentation of the answers and not a read-off or print-out that may be taken to the Patent files. In the case of the IBM machines, the presentation is a group of punched cards in nonsequential

OCR for page 671
-->
array. In the Matrex device, the presentation is a series of dots of light whose coordinates must be read off. We must assume that in all instances the patent numbers must be copied down or tabulated on a sheet of paper.
In the Appendix is given the detail of the sorting operation for each query which is of interest in showing how the deck was reduced on each pass. We were hopeful that such data could be generalized to write into an equation for prediction purposes, but the sample is too small. Such an equation based on several hundred random queries would be useful for cost determination purposes.
The question of read-out is important, cost-wise. With a Matrex device it takes approximately 6 seconds per item to read the coordinates and write down the number. We find that it takes about 4.5 seconds to write down a number from an interpreted punched card. A print-out from punched cards could be achieved much faster with a tabulator, but the cost of the tabulator must be reckoned. The steroid punched card deck need not be maintained in numerical order for search purposes, but the answer cards should be sorted for read-out purposes.
TABLE 2
Query
Number of terms
Number of answers
Searching time, minutes
No. of cards handled in Sorter
Matrex
101
Sorter
1
4
363
1.5
6.5
7.5
3495
2
7
155
2.3
8.1
3434
3
7
148
2.4
8.8
3633
4
5
18
2.4
6.5
1701
5
6
61
1.8
6.2
2120
6
4
17
1.2
6.3
1690
7
5
3
1.0
5.8
1739
8
8
69
2.3
8.5
2744
9
6
344
1.4
8.3
3887
10
7
15
1.7
6.0
1737
Total
18.0
72.0
Average
5.9
119.3
1.8
6.5
7.2
2618.0=1.68×deck of 1556 cards
All these factors must be related to the size of the answer. This would be of no concern if all searches yielded answers such as query Nos. 4, 6, 7, and 10 in Table 2. But query 1, with 363 answers presented, is another matter. With the Matrex device it would take 36.3 minutes to write out the answers; with punched cards, 27.2 minutes after a numerical sort. A tabulator could print out the answer from punched cards in about 6.3 minutes.
A tabulator is one of the more expensive pieces of equipment and a relatively high load factor is necessary to justify it.

OCR for page 671
-->
If a presentation of 363 answers is too large, the remedies are: (a) addition of another term, (b) substitution of terms, (c) complete rephrasing of the query. The first is easy to do with all three devices but probably simplest with the Matrex device. The second is still easy on the Matrex device, and considerably more difficult on the punched card machines. Rephrasing takes us back to the relative costs indicated by the time of search in Table 2.
Equipment costs
Table 3 shows the cost of equipment required for input and searching with punched card and Matrex devices.
TABLE 3
Punched card (sorter)
Punched card (101)
Matrex device
Punch
$2,067.00
Punch
$2,067.00
Light box, template, and drill (10,000 items capacity) $285.00
Sorter
6,943.00
101
31,800.00
If a tabulator were added to the punched card system for read-out purposes an additional cost of $29,627.00 would have to be considered.
Extrapolation to larger collection
Theoretical considerations have pointed to the unfavorable effects of growth of a collection in a system not using inverted grouping. There follows a brief analysis of this effect as exemplified by the punched card steroid index. Many intangible and policy questions enter into the problem of how many answers a search device should present for digestion by the inquirer. It is impossible to say that he should or should not go to the file and examine X number of actual documents. If the volume is large, print-out aids are necessary and the investment must be faced. By stopping short of the print-out question, the effects of collection growth can be brought sharply into form.
Let us assume that the steroid patent collection grows to 10,000 punched cards. (This happens to be the limit of a cheap Matrex device; more sophisticated Matrex devices can be had for collections of 40,000 (for between $1000 and $1500). The input costs of either punched cards or Matrex are not affected by growth, if we ignore the relatively low cost of tabulating card stock.
The story is quite different on the output side. The cost of answer presentation on the Matrex is the same whether 10 or 10,000 items are in the collection. With conventional punched cards the cost of search goes up linearly with the size of the deck. In examining just how much, we encounter the most interest-

OCR for page 671
-->
ing phenomenon that the more sophisticated 101 approach suffers greatest from growth and becomes far less efficient than an ordinary sorter.
If the 101 takes 6.5 minutes to search a deck of 1556 cards, it would take 41.8 minutes to search 10,000 cards.
Assuming equal proportions of drop out and the same operating efficiency as on the smaller deck, 10,000 cards would require
The operating efficiency on the smaller deck is
It is reasonable to assume that, with 10,000 cards, the operating efficiency may be raised to a figure of 80%. Thus the search time would be equivalent to
Table 4 summarizes the times involved in entering material and searching the present collection of 1556 steroid patents and a hypothetical collection of 10,000:
TABLE 4
Time for answer presentation per query, minutes
Entry time per patent, minutes
Collection of 1556 patents
Collection of 10,000 patents
Matrex device
3.5
1.8
1.8
Type 82 sorter
2
7.2
32.4
IBM 101
2
6.5
41.8
ACKNOWLEDGMENT
The authors are indebted to Mr. Don Andrews and Mr. Julius Frome of the U.S. Patent Office for their cooperation in supplying certain basic information for this paper.