Thursday, December 14, 2017

What is a document? - Part 2

Back in 1985, when I
needed to create a “document” on a computer, I had only two
choices. (Yes, I am indeed avoiding trying to define “document”
just yet. We will come back to it when we have more groundwork laid
for a useful definition.) The first choice involved typing into what
is known generically as a “text editor”. Back in those days, US
ASCII was the main encoding for text and it allowed for just the
basic symbols of letters, numbers and a few punctuation symbols. In
those days, the so called “text files” created by these “text
editors” could be viewed on screens which typically had 80 columns
and 25 rows. They could also be printed onto paper, using either “dot
matrix” printers or higher resolution, computerized typewriters
such as the so-called “golf ball” typewriters/printers which
mimicked a human typist using a ribbon-based impact printer.

The second choice was to wedge the text into little boxes called "fields" to be stored in a "database". Yes, My conceptual model of
text in computers in those early days was a very binary one. (Some nerd
humour in the last sentence.)

On one hand, I could type stuff into
small “boxes” on a screen which typically resulted in the
creation of some form of “structured” data file e.g. a CODASYL
database [1]. On the other hand, I could type stuff into an
expandable digital sheet of paper without imposing any structure on
the text, other than a collection of text characters, often chunked
with what we used to call CRLF separators (Carriage Return, Line
Feed).

(Aside: You can see the typewriter influence in the terminology
here. Return the carriage (holding the print head) to the left of the
page. Feed the page upwards by one line. So Carriage Return + Line
Feed = CR/LF).

(Aside:I find the
origins of some of this terminology is often news to younger
developers who wonder why moving to a new line is two characters
instead of one on some machines. Surely “newline” is one thing?
Well, it was two originally because one command moves the carriage
back (the “CR”) and another command moved the paper up a line
“LF”, hence the common pairing: CR/LF. When I explain this I
double up by explaining “uppercase/lowercase”. The origins of the
latter in particular, are not well known to digital natives in my
experience.)

From my first
encounters with computers, this difference in how the machines
handled storing data intrigued me. On one hand, there were
“databases”. These were stately, structured, orderly digital
objects. Mathematicians could say all sorts of useful things about
them and create all sorts of useful algorithms to process them. The
“databases” are designed for automation.

On the other hand,
there was the rebellious, free-wheeling world of text files.
Unstructured. Disorderly. A pain in the neck for automation.
Difficult to reason about and create algorithms for, but
fantastically useful precisely because they were unstructured and
disorderly.

I loved text files back
then. I still love them today. But as I began to dig deeper into
computer science I began to see that the binary world view : database
versus text. Structured versus unstructured. Was simple, elegant and
wrong. Documents can indeed be “structured”. Document processing could indeed be automated. It is possible to
reason about them, and create algorithms for them, but it took me
quite a while to get to grips with how this can be done.

My journey of discovery
started with an ADM 3A+ terminal to a VAX 11/780 mini-computer (by
day) [2] and an Apple IIe personal computer running CP/M – by
night[3].

For the former, a program called RUNOFF. For the latter, a program called Wordstar and one of my favorite pieces of hardware of all time : an Epson FX80 dot matrix printer.