Friday, February 23, 2018

What is a document - Part 7

The word “document”
is, like the word “database”, simple on the outside and complex on
the inside.

Most of us carry around pragmatically fuzzy definitions
of these in our heads. Since the early days of personal computers
there have been software suites/bundles available that have included
distinct tools to manage “documents” and “databases”,
treating them as different types of information object. The first such package
I used was called SMART running on an IBM PC XT machine in the late
Eighties. It had a 10MB hard disk. Today, that is hardly enough to store a single document, but I digress...

I have used many other Office Suites since then, most of which have
withered on the vine in enterprise computing, with the notable
exception of Microsoft Office. I find it interesting that of the
words typically associated with office suites, namely, “database”,
“word processor”, “presentation”, and “spreadsheet” the
two that are today most tightly bound to Microsoft office are
“spreadsheet” and “presentation” to the point where “Excel”
and “Powerpoint” have become generic terms for “spreadsheet”
and “presentation” respectively. I also think it is interesting Excel has become the de-facto heart of Microsoft Office in the business community with Word/Access/Powerpoint being of secondary importance as "must haves" in office environments, but again I digress...

In trying to chip away
at the problem of defining a “document” I think it is useful to
imagine having the full Microsoft office suite at your disposal and
asking the question “when should I reach for Word instead of one of
the other icons when entering text?” The system I worked in in the Nineties, mentioned
previously in this series, required a mix of classic field-type
information along with unstructured paragraphs/tables/bulleted lists.
If I were entering that text into a computer today with Microsoft
Office at my disposal, would I reach for the word processor icon or
the database icon?

I would reach for the
Word icon. Why? Well, because there are a variety of techniques I can
use in Word to enter/tag field-type textual information and many techniques
for entering unstructured paragraphs/tables/bulleted lists. The
opposite is not true. Databases tend to excel (no pun intended) at
field-type information but be limited in their support for
unstructured paragraphs/tables/bulleted lists – often relegating
the latter to “blob” fields that are second-class citizens in the
database schema.

Moreover, these days, the tools available for
post-processing Word's .docx file format make it much easier than
ever before to extract classic “structured XML” from Word
documents but with the vital familiarity and ease of use for the
authors/editors I mentioned previously.

Are there exceptions?
Absolutely. There are always exceptions. However, if your data
structure necessarily contains a non-trivial amount of unstructured
or semi-structured textual content and if your author/edit community wants to
think about the content in document/word-processor terms, I believe
today version of Word with its docx file format is generally
speaking a much better starting point than any database front-end
or spreadsheet front-end or web-browser front-end or any structured XML editing tool front-end.

Yes, it can get messy to do the post-processing of the data but given a choice between a solution
architecture that guarantees me beautifully clean data at the
back-end but an author/edit community who hate it, versus a solution
architecture that involves extra content enrichment work at the back
end but happy author/edit users, I have learned to favor the latter every time.

Note I did not start there! I was on the opposite side of this for many,
many years, thinking that structured author/edit tools, enforcing structure at the front-end was the way to go. I built a few beautiful structured systems that
ultimately failed to thrive because the author/edit user community
wanted something that did not “beep” as they worked on content. I myself, when writing the
books I wrote for Prentice-Hall (books on SGML and XML - of all things!), I myself wanted something that did not beep!

Which brings me
(finally!), to my answer to the question “What is a document?”. My
answer is that a document is a textual information artifact where the
final structure of the artifact itself is only obvious after
it has been created/modified and thus requires an author/edit user
experience that gets out of the way of the users creative
processes until the user decides to impose structure – if they
decide to impose a structure at all.

There is no guaranteed schema validity other than
that most generic of schemas that splits text into flows, paragraphs,
words, glyphs etc and allows users to combine content and presentation as they see fit.

On top of that low level structure, anything goes – at least until the
point where the user has decided that the changes to the information
artifact are “finished”. At the point where the intellectual work has been done figuring our that the document should say and how it should say it, it is completely fine - and generally very useful - to be able to validate against higher level, semantic structures such as "chapter", "statute", "washing machine data sheet" etc.

The big lesson of my career to date in high volume document management/processing is that if you seek to impose this semantic structure on the author/edit community rather than have them come to you and ask for some structure imposition, you will struggle mightily to have a successful system.