Data, Text and Strings (oh, my!)

The Strings and Text are not the same post on Musing Mortoray discusses the difference between “text” and “strings” and got me thinking. Rather than weigh down his (or her) comment section with a very long comment, I thought I’d go on a bit about it here.

I agree totally with the basic premise: that “text” and “strings” are different beasts. I also agree that text-handling depends on the text. There might be some difference in how we define text and string, and thinking about how I define them turned up a lot of thoughts on the matter. This isn’t intended as an opposition post (except on one point with regard to HTML). What follows is just one programmer’s opinion.

Let me start with the idea of text, which I take to be a generic, very inclusive, abstract class of physical object. The books on the shelves around me are filled with text. The CD and DVD covers have various (sometimes very inventive) forms of text on them. This post and the one that triggered it are both text objects. A hand-written letter is a form of text.

Almost by definition, language written down in readable form is text (for broad definitions of “written down” and “readable”). So far, I think, Mortoray and I are in full accord. Where we diverge may be in how we define a “string.”

For me, a string is a computer language term for a kind of data object. It has a generic, but notably concrete, definition (whereas text is generic and highly abstract). To see how a string fits into the picture, I have to go all the way back to the ones and zeros. (I know this will be obvious and old-hat to many, but bear with me; I’m laying groundwork here.)

Inside the computer it’s all just same-sized chunks of ones and zeros. (Technically, they’re not even that; they’re voltage levels interpreted as either zero or one.) Chunks are often stuck together (in various ways) to make bigger chunks. To make this useful, we interpret the chunks in meaningful ways. Taken individually and collectively, the chunks are data, the most generic form of information.

We can stop right here. Computation does not require anything more than the most fundamental data types (think Turing machine). Once we treat chunks as encoded numbers, that’s all we absolutely need. (Plus an important distinction between “data” (in the usual sense) and a special kind of data we call “code.”)

But to be more effective (and clear!) it’s handy to impose higher data abstractions on the chunks. To my way of thinking, data first breaks down into three generic classes: text (strings, chars, regex, etc.), numbers (integers, floats, complex, etc.) and other (dates, locations, images, etc.). These three main classes break down into various physical classes (some of which are named in the parentheses).

I do mean “class” not “type,” and I mean “class” in the abstract sense of “a kind of thing” (and not a class definition in an object-oriented language). We’re still talking data abstraction here; no metal involved. Unless referencing a specific language, the terms “string” or “integer” or “date” are still abstract and generic, but they’re a lot less abstract than “text” or “number” because we’re starting to get into semantics and physical format.

Dates, integers and strings all have specifics that affect how we store them as data. One obvious parameter is that the nature of the data controls how many chunks we need to represent that data. An interesting thing about strings is that they have variable length compared to most other data types.

Another thing we can say about a string is that is has some sort of context or semantic; it means something. But all data has some context. All data means something. Strings, like integers, are general enough data types to be used in a variety of ways, so about all we can really say about strings is that they have length and meaning, and only the first point is special.

Of course, text also has length and a meaning. Strings and text are not different in this regard. But I think they do differ in a big way in level of abstraction, as covered above. It’s arguable that they even differ in kind. I believe we all agree on that much.

For one thing text (at least to me) implies multiple lines of text. Just about everything I would label as “text” does have multiple lines. On the other hand, I think of strings as short bits of text — single lines at best. There are many places in the computer world where a multi-line string is a special case. I definitely see string as a special kind of text. (There is even a common phrase: “a string of text.”)

If I were to read a text file, I might very well read it into an array of strings, particularly if the text file had a line-oriented structure (e.g. properties or config file). If the file wasn’t particularly line-oriented, say an XML or HTML file, I’d probably read the whole thing into a single string or parse it on the fly.

Which brings us to a crucial point of disagreement: the idea that HTML is not text. In my view, HTML absolutely is text. So are XML, XSD, XSLT, SQL, JSON, DDL et many alii. A possibly important aspect: they are texts in specific languages. (One crucial distinction: all are computer languages, but only XSLT is a computer programming language. Specifically, HTML is not a programming language!)

These languages are not disqualified as text because editor software ignorant of the language mangles the text. An HTML editor edits HTML as expected, just as an XML editor edits XML. They are text because they can be loaded into any text editor and be edited as text by a person knowledgeable in the language. Obviously, syntax-aware editors are better than simple text editors, and structured editors are better still. But that doesn’t mean the source text isn’t text. (To me the term “codes” just means “numbers” and is too ambiguous for use. What these languages do contain is tokens.)

In the Mark-up languages (SGML, HTML, XML, etc.) the presence of meta-data tokens might lead one to classify these texts as non-textual, but I see them as texts in a specific language. There are language-specific editing considerations, as there might be in any language. The key, though, it any text editor can edit any text file.

In fact, one of the great things about text is how universal it is. This is why XML is the monster hit it is; text with structure and data type! It’s possible that HTML, CSS and JavaScript (and http itself!) all being text formats was instrumental in the huge success of “the web.”

The line is actually pretty fuzzy between text and structured, marked-up text. In the old days, we used *bold* and _italics_ meta-symbols. (There are websites that detect these and bold or italic accordingly.) One can even view punctuation, such as exclamation and question marks (or even the period), as meta-tokens that add meaning to the text. And I’ve seen coders use HTML tags (particularly made up ones) in text <seriously>to make a point.</seriously> Did my using two made up “HTML” tags make this not text? And what about the use of double-quotes to mean not really?

In my experience, in the computer world, text is the Yin to the Yang of binary. Data that isn’t binary is text. (If it’s not text, it’s usually something bizarre.) Text is code for “won’t blow up anything that handles text” or “won’t look like Martian when printed” (keeping in mind that coders are smart about what isn’t Martian). In the old days, text meant seven-bit ASCII (or EBCDIC or whatever). As we’ve become more global, we’ve grown towards Unicode, so now the definition of text is a bit broader. (Definitely requires graceful eight-bit handling.)

If it isn’t binary, it’s text. Yin/Yang. For me, that’s the bottom line.

But this may be mostly terminology (or pedantry). I think the real issue Mortoray is addressing is that string is too generic and that, in context, most strings have deeper semantics than “string” so they should be more complex data types. On that point, I agree completely.

There are many places where this is addressed by software. I have an XML editor (XML Spy) that allows me to create and edit XML without ever seeing the source text. I’ve also used (but generally not liked) HTML WYSIWYG editors. To be honest, I typically use a text editor (gvim) for almost all flavors of text file. I like having full control of my XML and HTML.

The complaint that basic string operations are often inappropriate for structured or marked-up text is accurate, but I wouldn’t expect generic operations to apply to special cases. String operations are necessarily generic and intended as a foundation for general string handling. And they work fine in many general cases. But special cases demand special handling.

In object-oriented terms, handling HTML strings is a specialization of string handling and appropriately sub-classed. Most browsers do this when reading an HTML website. They convert the HTML text to a DOM object that reflects the tree structure of the page. If one is working with HTML or JSON or XML it’s fairly easy to work with an object model and to consider source text as just for input and output.

It’s difficult for a language to provide this natively, but many provide capabilities in their libraries. Even so, it’s not unusual for a coder to create sub-types that handle structured text in ways tailored to the application. Long-time coders often have personal libraries that provide advanced functionality they’ve found useful.

To wrap it up, in my view, text is an abstraction, strings are computer objects, and complex text requires types that are knowledgeable about them. (In fact that last point is just basic OOD.)

Share this:

Like this:

Related

Post navigation

2 thoughts on “Data, Text and Strings (oh, my!)”

I think we are definitely at a loss for appropriate terminology. This makes it hard to pin down the concepts we’re discussing.

I think your definition of “text” comprises a valid group. Things like HTML and C++ are defintiely members of this broad group. It isn’t however the group I meant when I used the term “text”. I was speaking of the higher-level concept of words, paragraphs, and human consumable literature.

If we take an HTML document, from both our definitions, we have two meanings of “text”. From your view the sequence of codes, which we present as human glyphs, makes up a text. One that is understood by the computer and modified by a programmer. From my view, this is an encoding of a higher level text. The thing that that browser presents to the user is what my definition of “text” contained. Let’s call this high-level one a “document” for now, though it isn’t really a better term.

The contrast can be seen if we consider how many ways the document could be encoded by structured text. The exact same document could be encoded in Word, Latex, or HTML. Even within those encodings a lot of variation is allowed in the low-level text without changing the document in any fashion.

That aside, there is a very particular technical detail that distinguishes the classes. HTML is not actually a Unicode format. It can deal with unicode encodings but it doesn’t understand them. Parsing does not consider combining characters as part of a whole, they are strictly individual tokens. This is what my experiment showed, that an accented quote is seen by an HTML parser as a quote followed by an accent. This is very different from the intent of unicode to treat this as a single grapheme cluster.

Perhaps the term you’re reaching for is “content” (aka “text content” aka “raw text” aka “straight text“). Sometimes also called “plain text” but that’s more properly used in opposition to decrypted text.

If we take that HTML document and strip the tags, what remains is the document’s content. If we save a Word document “as text” and remove all the pretty formatting, what remains is the document’s content. RTF, some PDFs (the text-based ones! 🙂 ), TeX and LaTeX are all document mark-up languages that use “in-band, text” meta-data to add structure and extra semantics to a text. (The “in-band” means the meta-data is embedded in the text stream.)

Crucially, note that the distinction here is between formatted text and unformatted text. The nature of the content is a separate layer of abstraction. You can format a novel, a Wiki page, a PDF or an SQL or C++ file. (Formatting a source code file usually makes it a display document and not a source code file anymore, but creating display versions of code is an obvious practice when documenting code.)

Another crucial distinction is between narrative and source code text documents. The former is what’s usually meant by “human consumable literature.” Narrative includes novels, magazine articles, poems, emails, Wiki articles, most blogs and lots more. Source code is a special kind of text that is human consumable only to programmers. (Good source code does have narrative content; a well-written source file should read like a short story or letter to someone.) But unlike formatted documents that have an unformatted “raw” content, source code documents are the content.

FWIW, the terms “text” and “document” do have fairly widely used and accepted definitions (so if you use them in unique ways, you’ll have to define your terms often, which can be a bother). “Text” really does pretty much just mean “not binary.” The litmus test is whether a “text” file can be loaded into a “text” editor with normal results. “Document” is an inclusive term for complex (typically complete) information files. There are PDF documents, MS Word and MS Excel documents, image documents, and so forth.

Straight text files and source code files are technically “documents” but most people seem to call them “files” rather than “documents.” There may be an unconscious bias that divides unformatted documents from formatted or structurally complex ones.