Parsing with Partial Knowledge

Bill Venners: There's something that's been bothering me about the way people have
been raving about XML. One of the big claims is that because XML data is self-describing,
with data wrapped by tags like <customerid>12345</customerid>,
clients can figure things out even for documents that don't strictly adhere to their schema and specification.
I hear claims that XML is more flexible, because providers of documents can be sloppy and just add new pieces of
data here and there. Clients can just ignore tags they don't recognize and find data even if it
is in the wrong place according to the schema. The Java class file
is not XML, but like XML is a data structure and file format. There is a detailed specification for the Java
class file that describes all the data and semantics, and also
clearly defines the way in which class files can be extended. Providers and consumers of Java class files
adhere strictly to the specification. This approach
of strict compliance to a specification and schema makes more sense to me. I like what you have said about self-describing data,
but I'm concerned about the leap that some XML enthusiasts seem to make
that because the data is self-describing, the way in which a particular schema can evolve
doesn't have to be clearly specified or followed, because they assume clients will just ignore anything they don't
understand.

You write in your book, "You can parse a plain text file with only partial knowledge of its
format." How often do we lose the format specification, or is this more about not needing to
"read the manual"—the specification—because the data is more user-friendly.

Dave Thomas: Oh no, it's not so you don't have to read the manual. It's that, if all
you have is a pile of data, I'm sure you'd much rather have something in there that gives you
some hints to the semantics, as well as just the data itself.

Andy Hunt: We mean using partial knowledge of the format in a forensic
sense. You want to go back and dig out account numbers. If the data is tagged such that you
can see which pieces of data are account numbers, it becomes a much easier job than just
having to dig through a bunch of numbers.

Bill Venners: So the metadata makes the data itself more programmer-friendly. I
don't have to go to the manual. It's like there's a miniature, really terse manual in
the data itself.

Dave Thomas: Yes, and I think you're also assuming there's a manual.

Bill Venners: Well, that's part of what I'm asking. How often is there no manual?

Dave Thomas: Most of the time there is no manual. If I give you a Word 1 file,
where's the manual? If I ship you the output of my stock controller system, where's the
manual? If I'm gone, if my program's gone, what are you going to do with that file? There
are terabytes of data sitting around in an unusable state, because the software that reads
them is gone. Yes, you could probably sit there and reverse engineer it, but it would be a
whole lot easier to reverse engineer it if it were self-describing.