What I Really Meant to Say II

A few weeks ago, I spoke at the No Fluff Just Stuff conference (highly recommended, by the way). During one of the speaker panels, one of my answers got blogged. Since the blogged answers were probably representative of what I said, but not of what I was thinking, I'm now taking the chance to correct the record. In my last post, I explained why programming languages are unlikely to change much in the next 5 years.

Today, I'd like to address the second part of what I said. I said something like "Instead of changes in programming languages, there are going to be changes in libraries. The biggest of which is that we're all going to be worring about how to deal with huge masses of semi-structured and slightly incompatible information."
And then I put a plug in for GATE.

Emad Benjamin blogged this and then wrote "I think he is walking the google line here."

Which is interesting, but, I think, false. I tend to doubt that google is walking this particular line. And ownership of the line in question really belongs to Jaron Lanier anyway (I also highly recommend Jaron's talks. If he's in your area and speaking, GO).

Let's talk about the line. Here's some premises:

Today, approximately 50 billion years after the first computer programs were written, the vast majority of systems are still standalone.

In fact, most business systems are silos.

SQL is all about being able to access data, no matter which RDBMS it's in.

CORBA was all about interoperability.

The web succeeded in a phenomenal way for a lot of reasons. But at least one of them was the fact that, finally, people could actually access each other's information.

Suites like Microsoft Office use data compatibility as a powerful sales force. If enough people upgrade, then everyone else has to as well (Office used to do this; other products still do).

Data warehousing is really all about data integration.

Most XML usage is about being able to access other people's data.

Web services were adopted because of interop.

Web services continue to evolve in the direction of increased interoperability (coarse grained service oriented architectures are better for interop than SOAP).

In fact, I would claim that most, not all but most, of the major trends in Enterprise Software (which is, for the most part, where the money is) for the past 20 years have been either primarily or largely concerned with interoperability.

Now, if you think about it, there's are two possible reasons for this. One is that we, as an industry, know how to do interoperability really well, and we're sticking to what we're good at. Just like a shoemaker who is really good at making wingtips, and therefore only makes wingtips, we only do interoperability.

I don't really buy that one. Do you?

The other is that interoperability is one of the major problems with computer systems today. After a couple dozen industry-wide shared solutions, most applications still can't share data or work together in any meaningful way.

Now you could say "as soon as we all adopt the same set of XML formats for all our business processes and all use loosely-coupled service-oriented architectures, this problem will go away." And I will cheerfully smile and nod my head and tell you that, by golly, you have a point there. And I will also make a mental note do not buy software from this guy.

Instead of the panacea du jour, I will put forth the following deeply skeptical (of current solutions) and highly optimistic (about future technology) proposition:

We will get out of the interoperability mess when a piece of software can run across a file format it has never seen before and, perhaps with an end-user (not a programmer) answering a few simple questions, figure out if the file contains useful data, what that useful data is, and be able to handle files in the same format later without any additional help.

Then, and only then, will we start to make a dent in the interop problem.

And, I think that by 2009 we will be starting to approach this level of fluency. Programs will simply be able to handle any data we throw at them, in a reasonable and robust manner. And tools like Lucene and GATE (and jtidy for that matter), along with really fast CPUs and tons of memory, are a very good start.

Was that walking the google line?

2 Comments

simon_hibbs
2004-08-06 03:01:24

I'm afraid I'm sceptical
Interop of this kind is very format sensitive because that's an intrinsic characteristic of that kind of work, not because computers are particularly bad at doing it. Even human clerks can be completely thrown if given information on an out of date form, or a form from a different organisation, or with the fields filled out in a way they're not familiar with. That's not because they're stupid, but because such things introduce genuine ambiguity because the format of the data is based on assumptions that aren't always evident from the format itself.

Natural language parsing (GATE) is great, but even us humans don't use natural language to communicate format critical and high value data between us. We use special formal sub-languages, structured forms and special data types, and did so long before computers came along because natural langauge itself doesn't cut it. Natural language parsing will have virtualy no impact on the kinds of enterprise interoperability you talk about for this reason.

There's a lot of money in it because it's an important problem that also happens to be intrinsicaly very difficult.

What we need is a way to formalise and express the asumptions that go into the definition of a specific data format. If that can be done in a standard way, then that would be a genuine advance. It would only work if it is based on rigorously defined standards, and even then if you don't have the format description, a file of that format is still going to be next to useless. Of course, XML could be a huge help here.

Simon Hibbs

wegrosso
2004-08-07 08:28:37

I'm afraid I'm sceptical
Thanks much for the reply. I both agree and disagree. To a certain extent, you're right, of course. We do use structured language to communicate precise information.

I'm not convinced it's because we couldn't use natural language. A filled out from often could easily be converted into a rather dull natural-language essay ("Then, when I was 18, oh this must have been about 1985, I went to college at Suny StonyBrook").

The problem isn't that natural language isn't expressive, it's that it isn't concise and we (humans) have to hunt down the important information. Forms structure the information so it's easy to find.

Past that, I wasn't claiming that we would replace other formats with natural language (though that's a fun thought). What I was driving for was something like: ascii representations and xml, combined with information extraction techniques, make it more possible for programs to be robust in the face of unexpected or unknown data formats.

I'm not sure how far we'll get along this path. Your skepticism, or rather your faith that we won't get anywhere, may turn out to be well-founded.

In which case, we'll keep solving interoperability.

Sign up today to receive special discounts, product alerts, and news from O'Reilly.