Posted
by
Cliff
on Monday May 01, 2000 @12:32AM
from the more-document-formats-than-you-can-shake-a-stick-@ dept.

Cactvs142 asks: "Have to make a choice for writing my science report in chemistry. I want to use XML and end up with PDF, so Word (etc.) and LaTex fail. It comes down to this choice: DocBook or TEI.
DocBook is used by O'Reilly, but TEI already supports MathML. Which one is the better choice?"

Well, if you know one of them but not the other, it's probably a good idea to use the one you know. It's been a while since chemistry, but I don't think you use many math symbols in it (unless this is really advanced chemistry aka physics), so TEI's math abilities don't seem much of a bonus.

And if you do have a lot of math, LaTeX seems a hell of a lot nicer than the example W3C gives for formatting x**2 + 4x + 4 = 0. In MathML, [hopefully/. won't fuck this up] it's:

If you actually have to do serious math in this paper, I'd consider you nuts for doing it in anything but LaTeX. If not, well, personally I like DocBook but having never used TEI I won't recommend anything.

I'm curious what advantages using XML offers. LaTex can be made to produce very nice looking PDF's (or at least, LaTeX can be made to produce nice looking postscript, which can then be distilled into a PDF), and LaTeX is pretty much available for any platform and any text editor.

I'm not trying to start any jihads; just curious what the other side has to offer, as right now, I'm pretty happy with LaTeX.

(As an aside, it did take a little work to get dvips to use good fonts for my system, your milage may vary)

XML is keen, sure, but so is SGML, and look how it's taken off. Both have their places, although those roles are probably a lot more limited than those being proposed these days. I can see investing in SGML and XML for writing documentation that has to be worked on by large numbers of people or live for a long time, but for anything else I can't quite see the point.

I have to imagine that a big objection to LaTeX is that LaTeX source code doesn't do enough to separate content from appearance, because it doesn't have generic commands to denote specific kinds of content. That's a cop out, though, because it would be fairly trivial to create a standard set of such commands (with a new document class or package), and using them consistently would even allow you to translate your documents to SGML or XML at some later date if you had reason to do so. And it would be a lot easier on authors, as well.

One of the big problems I see with XML/SGML is that the idea of their separating content from appearance has taken hold to the point that people who haven't actually tried to use them don't realize that for each DTD, you need one or more separate applications to translate documents written using that DTD into other, more directly usable formats (such as TeX, LaTeX, HTML, PostScript, or PDF). There is no ``XMLtoPDF'' or ``SGMLtoHTML'' tool. Instead you have DTD-specific tools such as the debiandoc2format tools on my system. I can't quite see how this system is much different, let alone better, than having to use a TeX system to translate LaTeX source code into other formats. I can't even see how it's much better than the situation that existed when people were using AMS-TeX, AMS-LaTeX, LaTeX 2.09, and Plain TeX to write papers, all of which were incompatible to one degree or another, even though they were based on the same core system.

SGML and XML have a long way to go on the road to decent tools for writing documents using their DTDs, as well. Psgml mode for Emacs is swell, but it's not even close to being as useful as AUCTeX is for LaTeX, or even the most basic HTML editors are for HTML. Short of something like Frame, I think SGML needs some serious tool development before it's ready to be a serious contender.

LaTeX, on the other hand, has lots of quite good tools to help you write it, ranging from source-code--oriented tools such as AUCTeX to WYSIWYG tools such as LyX, Textures, and Scientific Word.

Finally, the idea that appearance is unimportant is ludicrous. Maybe someday in the distant future, when everything is viewed in some sort of electronic pad (ala Star Trek), SGML and XML will come into their own. Now, however, the ultimate page-display technology is paper, and I have yet to see a document written in SGML or XML that produced decent printed output without extensive tinkering. Whether that fact can best be explained by accidental weaknesses in the applications doing the translations or deliberate weaknesses imposed by the idea that printed output is irrelevant, I can't say. But attention to the end appearance of documents is an important area that will have to be addressed before I can seriously consider doing all, or even much, of my writing using SGML or XML.

I'm currently a college senior, finishing up a triple major in music, philosophy, and CS, and I've written ALL of my college papers in LaTeX. (even music history papers, complete with typeset musical examples.) It is far superior to DocBook, which is, although nice for what it does, a little crocky and has nasty syntax. (You can also write a legal DocBook document which is not legal SGML or XML.) LaTeX gives you far more control.

LaTeX is easily extensible and the math commands are unparalleled. In addition, you can make PDF files directly with pdflatex, a standard part of the tetex deistibution (which comes with RedHat).

SGML's "takeoff" is pretty well established, friend. You would be hard pressed to find a professional, hot-type emulating typsetting system which does not speak SGML. And as a basis for solid, enterprise level document management system, it's fundamental constructs (and by extension XML's) are unparalleled. Largely thanks to SGML's age (20+ years), these tools are typically proprietary and originally written for mini's. Ports to current hardware are thin on the ground thanks to the advent of "desktop publishing" and the triumph of the "good-enough" school of page-layout. You may have noticed the diminishing quality of type layout in major magazines and newspapers in the last decade, or perhaps its just I and the other old dogs of the typesetting trade. I digress. The fact is that outside of scientific publishing, where TeX and its children rightfully rule, tools for generating Postscript all speak SGML fluently.

...it would be fairly trivial to create a standard set of such commands (with a new document class or package), and using them consistently...

This is the trivial point of markup languages in general. Agreed that for the exercise of personal writing the overhead of XML/SGML validating and processing is absurd, but in any environment requiring the repurposing of content and/or extensive, multi-authored revisions, the structure/style distinction is vital.

The balance of your critique is specific to the desktop platforms and Linux in particular, where the demand for professional-quality type layout and DMS are somewhat limited and, as you say, the popularity and quality of the TeX tools have discouraged innovation. As to the general transformation of arbitrary SGML/XML, have a look at the W3C's XSLT specification and particularly its implementation at xml.apache.org. [apache.org]

I can't quite see how this system is much different, let alone better, than having to use a TeX system to translate LaTeX source code into other formats.

Neither more nor less, but for LaTeX file formats being proprietary to a particular layout/pagination engine. The dearth of quality free editors should remedy itself shortly. AUCTeX exists thanks to the desire to simplify TeX/LaTeX.

Lastly, it is not that appearance is unimportant, quite the contrary. It is important enough to deserve treatment separate from the management of content. The last time you opened a trade publication of almost any sort, including your O'Reilly books, a financial prospectus, any printed material not devoted to marketing or enslaved by graphic content (which sadly includes most newspapers and magazines, it seems) you saw a document whose content was managed with SGML. Interestingly, many of the latter types of publications which rushed to tools like Quark are now clamoring for XML compliance from these vendors in order to manage their content and streamline its repurposing for the Web.

SGML's "takeoff" is pretty well established, friend. You would be hard pressed to find a professional, hot-type emulating typesetting system which does not speak SGML. And as a basis for solid, enterprise level document management system, it's fundamental constructs (and by extension XML's) are unparalleled. ... The balance of your critique is specific to the desktop platforms and Linux in particular, where the demand for professional-quality type layout and DMS are somewhat limited and, as you say, the popularity and quality of the TeX tools have discouraged innovation.

I think it would be fair to say that all of my critique is specific to desktop platforms, and not just to Linux, either. SGML and XML clearly have a role to play in the world of professional publishing and ``enterprise-level document management systems'', and I acknowledged that role up front. But the specific question asked by Cactvs142 dealt with whether he should write his lab reports using DocBook or TEI, which hardly falls into the areas you describe.

In the world of professional publishing, I suspect that you have either proprietary (and expensive!) software to make creating SGML/XML content not much more complex than writing with a standard word processor (I'm thinking of tools such as Frame, or newspaper systems that provide the user with a terminal complete with specialized function keys to apply styles, etc.) or professionals who are both well versed in and happy to use the raw interfaces. I imagine the same is true for people who have to work with ``enterprise-level document management systems''.

But working with SGML and XML on desktop systems today means working with their raw forms: what tools there are require users to get intimately involved with the internal structures of the systems -- applying tags by hand in a text editor and running obscure command-line tools to generate usable output.

In the world of the average user (corporate or home), however, even command-line interfaces can be scary. People want (and rightly so) an interface like Word's. Users don't want to (or can't) remember obscure tags; they want to write their memos, letters, essays, or lab reports and have the look the way they want; and to not have to do anything more complicated than click on a few buttons and choose some menu items.

Even in the world of the hacker, the current tools for SGML and XML make it harder to use than it needs to be (for instance, it would be nice if psgml-mode switched to a DocBook mode when the DTD in use is DocBook, with a menu containing DocBook tags).

The only tool I'm aware of for editing XML that looks like it might be heading in the right direction (based on the screenshot alone -- the demo isn't much different, given that it doesn't allow you to actually change a document you're viewing) is Conglomerate [conglomerate.org], and it isn't even close to being ready for public release. If you know of other user-friendly tools, please tell us -- the tools I see at xml.apache.org [apache.org] are all server-oriented, not authoring tools by any stretch of the imagination, and the same seems to be true after a quick search on freshmeat [freshmeat.net].

When it comes to separating content from appearance, I'm all for it. I do so with both my HTML and LaTeX code. But if SGML/XML editors are rare, tools for creating and editing the style sheets that govern the appearance of those SGML/XML documents don't even seem to be on the drawing board. (Editing text files doesn't count as a user-friendly interface.)

Finally, as for the existence of ``professional, hot-type emulating typesetting systems [that] speak SGML'', it's great that those systems are out there, but just because such systems are available doesn't mean they're universally (or even commonly) used. I'm wrapping up the editing of a book to be published by a fairly well-known technical publisher (a division of a very well-known publisher). The book was written using LaTeX, but the publisher doesn't even want to see PostScript, let alone the LaTeX source; instead, they want the author to print the book out and send them the printed copy. For his last book, the author only had access to a 600 dpi printer, and the quality of the finished book shows that lack. This book will, at least, be printed using a 1200 dpi printer. If a major publishing house is still publishing commercial books in this primitive way, can anyone seriously expect the average business or home user to switch to using SGML or XML for managing far less complex documents?

I couldn't agree more. My intent was a defense of the importance of the concepts behind markup theory and the fact that such theory has been broadly applied. I certainly don't think the overhead of such a system is warranted for SOHO or personal computing. I do think that it would be beneficial for the typical desktop wordprocessing/publishing apps to speak some form of *ML, always keeping in mind the tradeoff between the users desire to apply on-off local formatting to get the job done and the need to do something else with the data later.

I cannot argue with the lack of functional, friendly editors for either *ML or its stylesheets. Have we identified a need? Is there a place for a WYSIWYG word processor which manages the tension between local formatting and reuse for the user, which integrates and abstracts away the management of DTD/schemas and stylesheets? Maybe, maybe not. Does our friend want to post his chemistry paper to the web, archive it? Does giving LyX an *ML fluent backend accomplish this goal?

I find it absolutely incredibly that a publisher is shooting plates from 600dpi laser output, although I can probably guess the publisher.

I cannot argue with the lack of functional, friendly editors for either *ML or its stylesheets. Have we identified a need? Is there a place for a WYSIWYG word processor which manages the tension between local formatting and reuse for the user, which integrates and abstracts away the management of DTD/schemas and stylesheets? Maybe, maybe not. Does our friend want to post his chemistry paper to the web, archive it? Does giving LyX an *ML fluent backend accomplish this goal?

Well, some applications do keep their documents in an SGML or XML format: AbiWord [abisource.com] saves files with an abw extension that are (according to file) ``exported SGML document text''. AbiSource's FAQ [abisource.com] states that their native file format is XML (and gives some reasons and a link to their DTD [abisource.com]). A brief example document might look like the following:

AbiWord's interface is very Word-like, including rulers for margin adjustment and toolbars with buttons to open, save, and close files; appear in multiple columns; make text italic, bold, underlined, and so forth; and set justification. As such, it's very useful for people moving from other computing platforms who are looking for replacements for their commercial word-processing applications, but that means that it's not ideal as a structured SGML/XML editor (without a significant amount of additional work, at any rate).

Other applications, such as Dia [lysator.liu.se], a diagramming application, save their output in recognizable XML (file says ``XML 1.0 document text'', and the document looks like XML when you view it in a text editor).

Finally, as I mentioned previously, Conglomerate [conglomerate.org] is specifically designed to be an XML editor, but isn't really available yet. Conglomerate looks like an SGML editor -- it clearly shows the tagging of various bits of text.

I find it absolutely incredibly that a publisher is shooting plates from 600dpi laser output, although I can probably guess the publisher.

That makes two of us. I was surprised that they didn't want the LaTeX source, and shocked when I found out they didn't want the PostScript, and horrified when I found out what they actually used.

Whoops. I clicked on Submit by accident, but thought I caught it in time. Guess not. Read this message instead...:-\

I cannot argue with the lack of functional, friendly editors for either *ML or its stylesheets. Have we identified a need? Is there a place for a WYSIWYG word processor which manages the tension between local formatting and reuse for the user, which integrates and abstracts away the management of DTD/schemas and stylesheets? Maybe, maybe not. Does our friend want to post his chemistry paper to the web, archive it? Does giving LyX an *ML fluent backend accomplish this goal?

Well, some applications do keep their documents in an SGML or XML format: AbiWord [abisource.com] saves files with an abw extension that are (according to file) ``exported SGML document text''. AbiSource's FAQ [abisource.com] states that their native file format is XML (and gives some reasons and a link to their [abisource.com]DTD). A brief example document might look like the following:

<p style="Plain Text">This text is in the plain text style, which is kind of like \texttt. Bizarre.</p>

<p style="Normal"></p>

<p style="Block Text">And this text is in the block text style, which I'm guessing is like quotation or quote or <blockquote>.</p>

<p style="Normal"></p>

<p style="Normal">Looks like I was right.</p>

</section>

</abiword>

AbiWord's interface is very Word-like, including rulers for margin adjustment and toolbars with buttons to open, save, and close files; appear in multiple columns; make text italic, bold, underlined, and so forth; and set justification. As such, it's very useful for people moving from other computing platforms who are looking for replacements for their commercial word-processing applications, but that also means that it's not ideal as a structured SGML/XML editor (without a significant amount of additional work, at any rate).

Other applications, such as Dia [lysator.liu.se], a diagramming application, save their output in recognizable XML (file says ``XML 1.0 document text'', and the document looks like XML when you view it in a text editor).

Finally, as I mentioned previously, Conglomerate [conglomerate.org] is specifically designed to be an XML editor, but isn't really available yet. Conglomerate looks like an SGML editor -- it clearly shows the tagging of various bits of text.

In any case, I'm not sure I'm convinced that a WYSIWYG editor is the best approach -- I think some sort of simplified syntax (such as that provided by LaTeX) that could easily be translated to SGML or XML might be a better solution. I like the ideas shown in the screenshots of [conglomerate.org]Conglomerate because they make the structural and other markup very clear without the illusion of control that would be provided by an interface such as AbiWord's. What I think is needed is an interface that would make the structure and other tags clear, and allow tags to be inserted by typing them directly (presumably with some macro features) or selecting from a mutable menu or palette (that would only present viable options for the selected text or cursor location). Although I hate to say it (because I hate them), I don't know that ``wizards'' would be entirely out of place for help with some complicated constructs.

I do think that you're right to believe that such tools will evolve themselves into existence, and that, as they do, more and more people may find themselves using SGML or XML without even realizing that they are. Time will tell.

I find it absolutely incredibly that a publisher is shooting plates from 600dpi laser output, although I can probably guess the publisher.

That makes two of us. I was surprised that they didn't want the LaTeX source, shocked when I found out they didn't want the PostScript, and horrified when I found out what they actually used.