Sex, software, politics, and firearms. Life's simple pleasures…

Main menu

Post navigation

Draining the manual-page swamp

One of my long-term projects is cleaning up the Unix manual-page corpus so it will render nicely in HTML.

The world is divided into two kinds of people. One kind hears that, just nods and says “That’s nice,” having no idea what it entails. The other kind sputters coffee onto his or her monitor and says something semantically equivalent to “How the holy jumping fsck do you think you’re ever going to pull that off?”

The second kind has a clue. The Unix man page corpus is scattered across tens of thousands of software projects. It’s written in a markup – troff plus man macros – that is a tag soup notoriously resistent to parsing. The markup is underspecified and poorly documented, so people come up with astoundingly perverse ways of abusing it that just happen to work because of quirks in the major implementation but confuse the crap out of analysis tools. And the markup is quite presentation oriented; much of it is visual rather than structural and thus difficult to translate well to the web – where you don’t even know the “paper” size of your reader’s viewer, let alone what fonts and graphics capabilities it has.

Nevertheless, I’ve been working this problem for seventeen years and believe I’m closing in on success in, maybe, another five or so. In the rest of this post I’ll describe what I’m doing and why, so I have an explanation to point to and don’t have to repeat it.

First, how we got where we are. Unix documentation predates even video terminals. When the first manual page was written in the very early 1970s, the way you displayed stuff was to print it – either on your teletype or – slightly later – a phototypesetter.

Crucially, while the photypesetter could do fairly high-quality typesetting with multiple fonts and kerning, the teletype was limited to a single fixed-width font. Thus, from nearly the beginning, the Unix documentation toolchain was adapted to two different output modes, one assuming only very limited capability from its output device.

At the center of the “Unix documentation toolchain” was troff (for phototypesetters) and its close variant nroff (for ttys). Both interpreted a common typesetting language. The language is very low-level and visually oriented, with commands like “insert line break” and “change to specified font”. Its distinguishing feature is that (most) troff requests are control words starting with a dot at the beginning of a line; thus, “insert line break” is “.br”. But some requests were “escapes” begun with a backslash and placed inline; thus, “\fI” means change to italic font”.

Manual pages were never written directly in troff. Instead, they were (and are) written mostly in macros expanded to sequences of troff requests by a preprocessor. Insteas of being purely visual, many of these macros are structural; they say things like “start new pagraph” or “item in bulleted list”. I say “mostly” because manual pages still contain low-level requests like font changes.

Text-substitution macro languages have become notorious for encouraging all manner of ingenious but ugly and hard-to-understand hackery. The troff language helped them get that reputation. Users could define their own macros, and sometimes did. The design encouraged visual microtweaking of pages to get the appearance just right – provided you know things like your paper size and the font capabilities of your output device exactly. In the hands of an expert troff could produce spare, elegant typesetting that still looks good decades later.

By 1980 there was already a large corpus – thousands, at least – of manual pages written in troff markup. The way it was rendered was changing, however.

First, ttys were displaced by tube terminals – this was in the late 1970s, around the time I started programming. nroff was quickly adapted to produce output for these, which is why we still use the “man” command in terminal emulators today. That’s nroff behind it turning man-page markup into fixed-width characters on your screen.

Not long after that that people almost completely stopped printing manual pages. The payoff from cute troff tricks declined because tube terminals were such a limited rendering device. This encouraged a change in the way people wrote them – simpler, with less purely visual markup, more structural. Today there’s a noticeable gradient in markup complexity by age of the page – newer ones tend to be simpler and you almost never see the really old-school style of elaborate troff tricks outside of the documentation of GNU troff itself.

Second, in the early 1980s, laser printers and Postscript happened. Unix man pages themselves changed very little in response because nroff-to-terminals had already become so important, but the entire rest of the range of troff’s use cases simplified to “generate Postscript” over the next decade. Occasionally people still ask it to emit HP’s printer language; that’s about the only exception left. The other back-end typesetting languages troff used to emit are all dead.

But the really big disruption was the World Wide Web.

By about 1997 it was becoming obvious that in the future most documentation would move to the web; the advantages of the hyperlink were just too obvious to ignore. The new wave in documentation markup languages, typified by DocBook, was designed for a Web-centric world in which – as with nroff on terminals – your markup can’t safely make a lot of assumptions about display size or fonts.

To deal with this, the new document markup languages were completely structural. But this created a huge problem. How were we going to get the huge pile of grubby, visually-marked-up Unix man pages into purely structural markup?

Yes, you can translate a straight visual markup into a sort of pidgin HTML. That’s what tools like man2html and troff2html do. But this produces poor, ugly HTML that doesn’t exploit the medium well. One major thing you lose is tables. The man pages of these tools are full of caveats and limitations. Basically, they suck.

Trying to jawbone every project maintainer in the world into moving their masters to something else web-friendly by hand seemed doomed. What we really needed was mechanical translation from structural man macros (including table markup) to a structural markup.

When I started thinking about this problem just after Y2K, the general view among experts was that it was impossible, or at least unfeasibly hard barring strong AI. Trying to turn all that messy, frequently malformed visual tag soup into clean structure seemed like a job only a human could handle, involving recognition of high-level patterns and a lot of subtle domain and context knowledge.

Ah, but then there was (in his best Miss Piggy voice) moi.

I have a background in AI and compiler technology. I’m used to the idea that pattern-recognition problems that seem intractable can often be reduced to large collections of chained recognition and production rules. I’ve forgotten more about writing parsers for messy input languages than most programmers ever learn. And I’m not afraid of large problems.

The path forward I chose was to lift manual pages to DocBook-XML, a well-established markup used for long-form technical manuals. “Why that way?” is a reasonable question. The answer is something a few experiments showed me: the indirect path – man markup to DocBook to HTML – produces much better-quality HTML than the rather weak direct-conversion tools.

But lifting to DocBookXML is a hard problem, because the markup used in man pages has a number of unfortunate properties even beyound those I’ve already mentioned. One is that the native parser for it doesn’t, in general, throw errors on ill-formed or invalid markup. Usually such problems are simply ignored. Sometimes they aren’t but produce defects that are hard for a human reader scanning quickly to notice.

The result is that manual pages often have hidden cruft in them. That is, they may render OK but they do so essentially by accident. Markup malformations that would throw errors in a stricter parser pass unnoticed.

This kind of cruft accumulates as man pages are modified and expanded, like deleterious mutations in a genome. The people who modify them are seldom experts in roff markup; what they tend to do is monkey-copy the usage they see in place, including the mistakes. Thus defect counts tend to be proportional to age and size, with the largest and oldest pages being the cruftiest.

This becomes a real problem when you’re trying to translate the markup to something like DocBook-XML. It’s not enough be able to lift clean markup that makes structural sense; you have to deal with the accumulated cruft too.

Another big one, of course, is that (as previously noted) roff markup is presentational rather than semantic. Thus, for example, command names are often marked by a font change, but there’s no uniformity about whether the change is to italic, bold, or fixed width.

XML-DocBook wants to do structured tagging based on the intended semantics of text. If you’re starting from presentation markup, you have to back out the intended semantics based on a combination of cliche recognition and context rules. My favorite tutorial example is: string marked by a font change and containing “/” is wrapped by a DocBook filename tag if the name of the enclosing section is “FILES”.

But different people chose different cliches. Sometimes you get the same cliche used for different semantic purpose by different authors. Sometimes multiple cliches could pattern-match to the same section of text.

A really nasty problem is that roff markup is not consistent (he said, understating wildly) about whether or not its constructions have end-of-scope markers. Sometimes it does – the .RS/.RE macro pair for changing relative indent. More often, as for example in font changes, it doesn’t. It’s common to see markup like “first we’re in \fBbold,\fIthen italic\fR.”

Again, this is a serious difficulty when you’re trying to lift to a more structured XML-based markup with scope enders for everything. Figuring out where the scope ends should go in your translation is far from trivial even for perfectly clean markup.

Now think about all the other problems interact with the cruft. Random undetected cruft can be lying in wait to confuse your cliche recognition and trip up your scope analyzer. In truth, until you start feeling nauseous or terrified you have not grasped the depth of the problem.

The way you tackle this kind of thing is: Bite off a piece you understand by writing a transformation rule for it. Look at the residuals for another pattern that could be antecedent to another transformation. Lather, rinse, repeat. Accept that as the residuals get smaller, they get more irregular and harder to process. You won’t get to perfection, but if you can get to 95% you may be able to declare victory.

A major challenge is keeping the code structure from becoming just as grubby as the pile of transformation rules – because if you let that slide it will become an unmaintainable horror. To achive that, you have to be constantly be looking for opportunities to generalize and make your engine table-driven rather than writing a lot of ad-hoc logic.

It took me a year of effort to get to doclifter 1.0. It could do a clean lift on 95% of the 5548 man pages in a full Red Hat 7.3 workstation install to DocBook. (That’s a bit less than half the volume of the man pages on a stock Ubuntu installation in 2018.) The reaction of topic experts at the time was rather incredulous. People who understood the problem had trouble believing doclifter actually worked, and no blame for that – I’m good, but it was not a given that the problem was actually tractable. In truth even I was a little surprised at getting that good a coverage rate without hitting a wall.

Those of you a bit familiar with natural-language processing will be unsurprised to learn that at every iteration 20% of the remaining it-no-work pages gave me 80% of the problems, or that progress slowed in an inverse-geometric way as I got closer to 1.0.

In retrospect I was helped by the great simplification in man markup style that began when tube terminals made nroff the renderer for 99% of all man page views. In effect, this pre-adapted man page markup for the web, tending to select out the most complex and intractable troff features in favor of simple structure that would actually render on a tube terminal.

Just because I could, I also taught doclifter to handle the whole rest of the range of troff markups – ms, mm, me and so forth. This wasn’t actually very difficult once I had the framework code for man processing. I have no real idea how much this capability has actually been used.

With doclifter production-ready I had the tool required to drain the swamp. But that didn’t mean I was done. Oh no. That was the easy part. To get to the point where Linux and *BSD distributions could flip a switch and expect to webify everything I knew I’d have to push the failure rate of automated translation another factor of five lower, to the point where the volume of exceptions could be reasonably handled by humans on tight deadlines.

There were two paths forward to doing that. One was to jawbone project maintainers into moving to new-school, web-friendly master formats like DocBook and asciidoc. Which I did; as a result, the percentage of man pages written that way has gone from about 2% to about 6%.

But I knew most projects wouldn’t move, or wouldn’t move quickly. The alternative was to prod that remnant 5%, one by one, into fixing their crappy markup. Which I have now been doing for fifteen years, since 2003.

Every year or two I do a sweep through every manual page in sight of me, which means everything on a stock install of the currently dominant Linux distro, plus a boatload of additional pages for development tools and other things I use. I run doclifter on every single one, make patches to fix broken or otherwise untranslatable markup, and mail them off to maintainers. You can look at my current patch set and notes here.

I’ve had 579 patches accepted so far, so I am getting cooperation. But the cycle time is slow; there wouldn’t be much point in sampling the corpus faster than the refresh interval of my Linux distribution, which is about six months.

In a typical round, about 80 patches from my previous round have landed and I have to write maybe two dozen new ones. Once I’ve fixed a page it mostly stays fixed. The most common exception to that is people modifying command-option syntax and forgetting to close a “]” group; I catch a lot of those. Botched font changes are also common; it’s easy to write one of those \-escapes incorrectly and not notice it.

There are a few categories of error that, at this point, cause me the most problems. A big one is botched syntax in descriptions of command-line options, the simplest of which is unbalanced [ or ] in option groups. But there are other things that can go wrong; there are people, for example, who don’t know that you’re supposed to wrap mandatory switches and arguments in { } and use something else instead, often plain parentheses. It doesn’t help that there is no formal standard for this syntax, just tradition – but some tools will break if you flout it.

A related one is that some people intersperse explanatory text sections in their command synopses, or follow a command synopsis with a summary paragraph. The proper boundary to such trailing paragraphs is fiendishly difficult to parse because distinguishing fragments of natural language from command syntax is hard, and DocBook markup can’t express the interspersed text at all. This is one of the very few cases in which I have to impose a usage restriction in order to get pages to lift. If you maintain a manual page, don’t do these these things!. If doclifter tells you “warning – dubious content in Synopsis”, please fix until it doesn’t.

Another bane of my life has been botched list syntax, especially from misuse of the .TP macro. This used to be very common, but I’ve almost succeeded in stamping it out; only around 20 instances turned up in my latest pass. The worst defects come from writing a “bodiless .TP”, an instance with a tag but no following text before another .TP or a section macro. This is the most common cause of pages that lift to ill-formed XML, and it can’t be fixed by trickier parsing. Believe me, I’ve tried…

Another big category of problems is people using very low-level troff requests that can’t be parsed into structure, like .in and .ce and especially .ti requests. And yet another is abuse of the .SS macro to bolden and outdent text that isn’t really a section heading.

But over time I have been actually been succeeding in banishing a lot of this crap. Counting pages that have moved to web-friendly master formats, the percentage of man-page content that can go automatically to really high-quality HTML (with tables and figures and formulas properly carried along) is now over 99%.

And yes, I do think what I see in a mainsteam Linux distro is a sufficiently large and representative sample for me to say that with confidence. Because I notice that my remaining 75 or so of awkward cases are now heavily concentrated around a handful of crotchety GNU projects; groff itself being a big one.

I’ll probably never get it perfect. Some fraction of manual pages will always be malformed enough to terminally confuse my parser. Strengthening doclifter enough to not barf on more of them follows a pattern I’ve called a “Zeno tarpit” – geometrically increasing effort for geometrically decreasing returns.

Even if I could bulletproof the parser, perfection on the output side is hard to even define. It depends on your metric of quality and how many different rendering targets you really want to support. There are three major possibles: HTML, PostScript, and plain text. DocBook can render to any of them.

There’s an inescapable tradeoff where if you optimize for one rendering target you degrade rendered quality for the others. Man-page markup is an amphibian – part structural, part visual. If you use it visually and tune it carefully you will indeed get the best output possible on any individual rendering target, but the others will look pretty terrible.

This is not a new problem. You could always get especially pretty typography in man if you decide you care most about the Postcript target and use troff as a purely visual markup, because that’s what it was designed for. But you take a serious hit in quality of generated HTML because in order to do *that* right you need *structural* markup and a good stylesheet. You take a lesser hit in plain-text rendering, especially near figures and tables.

On the other hand, if you avoid purely visual markup (like .br, .ce, \h and \v), emphasizing structure tags motions, you can make a roff master that will render good HTML, good plain text, and something acceptable if mediocre on Postscript/PDF. But you get the best results not by naive translation but by running a cliche recognizer on the markup and lifting it to structure, then rendering that whatever you want via stylesheets. That’s what doclifter does.

Lifting to DocBook makes sense under the assumption that you want to optimize the output to HTML. This pessimizes the Postscript output, but the situation is not quite symmetrical among the three major targets; what’s good for HTML tends to be coupled to what’s good for plain text. My big cleanup of the man page corpus, now more than 97% complete after seventeen years of plugging at it, is based on the assumption that Postscript is no longer an important target, because who prints manual pages anymore?

Thus, the changes I ship upstream try to move everyone towards structural markup, which is (a) less fragile, (b) less likely to break on alien viewers, and (c) renders better to HTML.

I recall fighting with it years ago. At least some man pages were saying that info was the new way, so I tried it. For some reason, I found it very difficult to understand – hard to tell how the information was organized and hard to memorize commands.

If I had cared enough, I would have found another tutorial on info. Instead, I just continued to use man pages.

The only people who use GNU info are diehard EMACS-heads, and not all of them.If I had my way, the entire set of info documentation would be translated to TML, and then the original bits permanently recycled.

Yes, it is. Unfortunately, the info model is so weird to anyone used to HTML that you really have to do a semantic-level rewrite before it will convert well. Trying to lift it to DocBook as an intermediate step won’t work either, because page-sized frames are so fundamental in info. Yes, I’ve looked at this problem.

My understanding of software best-practices is more recent, but at what point did the separation of content from layout become widely understood?

The Wiki claims texinfo originally came out in 1986, at which point I would have assumed that everything had moved to terminals which at least could support scrolling. Having the concept of page size baked into something unless explicitly designed for printing/layout strikes me as a terrible idea. I’m trying to better understand the thinking patterns which would have resulted in a design decision like this.

From the wikipedia article on info: “The C implementation of info was designed as the main documentation system of GNU based Operating Systems and was then ported to other Unix-like operating systems. However, info files had already been in use on ITS emacs.”

From the Wikipedia article on Texinfo:

“The Texinfo format was created by Richard M. Stallman, combining another system for print output in use at MIT called BoTeX, with the online Info hyperlinked documentation system, also created by Stallman on top of the TECO implementation of Emacs.”

So not so much “modeled on a previous service on ITS” as “Stallman migrated his project to Unix from ITS”.

If I wanted to convert a mass of strangely-formatted files into HTML, knowing there was little chance of getting the individual authors to fix any problems… I would have started at the other end; redirect the man output to files, then parse those files HTML.

I used that method for converting data from various incompatible systems back in the day, some of which didn’t have compatible network protocols or removable storage formats. More than once, I had to get the dataset as dumped to a serial printer hooked to a PC running QModem…

You forgot to close the quote on “Zeno tarpit (which my rss feed version ended at the open quote).

Then there’s GNU Info. Emacs had some nice things since you could use the editor navigation, but there was the info program that couldn’t do much and was confusing. I wonder how much of that is still around. In particular, the subtleties of the GCC were only in Info.

I wonder if you can go round trip – roff to doclift then back to (clean) roff that would produce binary identical renderings, sort of F(F**-1(X)) ?= X.

I don’t think anybody “honors” man page markup any more. We’ve just lived with it because draining the swamp is so hard.

I’ve essentially solved the technically hard part of that problem. Now we need some distro to decide it can get a competitive advantage by having *every* package use my tools to install HTML at a standard location where man can see it and redirect to a browser if one is up.. I’ve even written the hooks required for this – they’ve been in the standard version of man for years.

The man page organization still gets some love, and not wrongly – it’s good style for command references, though not for long-form docs.

This runs afoul of that RedHat ism that says “even on a server, ‘browser’ means a browser on the remote machine from which the user is logged in”, thus preventing browser-access to server-local HTML (which is really annoying for the HTMLized Intel compiler-manuals.
When I complained (years ago), RedHat responded that they were “doing the right thing”®

Man is awesome. More than that man is *useful*. If your software doesn’t ship with man pages you’re a luser and you should go get a job working for MIcrosoft or the FBI.

And if all your man page does is refer to your info docs? An unmarked grave is better than you deserve.

When I want to pull up a short precis of what options I need to feed to some command to get all the f*king angels dancing on that pinhead together I don’t want to have to hope the morons in charge if the system bothered to put a text based web browser/pager on it. I want to type “man” and read the f*king doc, then if it’s not the command I want, I want to see the “SEE ALSO” bit at the bottom.

Don’t want to spend 5 minutes on google finding the the html edition of the man page for THIS version, because the BSD version is slightly different etc..

I don’t want to even have to click over to a web browser.

The two biggest things about windows that PISS ME THE F*K OFF are crappy logs and crappy “help” pages. Those “help” pages are a net LOSS to mankind.

I really don’t care what the format is, I want the documentation for that system ON that system, and readable in the sparsest terminal you can imagine. Because some poor bastard on the back side of nowhere is going to be trying to remotely manage a system over a microwave link in a bloody sandstorm and *NOTHING* GUI is going to get through. And he’s trying like hell to get that system, or the one right next to it fixed without rebooting, much less reinstalling like a good Windows admin would, because the only people within physical reach of the box are too stupid to be allowed to breed, much less log in as root, and if he looses this connection he’s going to have to talk to these mucking forons AND THERE IS NO ALCOHOL ALLOWED.

Ever thought of offloading the editing of offending man pages in maintained projects to other people? Like, make a list of stuff that fails, assign each item to a volunteer who would go through the trouble of fixing the document and submitting the patches to the maintaine.

>Like, make a list of stuff that fails, assign each item to a volunteer who would go through the trouble of fixing the document and submitting the patches to the maintainer.

I don’t need to crowdsource this. The volume of patches reqired is not all that high year-over year, though they’ve added up cumulatively to a pretty impressive number (not far shy of 900) in the last 17 years. I’ve got the machinery set up so that one person can keep the process moving.

Eric: Of the two types of people you mention, I’m a third. Rather than “that’s nice” or “how?”, my response is “why?”. When I reach for a man page, it’s generally because I’m looking for documentation to be rendered in a terminal. On the rare occasions that I look up a man page on the Web, I’m generally content to have it rendered as 80 column fixed width text. If I’m not looking specifically for documentation to be rendered in a terminal, I generally don’t care if a man page exists, and if a man page happens to be what I pull up, I don’t need it to be any more pretty-printed than it would be as rendered in a terminal. Trying to get man pages to render as HTML strikes me as a case of fixing what ain’t broke (as long as said man pages render properly in a terminal). Of course, I’m a young whippersnapper who grew up on Windows *.hlp files, and then the Web, so I’m used to other things than man filing the role of non-terminal documentation (these days mostly plaintext or PDF for local, non-hypertext documentation, and hyperlinked HTML for non-local).

>if a man page happens to be what I pull up, I don’t need it to be any more pretty-printed than it would be as rendered in a terminal.

Wrong-o, space cadet!

You’re right that there is not normally a functionally significant difference in purely visual quality between man markup rendered to a terminal and the equivalent HTML.

Truly, terminal-emulator monospace vs browser proportional and bold vs. italic do not matter all that much in the great scheme of things, even given that there are minor edge cases in which not having both is awkward. (Graphics designers will heap scorn upon me for not finding other visual differences Vastly Significant, but graphics designers can piss up a rope.)

However, O youth, ye hast unwisely forgotten the most important thing one medium hath that the other hath not.

Hyperlinks.

The reason man needs to die and all these pages need to move to the Web is because that is a precondition for rich semantic link structures.

I will repeat my last, I want the documentation for this system ON this system, and I don’t want to have to depend on anything else working for me to read that documentation, because at the sharp end IT WON’T.

One time for my sins I brought back a small “test/dev” datacenter back from a sudden power failure with absolutely NO outside network connectivity. This was at the sort of facility where PEOPLE DIE when it goes down.

Let me repeat that, this wasn’t Twitter. This wasn’t Facebook. It wasn’t Amazon. It was something that /mattered/. I’m not allowed to go into much more detail other than no, we had no links to the outside because they were offline too.

>Documentation FOR the system ON the system, AND on the install media.

Your point is well taken, and I see that my previous use of “on the Web” might be taken to imply undue dependence on remote sites. It was not intended that way.

Here’s how I think things should work:

* Every package installs man pages in HTML format in an HTML document tree parallel to the groff manpage tree. (If an HTML page is installed, its man-markup progenitor need not be.) You are not dependent on net access except to the degree that pages have links pointing to other pages *not* installed.

* When you call man on a page, it looks for HTML first. If you have a graphical browser up, the HTML is rendered in that browser and man exits. If you do not have a browser up, man shells to lynx which displays the page in your terminal.

* If there is no HTML, man falls back to old behavior (render man page with groff and page it). At present this would be required for fewer than 1% of pages.

I think this covers all the use cases. You get the best rendering supported by the package and your local environment. Hazard if off-net is minimized, but you do get local hyperlinks that work.

We don’t have to change everything to HTML at once, and there is never a flag day. As more package docs can be rendered cleanly to HTML, things get monotonically better.

I’ve done pretty much all the technical work required for this to happen, including the modifications to man and the tools to get decent HTML from man markup. What needs to happpen now is for some distro to add the right steps to their packaging routine.

The next step beyond that is some kind of semantic-web enrichment for the HTML docs. I don’t know how that should work; I do know it can’t happen without the steps I just described.

* When you call man on a page, it looks for HTML first. If you have a graphical browser up, the HTML is rendered in that browser and man exits. If you do not have a browser up, man shells to lynx which displays the page in your terminal.

If hyperlinks are added to man pages, that’s a good enough reason to move to HTML, provided that the above is changed to the below:

* When you call man on a page, it looks for HTML first. If HTML is available, man shells to lynx which displays the page in your terminal.

I don’t want man messing with my graphical browser session, and I don’t want it displaying anywhere but a terminal.

Now, one thing I wouldn’t have a problem with is “man foo”, typed into my browser’s address bar, bringing up a locally installed,
pretty-printed HTML man page, or some kind of graphical help browser system for my DE that displayed man pages. I’m perfectly fine with graphical hypertext documentation, I grew up on WinHelp, and later the Web, but man from a terminal should always display in that terminal.

ISTR that you use a tiling window manager. I think that may influence your perception of how acceptable the behavior you propose is. Browsers often steal focus when an external event causes them to open a new tab, and people who don’t use tiling window managers generally run their browsers full-screen (and, unrelated to tiling window managers, some browsers will open externally triggered pages in an existing tab, replacing whatever’s there). Even if the browser doesn’t steal focus, the user may very well want to view man pages in a different sized window than their browser occupies.

Gark, no. I work in the DoD, and STIGs are a cast-iron bitch. If the only way to read system documentation is to have it dynamically start a local web server, the answer will be “don’t have no stinking on-machine documentation, or any of the tools associated therewith”, and that sucks semi-infinite hosewater. That, or they’ll require HTTPS with DoD-signed certs and FIPS 140-2 validated crypto modules and all that kitty-cat dance.

I’d love to have the option, but if it’s the only option, this will hobble people working in particularly paranoid, pedantic, ponderous environments.

To me, the correct way of doing this is not to have man make the determination of where the documentation should be displayed, but the user. For example, if I typed man://foo in the browser address bar, serve me the HTML version; but keep the status quo when invoked from the command line. That way I get to open in in a private window if I don’t want it cluttering my history, the browser doesn’t steal focus, and it doesn’t interfere with piping the command line output to grep or other utilities.

Not a question, but a comment; in the conversion to HTML you can render the pages which won’t take your markup contributions in exactly the fashion you’d prefer. If someone doesn’t accept your correction, put them on a list, and the HTML version goes to your corrections (if current) and a large, visible error message if not.

My only complaint is that if you set “the web” as the new standard for documentation, people will use it — all of it. Hipsters will flood in and add trendy, PowerPoint-ish designs with lots of whitespace and parallax scrolling, which brings me to all the JavaScript cruft that will necessarily be added — all under the rubric of “Linux needs to look modern, 90s web pages don’t cut it anymore”. (And don’t get me started on when Canonical, in an effort to show positive cash flow and maybe even profit, puts “Stories from Around the Web”, “You’ll never believe what this child celebrity looks like now” type ads on Ubuntu’s system documentation.) On top of all this — it will be cumbersome to browse or not work at all on Lynx; you will need to load a multi-hundred-megabyte browser runtime just to read a page of documentation.

Of course that last is a tax we’ll probably willingly pay anyway. We’re living in an era when the best open-source programmer’s editor for Linux runs in a browser runtime. And is written by Microsoft.

* Every package installs it’s manpage in whatever format it’s written in. That package has as a required dependency the viewer for that format.

* The man command is enhanced to either handle those formats, or fire off an external processor for those formats. This can be controlled by options or environment variables.

Man isn’t a pager, it’s just a formatter. It hands off the paging/display to another program (for fun do man -P vim man and see how it blows up (at least on RHEL 6)), so if you want the HTML docs on your system, you install lynx/w3/whatever and set HTMLPAGER to chrome and it knows what to do. If HTMLPAGER is unset it uses the default (set in /etc/man/htmlpager, of course).

Enhancing man itself gives you the option of what to save in catman as well. Do you want to save the “rendered” groff output, the HTML.

Of course, you know what will happen once you start using HTML manpages, right?

Some “clever” boy will put javascript in there. And you know what THAT leads to, right?

Presumably he means that at that point “I don’t think there’s any way to do that through the doclifter pipeline.” will be irrelevant and Javascript will appear in man pages.

At that point the next logical step is for somebody to write a man page that serves up an emscripten build of bochs, boots Debian, auto-launches a DE, and opens up a browser displaying the same man page, recursively. Bonus points if the page runs two bochs instances concurrently, so that the number of bochs instances at each level of recursion is twice the number at the previous level.

“All hardware and software attempts to expand until it can boot Linux. Anything that cannot so expand will be assimilated. Now it can boot Linux. Resistance is futile.”

Some “clever” boy will put javascript in there. And you know what THAT leads to, right?

Manpage authors being lectured on how they shouldn’t rely on Javascript.

You might wonder “Why would anyone want Javascript in a manpage he’s viewing?”, but the manpage for systat(1) would be able to describe the systat -vmstat layout a lot better if it had ginormous, rich tooltips for labels like “ozfod” (pages optimally zero-filled on demand, whatever that means) and “Share” (Shared (virtual) RAM, but what counts as shared?)

I could imagine some manpage writer of 2020 putting in all sorts of great information about how to read systat -vmstat blinkenlights, yet forget to make that information available to people who aren’t running X servers.

Again, we’re living in an era where best practice for shipping desktop apps (at least, desktop apps that aren’t tied to a specific platform) is to write the thing in HTML/CSS/JS and ship it with a 200 MiB browser runtime.

Nobody gives a shit about being able to read documentation from the terminal in 2018. Virtually all devs have terminals on one 24″ monitor, a browser open on another, and an IDE open on yet another.

And this info will be available to people without X servers because Google Chrome supports Wayland and Firefox support is coming. To say nothing of all those open source devs (arguably the majority) using a macOS as their primary environment.

I routinely use man in a terminal window, especially since in some environments ($JOB – 3) we had RHEL 3 through 6, with 7 coming in, HPUX, Solaris, VMS, AIX, Digital Unix and a few others.

To quote the bard:

And therefore as a stranger give it welcome.
There are more things in heaven and earth, Horatio,
Than are dreamt of in your philosophy.

Reading documentation in your browser, as a dev, makes a lot of sense. You’re sitting in your comfy chair in your nice quiet office, with cup of coffee and if your network is down, well, there’s the foosball table, and maybe wander over to Starbucks to get a cup of coffee.

But if you’re not a dev, you often aren’t ANY of that.

There are more than a few other jobs where you don’t have any of those. Including (sometimes) internet access.

The current OSS developer culture is one of “optimize for my pet common case (a.k.a. modern systems), and fuck everyone else”. Good luck convincing the developers of the documentation system of the future that your exotic systems reachable only by satcom link are worth even a moment of their time. They’ll probably give you helpful advice like “try using a system from this century”.

That’s totally wrong. The source code of the AT&T Version 3 UNIX manual pages is still preserved by TUHS. It is pure roff without any macro set. Among the most frequent requests, you find: .br .fi .he .in .nf .pa .sp .ti .ul The macro set in v4 to v6 was merely 74 lines, with almost no resemblance to the v7 man(7) macros. The man(7) macros only appeared in v7. So for almost the first ten years, manual pages were more or less pure roff.

2. “About 1997 it was becoming obvious that in the future most documentation would move to the web; the advantages of the hyperlink were just too obvious to ignore. The new wave in documentation markup languages, typified by DocBook, was designed for a Web-centric world in which – as with nroff on terminals – your markup can’t safely make a lot of assumptions about display size or fonts. To deal with this, the new document markup languages were completely structural. But this created a huge problem. How were we going to get the huge pile of grubby, visually-marked-up Unix man pages into purely structural markup?”

That’s badly misleading. Long before 1997, Cynthia Livingston had already solved 100% of that problem. On a USENIX grant and with direct commit access to the SCCS repo of the UC Berkeley CSRG, she rewrote all UNIX manual pages in the new roff macro language she designed and implemented herself, mdoc(7). It was not only structural, but semantic. She did the crucial part of that work in 1990 (!) and finished polishing it before 1994.

3. “To get to the point where Linux and *BSD distributions could flip a switch and expect to webify everything…”

That’s badly misleading as well. None of the BSD operating systems (FreeBSD, OpenBSD, NetBSD, DragonFly, Minix 3, illumos) is using the ancient man(7) format in the first place, so there is no problem to solve for *BSD whatsoever. It is almost purely a home-grown GNU/Linux problem, caused by the failure to abandon the obsolete man(7) language more than two decades ago.

>The source code of the AT&T Version 3 UNIX manual pages is still preserved by TUHS. It is pure roff without any macro set

OK, I didn’t know that. By Version 6 my statement was true. I don’t accept your claim about V6 because I’ve seen those macros; they weren’t naked roff.

>Long before 1997, Cynthia Livingston had already solved 100% of that problem. […] It was not only structural, but semantic.

Oh how I wish that were true. mdoc has some compromises in it that I can understand in light of the groff engine it was written over, but which violate the “structural” trait quite seriously. It is also not entirely semantic, though it comes close. I know this because doclifter includes an mdoc parser; the edge cases where Ms. Livingston didn’t get those traits quite perfect are serious pains in my ass and account for a substantial fraction of my residuals. Which is not to say she botched it in any sense – overall, mdoc was good, craftsmanlike work and probably about as complete a job as possible under its constraints.

>None of the BSD operating systems (FreeBSD, OpenBSD, NetBSD, DragonFly, Minix 3, illumos) is using the ancient man(7) format in the first place

Are you really claiming that BSD ports never packages anything from Linux with old-style pages? That would be quite surprising.

1. v4, v5, v6: Well, here are the macros used in the v6 manual pages (and consequently, in 1BSD and 2BSD as well):https://minnie.tuhs.org/cgi-bin/utree.pl?file=V6/usr/man/man0/naa
As i said, that is 74 lines, 12 macros: .bd .bn .dt .fo .it .i0 .lp .sh .s1 .s2 .s3 .th, with very little resemblance to man(7).
.bd vaguely resembles later .I, .dt later .DT (now deprecated), .Lp later .IP, .sh later .SH, .th later .TH, but the rest are totally alien, and the longest one except .th is five lines, so the pages are still more or less pure roff in v6. At least *much* closer to pure roff than to v7 man(7).

2. You dramatise the difficulty of mdoc(7) parsing. Yes, there are a few quirks that i discussed in detail at BSDCan 2011, but they hardly occur in practice. Basically, mdoc(7) parsing is simple, we got it perfect in mandoc(1) with absolutely no remaining problems more than half a decade ago. (Of course, you are *not* overstating the difficulty of lifting pages written in man(7); that is a seriously hard task indeed. I’m still dreaming about supporting semantic searching in apropos(1) and semantic markup in man.cgi(8) for legacy man(7) pages, and when i come around to that, i will use doclifter (because redoing that hard work would be very wasteful) and feed the results into Kristaps Dzonsons’ docbook2mdoc(1) – i mentioned that plan a few times at conferences already. The reason i didn’t really start it yet is that the few remaining legacy man(7) pages are not very relevant, so there are more urgent matters than worrying about man(7).)

3. No, of course BSD *ports* package many legacy man(7) pages from GNU/Linux land; i was talking about the *operating systems*: kernel manuals, base system library manuals, POSIX userland manuals, and native BSD application program manuals. Ports manual pages, in particular those that are not mdoc(7), are typically of inferior quality in the first place, so the obsolete formatting language and the lack of semantic markup is the least concern with respect to them. By the way, since last year, mandoc(1) parses the manual pages of 99.7% of our ports to an abstract syntax tree (i.e. a structural representation) that renders to visually acceptable HTML. Two years ago, it was 97.5%. Of course, as opposed to doclifter, it does *not* contain heuristics to translate presentational to semantic markup; mandoc will simply render \fB to <b> no matter what, even in the SYNOPSIS.

That’s because I had to do it, man. I think the biggest single PITA was the .O and .X enclosure macros. Table elements and lists were pretty bad, too. There are some cases I’m still not sure I got right.

Oops, very sorry, i overlooked your reply, and now a user made me aware of it here at the BSDCan conference. I said “dramatise” because you said mdoc(7) manuals account for “a substantial fraction of my residuals”. That simply means your mdoc(7)-parser is sub-par. I’m not saying the task is trivial, but it can be done with reasonable effort. In mandoc(1), we relatively easily got the mdoc(7) parser to a 100% success rate without a single residual in real-world manuals many years ago. Getting the man(7) parser to 95% was much harder than getting the mdoc(7) parser to 100% and took until 2015. Getting the man(7) parser from 95% to 99% took three more years, and that is where we are now, with slightly less than 30 residuals remaining.

>you said mdoc(7) manuals account for “a substantial fraction of my residuals”. That simply means your mdoc(7)-parser is sub-par. I

I suspect you will change your mind when you know what the numbers are. In 13697 man pages I now have only 15 residuals (0.01%). Of these, 3 are in mdoc markup. None of those are core BSD pages; all but one are from NTP Classic and the last is from groff. You’re getting a 100% conversion rate because your sample is better behaved, and even so I have half your number of total residuals.

The only mdoc construct left that doclifter doesn’t handle is the .O/.X enclosure macros.

>Cmake appears to be emerging as a de facto standard for new C and C++ projects

Two-phase system with generated makefiles. Not the horror that autoconf is, but one-phase systems like waf and scons are better. Among other problems, parallel build is difficult and error-prone from two-phase builders.

May not matter much. My estimate of C/C++’s time-to-live has been falling.

CMake has its own problems. In addition to what ESR mentioned, the syntax is unwieldy and clunky, not to mention error-prone. Getting it to work just right is often an art. I say that as a CMake user.

Also, you have to have the CMake binary installed in order to build–in addition to whatever build system you use in the end. This is true even if the project source comes with a pre-generated build system–the build files will actually call out to the CMake binary, not only to regenerate the build system, but also to do things like create directories and symlinks and install files. (Why? Because of Windows, of course! Windows doesn’t come with things like ln(1) and install(1) out of the box.)

I can only surmise that the reason it took off at all is that it can generate project files for IDEs like Visual Studio. Oh, and that it has better support for Windows than most other build systems. (Being better than autotools in itself is, like, the lowest bar in the world. ;)

>CMake has its own problems. In addition to what ESR mentioned, the syntax is unwieldy and clunky, not to mention error-prone. Getting it to work just right is often an art. I say that as a CMake user.

I forgot the big one: tracing/debugging build recipe errors on a two-phase system is inherently nasty. It’s hard to back up from where the breakage manifests in a Makefile to the source recipe you actually want to modify.

As much as we disagree on politics, I am a huge fan of yours because you undertake this kind of work. Despite all our disagreements about the ideal structure of society, I can’t possibly thank you enough for all the work you do on foundation-level infrastructure which is invisible to most users!

If anyone out there has control of any money, please give Eric a grant!

I recently added a feature in the old script(1) command from the Linux term-utils package. Its man page is troff, complete with ugly cruft that nobody dared touching for years of successive revisions. To wit, a code sample is given as:
.RS
.RE
.sp
.na
.RS
.nf
if test -t 0 ; then
script
exit
fi
.fi
.RE
.ad
.PP

The RS and RE macros in close succession are a NOOP — a sure sign of cruft scaring away maintainers.

I would love to reformat these man pages if you gave me a syntactic markup model.