Will not happen. There is only so much you can clean up automatically. As always GIGO.

Hitch, dialogue quotes issues? Those they can be solved easily, there are tools for that... I also know it was bad, but not that bad... I understand your desire for automated analysis more and more...

Tox, m'dear:

oh, no, not THAT type of thing: not broken dialogues, per se. I meant, that for some bizarro-world reason, the typist had in two instances (two diff. books) created one paragraph style for dialogue paragraphs, and one for narrative. Obviously, that's easy-peasy to solve.

For broken dialogues, I have your wondrous tool. But then there are simply broken paragraphs, outside of dialogue, and I'm working on more all-inclusive regex/searches to fix those, as much as possible...and lastly, this one real doozy (the one of which I spoke), in which there were sentences somewhat like this:

I tol him to leave the room "I'll be back, "David will make sure of that.' he smiled.

Now, obviously, no formatter can "save" that. It's just bugger-all bad. But I've given some contemplation to the idea of playing with your broken dialogues, Tox, to create a pass that marks up--identifies--all the broken dialogues, and then hand it BACK to the author to review, for preliminary clean up. Just toying with it. Not solidified in ye olden brain yet. Just pushing around in "the little grey cells." ;-)

In what way? It is my understanding that Calibre's book editor is to be similar to Sigil in lots of ways. And since Sigil is for all the good intentions and purposes, dead as far as future development goes, I would have expected Calibre to be the more logical choice seeing as its Edit Book is still very much in the early stages (even though what currently is available has been implemented in record speed/time) and therefore likely easier to adapt to your needs. In fact, from what I understand with my limited scope from your post, the functionality you speak of could maybe be put to use as enhancements to the same.

In what way? It is my understanding that Calibre's book editor is to be similar to Sigil in lots of ways. And since Sigil is for all the good intentions and purposes, dead as far as future development goes, I would have expected Calibre to be the more logical choice seeing as its Edit Book is still very much in the early stages (even though what currently is available has been implemented in record speed/time) and therefore likely easier to adapt to your needs. In fact, from what I understand with my limited scope from your post, the functionality you speak of could maybe be put to use as enhancements to the same.

I realize book-editor is not done.
But what I see being discussed: is it will have more 'automatic (only?)' features. I keep thinking about 'Tidy' (I do use 'Pretty')

As an example
Early Sigil would just stomp on your NCX file. The current Sigil waits for/lets you to do the deed

In what way? It is my understanding that Calibre's book editor is to be similar to Sigil in lots of ways. And since Sigil is for all the good intentions and purposes, dead as far as future development goes, I would have expected Calibre to be the more logical choice seeing as its Edit Book is still very much in the early stages (even though what currently is available has been implemented in record speed/time) and therefore likely easier to adapt to your needs. In fact, from what I understand with my limited scope from your post, the functionality you speak of could maybe be put to use as enhancements to the same.

Sigil is written in C and the source is readily available. Calibre is similar but in no way a copy of Sigil and is written in python. It was written as a new editor from the ground up. Depending on what the OP wants to do they are clearly not in the same direction and it would be hard to presume one is a better fit than the other.

[pedant mode]Sigil is written in C++ not C. I'm a C programmer for my day job. I keep meaning to learn C++ but I never seem to find the time. They are related and there's some overlap but they are different.

Anyway I believe the OP said he was a C++ coder so I really am being pedantic [/pedant mode]

I'm now not really sure if we're saying the same thing, or different things. At this juncture, I don't see any tool, at all, that is assisting in providing clean, properly-formatted XML into Sigil or any other workflow. My comprehension of your posts is that this is what you were considering creating, as it's extremely unlikely that any of the current writing tools on the market, whether Word, Scrivener, etc., are going to go in that direction?

Yes, the main goal is getting well-formed, semantic XML. If such XML is XHTML, it should also be valid. This job could easily be done by different programs which are used to write, edit and transform texts, but not all programs care about the quality of their output or encourage practices to create such.

Quote:

Originally Posted by Hitch

I don't think that's hard; the truth is that you either take a word-processed document, or something that's been through, say, INDD, and you can a) clean it and b) then export it to HTML in order create an ePUB for instant commercial use, and then c) export it into MOBI for commercial use, or b) clean it to create semantic XML in the first place, which then has to be processed again to create an ePUB and/or MOBI. In the former case, you essentially run 1+ processes, in the latter it's 2+ or 3, as creating a mobi from a good ePUB is simplicity itself. I think it's as simple as, XML isn't natively suited for print or faux-print layout, as it's basically Markup. Writers and editors don't write in Markup.

b) only takes two steps, if the program to write the text failed to encourage/enforce the use of style templates in order to get semantic XML output initially, so the case of a tool to clean a XML output with 2+ steps is just a workaround (but won't be uncommon considering the current situation, I fear). I don't see why XML wouldn't be natively suited for print, since it is hierarchically structured like print layouts are. Transformation from XML to print is quite easy with XSLT to FO or to LaTeX, which will almost always result in better quality output in less time. If a self-publisher prepares more than one book (and probably even with no more than one book) for distribution, it will save time to automatically process input through the automated workflow, will save time to replicate edits at a later stage into all target formats, and add other or future target formats for all projects which were prepared as input for the workflow. One click could convert an entire collection of texts into a new output format.

Quote:

Originally Posted by Hitch

When Amazon came into the marketplace, they bought Mobipocket creator, and the Kindle ran on HTML 3.2. This drove the bookmaking market. I can't say I've done a boatload of XML cleanup, but the XML I've tried to export from Word, to investigate this idea (XML to XSLT) hasn't looked like a party to clean. Moreover, the retailers change their standards and their devices every 5 minutes. No major reader runs on XML; so...I think it was, quite simply, creating a process that would be able to reuse a file, to create other outputs, in a market that is primarily driven by entertainment books, seemed like extra work and an extra step that's unnecessary. PLUS, even if you assume arguendo that it's a good idea, then you have the problem of (say, with Textbooks), trying to export the initial content into a usable form for the author/editor to do an UPDATED version...whereas, with HTML, you can reimport the content easily back into Word or another word-processor for an author and editor to work collaboratively to update the material for a next Edition or updated textbook. Trust me: they are NOT going to sit there over something that looks like an RSS feed or XML and edit it. I think that's a major hurdle, too.

Technically, XML = XHTML = EPUB = RSS. So, if a word processor is used to edit XHTML, it is already usable for XSLT, if the output isn't crappy (like with Microsoft Word) and if semantic markup is ensured by the word processor GUI.

Quote:

Originally Posted by Hitch

If you say so. I admit, I've not seen anything that looks remotely user-friendly to which I could point my clients. And as I said somewhere in this thread, a major converter of books in India just invested a ton of money to invent/develop a system by which XML could be displayed in a Word-like, browser interface in order to provide a collaborative environment for textbook revisers/editors to work in. I'd have thought that if the environment existed, they wouldn't have spent all that money to create it, specifically for one client. I know someone else on this very forum considering creating a markup editor at one point in time; I don't know what happened with that.

The only two things that require a substantial amount of work, is implementing true WYSIWYG for print (if not done by a PostScript rendering canvas), and implementing high-end collaborative editing features.

Quote:

Originally Posted by Hitch

Yes, but again: all of those, every single one, all depend on the cleaned, ready-to-go XML being prepared and ready. I see that as the huge stumbling block, myself. For commercial users, it would have to be as simple, and as easy, as "simply" exporting and cleaning to HTML/XHTML, and it would have to be something that we could convince our users that they want, and are willing to pay for. THAT would also be a fairly big block; convincing them that they want a cleaned XML file that they themselves likely won't ever open or use, or even foresee a need for. But, I could be wrong.

Again: the word processor initially should take care of this, so that a clean, semantic XHTML is always produced. Since word processor developers fail at the moment to provide such a feature (well, they already have style templates...just disable direct formatting). But even for word processors who write Pseudo-HTML, an editor to clean such to XHTML would be useful, both for the writer himself and also for a person who does formatting for writers.

Quote:

Originally Posted by Toxaris

My biggest questionmark would actually be the first step, from a wordprocessor to 'clean' XML. Clean XML is important, because if it is clean, it is relatively easy to go to anything else again.

There is no valid excuse for a word processor in the 21st century to export HTML (or even Pseudo-HTML) instead of XHTML. Also, it should be semantic, if automated processing should be made possible. I agree with your list of problems you identified, some could be solved (use of styles enforced, no direct formatting controls in the GUI), others not (use of whitespace for optical positioning - probably a word processor could at least help with it, as it already does spell checking and all kinds of more advanced stuff).

Quote:

Originally Posted by Toxaris

It will be an almost impossible task to be able to filter/convert the output of all these programs to XML/XHTML while maintaining all the markup and taking the bizar things writers do in their documents into account. I only do it for Word and that is already a nightmare sometimes. Writers still surprise me with their workmethod and output.

Indeed, so if the output initially was done bad, a tool would be great to strip away all direct formatting, and then allow to apply style templates to the text. The Pseudo-HTML to XHTML conversion can't be done in all cases, but that's what the word processor absolutely has to fix, not some other program. I wouldn't start to parse all kinds of crappy Pseudo-HTML output. That Microsoft Word is incredibly bad to output valid XHTML, could be either deliberate protectionism for their proprietary software, or a result of their incompetence in the field of web technology (just think about Microsoft Internet Explorer).

Quote:

Originally Posted by Toxaris

The ambition is good, but the number of writers that wants to be bothered with this is very slim, especially for novelists.

I found out that at least self-publishers spend lots of hours or lots of money to do formatting for e-book and print, which could be replaced by an integrated processing software or an (online) service to do it for them.

Quote:

Originally Posted by Toxaris

However, the second part of converting the clean XML to other outputs could be very useful. That being said, XML itself is meaningless without the structure. What structure should be used? XHTML? A kind of LaTeX perhaps?

As XML can be used to specify all kinds of custom XML formats, it is too generic in order to be supported as the main input format (which would mean, all kinds of custom XML formats need to be supported), except one would develop a "mapping tool" to map custom XML to XHTML elements (for instance). As in most cases, XHTML is in web and e-book the standard to represent structured documents, that would be already pretty usable, so the wheel won't have to be reinvented, and it's also quite common for word processors and other software (including websites via browser) to output XHTML. A custom XML definition would be of advantage, if a processing system provider wants to support specific features, which would be triggered by the custom XML elements. If compatible, XHTML could be transformed into the custom XML format. Also, specialized programs that are intended as front end for the processing workflow, could output the custom XML format initially. This way, an (online) service could provide the software for it's writers to write in, and they would get, for instance, output in all formats and automated distribution into online shops. Alternatively, an (online) service would get manuscripts in all kinds of formats, would strip it down to the basic text, and then prepare it for the custom XML format, which corresponds to the processing workflow he is going to use.

Quote:

Originally Posted by Hitch

So...I'm with Tox. Getting the clean XML is the major hurdle. I just don't know how to get there from here. And I test-exported a *clean* Word file to XML last night...and, ayup, good luck with THAT.

Oh, Microsoft Word has fooled you ;-) The term "XML export" is technically nonsense, because XML isn't a format in itself, but a way to define all kinds of formats. So there is no "XML format" per se, one would have to ask "which XML format?" (because there are lots of them, XHTML included). So what Word calls ".xml", is their "Word XML format", and yes, of course, such a thing is useless, if they're not even capable of outputting valid XHTML in the first place. I do not talk about stupid custom XML formats, but of reasonable ones.

Quote:

Originally Posted by Hitch

My personal favorite? The "every paragraph is aligned differently" approach. I don't know what the hell is going on out there, educationally, but we've had a number of manuscripts in which dialogue paragraphs are unindented, and narrative are indented, or vice-versa. No, these aren't the James Joyce's of the future; they're illiterate (literally. I'm not being mean. The books are usually hardly readable). There appears to be someone out there "teaching" aspiring authors that this is the correct way to write.

Well, a word processor could refuse to show a visual difference for paragraphs, all paragraphs would be shown equally, so the use of a style template would just be a semantic markup for the processing that follows later. Indentation by whitespace only comes to mind, because a writing tool is abused to create layout. Whitespace isn't part of the text, and a writer is supposed to write text, not other stuff. A whitespace directly followed by another whitespace could be marked as spelling error. Combining writing and typesetting is a very bad idea in the first place.

Quote:

Originally Posted by Hitch

So: how does a front-end piece of software fix THAT and produce clean XML?

I don't know your actual situation, but you could do the following things, if you have control over it:

Encourage manuscript senders to provide semantic, valid XHTML by educating how to do so. If semantic, valid XHTML is provided, you could charge nothing or less than usual prices for print and e-book preparation, since you won't have to do put any manual work into it.

Require manuscript submission by online form. Authors should paste their text into it, and the form will loose all direct formattings, since the form is plain text. If it is a WYSIWYG editor to paste to, strip all direct formatting programmatically. If wished, you could provide style templates the author could apply to the text in order to semantically prepare it for your processing system.

Allow manuscript submission in all kinds of formats, and copy the plain text from it (maybe by plain text export, maybe by copy and paste). Provide the plain text for the author to do the markup for you, and just let him apply the style templates to the text which your processing system supports. Alternatively, you may do this task yourself as part of your service, with a tool like I initially thought Sigil could be. You have to fix the formatting of the text anyway, so why do e-book and print preparation separately and by hand, instead of applying semantic markup and produce e-book and print (and website from a database and whatever) from it?

Quote:

Originally Posted by Toxaris

Will not happen. There is only so much you can clean up automatically. As always GIGO.

Yes, as with "garbage in, garbage out": I would limit any attempt to throw the garbage of the input away and write highly usable files out. This way, visually encoded information and implied information gets lost, and there's no way to automatically convert it into semantic encoding. I would make it as easy as possible to encode this lost information manually in a semantic way, which could be done by the person who provided the garbage or by the person to which (for whatever reason) the task is given to make something beautiful out of the garbage.

Quote:

Originally Posted by Hitch

I tol him to leave the room "I'll be back, "David will make sure of that.' he smiled.

Now, obviously, no formatter can "save" that. It's just bugger-all bad. But I've given some contemplation to the idea of playing with your broken dialogues, Tox, to create a pass that marks up--identifies--all the broken dialogues, and then hand it BACK to the author to review, for preliminary clean up. Just toying with it. Not solidified in ye olden brain yet. Just pushing around in "the little grey cells." ;-)

Yes, nobody can fix this automatically. The semantics of this text markup lie. Additionally, the markup itself is broken. "I'll be back, " would be identified as one part, and 'll be back, "David will make sure of that.' could be considered another, since ' is used as delimiter and apostrophe at the same time.

But as with your idea to identify broken dialogues, that's exactly what I'm proposing: Keep the stuff which is already in good quality, and throw away what is not (I myself are mostly concerned about this for XML, text tools for such purpose would be a different topic, but why not work in this field also, since a solution is needed for self-publishers as well?). In this specific case, either you or the author has to re-apply apostrophes and quotation marks. If the author has to do it, make it as easy as possible for him. If you have to do it, make it as easy as possible for you. As an advanced solution, don't allow quotation marks in your (online) editing software at all, but let the author mark quotations and direct speech semantically (my initial question of this thread was, if Sigil could be that software) by something like

Code:

I tol him to leave the room <dialogue>I'll be back, David will make sure of that.</dialogue> he smiled.

which could be automatically translated to

Code:

I tol him to leave the room “I'll be back, David will make sure of that.” he smiled.

(while even taking care of typographical quotation marks and other things like that - all the formatting, essentially). You could then easily output all dialogue text, or all text but dialogue text. Since authors don't do XML markup in a text editor, you need a tool to enable yourself or the author to apply the semantic style template "dialogue" to the selected text "I'll be back, David will make sure of that."

Quote:

Originally Posted by At_Libitum

In what way?

In this way:

The crucial part about the "feature" of direct formatting shown in the screenshot is that the text "use 2 egg whites instead for healthier version" gets marked as

Code:

style="color: rgb(255, 0, 0)"

instead of semantic encoding like

Code:

class="alternative"

Such approach of direct formatting makes it very difficult for processing software: what is the color red - rgb(255, 0, 0) meant to represent? Is rgb(254, 0, 0) intended to mark the same or something different? Even with two rgb(255, 0, 0) at different portions of the text, at the one place red could be used to mark alternatives and at another place to mark important warnings, which will look for software as the same, indifferent. The information about what "use 2 egg whites instead for healthier version" means is encoded visually with red color in order to be understood implicitly by the reader. In software, there's absolutely no way to get to an implicit understanding (well, for humans also - you don't know what you don't know, otherwise you would know it, right?). The only way to solve this problem is to provide the information explicitly, either by Calibre, or by me as developer of a processing software. I could hard code which red text is of which type, but then my software won't be of general purpose anymore, it would be specific to one single book. So what Calibre would do in case such direct formatting gets introduced (or is already present as "feature" in the software), is downgrading the software by making its output less usable, even if it might look as a good new feature to the user. If text gets formatted by Calibre with this feature, it excludes authors from the benefits of automated text processing and the ability to change the layout within Calibre quickly instead of time-consuming manual work. Furthermore, software would have to implement a CSS parser if it wants to read Calibre output. Note that with a semantic approach, with style templates, there would be no difference noticable for the user for the task of formatting text as red, except that a style has to be defined first. There's still the risk to abuse style templates for visual markup (let's say a style template "red"), but if ever red + bold is needed, a new style template would be needed, so both markups would be distinguishable. I don't know of any solution to prevent such abuse, but at least if style templates get imported from an (online) service, one can make sure that the output file will be usable by the service without any further conversion problems.

Oh, Microsoft Word has fooled you ;-) The term "XML export" is technically nonsense, because XML isn't a format in itself, but a way to define all kinds of formats. So there is no "XML format" per se, one would have to ask "which XML format?" (because there are lots of them, XHTML included). So what Word calls ".xml", is their "Word XML format", and yes, of course, such a thing is useless, if they're not even capable of outputting valid XHTML in the first place. I do not talk about stupid custom XML formats, but of reasonable ones.

Well, not quite. XML is meaningless. It is only a markup language and add structure. I can export whatever I want as XML, as long as I honor the structure. Without the schema however, the XML is useless. In the schema we define what the tags mean and how the structure should look like. XHTML is not a format, it is just XML with a (more or less) strictly defined schema.
Word XML is just that. It is perfectly valid XML with a schema specifically for Word documents, just as the intention was. In principle it is possible to load the XML in Word and have your original document. The same applies for their HTML output. It is valid, even if it is not what we would like.
All XML 'formats' are custom, but some schemas are public and agreed upon by various parties.

That is also one of the issues. A schema needs to be agreed to correctly identify the semantic value of the tags. You cannot expect all (or any) wordprocessor to honor the schema you would like. So, you would need to map the XML schema from the wordprocessor to your schema. That will not always be possible.

You also greatly overestimate the willingness of writers to change their ways and their reaction to being forced to work in a certain way. They would rather use another program or even Wordpad than to change their wow. Only a small amount of writers is willing to do that.

You might take a look at my Word add-in. I create clean HTML output (or XHTML directly in an ePUB) out of Word, but at a price. Styling like margins and fonts will be removed. It would be relatively easy to create an export for another format (e.g. Markdown) in the same way.

I like the idea, but I think you are too optimistic. However, if I can help to improve things, I probably will.

Well, not quite. XML is meaningless. It is only a markup language and add structure. I can export whatever I want as XML, as long as I honor the structure. Without the schema however, the XML is useless. In the schema we define what the tags mean and how the structure should look like. XHTML is not a format, it is just XML with a (more or less) strictly defined schema.

Well, not quite. XML is somewhat self-descriptive, if good names are used. The absence of a schema hasn't any effect at all, you can still read and interpret a XML file. Even if a schema is available, the schema doesn't tell what a tag means, it just defines the structure. And even if you have a format specification document, you might still don't know how to implement tags which are defined in the schema. And I wonder how you define the word "format".

Quote:

Originally Posted by Toxaris

Word XML is just that. It is perfectly valid XML with a schema specifically for Word documents, just as the intention was. In principle it is possible to load the XML in Word and have your original document. The same applies for their HTML output. It is valid, even if it is not what we would like.

The HTML output of Microsoft Word can't be valid to any schema, since HTML isn't XML. Furthermore, as far as I know, there is no schema for HTML4, validation is done by DTD. I looked at the so called "XHTML" output of a recent Word version, and it wasn't even well-formed.

Quote:

Originally Posted by Toxaris

All XML 'formats' are custom, but some schemas are public and agreed upon by various parties.

With the term "custom" in the previous posts I referred to XML files in a structure defined by yourself, with or without schema, and even to such ones which are "uncommon" (for the purpose of this thread, also to XML definitions which are common in general, but less common in comparison with XHTML), with or without public schema.

Quote:

Originally Posted by Toxaris

That is also one of the issues. A schema needs to be agreed to correctly identify the semantic value of the tags. You cannot expect all (or any) wordprocessor to honor the schema you would like. So, you would need to map the XML schema from the wordprocessor to your schema. That will not always be possible.

No. Everybody can write his own schema and just validate his own files against it. Why would anybody want to do so? For software it is a very convenient way to check input, so the source code can trust that certain elements are there, instead of checking it all the time or with a lot of code. Also, if other programs or people provide you with their XML files in a custom structure, you could write your own schema for it, according to the elements your software will recognize (and adjust, if you discover new or different elements in files in the future). I would be glad if word processors would honor XHTML, and hopefully in the most semantic way.

Quote:

Originally Posted by Toxaris

You also greatly overestimate the willingness of writers to change their ways and their reaction to being forced to work in a certain way. They would rather use another program or even Wordpad than to change their wow. Only a small amount of writers is willing to do that.

Well, you're right, but only up to a certain point. I know that some writers prefer to shoot themselves in the foot. Without doubt, I wouldn't even try to convince them to get good output files, with less or no manual work to produce e-books and PDFs, because they really like bad output files, much manual work and crappy e-books as well as crappy print results. For all the other writers, in case of a word processor, I would bring up a setup wizard at first start of the program, inform the user why and how he has to use styles, let him define several styles, and then work on the text. In general, not much would be different, except you couldn't just select a font or a font size. Even if font selection (and similar GUI components) would persist, you could change the font, but would be asked which style you're currently editing or if you want to create a new style, or you would automatically change the font at all portions of the text which are marked with the style that is currently selected. So there would be little difference for the writer (additionally, I would assume a writer writes text, while you assume that a writer does typesetting).

Quote:

Originally Posted by Toxaris

You might take a look at my Word add-in. I create clean HTML output (or XHTML directly in an ePUB) out of Word, but at a price. Styling like margins and fonts will be removed. It would be relatively easy to create an export for another format (e.g. Markdown) in the same way.

From the description of your add-in on the websites linked in your signature, it looks like you're throwing out all direct formatting, but retain semantic style markup. So I wonder why you don't agree that a word processor should encourage semantic style markup and disable direct formatting, since the latter is obviously useless for all other software except the word processor itself. To save the time of the author, who potentially spends time with direct formatting, he could do something useful instead by applying templates, which your add-in could retain. Also, if the output of your add-in should be used for e-book or print preparation (or as input for an automated processing workflow), the output file needs to be extended with semantic markup, using Sigil. So not only the time of the author is wasted, if he uses direct formatting, also the time of the Sigil person is wasted, who has to do the semantic markup afterwards completely from scratch (in the worst case). In an ideal workflow, the author would do semantic markup with style templates initially (everywhere where he would use direct formatting anyway), all of it would be retained by your add-in, and if some markup still would be missing for preparation of e-book and print creation, the Sigil person would add just the missing markup. The key thing here is that the direct formatting is useless for the writer and the preparation guy in any case (and therefore a waste of time and resources), so a word processor does a bad job by allowing direct formatting. The developer of the word processor just gets away with it, because the author will find out about the consequences when it is far too late, and then not blame the developer of the word processor, but the poor formatting guy, because the root of the problem is unknown to the author.

Quote:

Originally Posted by Toxaris

I like the idea, but I think you are too optimistic. However, if I can help to improve things, I probably will.

Well, do you have any needs for your own projects? I'm mostly driven by my own personal need, currently just small "book" projects. But over time, I hope to provide more and more general purpose processing tools, which could be used by self-publishers or to set up an (online?) service. On the one hand, it's a lot of work and won't be sufficient for all kinds of uses within the first time, on the other hand if a solution is implemented once, a lot of texts can be processed with it. The problem to get good semantic XML will still need to be addressed, but that's exactly what I was wondering about if Sigil could be used for it (to let the author do the semantic markup of his text with Sigil if he failed to do it right in the first place, and then take the prepared EPUB (XHTML) file from Sigil as input for an automated processing system. But there are also alternative ways to get a semantic XML/XHTML file from the author, one could be to write a JavaScript based online/offline text editor for semantic editing. Currently, I write semantic XHTML myself as input for conversion to EPUB, but as OpenOffice (therefore LibreOffice too, I assume) is already capable of valid, semantic XHTML output, I should probably work on a way to educate the author (video tutorial), a website to provide this education, a list of style names to use in OpenOffice, an upload form for the author to submit OpenOffice XHTML output on the mentioned website, and a schema to check if the uploaded file matches the expected style names, so that the file then could be automatically be processed to EPUB, and later to PDF. I know how this description reads, but existing free software would provide short cuts, the development could be done collectively as free software, and over time the system would expand, so it could become a real option for self-publishers that would reduce manual labor for authors, formatters and developers. Maybe it would not be in the scope of the website, but depending on the interfaces, theoretically, somebody could from there distribute the prepared files directly to online e-book shops and print-on-demand services. As build as and with free software, that system would not be an online service by some provider, but could be set up by everybody online or offline. The free software license would make sure that every improvement is available to everybody else, so essentially a community would work together instead of competing against each other. I myself don't need necessarily such a large system, I'm glad to develop my own little system to use it for my book projects and maybe for people I work together with, and if it grows beyond that because my results are freely licensed, fine. In any case, I'm interested if somebody else does something similar with free software, and if there could be a joint effort to provide a common solution for a larger audience of people.

The HTML output of Microsoft Word can't be valid to any schema, since HTML isn't XML. Furthermore, as far as I know, there is no schema for HTML4, validation is done by DTD. I looked at the so called "XHTML" output of a recent Word version, and it wasn't even well-formed.

I never said HTML is XML, because it isn't. It is a different language al together. The only thing in common are the brackets... They can be combined and we call that XHTML, more or less.
However, you are making a mistake here. Word does NOT output XHTML, nor makes that claim. It can output HTML (in two flavours), XML (again in two flavours) and DOCX. Of course there are more formats, but lets ignore them for now.
The HTML output is valid HTML 4.01 by default. The problem most people have with it, that it is full of code to make sure the output in a browser resembles the original document AND that it can be understood by Word upon importing to make it a Word document again. It does that well enough, that it is not practical for subsequent processing is another story. That is also not the purpose.
The XML output is valid XML. The structure used is described in detail in the various websites from Microsoft. It has the same premise as the HTML output, that it must be understood by Word upon importing. That makes it less valuable for semantics. To give a short example of where issues will arise. Lets say I make a word italic. In the code <w:i /> (amongst other things) will be used to identify that it is italic. Now, when I create a style that applies italic, that code will not be there, but the code to apply the style. From the perspective from Word that makes sense, since italic is embedded in the style. From a semantic point of view it makes it a whole lot more difficult (the same applies for the HTML output btw). That also makes a whole lot harder to map it to other XML schemas.
I mention the docx format because that is essentially the same as the XML, only divided in multiple structured files in a container.

Quote:

Originally Posted by skreutzer

Well, you're right, but only up to a certain point. I know that some writers prefer to shoot themselves in the foot. Without doubt, I wouldn't even try to convince them to get good output files, with less or no manual work to produce e-books and PDFs, because they really like bad output files, much manual work and crappy e-books as well as crappy print results. For all the other writers, in case of a word processor, I would bring up a setup wizard at first start of the program, inform the user why and how he has to use styles, let him define several styles, and then work on the text. In general, not much would be different, except you couldn't just select a font or a font size. Even if font selection (and similar GUI components) would persist, you could change the font, but would be asked which style you're currently editing or if you want to create a new style, or you would automatically change the font at all portions of the text which are marked with the style that is currently selected. So there would be little difference for the writer (additionally, I would assume a writer writes text, while you assume that a writer does typesetting).

So basically you are suggesting creating YAWP (Yet Another Word Processor) that works semantically and pursuade all writers to use that one instead of the ones they are accustomed to like Word, OpenOffice, WordPerfect, etc. There is no way you get those corporations to change their export to your liking, how sane it may be.

Quote:

Originally Posted by skreutzer

From the description of your add-in on the websites linked in your signature, it looks like you're throwing out all direct formatting, but retain semantic style markup. So I wonder why you don't agree that a word processor should encourage semantic style markup and disable direct formatting, since the latter is obviously useless for all other software except the word processor itself.
...

Oh, but I do agree in part. I do not think that disabling direct formatting would a wise decision. It only is when the document is the first in a process. If the document is also the endstate, there is no reason to disable it.

Quote:

Originally Posted by skreutzer

Well, do you have any needs for your own projects?
...
Currently, I write semantic XHTML myself as input for conversion to EPUB, but as OpenOffice (therefore LibreOffice too, I assume) is already capable of valid, semantic XHTML output, I should probably work on a way to educate the author (video tutorial), a website to provide this education, a list of style names to use in OpenOffice, an upload form for the author to submit OpenOffice XHTML output on the mentioned website, and a schema to check if the uploaded file matches the expected style names, so that the file then could be automatically be processed to EPUB, and later to PDF.

No, not really. For my own work the add-in works fine and I made it available for others to use in case they would find it useful. It is a real time-saver for me and the results are much, much cleaner.
I tried to work with OpenOffice, but it is just not for me. I miss several features (not for ePUB creation) and don't like the interface. I also do not like the output to be honest and I am not the only one. There is a reason why there is also a program to take the output from OpenOffice to prepare it for ePUB. I believe it is called ePUBWriter.

Well, you're right, but only up to a certain point. I know that some writers prefer to shoot themselves in the foot. Without doubt, I wouldn't even try to convince them to get good output files, with less or no manual work to produce e-books and PDFs, because they really like bad output files, much manual work and crappy e-books as well as crappy print results. For all the other writers, in case of a word processor, I would bring up a setup wizard at first start of the program, inform the user why and how he has to use styles, let him define several styles, and then work on the text.

And, this is where you lost me. I've reviewed, analyzed, quoted, and discussed nearly 3-4,000 manuscripts over the past 5 years, mostly the last 4. Would you like me to tell you precisely--precisely--how many authors went back and cleaned up their manuscripts, after I gave them a) tutorials, b) manuals, c) good economic reasons to do it and d) detailed instructions, on items ranging from styles to broken paragraphs? Go ahead and ask me. Because I'll tell you, and here's a hint: the answer does not have two syllables. Out of ALL OF THOSE manuscripts, of which, over 90% needed some type of cleaning, styling, and of which, nearly 10% or more had myriad problematic broken paragraphs. (And don't get me STARTED on trying to get a clean, proofed manuscript from a publisher that's had a book scanned and OCR'ed!!!).

You are more than welcome to give this idea a go, but trust me when I tell you: given that there are dozens of word processors out there that can already do this, for all intents and purposes, why would the people who ALREADY won't do this, do it with yours?

Quote:

In general, not much would be different, except you couldn't just select a font or a font size. Even if font selection (and similar GUI components) would persist, you could change the font, but would be asked which style you're currently editing or if you want to create a new style, or you would automatically change the font at all portions of the text which are marked with the style that is currently selected. So there would be little difference for the writer (additionally, I would assume a writer writes text, while you assume that a writer does typesetting).

No...you're assuming that the writer is going to format the text. You've just SAID so. An author sits there and decides that they want to create a "text message" style for text messages from his protagonist to someone else, so s/he hits the tab key. In your scenario, you're going to, at that moment, force them to make all these styling decisions, while they, to their minds, are in full artistic and creative flow? Uhhhhhhhhhh....trust me when I say, I can hear the screams now. Why not just have them use Jutoh, instead?

Quote:

From the description of your add-in on the websites linked in your signature, it looks like you're throwing out all direct formatting, but retain semantic style markup. So I wonder why you don't agree that a word processor should encourage semantic style markup and disable direct formatting, since the latter is obviously useless for all other software except the word processor itself. To save the time of the author, who potentially spends time with direct formatting, he could do something useful instead by applying templates, which your add-in could retain.

Or, just use one of the nine bajillion free Word-for-print or Word-for-ebook templates that are already out there, like Guy Kawasaki's. You should try Tox's add-in before you make assumptions about it, it's pretty cool.

Quote:

Also, if the output of your add-in should be used for e-book or print preparation (or as input for an automated processing workflow), the output file needs to be extended with semantic markup, using Sigil. So not only the time of the author is wasted, if he uses direct formatting, also the time of the Sigil person is wasted, who has to do the semantic markup afterwards completely from scratch (in the worst case). In an ideal workflow, the author would do semantic markup with style templates initially (everywhere where he would use direct formatting anyway), all of it would be retained by your add-in, and if some markup still would be missing for preparation of e-book and print creation, the Sigil person would add just the missing markup. The key thing here is that the direct formatting is useless for the writer and the preparation guy in any case (and therefore a waste of time and resources), so a word processor does a bad job by allowing direct formatting. The developer of the word processor just gets away with it, because the author will find out about the consequences when it is far too late, and then not blame the developer of the word processor, but the poor formatting guy, because the root of the problem is unknown to the author.

Yeah, but: you have this view that the author WANTS to know. Now, obviously, the authors I know are those that don't want to know, but you don't have to spend very many days on the KDP forums to find out that basically: they don't want to know. Hear this: they would rather use the dreaded Smashword's "nuclear method" (clear all formatting) than learn to use Styles. I say this, and I hold it to be true because in 5 years--FIVE--I've had ONE author ask me to teach him how to use Styles. ONE. Out of at least Three, more likely Four THOUSAND with whom I've corresponded in detail about their manuscripts. Work those odds.

Quote:

Well, do you have any needs for your own projects? I'm mostly driven by my own personal need, currently just small "book" projects. But over time, I hope to provide more and more general purpose processing tools, which could be used by self-publishers or to set up an (online?) service. On the one hand, it's a lot of work and won't be sufficient for all kinds of uses within the first time, on the other hand if a solution is implemented once, a lot of texts can be processed with it. The problem to get good semantic XML will still need to be addressed, but that's exactly what I was wondering about if Sigil could be used for it (to let the author do the semantic markup of his text with Sigil if he failed to do it right in the first place, and then take the prepared EPUB (XHTML) file from Sigil as input for an automated processing system. But there are also alternative ways to get a semantic XML/XHTML file from the author, one could be to write a JavaScript based online/offline text editor for semantic editing. Currently, I write semantic XHTML myself as input for conversion to EPUB, but as OpenOffice (therefore LibreOffice too, I assume) is already capable of valid, semantic XHTML output, I should probably work on a way to educate the author (video tutorial), a website to provide this education, a list of style names to use in OpenOffice, an upload form for the author to submit OpenOffice XHTML output on the mentioned website, and a schema to check if the uploaded file matches the expected style names, so that the file then could be automatically be processed to EPUB, and later to PDF. I know how this description reads, but existing free software would provide short cuts, the development could be done collectively as free software, and over time the system would expand, so it could become a real option for self-publishers that would reduce manual labor for authors, formatters and developers. Maybe it would not be in the scope of the website, but depending on the interfaces, theoretically, somebody could from there distribute the prepared files directly to online e-book shops and print-on-demand services. As build as and with free software, that system would not be an online service by some provider, but could be set up by everybody online or offline. The free software license would make sure that every improvement is available to everybody else, so essentially a community would work together instead of competing against each other. I myself don't need necessarily such a large system, I'm glad to develop my own little system to use it for my book projects and maybe for people I work together with, and if it grows beyond that because my results are freely licensed, fine. In any case, I'm interested if somebody else does something similar with free software, and if there could be a joint effort to provide a common solution for a larger audience of people.

Nobody here, software-wise, is competing. All the products, software, etc., that have been discussed here, whether Calibre, Sigil, Tox's add-ins, add-ons or macros, etc., are all OS and donorware. That's it. But there's a realism factor, as well...maybe I am jaded. In fact, I'd bet money on it. I once really, really TRIED to get into XML-->XSLT and just couldn't get there from here, as previously discussed. if something comes along that authors will adopt, OR, allows me to easily convert/channel what authors REALLY do into XML, great. I'm all for it. I just don't...I don't FEEL it yet.

Nobody here, software-wise, is competing. All the products, software, etc., that have been discussed here, whether Calibre, Sigil, Tox's add-ins, add-ons or macros, etc., are all OS and donorware. That's it. But there's a realism factor, as well...maybe I am jaded. In fact, I'd bet money on it. I once really, really TRIED to get into XML-->XSLT and just couldn't get there from here, as previously discussed. if something comes along that authors will adopt, OR, allows me to easily convert/channel what authors REALLY do into XML, great. I'm all for it. I just don't...I don't FEEL it yet.

Hitch

To be fair, my add-in is not OS at this time. It may become open source later, but at this time not. I am more than willing to help anybody with additional options or functions, but not the code as of yet. The usage is and will remain free. There are no restrictions in its usage.

That said, I have done my share of XML conversions with XSLT. In fact, that was also my first ideas with regards of creating clean (X)HTML for ePUB. That went out the window very fast, since it would cripple the result to an undesired level. Too much could not be converted with the XSLT. I don't really care too much about lists and tables with col/rowspans, but whole pieces of formatting (like bold/italic) could get lost if it is part of a style (as I mentioned before). So, that is why I decided to do it differently.
I actually revisited the idea with OpenXML conversion and ran against the same limitations. No way to solve the inheritance of certain formatting in styles. Is it the fault of Microsoft? No, not really since the information is there and from their point of view it is perfectly logical. They cannot solve it within the current specification of their OpenXML definition. It would save me a lot of work, but they do not have any need for it. They would rather build in ePUB exporting capablities first.
I know that there have been many requests for exports from Word that are clean, but there are also difficulties there. I am already thinking for future developments of my add-in to try to create a basic stylesheet based upon the layout in Word. Simple stuff like indents, centering and alike. I don't know if that will happen, but I am thinking about it. It has quite some serious impacts and I do not know if there is a need for it.

To be fair, my add-in is not OS at this time. It may become open source later, but at this time not. I am more than willing to help anybody with additional options or functions, but not the code as of yet. The usage is and will remain free. There are no restrictions in its usage.

Yes, of course, sorry, I misspoke. I meant, your add-in is not for sale as a commercially competitive product, fighting for dollars against other commercial products. My bad!

Quote:

That said, I have done my share of XML conversions with XSLT. In fact, that was also my first ideas with regards of creating clean (X)HTML for ePUB. That went out the window very fast, since it would cripple the result to an undesired level. Too much could not be converted with the XSLT. I don't really care too much about lists and tables with col/rowspans, but whole pieces of formatting (like bold/italic) could get lost if it is part of a style (as I mentioned before). So, that is why I decided to do it differently.

Ditto. And Tox is very familiar with my own limitations; I'm not the hardcore guy that he is. It was just too much bloody WORK.

I never said HTML is XML, because it isn't. It is a different language al together. The only thing in common are the brackets... They can be combined and we call that XHTML, more or less.
However, you are making a mistake here. Word does NOT output XHTML, nor makes that claim. It can output HTML (in two flavours), XML (again in two flavours) and DOCX. Of course there are more formats, but lets ignore them for now.
The HTML output is valid HTML 4.01 by default. The problem most people have with it, that it is full of code to make sure the output in a browser resembles the original document AND that it can be understood by Word upon importing to make it a Word document again. It does that well enough, that it is not practical for subsequent processing is another story. That is also not the purpose.
The XML output is valid XML. The structure used is described in detail in the various websites from Microsoft. It has the same premise as the HTML output, that it must be understood by Word upon importing. That makes it less valuable for semantics.

You are right, I just assumed Microsoft Word XHTML output in error. I recently got the chance to look at the Word 2010 output on somebody elses computer, so your list of supported XML-related formats seems to be accurate. Based upon this information, the current situation is more or less, that the only usable direct output of Word in XML form is Word XML, so one would have to get rid of Word-specific additions (such as spell checking information), and then it still wouldn't be semantic, where the latter problem is the crucial one for automated processing. Other applications solve the issue of unreadable output (since it is supposed to be read again by the same application in order to restore the original document) by providing a "Save as" option to an intermediate format (lossless export/import) and an "Export" option to an end format (irreversible loss of information on export, probably just a basic import).

Quote:

Originally Posted by Toxaris

To give a short example of where issues will arise. Lets say I make a word italic. In the code <w:i /> (amongst other things) will be used to identify that it is italic. Now, when I create a style that applies italic, that code will not be there, but the code to apply the style. From the perspective from Word that makes sense, since italic is embedded in the style. From a semantic point of view it makes it a whole lot more difficult (the same applies for the HTML output btw). That also makes a whole lot harder to map it to other XML schemas.

None of this is a problem at all, instead, it is exactly the semantic markup I'm in favour of. For a processing software, <w:i/> would mark text portions as "being of the same kind", so the user would be able to select what <w:i/> means and/or how it should be represented. All text in <w:i/> could be output as italic, as bold or red, or not outputted at all. Unfortunately, it is very likely that <w:i/> is used to encode visual information in a semantically indifferent way, which is the fault of the Word GUI, so that the visual appearance of "italic" will get applied to a lot of text, which should be semantically not always be represented as the same <w:i/>, but probably as separate elements such as <w:important/>, <w:emphasis/>, <w:special_term/> or whatever, so that text marked this way could be handled separately by a software which reads the XML output. However, the user will soon find out about the wrong paradigm of visual markup (since a writing program shouldn't try to do typesetting at the stage of writing) instead of semantic markup results in less usable output for a processing workflow, so in order to get automatically EPUBs, PDFs or whatever from the input file, the markup of <w:i/> has to be re-done again as separate, differentiating markup of meaning, not of visual appearance (while the markup of meaning can still have a visual appearance attached to it, so WYSIWYG will always remain in place).

Regarding your description of style, that is the ideal solution for the problem. If the style definition is hard to read and to apply (if it isn't XML, as CSS isn't XML and therefore would require a reading software to parse CSS), a user would have to define the style again with the same or a similar visual appearance for a processing software, or a converter/parser would transform the Word/CSS style definition to something that the processing software could read and apply. However, my processing workflow would most likely expect some style like "emphasis", and would either apply always the same visual appearance for PDF output to it, no matter what the visual appearance was in the word processor, or let the user decide which visual appearance for "emphasis" is preferred. Something like <p class="MsoNormal"> is perfectly fine, the user would, corresponding to the description of <w:i/> above, just define how default text should look like in EPUB output, how in PDF output, how in SQL output, how in whatever output, since there could be indeed different requirements.

Just to complete this overview: the worst case would be direct formatting like <p style="font-size: 11pt; font-family: Arial; color: rgb(255, 0, 0)">. There is no information encoded here which would tell a program if the <p> is supposed to be handled the same way or different than other <p>s - instead, they just look the same on purpose or accidentally. Some <p>s should probably be handled the same way, even if their visual appearance is different, and other <p>s should probably handled different, even if their visual appearance is the same. There are also unnecessary dependencies introduced, such as the ability of parsing CSS, to know what an alternative for the "Arial" font could be (if not available for the target format), and to interpret RGB color code (if the target format or software only supports things like "black", "blue", "red" strings).

Quote:

Originally Posted by Toxaris

So basically you are suggesting creating YAWP (Yet Another Word Processor) that works semantically and pursuade all writers to use that one instead of the ones they are accustomed to like Word, OpenOffice, WordPerfect, etc. There is no way you get those corporations to change their export to your liking, how sane it may be.

Basically yes, but I wouldn't pursuade all writers necessarily, but just show them the benefits, so that they can't complain if they wasted again time and/or money for something which could have been done right initially. I also know that word processors won't enforce semantic markup and disable direct formatting. However, I'm convinced that a tool is needed (which I would use as front end for automated processing workflows) which would allow to apply semantic markup (template styles) to a plain text (either by plain text input initially or by stripping away all direct formatting while importing) and output semantic XML, be it XHTML or "custom XML". OpenOffice does this very well (could be improved by less clicks for defining and applying styles, import/export of styles if not already present), but also allows direct formatting, so it always causes the risk of using some direct formatting here and there. Sigil does this a little bit, which could be expanded. LyX does this, but it won't be released soon and is intended for much more advanced documents than just plain text. I myself can write semantic XHTML by hand, but I would like to have an application to which people I could point to they could use, if they don't know XHTML, which would also prevent them by the GUI from falling into the trap of direct formatting. I would provide them the style template definition for import which my processing workflow would recognize. And even if the people I work with don't like to write in the application I pointed them to, I could still let them write in whatever they want, export it as plain text, and let them apply styles afterwards in the semantic markup application, let them find people who apply the styles for them or apply the styles myself by using the application. Such a setup could be used by other people as well, who would then define their own style definitions in correspondence of their own processing workflows.

Quote:

Originally Posted by Toxaris

Oh, but I do agree in part. I do not think that disabling direct formatting would a wise decision. It only is when the document is the first in a process. If the document is also the endstate, there is no reason to disable it.

If it is in endstate and you want direct formatting for micro typography or something like it, you shouldn't do it to the original text from the "writer mode" step, but to the text of a "typesetting mode" step, where you will loose all direct formatting if you change something in the original text for the affected part of document (maybe limited to the chapter, if changes doesn't have affect to the chapters before or after the affected one).

Quote:

Originally Posted by Hitch

And, this is where you lost me. I've reviewed, analyzed, quoted, and discussed nearly 3-4,000 manuscripts over the past 5 years, mostly the last 4. Would you like me to tell you precisely--precisely--how many authors went back and cleaned up their manuscripts, after I gave them a) tutorials, b) manuals, c) good economic reasons to do it and d) detailed instructions, on items ranging from styles to broken paragraphs? Go ahead and ask me. Because I'll tell you, and here's a hint: the answer does not have two syllables. Out of ALL OF THOSE manuscripts, of which, over 90% needed some type of cleaning, styling, and of which, nearly 10% or more had myriad problematic broken paragraphs. (And don't get me STARTED on trying to get a clean, proofed manuscript from a publisher that's had a book scanned and OCR'ed!!!).

Well, Hitch, so if you've reviewed, analyzed, quoted and discussed nearly 3-4,000 manuscripts without the chance of getting them in a good shape initially or after educating the authors, and especially if you (!) do the cleaning, you're then not the back-end of a processing workflow, you're the front end. I guess people pay you to do the cleaning? So then it is your job to provide a processing system with usable semantic input. It's you who decides if you want to do it multiple times by manual adjustments for every target format, or if you want to do it once by automated processing for all target formats. Without question there are situations like OCRed text, where it is impossible to technically ensure semantic markup, and also some authors will refuse to provide clean input in the first place. I don't object to the situation that writers don't care about the benefits of automated processing because they prefer to waste time (by doing your job on their own) or waste money (by paying you to do semantic markup for you), but I want to have software available which would allow the writers to do this the right way initially if they care, software which will make your job easier and less time consuming, and the automated processing workflows from whoever sits at the front end will benefit from. That said, some word processors tend to worsening the situation, but I too see no way how they could change their mind to provide a better solution.

For the topic of my initial question, I would really like to hear how you clean the input from authors in terms of structure, and if/how the added structural information is used for a later automated processing.

Quote:

Originally Posted by Hitch

No...you're assuming that the writer is going to format the text. You've just SAID so. An author sits there and decides that they want to create a "text message" style for text messages from his protagonist to someone else, so s/he hits the tab key. In your scenario, you're going to, at that moment, force them to make all these styling decisions, while they, to their minds, are in full artistic and creative flow? Uhhhhhhhhhh....trust me when I say, I can hear the screams now. Why not just have them use Jutoh, instead?

No, quite the opposite. I suggest that they should only concentrate on writing the text, and to formatting (at their option) later. But in case they wish to do the formatting on the fly, they should be able to apply styles just as they apply direct formatting now. However, I would prevent them from "creating a text message style" without applying a style template for it, I would also prevent the use of the tab key. This is the situation where a writer decides if he wants to shoot himself into the foot or not. If he decides to do so, there will be at a later stage the need to re-do all of it, without question.

Quote:

Originally Posted by Hitch

Or, just use one of the nine bajillion free Word-for-print or Word-for-ebook templates that are already out there, like Guy Kawasaki's. You should try Tox's add-in before you make assumptions about it, it's pretty cool.

Two separate templates for print and e-book already indicate that you have to do the work twice, update the files twice if you change something in the text at a later stage. Also, this can't be automated. I can't use the Word addin, since Microsoft Word isn't free software, so I'm technically, legally and ethically unable to.

Quote:

Originally Posted by Hitch

Yeah, but: you have this view that the author WANTS to know. Now, obviously, the authors I know are those that don't want to know, but you don't have to spend very many days on the KDP forums to find out that basically: they don't want to know. Hear this: they would rather use the dreaded Smashword's "nuclear method" (clear all formatting) than learn to use Styles. I say this, and I hold it to be true because in 5 years--FIVE--I've had ONE author ask me to teach him how to use Styles. ONE. Out of at least Three, more likely Four THOUSAND with whom I've corresponded in detail about their manuscripts. Work those odds.

Additionally to my answer already given above, I'm not much concerned that they don't want to know. Basically, it will be the waste of _their_ time and/or money, if they don't care, not mine.

Quote:

Originally Posted by Hitch

Nobody here, software-wise, is competing. All the products, software, etc., that have been discussed here, whether Calibre, Sigil, Tox's add-ins, add-ons or macros, etc., are all OS and donorware. That's it. But there's a realism factor, as well...maybe I am jaded. In fact, I'd bet money on it. I once really, really TRIED to get into XML-->XSLT and just couldn't get there from here, as previously discussed. if something comes along that authors will adopt, OR, allows me to easily convert/channel what authors REALLY do into XML, great. I'm all for it. I just don't...I don't FEEL it yet.

Non-free software is always competing, and even if not for money, it is about user data or control over the users. Except for the reason of mere ignorance, if they wouldn't compete, it would be free software. Even in the case of mere ignorance, non-free software is competing over dependency on their program in opposition of other non-free software or free software. Don't worry, you don't have to get into XML -> XSLT yourself, I guess it would be already sufficient enough to get a basic processing workflow set up, so one could use and build upon it for the few cases where the flexibility is really needed, but even as a special application, it could have a larger effect for more people, if it is used as a service. For instance, if you have to convert all of your books to a future format or for the format of a company or service you work with, I guess it will always be a better idea do it automatically instead of manually.

Quote:

Originally Posted by Toxaris

To be fair, my add-in is not OS at this time. It may become open source later, but at this time not. I am more than willing to help anybody with additional options or functions, but not the code as of yet. The usage is and will remain free. There are no restrictions in its usage.

I really appreciate your willingness, but as Microsoft Word isn't free software, there's absolutely no way to freely use your add-in, no matter if the source code is open or closed. Everybody may get your add-in, with or without code, do whatever they want with it, but nobody gets Microsoft Word, so you and all users of the add-in are completely dependend on a restrictive, proprietary environment.

Quote:

Originally Posted by Toxaris

That said, I have done my share of XML conversions with XSLT. In fact, that was also my first ideas with regards of creating clean (X)HTML for ePUB. That went out the window very fast, since it would cripple the result to an undesired level. Too much could not be converted with the XSLT. I don't really care too much about lists and tables with col/rowspans, but whole pieces of formatting (like bold/italic) could get lost if it is part of a style (as I mentioned before). So, that is why I decided to do it differently.

Oh, yes, absolutely! I would do exactly the same, loose the style information, but keep the style names to later attach visual appearances to it (maybe the very same style information, if extracted or converted from the original XHTML) - as general concept for an universal processing workflow with various output formats. If the original CSS definition isn't used to apply the visual appearance to the style names, one could also provide several pre-defined styles to choose from, or define a new one by selecting fonts and stuff to the style names found in the XHTML. However, with my html2epub tool I'm going to keep the CSS style definition as it is in the original XHTML, just copy it, and if it hasn't rendered properly in the original XHTML, then it won't in the EPUB, but that's not my fault, but the fault of the application that wrote the XHTML (anyway, one might want to fix the CSS and/or apply his own CSS definition to the CSS style names in the semantic markup). Since html2epub can be used as standalone without any processing workflow (since all parts of a workflow should always be only loosely linked together), its output then will in future just provide the same visual appearance as the XHTML in a browser (except the limitations imposed by an e-reader, which indeed will take away a lot of what could be part of the visual appearance, but that's not the fault of the XHTML, the EPUB or the word processor, it is the fault of the e-reader implementing CSS, which is, as far as I know, done by porting a browser onto the e-reader device). Further, there are things which can't be done with XSLT, but in such cases I solve it programmatically in a real programming language, not in a transformation specification language.

I suspect that my view of the possibility of this functioning, widely, is utterly jaded by my own experiences, and I am thus just mostly talking to myself, to hear myself talk. I don't like this at the best of times, so I'm just going to bug out of this discussion. I don't genuinely think I have anything to truly add, other than cynicism, so ...that's not helpful.