Details

Description

I've done quite a bit of digging in the web site and the mailing lists to track down this issue, and there does seem to be some issues around white-space handling. What I'm not clear on is whether the already-documented issues around white-space handling match the behavior I'm seeing, so I'll log this bug to explain.

Except for a few peculiar cases, the behavior documented in the specification regarding white-space-treatment does seem to be implemented correctly, but I'm wondering if the specification is either misinterpreted or wrong to begin with.

I'll attach some FO/PDF examples to explain better, but the bottom-line issue is that there does not appear to be a way to get spaces at the beginning of a line to preserve while not creating hanging indents for long blocks that have formatter-generated line feeds. The "ignore-if-surrounding-linefeed" property value obviously solves the hanging indent problem but also prevents spaces at the start of a block from preserving.

Curiously, if I add inline children to the block the treatment of the whitespace is different, and further, if I wrap the various blocks with a single parent block then it changes the whitespace treatment again. Please see attachments to see what I'm talking about.

I've tried this in both FOP 0.95beta and FOP Trunk with the same results.

To summarize, I see 3 questionable items:
1. Shouldn't the whitespace_without_wrapping_block.pdf match the whitespace_with_wrapping_block.pdf?
2. In whitespace_without_wrapping_block.pdf, is the behavior of Example 2 correct where whitespace is preserved inside inline elements even when whitespace-treatment != "preserve"?
3. In whitespace_without_wrapping_block.pdf, is there a way to get Example 1 behavior and Example 5 behavior with the same block property settings (to prevent Example 4 behavior)?

> 2. In whitespace_without_wrapping_block.pdf, is the behavior of Example 2
> correct where whitespace is preserved inside inline elements even when
> whitespace-treatment != "preserve"?

No, this is definitely a bug. The behavior seems to be wrong in both cases.
The result should be identical to Example 3, only with additional borders.

Example 1 is also incorrect with the wrapping block. The trailing whitespace on the last line should definitely be preserved.

Technically:
XMLWhiteSpaceHandler does not seem to properly remove the leading/trailing spaces in the inlines due to an implicit start-of-block/end-of-block in Example 2.
On the one hand the afterLinefeed member is not correctly set when handleWhiteSpace() is entered the first time for the surrounding block. Easily fixed.
On the other hand, the pendingInlines are not processed when handleWhiteSpace() is entered the second time for that block, when the block ends. Slightly more complicated, but still quite straightforward.

I'll look into a fix for this soon. For now, the correct behavior in whitespace_without_wrapping_block can be simulated by adding a space character before and after the inline. In that case, white-space removal is properly triggered and you get the correct result.

> 3. In whitespace_without_wrapping_block.pdf, is there a way to get Example 1
> behavior and Example 5 behavior with the same block property settings (to
> prevent Example 4 behavior)?
>

Not sure if I'm following here... Can you clarify? Do you wish to override the behavior of the first /and/ the last line? I know the XSL-FO specification defines fo:initial-property-set to affect only the first line-area generated by an fo:block, but FOP does not implement this yet.

Andreas L. Delmelle
added a comment - 29/May/08 10:12 (In reply to comment #0)
<snip />
Thanks for the extensive report, and the testcases!
> To summarize, I see 3 questionable items:
> 1. Shouldn't the whitespace_without_wrapping_block.pdf match the
> whitespace_with_wrapping_block.pdf?
Confirmed. Something is definitely wrong here.
> 2. In whitespace_without_wrapping_block.pdf, is the behavior of Example 2
> correct where whitespace is preserved inside inline elements even when
> whitespace-treatment != "preserve"?
No, this is definitely a bug. The behavior seems to be wrong in both cases.
The result should be identical to Example 3, only with additional borders.
Example 1 is also incorrect with the wrapping block. The trailing whitespace on the last line should definitely be preserved.
Technically:
XMLWhiteSpaceHandler does not seem to properly remove the leading/trailing spaces in the inlines due to an implicit start-of-block/end-of-block in Example 2.
On the one hand the afterLinefeed member is not correctly set when handleWhiteSpace() is entered the first time for the surrounding block. Easily fixed.
On the other hand, the pendingInlines are not processed when handleWhiteSpace() is entered the second time for that block, when the block ends. Slightly more complicated, but still quite straightforward.
I'll look into a fix for this soon. For now, the correct behavior in whitespace_without_wrapping_block can be simulated by adding a space character before and after the inline. In that case, white-space removal is properly triggered and you get the correct result.
> 3. In whitespace_without_wrapping_block.pdf, is there a way to get Example 1
> behavior and Example 5 behavior with the same block property settings (to
> prevent Example 4 behavior)?
>
Not sure if I'm following here... Can you clarify? Do you wish to override the behavior of the first /and/ the last line? I know the XSL-FO specification defines fo:initial-property-set to affect only the first line-area generated by an fo:block, but FOP does not implement this yet.

> > 3. In whitespace_without_wrapping_block.pdf, is there a way to get Example 1
> > behavior and Example 5 behavior with the same block property settings (to
> > prevent Example 4 behavior)?
> >
>
> Not sure if I'm following here... Can you clarify? Do you wish to override the
> behavior of the first /and/ the last line? I know the XSL-FO specification
> defines fo:initial-property-set to affect only the first line-area generated by
> an fo:block, but FOP does not implement this yet.
>

Sorry, I probably wasn't very clear. White-space-preserve is set on both Example 1 and Example 4. The behavior of Example 1 was expected but the behavior of Example 4 was not expected...at first. The preservation of the space after each formatter-generated line feed looks funny and I thought it was a bug. But after thinking about it and reading the white-space handling portion of the specification I began to see why it's being done...after all, we are saying to preserve all white space and I saw no mention in the spec that formatter-generated line feeds should replace surrounding space characters. So I removed white-space-preserve to make Example 4 look like Example 5 (what I want), but of course that made Example 1 look like Example 3 (what I didn't want).

Technically this "worked" in FOP 0.20.5, but that's not saying much since it had other problems related to white-space handling. Basically, I don't see anyone wanting the behavior shown in Example 4 (unless they actually put in a text-indent), so I'm questioning if it's truly working as expected.

To explain a little about what I'm doing, I'm wrapping user-entered text in a block, and I want to ensure I keep their formatting. But I appear to be in a catch-22 because if I do that then I also get this "handing indent" problem for blocks that have more than 1 line area.

Sean Griffin
added a comment - 29/May/08 11:47 (In reply to comment #4)
> (In reply to comment #0)
>
> Thanks for the extensive report, and the testcases!
Glad I could help!
> > 3. In whitespace_without_wrapping_block.pdf, is there a way to get Example 1
> > behavior and Example 5 behavior with the same block property settings (to
> > prevent Example 4 behavior)?
> >
>
> Not sure if I'm following here... Can you clarify? Do you wish to override the
> behavior of the first /and/ the last line? I know the XSL-FO specification
> defines fo:initial-property-set to affect only the first line-area generated by
> an fo:block, but FOP does not implement this yet.
>
Sorry, I probably wasn't very clear. White-space-preserve is set on both Example 1 and Example 4. The behavior of Example 1 was expected but the behavior of Example 4 was not expected...at first. The preservation of the space after each formatter-generated line feed looks funny and I thought it was a bug. But after thinking about it and reading the white-space handling portion of the specification I began to see why it's being done...after all, we are saying to preserve all white space and I saw no mention in the spec that formatter-generated line feeds should replace surrounding space characters. So I removed white-space-preserve to make Example 4 look like Example 5 (what I want), but of course that made Example 1 look like Example 3 (what I didn't want).
Technically this "worked" in FOP 0.20.5, but that's not saying much since it had other problems related to white-space handling. Basically, I don't see anyone wanting the behavior shown in Example 4 (unless they actually put in a text-indent), so I'm questioning if it's truly working as expected.
To explain a little about what I'm doing, I'm wrapping user-entered text in a block, and I want to ensure I keep their formatting. But I appear to be in a catch-22 because if I do that then I also get this "handing indent" problem for blocks that have more than 1 line area.

(In reply to comment #5)
>
> Sorry, I probably wasn't very clear. White-space-preserve is set on both
> Example 1 and Example 4. The behavior of Example 1 was expected but the
> behavior of Example 4 was not expected...at first. The preservation of the
> space after each formatter-generated line feed looks funny and I thought it was
> a bug. But after thinking about it and reading the white-space handling
> portion of the specification I began to see why it's being done...after all, we
> are saying to preserve all white space and I saw no mention in the spec that
> formatter-generated line feeds should replace surrounding space characters. So
> I removed white-space-preserve to make Example 4 look like Example 5 (what I
> want), but of course that made Example 1 look like Example 3 (what I didn't
> want).

Yep, either you preserve white-space surrounding linefeeds or you don't.

Note that white-space-treatment (in XSL-FO 1.1 at least) is defined in terms of preserving/discarding glyph-areas for XML white-space characters during line-building. So the preservation is not restricted to spaces surrounding explicit linefeed-characters.

> Technically this "worked" in FOP 0.20.5, but that's not saying much since it
> had other problems related to white-space handling. Basically, I don't see
> anyone wanting the behavior shown in Example 4 (unless they actually put in a
> text-indent), so I'm questioning if it's truly working as expected.

The fact that Example 4 only has preserved spaces at the start of the lines is because all the line-breaks are implicit, and if the line-breaking algorithm has a choice of either breaking before or after a space, it will always break before it. The result is therefore correct, even though the chances of anyone seeking that behavior are very slim. Trailing spaces on a line will normally only appear in case there are also explicit linefeeds or nested blocks, like:

> To explain a little about what I'm doing, I'm wrapping user-entered text in a
> block, and I want to ensure I keep their formatting. But I appear to be in a
> catch-22 because if I do that then I also get this "handing indent" problem for
> blocks that have more than 1 line area.

Actually, it seems like you want to do more than just 'keep the original formatting'. This scenario is very different from the case where one would use preserved linefeeds combined with wrap-option="no-wrap". It seems like you need a mixture of both, since you do seem to be needing formatter-generated linebreaks as well.
Example 1 is also slightly different than Example 5, since it contains nested blocks. Moving white-space-treatment="preserve" to the inner blocks may be an option, but I don't know if that fits in your processing logic

Andreas L. Delmelle
added a comment - 29/May/08 12:26 (In reply to comment #5)
>
> Sorry, I probably wasn't very clear. White-space-preserve is set on both
> Example 1 and Example 4. The behavior of Example 1 was expected but the
> behavior of Example 4 was not expected...at first. The preservation of the
> space after each formatter-generated line feed looks funny and I thought it was
> a bug. But after thinking about it and reading the white-space handling
> portion of the specification I began to see why it's being done...after all, we
> are saying to preserve all white space and I saw no mention in the spec that
> formatter-generated line feeds should replace surrounding space characters. So
> I removed white-space-preserve to make Example 4 look like Example 5 (what I
> want), but of course that made Example 1 look like Example 3 (what I didn't
> want).
Yep, either you preserve white-space surrounding linefeeds or you don't.
Note that white-space-treatment (in XSL-FO 1.1 at least) is defined in terms of preserving/discarding glyph-areas for XML white-space characters during line-building. So the preservation is not restricted to spaces surrounding explicit linefeed-characters.
> Technically this "worked" in FOP 0.20.5, but that's not saying much since it
> had other problems related to white-space handling. Basically, I don't see
> anyone wanting the behavior shown in Example 4 (unless they actually put in a
> text-indent), so I'm questioning if it's truly working as expected.
The fact that Example 4 only has preserved spaces at the start of the lines is because all the line-breaks are implicit, and if the line-breaking algorithm has a choice of either breaking before or after a space, it will always break before it. The result is therefore correct, even though the chances of anyone seeking that behavior are very slim. Trailing spaces on a line will normally only appear in case there are also explicit linefeeds or nested blocks, like:
<fo:block white-space-treatment="preserve">
<fo:block linefeed-treatment="preserve">text
text</fo:block>
<fo:block>text <fo:block /> text</fo:block>
</fo:block>
> To explain a little about what I'm doing, I'm wrapping user-entered text in a
> block, and I want to ensure I keep their formatting. But I appear to be in a
> catch-22 because if I do that then I also get this "handing indent" problem for
> blocks that have more than 1 line area.
Actually, it seems like you want to do more than just 'keep the original formatting'. This scenario is very different from the case where one would use preserved linefeeds combined with wrap-option="no-wrap". It seems like you need a mixture of both, since you do seem to be needing formatter-generated linebreaks as well.
Example 1 is also slightly different than Example 5, since it contains nested blocks. Moving white-space-treatment="preserve" to the inner blocks may be an option, but I don't know if that fits in your processing logic

(In reply to comment #7)
> ...if the line-breaking algorithm
> has a choice of either breaking before or after a space, it will always break
> before it.

Here's where I think our opinions might differ. I believe it should break after the space. I equate the space character's significance in line wrapping the same as a hyphen. Technically I know they are quite different, but functionally, with hyphenation, the break is placed after the hyphen not before, and it seems the same rule should be used with spaces.

To test this theory I first opened up MS Word, turned on the "Show Formatting Marks" option, and typed a few lines of text that Word wraps on its own. The space characters are kept on the line before the wrap as opposed to after.

Since XSL is based off CSS I wondered what happened in internet browsers with HTML, so I tried the same thing there with a span border on a large block of text. Internet Explorer keeps the space at the end of the prior line before the wrap. Firefox trims the space similar to the XSL white-space-treatment="ignore-if-surrounding-linefeed", so it didn't really apply there. Unfortunately CSS doesn't have the level of control over whitespace that XSL does, but it seems the root of the issue isn't white-space-treatment but how the line-areas are created.

I searched in the spec for quite awhile trying to find where it clearly says which line-area gets the whitespace in a wrapping block-area, but I couldn't find it. The closest I found was this, which unfortunately is a little ambiguous.

4.7.2 Line-building
The partitioning occurs at legal line-breaks. Specifically, if A is the last area of Si and B is the first area of Si+1, then the rules of the language, script and hyphenation constraints ... in effect must permit a line-break between A and B, within the context of all areas in Si and Si+1.

Sean Griffin
added a comment - 29/May/08 17:08 (In reply to comment #7)
> ...if the line-breaking algorithm
> has a choice of either breaking before or after a space, it will always break
> before it.
Here's where I think our opinions might differ. I believe it should break after the space. I equate the space character's significance in line wrapping the same as a hyphen. Technically I know they are quite different, but functionally, with hyphenation, the break is placed after the hyphen not before, and it seems the same rule should be used with spaces.
To test this theory I first opened up MS Word, turned on the "Show Formatting Marks" option, and typed a few lines of text that Word wraps on its own. The space characters are kept on the line before the wrap as opposed to after.
Since XSL is based off CSS I wondered what happened in internet browsers with HTML, so I tried the same thing there with a span border on a large block of text. Internet Explorer keeps the space at the end of the prior line before the wrap. Firefox trims the space similar to the XSL white-space-treatment="ignore-if-surrounding-linefeed", so it didn't really apply there. Unfortunately CSS doesn't have the level of control over whitespace that XSL does, but it seems the root of the issue isn't white-space-treatment but how the line-areas are created.
I searched in the spec for quite awhile trying to find where it clearly says which line-area gets the whitespace in a wrapping block-area, but I couldn't find it. The closest I found was this, which unfortunately is a little ambiguous.
4.7.2 Line-building
The partitioning occurs at legal line-breaks. Specifically, if A is the last area of Si and B is the first area of Si+1, then the rules of the language, script and hyphenation constraints ... in effect must permit a line-break between A and B, within the context of all areas in Si and Si+1.

(In reply to comment #8)
> (In reply to comment #7)
> > ...if the line-breaking algorithm
> > has a choice of either breaking before or after a space, it will always break
> > before it.
>
> Here's where I think our opinions might differ. I believe it should break
> after the space.

Oh, but the result /is/ correct, strictly speaking. A bit unexpected, maybe, but definitely not a bug.

The point is well taken though. I've been looking at the related code, and was beginning to wonder...

This is more meant for the layout-specialists, but using simple start-alignment, the sequence currently generated for a preserved space consists of:

a glue

a penalty p=0

a glue

an auxiliary box w=0

a penalty p=INFINITE

a glue

From a higher-level point of view (the LineLayoutManager) a break on the first penalty will always be favored over a break on the second, hence why I think the algorithm chooses to break before the space rather than after. With a preceding and following word, the above sequence would be enclosed by boxes corresponding to those words. If the break /has/ to be somewhere in between the two word-boxes, the preserved space in between always appears at the start of the next line.

Again, not incorrect, but not the most elegantly looking outcome either.

Actually, it's even slightly worse. Given a sequence of those preserved spaces, as many as possible will be placed on the line as trailing white-space. That is: all but the very last one. The zero-penalty appears to be always favored as the last break in the sequence...

<snip />
> I searched in the spec for quite awhile trying to find where it clearly says
> which line-area gets the whitespace in a wrapping block-area, but I couldn't
> find it. The closest I found was this, which unfortunately is a little
> ambiguous.
>

Indeed, the rules about where exactly line-breaks are supposed to end up are not defined by XSL-FO itself. FOP uses Unicode UAX#14 (http://www.unicode.org/reports/tr14/) as reference for the most part, which does not explicitly forbid a break before a space (although it is discouraged, IIC)

Andreas L. Delmelle
added a comment - 30/May/08 00:26 (In reply to comment #8)
> (In reply to comment #7)
> > ...if the line-breaking algorithm
> > has a choice of either breaking before or after a space, it will always break
> > before it.
>
> Here's where I think our opinions might differ. I believe it should break
> after the space.
Oh, but the result /is/ correct, strictly speaking. A bit unexpected, maybe, but definitely not a bug.
The point is well taken though. I've been looking at the related code, and was beginning to wonder...
This is more meant for the layout-specialists, but using simple start-alignment, the sequence currently generated for a preserved space consists of:
a glue
a penalty p=0
a glue
an auxiliary box w=0
a penalty p=INFINITE
a glue
From a higher-level point of view (the LineLayoutManager) a break on the first penalty will always be favored over a break on the second, hence why I think the algorithm chooses to break before the space rather than after. With a preceding and following word, the above sequence would be enclosed by boxes corresponding to those words. If the break /has/ to be somewhere in between the two word-boxes, the preserved space in between always appears at the start of the next line.
Again, not incorrect, but not the most elegantly looking outcome either.
Actually, it's even slightly worse. Given a sequence of those preserved spaces, as many as possible will be placed on the line as trailing white-space. That is: all but the very last one. The zero-penalty appears to be always favored as the last break in the sequence...
<snip />
> I searched in the spec for quite awhile trying to find where it clearly says
> which line-area gets the whitespace in a wrapping block-area, but I couldn't
> find it. The closest I found was this, which unfortunately is a little
> ambiguous.
>
Indeed, the rules about where exactly line-breaks are supposed to end up are not defined by XSL-FO itself. FOP uses Unicode UAX#14 ( http://www.unicode.org/reports/tr14/ ) as reference for the most part, which does not explicitly forbid a break before a space (although it is discouraged, IIC)

Andreas L. Delmelle
added a comment - 31/May/08 03:58
FWIW: the issues with Example 1 and Example 2 have been fixed in FOP trunk. In both cases, with or without wrapping block, the result is now as in the attached PDF.

I'm keeping the issue open FTM, as a reminder for the dubious/inelegant way of handling preserved white-space around formatter-generated linebreaks. Strictly speaking not a bug, but I agree with the reporter that the current behavior is not really what it should be...

Andreas L. Delmelle
added a comment - 31/May/08 04:00
Fix applied to FOP Trunk.
see: http://svn.apache.org/viewvc?rev=661999&view=rev
I'm keeping the issue open FTM, as a reminder for the dubious/inelegant way of handling preserved white-space around formatter-generated linebreaks. Strictly speaking not a bug, but I agree with the reporter that the current behavior is not really what it should be...

I've dusted off the investigation of this issue in an effort to possibly fix it. I've done some research into the Knuth related concepts and read some old mailing list entries from Simon and Manuele around 2006 trying to get my head around the layout manager behavior. I'm still not sure I have it, but I'll take a crack at it.

This matches what Andreas shows as the sequence for a preserved space (glue, penalty=0, glue, aux. box w=0, penalty=inf, glue). Is my analysis of each KnuthElement and the purpose it serves correct? I still don't understand how it gets the stretch values that it does, but ignoring that for now, it seems that a possible fix to this undesirable behavior is to move the break possibility from the beginning to the end of the boilerplate sequence. Something like this...

It seems this would move the possible break /after/ the glue with w=3336, thereby keeping the preserved spaces all on the first line and ensuring the next line after the break starts at the left margin.

This is complicated stuff, so I apologize if I have it all wrong, but I know it's just a matter of time before my clients complain about this behavior, and I'd like to help fix the issue if possible before it blows up.

Sean Griffin
added a comment - 26/Oct/08 20:26 I've dusted off the investigation of this issue in an effort to possibly fix it. I've done some research into the Knuth related concepts and read some old mailing list entries from Simon and Manuele around 2006 trying to get my head around the layout manager behavior. I'm still not sure I have it, but I'll take a crack at it.
Given markup like this:
<fo:block><fo:inline>aaaaaaaaaaaaaaaaaa bbbbbbbbbbbbbbbbbbb</fo:inline></fo:block>
Note, there are 2 spaces (0x20) between the a and b.
The BreakingAlgorithm is given a KnuthSequence constructed with these elements:
[box w=120096, glue w=0 stretch=10008 shrink=0, aux. penalty p=0 w=0, glue w=3336 stretch=-10008 shrink=0, box w=126768, penalty p=INFINITE w=0, glue w=0 stretch=216000 shrink=0, penalty p=-INFINITE w=0 (forced break)]
Based on my novice analysis, it appears the various KnuthElements provide the following purposes:
box w=120096 --> string of 'a' characters
glue w=0 stretch=10008 shrink=0 --> handles alignment in case the following possible break is honored?
aux. penalty p=0 w=0 --> possible break
glue w=3336 stretch=-10008 shrink=0 --> 1 character of whitespace
box w=126768 --> string of 'b' characters
Now, if I add white-space-collapse="false" and white-space-treatment="preserve" to the block in the markup above I get this KnuthSequence in the BreakingAlgorithm:
[box w=120096, glue w=0 stretch=10008 shrink=0, aux. penalty p=0 w=0, glue w=0 stretch=-10008 shrink=0, aux. box w=0, aux. penalty p=INFINITE w=0, glue w=3336 stretch=0 shrink=0, glue w=0 stretch=10008 shrink=0, aux. penalty p=0 w=0, glue w=0 stretch=-10008 shrink=0, aux. box w=0, aux. penalty p=INFINITE w=0, glue w=3336 stretch=0 shrink=0, box w=126768, penalty p=INFINITE w=0, glue w=0 stretch=216000 shrink=0, penalty p=-INFINITE w=0 (forced break)]
Which I analyze to provide the following purposes:
box w=120096 --> string of 'a' characters
glue w=0 stretch=10008 shrink=0 --> handles alignment in case the following possible break is honored?
aux. penalty p=0 w=0 + glue w=0 stretch= 10008 shrink=0 -> possible break
aux. box w=0 --> prevents whitespace removal
aux. penalty p=INFINITE w=0 --> disables next glue from being break possibility
glue w=3336 stretch=0 shrink=0 --> 1 character of whitespace on line 1
glue w=0 stretch=10008 shrink=0 --> handles alignment in case the following possible break is honored?
aux. penalty p=0 w=0 + glue w=0 stretch= 10008 shrink=0 -> possible break
aux. box w=0 --> prevents whitespace removal
aux. penalty p=INFINITE w=0 --> disables next glue from being break possibility
glue w=3336 stretch=0 shrink=0 --> 1 character of whitespace on line 2
box w=126768 --> string of 'b' characters
This matches what Andreas shows as the sequence for a preserved space (glue, penalty=0, glue, aux. box w=0, penalty=inf, glue). Is my analysis of each KnuthElement and the purpose it serves correct? I still don't understand how it gets the stretch values that it does, but ignoring that for now, it seems that a possible fix to this undesirable behavior is to move the break possibility from the beginning to the end of the boilerplate sequence. Something like this...
aux. penalty p=INFINITE w=0 --> disables next glue from being break possibility
glue w=3336 stretch=0 shrink=0 --> 1st character of whitespace on line 1
glue w=0 stretch=? shrink=0 --> in case the following possible break is honored?
aux. penalty p=0 w=0 + glue w=0 stretch=? shrink=0--> possible break
It seems this would move the possible break /after/ the glue with w=3336, thereby keeping the preserved spaces all on the first line and ensuring the next line after the break starts at the left margin.
This is complicated stuff, so I apologize if I have it all wrong, but I know it's just a matter of time before my clients complain about this behavior, and I'd like to help fix the issue if possible before it blows up.

Manuel Mall
added a comment - 26/Oct/08 21:38 Sean,
I haven't analysed what you wrote on the Knuth sequences but it may be worthwhile, unless you have done it already, to compare it against
http://wiki.apache.org/xmlgraphics-fop/LineBreaking .

> Based on my novice analysis, it appears the various KnuthElements provide the
> following purposes:
<snip />

Entirely correct interpretation.

A box is never a break-possibility, unless when preceded by a penalty indicating one. Glues are always a break-possibility, unless when preceded by a penalty prohibiting one. That's the general idea.
If a glue simply appears in between two boxes, then when it is chosen as the effective break, it dissolves. To generate the effect of preserved spaces or account for alignment other than "justify", one needs a sequence of those elements to represent the different effects (break/no-break).
If a glue is followed by a glue, then the latter becomes the more favorable break. The former could then simply be discarded as a possibility.

> This matches what Andreas shows as the sequence for a preserved space (glue,
> penalty=0, glue, aux. box w=0, penalty=inf, glue). Is my analysis of each
> KnuthElement and the purpose it serves correct? I still don't understand how
> it gets the stretch values that it does,

A point which has been put into question recently: 10008 is exactly the width of 3 normal spaces, indeed to handle alignment other than "justify", but it has been proven to have nasty side-effects for long blocks with a relatively small line-width (multi-column documents), where three spaces would represent a large portion... The suggestion has been raised to make this a percentage of the line-width, and IIC, we would also need to take into account the font-size.

On the one hand, the TextLM optimizes the search for linebreaks by merging words into one single element, not 1 element per character. Even with hyphenation, we only get one box per hyphenated word-fragment). In terms of the algorithm, there is no difference between a non-interrupted sequence of fixed-size boxes or a single box spanning the same width. Most elementary representation: one box per regular character, one glue for a space. Since we already know that the letter-boxes will be kept together, we only generate the one box. If hyphenation is enabled, the word-box is later split into multiple boxes, with additional flagged penalties in between.
On the other hand, spaces generate multiple elements for one single space character (and sequences of space-characters are currently not glued together to a single element, IIC).

Looking closer at the Wiki again, I realize that the sequence for a simple preserved space looks surprisingly similar to that of a simple break in case of centered text, apart from the stretch/shrink... and in that case, the trailing glue there is /meant/ to always be pushed to the next line.

> it seems that a possible fix to this undesirable behavior is to move the break
> possibility from the beginning to the end of the boilerplate sequence.

Could indeed very well be the solution. If so, the auxiliary box may not even be needed anymore
I'll look into it. At any rate, it seems like the sequence should be drastically simplified. Specifying white-space-preserve should not mean that suddenly, it becomes more attractive to break before the space. The break should still be strongly discouraged. In the most elementary case, if a glue is preceded by a box, that condition is easily satisfied.
I think the cases where white-space-preserve really plays a part come down to:
1) white space around preserved linefeeds
2) necessary breaks in the middle of a sequence of non-collapsed white-space

For 1), the solution so far has been to end the current paragraph and start a new one. One TextLM returns a sequence of element-lists to the LineLM.
If a space were simply represented by a glue, it would dissolve higher up. Due to the added auxiliary box, at least the auxiliary glue is preserved and does generate the right effect here.

For 2), I'm thinking of very extreme (and highly unusual) cases, where it becomes necessary to choose 'a' break, but the choice is between white-space characters only. If white-space treatment is "preserve", a portion of white-space should, strictly speaking, be pushed to the next line, and influence alignment there... but ideally, if it all fits on one line, that possibility should obviously be preferred above all else.

Andreas L. Delmelle
added a comment - 25/Nov/08 13:12 (In reply to comment #12)
Sorry to chime in so late...
> Based on my novice analysis, it appears the various KnuthElements provide the
> following purposes:
<snip />
Entirely correct interpretation.
A box is never a break-possibility, unless when preceded by a penalty indicating one. Glues are always a break-possibility, unless when preceded by a penalty prohibiting one. That's the general idea.
If a glue simply appears in between two boxes, then when it is chosen as the effective break, it dissolves. To generate the effect of preserved spaces or account for alignment other than "justify", one needs a sequence of those elements to represent the different effects (break/no-break).
If a glue is followed by a glue, then the latter becomes the more favorable break. The former could then simply be discarded as a possibility.
> This matches what Andreas shows as the sequence for a preserved space (glue,
> penalty=0, glue, aux. box w=0, penalty=inf, glue). Is my analysis of each
> KnuthElement and the purpose it serves correct? I still don't understand how
> it gets the stretch values that it does,
A point which has been put into question recently: 10008 is exactly the width of 3 normal spaces, indeed to handle alignment other than "justify", but it has been proven to have nasty side-effects for long blocks with a relatively small line-width (multi-column documents), where three spaces would represent a large portion... The suggestion has been raised to make this a percentage of the line-width, and IIC, we would also need to take into account the font-size.
On the one hand, the TextLM optimizes the search for linebreaks by merging words into one single element, not 1 element per character. Even with hyphenation, we only get one box per hyphenated word-fragment). In terms of the algorithm, there is no difference between a non-interrupted sequence of fixed-size boxes or a single box spanning the same width. Most elementary representation: one box per regular character, one glue for a space. Since we already know that the letter-boxes will be kept together, we only generate the one box. If hyphenation is enabled, the word-box is later split into multiple boxes, with additional flagged penalties in between.
On the other hand, spaces generate multiple elements for one single space character (and sequences of space-characters are currently not glued together to a single element, IIC).
Looking closer at the Wiki again, I realize that the sequence for a simple preserved space looks surprisingly similar to that of a simple break in case of centered text, apart from the stretch/shrink... and in that case, the trailing glue there is /meant/ to always be pushed to the next line.
> it seems that a possible fix to this undesirable behavior is to move the break
> possibility from the beginning to the end of the boilerplate sequence.
Could indeed very well be the solution. If so, the auxiliary box may not even be needed anymore
I'll look into it. At any rate, it seems like the sequence should be drastically simplified. Specifying white-space-preserve should not mean that suddenly, it becomes more attractive to break before the space. The break should still be strongly discouraged. In the most elementary case, if a glue is preceded by a box, that condition is easily satisfied.
I think the cases where white-space-preserve really plays a part come down to:
1) white space around preserved linefeeds
2) necessary breaks in the middle of a sequence of non-collapsed white-space
For 1), the solution so far has been to end the current paragraph and start a new one. One TextLM returns a sequence of element-lists to the LineLM.
If a space were simply represented by a glue, it would dissolve higher up. Due to the added auxiliary box, at least the auxiliary glue is preserved and does generate the right effect here.
For 2), I'm thinking of very extreme (and highly unusual) cases, where it becomes necessary to choose 'a' break, but the choice is between white-space characters only. If white-space treatment is "preserve", a portion of white-space should, strictly speaking, be pushed to the next line, and influence alignment there... but ideally, if it all fits on one line, that possibility should obviously be preferred above all else.

(In reply to comment #14)
> (In reply to comment #12)
>
> Sorry to chime in so late...
>
> > Based on my novice analysis, it appears the various KnuthElements provide the
> > following purposes:
> <snip />
>
> Entirely correct interpretation.
>
> A box is never a break-possibility, unless when preceded by a penalty
> indicating one. Glues are always a break-possibility, unless when preceded by a
> penalty prohibiting one. That's the general idea.

I'm afraid this is wrong. You can break only at two places:

a penalty element whose penalty value is not infinite; then the width of the penalty must be taken into account.

a glue element that's immediately preceded by a box; then you discard the glue's length, shrink and stretch.
Also, when an element is chosen as a breaking point, all the following glue and penalty elements (if any) are discarded up to the next box element. The presence of aux. box w=0 at places in the sequence is meant to prevent the triggering of that mechanism.
See section “Breaking Rules” at the following page: http://wiki.apache.org/xmlgraphics-fop/KnuthsModel

<snip/>
> For 2), I'm thinking of very extreme (and highly unusual) cases, where it
> becomes necessary to choose 'a' break, but the choice is between white-space
> characters only. If white-space treatment is "preserve", a portion of
> white-space should, strictly speaking, be pushed to the next line, and
> influence alignment there... but ideally, if it all fits on one line, that
> possibility should obviously be preferred above all else.

This is probably the biggest issue. This may require to handle a sequence of white spaces in its whole instead of each character individually. Sorry, I don't have enough energy ATM to look at this issue into more details. Being sure that every combination of white space options (white-space-treatment, white-space-collapse, linefeed-treatment...) is handled correctly requires an extensive study.

Vincent Hennebert
added a comment - 26/Nov/08 03:28 (In reply to comment #14)
> (In reply to comment #12)
>
> Sorry to chime in so late...
>
> > Based on my novice analysis, it appears the various KnuthElements provide the
> > following purposes:
> <snip />
>
> Entirely correct interpretation.
>
> A box is never a break-possibility, unless when preceded by a penalty
> indicating one. Glues are always a break-possibility, unless when preceded by a
> penalty prohibiting one. That's the general idea.
I'm afraid this is wrong. You can break only at two places:
a penalty element whose penalty value is not infinite; then the width of the penalty must be taken into account.
a glue element that's immediately preceded by a box; then you discard the glue's length, shrink and stretch.
Also, when an element is chosen as a breaking point, all the following glue and penalty elements (if any) are discarded up to the next box element. The presence of aux. box w=0 at places in the sequence is meant to prevent the triggering of that mechanism.
See section “Breaking Rules” at the following page: http://wiki.apache.org/xmlgraphics-fop/KnuthsModel
<snip/>
> For 2), I'm thinking of very extreme (and highly unusual) cases, where it
> becomes necessary to choose 'a' break, but the choice is between white-space
> characters only. If white-space treatment is "preserve", a portion of
> white-space should, strictly speaking, be pushed to the next line, and
> influence alignment there... but ideally, if it all fits on one line, that
> possibility should obviously be preferred above all else.
This is probably the biggest issue. This may require to handle a sequence of white spaces in its whole instead of each character individually. Sorry, I don't have enough energy ATM to look at this issue into more details. Being sure that every combination of white space options (white-space-treatment, white-space-collapse, linefeed-treatment...) is handled correctly requires an extensive study.
Vincent

Thanks for the rectification, although I wasn't really wrong. At most, not accurate enough. ;-P

> <snip/>
> > For 2), I'm thinking of very extreme (and highly unusual) cases, where it
> > becomes necessary to choose 'a' break, but the choice is between white-space
> > characters only. If white-space treatment is "preserve", a portion of
> > white-space should, strictly speaking, be pushed to the next line, and
> > influence alignment there... but ideally, if it all fits on one line, that
> > possibility should obviously be preferred above all else.
>
> This is probably the biggest issue. This may require to handle a sequence of
> white spaces in its whole instead of each character individually. Sorry, I
> don't have enough energy ATM to look at this issue into more details. Being
> sure that every combination of white space options (white-space-treatment,
> white-space-collapse, linefeed-treatment...) is handled correctly requires an
> extensive study.

I was thinking about introducing a special type of auxiliary glue, with the possibility to break it in two at a position that is not fixed at the time the element is generated. (more like a combination of two glues, whose combined width is known, but not the width of the two individual elements.)
The LineLM would then treat this as a whole, but not an unbreakable whole, see how big a portion it can fit on one line, and insert an auxiliary box for the remaining width (rather than /always/ adding that auxiliary box in the TextLM when white-space-treatment='preserve').

Andreas L. Delmelle
added a comment - 26/Nov/08 11:10 (In reply to comment #15)
<snip />
Thanks for the rectification, although I wasn't really wrong. At most, not accurate enough. ;-P
> <snip/>
> > For 2), I'm thinking of very extreme (and highly unusual) cases, where it
> > becomes necessary to choose 'a' break, but the choice is between white-space
> > characters only. If white-space treatment is "preserve", a portion of
> > white-space should, strictly speaking, be pushed to the next line, and
> > influence alignment there... but ideally, if it all fits on one line, that
> > possibility should obviously be preferred above all else.
>
> This is probably the biggest issue. This may require to handle a sequence of
> white spaces in its whole instead of each character individually. Sorry, I
> don't have enough energy ATM to look at this issue into more details. Being
> sure that every combination of white space options (white-space-treatment,
> white-space-collapse, linefeed-treatment...) is handled correctly requires an
> extensive study.
I was thinking about introducing a special type of auxiliary glue, with the possibility to break it in two at a position that is not fixed at the time the element is generated. (more like a combination of two glues, whose combined width is known, but not the width of the two individual elements.)
The LineLM would then treat this as a whole, but not an unbreakable whole, see how big a portion it can fit on one line, and insert an auxiliary box for the remaining width (rather than /always/ adding that auxiliary box in the TextLM when white-space-treatment='preserve').

(In reply to comment #16)
> (In reply to comment #15)
> >
> > This is probably the biggest issue. This may require to handle a sequence of
> > white spaces in its whole instead of each character individually. Sorry, I
> > don't have enough energy ATM to look at this issue into more details. Being
> > sure that every combination of white space options (white-space-treatment,
> > white-space-collapse, linefeed-treatment...) is handled correctly requires an
> > extensive study.
>
> I was thinking about introducing a special type of auxiliary glue, ...

Or maybe even, we could benefit from a space-resolution pass in line-layout too. Replace the white-space sequences by one unresolved SpaceElement, and resolve those in the LineLM, at the end of collecting the inline elements for a paragraph.
That would probably be the most comprehensive approach, since it could then be folded into space-start/space-end resolution (currently non-functional), and it would make it much easier to detect sequences of consecutive preserved white-space characters across FO boundaries...

Andreas L. Delmelle
added a comment - 26/Nov/08 12:34 (In reply to comment #16)
> (In reply to comment #15)
> >
> > This is probably the biggest issue. This may require to handle a sequence of
> > white spaces in its whole instead of each character individually. Sorry, I
> > don't have enough energy ATM to look at this issue into more details. Being
> > sure that every combination of white space options (white-space-treatment,
> > white-space-collapse, linefeed-treatment...) is handled correctly requires an
> > extensive study.
>
> I was thinking about introducing a special type of auxiliary glue, ...
Or maybe even, we could benefit from a space-resolution pass in line-layout too. Replace the white-space sequences by one unresolved SpaceElement, and resolve those in the LineLM, at the end of collecting the inline elements for a paragraph.
That would probably be the most comprehensive approach, since it could then be folded into space-start/space-end resolution (currently non-functional), and it would make it much easier to detect sequences of consecutive preserved white-space characters across FO boundaries...

I know it's been a long time, but I was just wondering if anyone was able to get anywhere on the last remaining issue described in this bug. The last discussion centered around whether the Knuth sequence for preserved whitespace was more complicated than it needed to be and that, possibly, by moving the penalty=0 after the glue instead of before the issue would be fixed...assuming it doesn't then cause problems with alignment/justification.

As predicted, one of my clients is finally complaining about the behavior and is asking when it will be fixed. Obviously I can jump in and try to fix myself to help in the effort, but I'm guessing that in the time it takes me to learn the layout algorithm and Knuth concepts someone with more experience in this stuff could have already resolved the issue. Plus, it sounds like Andreas might have already started working on a fix?

Sean Griffin
added a comment - 03/Aug/09 11:03 I know it's been a long time, but I was just wondering if anyone was able to get anywhere on the last remaining issue described in this bug. The last discussion centered around whether the Knuth sequence for preserved whitespace was more complicated than it needed to be and that, possibly, by moving the penalty=0 after the glue instead of before the issue would be fixed...assuming it doesn't then cause problems with alignment/justification.
As predicted, one of my clients is finally complaining about the behavior and is asking when it will be fixed. Obviously I can jump in and try to fix myself to help in the effort, but I'm guessing that in the time it takes me to learn the layout algorithm and Knuth concepts someone with more experience in this stuff could have already resolved the issue. Plus, it sounds like Andreas might have already started working on a fix?

I'm afraid this bug doesn't seem to be high on the priority list of any of the committers. This issue both is non-trivial and affects non-trivial code, so that would require some involvement to fix it.

You might be happy with the following workaround, though: in a pre-processing step, replace every space character with a non-breaking space (U+00A0) followed by a zero-width space (U+200B). That will force the line-breaking algorithm to break after the space and not before.

HTH,
Vincent

(In reply to comment #19)
> I know it's been a long time, but I was just wondering if anyone was able to
> get anywhere on the last remaining issue described in this bug. The last
> discussion centered around whether the Knuth sequence for preserved whitespace
> was more complicated than it needed to be and that, possibly, by moving the
> penalty=0 after the glue instead of before the issue would be fixed...assuming
> it doesn't then cause problems with alignment/justification.
>
> As predicted, one of my clients is finally complaining about the behavior and
> is asking when it will be fixed. Obviously I can jump in and try to fix myself
> to help in the effort, but I'm guessing that in the time it takes me to learn
> the layout algorithm and Knuth concepts someone with more experience in this
> stuff could have already resolved the issue. Plus, it sounds like Andreas
> might have already started working on a fix?

Vincent Hennebert
added a comment - 06/Aug/09 03:22 Hi Sean,
I'm afraid this bug doesn't seem to be high on the priority list of any of the committers. This issue both is non-trivial and affects non-trivial code, so that would require some involvement to fix it.
You might be happy with the following workaround, though: in a pre-processing step, replace every space character with a non-breaking space (U+00A0) followed by a zero-width space (U+200B). That will force the line-breaking algorithm to break after the space and not before.
HTH,
Vincent
(In reply to comment #19)
> I know it's been a long time, but I was just wondering if anyone was able to
> get anywhere on the last remaining issue described in this bug. The last
> discussion centered around whether the Knuth sequence for preserved whitespace
> was more complicated than it needed to be and that, possibly, by moving the
> penalty=0 after the glue instead of before the issue would be fixed...assuming
> it doesn't then cause problems with alignment/justification.
>
> As predicted, one of my clients is finally complaining about the behavior and
> is asking when it will be fixed. Obviously I can jump in and try to fix myself
> to help in the effort, but I'm guessing that in the time it takes me to learn
> the layout algorithm and Knuth concepts someone with more experience in this
> stuff could have already resolved the issue. Plus, it sounds like Andreas
> might have already started working on a fix?

Vinesh Kumar
added a comment - 09/Oct/15 13:16 We are seeing this issue in FOP 2.0 as well. This issue became showstopper to upgrade.
Any solution identified to fix this issue other than replacing space to NBSP and ZWSP?