On Tue, 06 Jul 2010 14:51:21 -0400, Eric Sunshine wrote:
> Taken literally, Thomas's statement does not apply to this case. An
> entity (&aacute;) is not a literal character (á), and vice-versa.
> kramdown correctly represents the entity internally as just that: an
> entity, not as its external representation. It does not confuse the
> entity with a character in a string.
> No particular external
> representation of the entity (&aacute;, &#225;, &#xE1;) is more
> correct than any other, and none is incorrect (except the reported
> bug with &zcaron;).
Are you sure? There may be other cases in which keeping a particular representation is desirable or necessary. We don't know what post-processing and render agents may be used after kramdown and what set of entities they will accept. Kramdown's behavior of changing entities from numeric to named not only violates the rule of least surprise, but also makes it very difficult to preserve numeric entities through kramdown.
> I expect that Thomas could augment the internal entity object so that
> it remembers its input representation,
Yes, it should not be that hard to add an instance variable specifying :named or :numeric, or simply containing the original entity string token from the parse.
> but this would sully the
> presently clean abstraction,
If we can permit that named and numeric entities can in some cases be functionally different (as in the zcaron case), then it is not proper to abstract to the level of losing named/numeric information.
> The present behavior of emitting symbolic
> references when possible (unless explicitly disabled) seems a decent
> compromise if the output is expected to be read by humans.
When Thomas first started the discussion on entities, I was in favor of named entities for conversion of non-ASCII characters. I did not realize we were talking about changing one type of HTML entities into another. I don't think it's proper for kramdown to change numeric entities to named ones.
I think if a (human) user inputs a numeric entity, they are clearly competent enough to understand it on output. Furthermore, since numeric entities are not as readable, we should assume that the human had some reason for inputting the numeric instead of the named, and it should be left as numeric in the output.
If some automated pre-process is feeding numeric entities to kramdown, then I think those processes should be changed to provide named entities instead, to enforce human readability earlier in the chain.
Shawn