Re: [Sax-devel] Entity references in attribute values

> The spec for ContentHandler.setDocumentLocator() disagrees;
> it says "The locator allows the application to determine the end
> position of any document-related event...". So ">" in this case.
I just spent a week reading all of those docs too! So just theoretically
speaking-- what are some of the things this token interface would need? And
could we come up with a list of things that a parser would need to do to
support round tripping?
Jeff Rafter
Defined Systems
http://www.defined.net
XML Development and Developer Web Hosting

Thread view

How should entity references in attribute values be reported?
Is there any interest out there in making a "really" lexical handler? It
would be nice to have one that reports character entities, character
references and entities in attribute values. And then beyond that there
could be "really, really" lexical handler that allows byte-for-byte
round-tripping.
Paul Prescod

Paul Prescod <paul@...> asks:
> How should entity references in attribute values be reported?
Right now, they should not be reported. Just like the
PE references used to construct markup declarations,
these are below the "markup event horizon" of SAX.
> Is there any interest out there in making a "really" lexical handler? It
> would be nice to have one that reports character entities, character
> references and entities in attribute values. And then beyond that there
> could be "really, really" lexical handler that allows byte-for-byte
> round-tripping.
When such issues have come up before I've talked about a
"token handler". Such a handler happens not to be on the list
of things I'm interested in doing; I think it's fair to say that so
far there's been more interest (occasionally) in having one
than doing the work to make one.
Such a handler would need to be done in conjunction with
modification to some parser(s).
- Dave

> When such issues have come up before I've talked about a
> "token handler". Such a handler happens not to be on the list
> of things I'm interested in doing; I think it's fair to say that so
> far there's been more interest (occasionally) in having one
> than doing the work to make one.
>
> Such a handler would need to be done in conjunction with
> modification to some parser(s).
I have thought a lot about this as well-- I even have a somewhat modified
AElfred2 Java that does a better job of handling the reporting of Locator
information (but it is horribly out of date so I gave up for the time
being). I think that in order to do such a thing you would need to legislate
quite a bit more than SAX currently does (or could do with non-parser
producers).
For example one of the first questions we are faced with is "When do I
report X?". Should you report startElement (locator position) at "<" or at
">"? I think the former, though some may disagree. If you say the former--
that involves storage of the position (which may involve buffer chunks and
translating to actual document positions). When do you report a malformed
entity? missing end-tag?
Once you decide all of those questions (there are lots) you must deal with
storing position within the parser far better than they do now (for
optimization purposes and memory, possibly-- though we are talking more
about depth of nesting than size of document). Additionally do you really
want to know byte information or character information?
I could see something like this being very useful-- but difficult. SAX3
stuff.
Jeff Rafter
Defined Systems
http://www.defined.net
XML Development and Developer Web Hosting

> I think that in order to do such a thing you would need to legislate
> quite a bit more than SAX currently does (or could do with non-parser
> producers).
I wouldn't be surprised to find that's true, but ...
> For example one of the first questions we are faced with is "When do I
> report X?". Should you report startElement (locator position) at "<" or at
> ">"? I think the former, though some may disagree.
The spec for ContentHandler.setDocumentLocator() disagrees;
it says "The locator allows the application to determine the end
position of any document-related event...". So ">" in this case.
- Dave

> The spec for ContentHandler.setDocumentLocator() disagrees;
> it says "The locator allows the application to determine the end
> position of any document-related event...". So ">" in this case.
I just spent a week reading all of those docs too! So just theoretically
speaking-- what are some of the things this token interface would need? And
could we come up with a list of things that a parser would need to do to
support round tripping?
Jeff Rafter
Defined Systems
http://www.defined.net
XML Development and Developer Web Hosting

Jeff Rafter wrote:
>
> > The spec for ContentHandler.setDocumentLocator() disagrees;
> > it says "The locator allows the application to determine the end
> > position of any document-related event...". So ">" in this case.
>
> I just spent a week reading all of those docs too! So just theoretically
> speaking-- what are some of the things this token interface would need? And
> could we come up with a list of things that a parser would need to do to
> support round tripping?
Off the top of my head:
* report entity events in attribute values
* report which type of quotes attributes used
* report when characters were represented as character references,
versus predefined entities, versus directly
* report insignificant whitespace in markup
Really, entity events in attribute values is in a whole different class
of importance to me, however. I mean most people won't complain if you
change their attribute quotes but when you expand the entities they
worked hard to define....
Paul Prescod

> > ... what are some of the things this token interface would need? And
> > could we come up with a list of things that a parser would need to do to
> > support round tripping?
>
> Off the top of my head:
>
> * report entity events in attribute values
> * report which type of quotes attributes used
> * report when characters were represented as character references,
> versus predefined entities, versus directly
> * report insignificant whitespace in markup
For "<element attr = '&foo;' />" I think it'd need to report each
kind of token ... possibly some of these events could be grouped:
- "<"
- "element"
- " "
- "attr"
- " "
- "="
- " "
- \'
- "&foo;"
- \'
- " "
- "/>"
Doing that level of reporting in scopes where PEs can be reported
gets to be more awkward.
- Dave