dxml

dxml.parser

This implements a range-based
StAX parser for XML 1.0 (which
will work with XML 1.1 documents assuming that they don't use any
1.1-specific features). For the sake of simplicity, sanity, and efficiency,
the DTD
section is not supported beyond what is required to parse past it.

Start tags, end tags, comments, cdata sections, and processing instructions
are all supported and reported to the application. Anything in the DTD is
skipped (though it's parsed enough to parse past it correctly, and that
can result in an XMLParsingException if that XML isn't valid
enough to be correctly skipped), and the
XML declaration at the
top is skipped if present (XML 1.1 requires that it be there, XML 1.0 does
not).

Regardless of what the XML declaration says (if present), any range of
char will be treated as being encoded in UTF-8, any range of wchar
will be treated as being encoded in UTF-16, and any range of dchar will
be treated as having been encoded in UTF-32. Strings will be treated as
ranges of their code units, not code points.

Since the DTD section is skipped, entity references other than the five
which are predefined by the XML spec cannot be properly processed (since
wherever they were used in the document would be replaced by what they
referred to, and that could affect the parsing). As such, if any entity
references which are not predefined are encountered outside of the DTD
section, an XMLParsingException will be thrown. The predefined
entity references and any character references encountered will be checked
to verify that they're valid, but they will not be replaced (since that
does not work with returning slices of the original input).

Each code unit is considered a column, so depending on what a program
is looking to do with the column number, it may need to examine the
actual text on that line and calculate the number that represents
what the program wants to display (e.g. the number of graphemes).

The purpose of this is to simplify the code using the parser, since most
code does not care about the difference between an empty tag and a start
and end tag with nothing in between. But since some code may care about
the difference, the behavior is configurable.

Helper function for creating a custom config. It makes it easy to set one
or more of the member variables to something other than the default without
having to worry about explicitly setting them individually or setting them
all at once via a constructor.

The order of the arguments does not matter. The types of each of the members
of Config are unique, so that information alone is sufficient to determine
which argument should be assigned to which member.

This Config is intended for making it easy to parse XML by skipping
everything that isn't the actual data as well as making it simpler to deal
with empty element tags by treating them the same as a start tag and end
tag with nothing but whitespace between them.

If there is an entity other than the end tag following the text, then
the text includes up to that entity.

Note however that character references (e.g.
"&#42") and the predefined entity references (e.g.
"&apos;") are left unprocessed in the text. In
order for them to be processed, the text should be passed to either
normalize
or
asNormalized
. Entity references
which are not predefined are considered invalid XML, because the DTD
section is skipped, and thus they cannot be processed properly.

EntityRange is essentially a
StAX parser, though it evolved
into that rather than being based on what Java did, and it's range-based
rather than iterator-based, so its API is likely to differ from other
implementations. The basic concept should be the same though.

One of the core design goals of this parser is to slice the original input
rather than having to allocate strings for the output or wrap it in a lazy
range that produces a mutated version of the data. So, all of the text that
the parser provides is either a slice or
std.range.takeExactly of the input. However, in some cases,
for the parser to be fully compliant with the XML spec,
dxml.util.normalize must be called on the text to mutate certain
constructs (e.g. removing any '\r' in the text or
converting "&lt;" to '<'). But
that's left up to the application.

The parser is not @nogc, but it allocates memory very minimally. It
allocates some of its state on the heap so it can validate attributes and
end tags. However, that state is shared among all the ranges that came from
the same call to parseXML (only the range farthest along in parsing
validates attributes or end tags), so save does not
allocate memory unless save on the underlying range allocates memory.
The shared state currently uses a couple of dynamic arrays to validate the
tags and attributes, and if the document has a particularly deep tag-depth
or has a lot of attributes on on a start tag, then some reallocations may
occur until the maximum is reached, but enough is reserved that for most
documents, no reallocations will occur. The only other times that the
parser would allocate would be if an exception were thrown or if the range
that is passed to parseXML allocates for any reason when calling any of the
range primitives.

If invalid XML is encountered at any point during the parsing process, an
XMLParsingException will be thrown. If an exception has been thrown,
then the parser is in an invalid state, and it is an error to call any
functions on it.

However, note that XML validation is reduced for any entities that are
skipped (e.g. for anything in the DTD section, validation is reduced to what
is required to correctly parse past it, and when
Config.skipPI == SkipPI.yes, processing instructions are only validated
enough to correctly skip past them).

As the module documentation says, this parser does not provide any DTD
support. It's not possible to properly support the DTD section while
returning slices of the original input, and the DTD portion of the spec
makes parsing XML far, far more complicated.

A quick note about carriage returns: per the XML spec, they're all
supposed to either be stripped out or replaced with newlines before the XML
parser even processes the text. That doesn't work when the parser is slicing
the original text and not mutating it at all. So, for the purposes of
parsing, this parser treats all carriage returns as if they were newlines or
spaces (though they won't count as newlines when counting the lines for
TextPos). However, they will appear in any text fields or
attribute values if they are in the document (since the text fields and
attribute values are slices of the original text).
dxml.util.normalize can be used to strip them along with converting
any character references in the text. Alternatively, the application can
remove them all before calling parseXML, but it's not necessary.

The type used when any slice of the original input is used. If R
is a string or supports slicing, then SliceOfR is the same as R;
otherwise, it's the result of calling
std.range.takeExactly on the input.

Note that the type determines which
properties can be used, and it can determine whether functions which
an Entity or EntityRange is passed to are allowed to be called.
Each function lists which EntityTypes are allowed, and it is an
error to call them with any other EntityType.

Note that this is the direct name in the XML for this entity and
does not contain any of the names of any of the parent entities that
this entity has. If an application wants the full "path" of the
entity, then it will have to keep track of that itself. The parser
does not do that as it would require allocating memory.

In the case of EntityType.pi, this is the
text that follows the name, whereas in the other cases, the text is
the entire contents of the entity (save for the delimeters on the
ends if that entity has them).

Returns the Entity representing the entity in the XML document
which was most recently parsed.

void popFront();

Move to the next entity.

The next entity is the next one that is linearly in the XML document.
So, if the current entity has child entities, the next entity will be
the first child entity, whereas if it has no child entities, it will be
the next entity at the same level.

Note that because an XMLParsingException will be thrown an
invalid XML, it's actually possible to call
front and
popFront without checking empty if the
only way that empty would be true is if the XML were invalid (e.g. if at
a start tag, it's a given that there's at least one end tag left in the
document unless it's invalid XML).

However, of course, caution should be used to ensure that incorrect
assumptions are not made that allow the document to reach its end
earlier than predicted without throwing an XMLParsingException,
since it's still an error to call front or
popFront if empty would return false.

@property auto save();

Forward range function for obtaining a copy of the range which can then
be iterated independently of the original.

EntityRange takeNone();

Returns an empty range. This corresponds to
std.range.takeNone except that it doesn't create a
wrapper type.

R skipContents(R)(R entityRange)if(isInstanceOf!(EntityRange, R));

Takes an EntityRange which is at a start tag and iterates it until
it is at its corresponding end tag. It is an error to call skipContents when
the current entity is not EntityType.elementStart.

Skips entities until the end tag is reached that corresponds to the start
tag that is the parent of the current entity.

Returns:
The given range with its front now at the end tag which
corresponds to the parent start tag of the entity that was
front when skipToParentEndTag was called. If the current
entity does not have a parent start tag (which means that it's
either the root element or a comment or PI outside of the root
element), then an empty range is returned.

Treats the given string like a file path except that each directory
corresponds to the name of a start tag. Note that this does not try to
implement XPath as that would be quite complicated, and it really doesn't
fit with a StAX parser.

A start tag should be thought of as a directory, with its child start tags
as the directories it contains.

All paths should be relative. EntityRange can only move forward
through the document, so using an absolute path would only make sense at
the beginning of the document. As such, absolute paths are treated as
invalid paths.

"./" and "../" are supported. Repeated
slashes such as in "foo//bar" are not supported and are
treated as an invalid path.

If range.front.type == EntityType.elementStart, then
range.skiptoPath("foo") will search for the first child
start tag (be it EntityType.elementStart or
EntityType.elementEmpty) with the name"foo". That start tag must be a direct child of the current
start tag.

If range.front.type is any other EntityType, then
range.skipToPath("foo") will return an empty range,
because no other EntityTypes have child start tags.

For any EntityType, range.skipToPath("../foo")
will search for the first start tag with the
name"foo" at the same level
as the current entity. If the current entity is a start tag with the name
"foo", it will not be considered a match.

range.skipToPath("./") is a no-op. However,
range.skipToPath("../") will result in the empty range
(since it doesn't target a specific start tag).

range.skipToPath("foo/bar") is equivalent to
range.skipToPath("foo").skipToPath("bar"),
and range.skipToPath("../foo/bar") is equivalent to
range.skipToPath("../foo").skipToPath("bar").

Returns:
The given range with its front now at the requested entity if
the path is valid; otherwise, an empty range is returned.