Type definitions used throughout PXP
This module re-exports all the types listed in
Pxp_core_types.S (and finally defined in Pxp_core_types.I), so
the user only has to
openPxp_types to get all relevant type definitions. The re-exported
definitions are shown here in the indented grey block:

Identifiers

External identifiers are names for documents. A System identifier is
a URL. PXP (without extensions) only supports file URLs in the form
file:///directory/directory/.../file. Note that the percent encoding
(% plus two hex digits) is supported in file URLs. A public identifier
can be looked up in a catalog to find a local copy of the file; this
type is mostly used for well-known documents (e.g. after
standardization). A public identifier can be accompanied by a
system identifier (Public(pubid,sysid)), but the system identifier
can be the empty string. The value Anonymous should not be used
to identify a real document; it is more thought as a placeholder when
an ID is not yet known. Private identifiers are used by PXP internally.
These identifiers have, unlike system or public IDs, no textual
counterparts.

A resolver ID is a version of external identifiers used during
resolving (i.e. the process of mapping the identifier to a real
resource). The same entity can have several names during resolving:
one private ID, one public ID, and one system ID. For resolving
system IDs, the base URL is also remembered (usually the system ID
of the opener of the entity).

The standard way of converting an ext_id into a resolver ID.
A System ID is turned into a resolver_id where only rid_system is
set. A Public ID is turned into a resolver_id where both rid_public
and rid_system are set. A Private ID is turned into a resolver_id
where only rid_private is set. An Anonymous ID is turned into a
resolver_id without any value (all components are None).

Attribute value

Value s: The attribute is declared as a non-list type, or the
attribute is undeclared; and the attribute is either defined with
value "s", or it is missing but has the default value s.

[Valuelist [s1;...;sk]]: The attribute is declared as a list type,
and the attribute is either defined with value "s1 ... sk"
(space-separated words),
or it is missing but has the default value "s1 ... sk".

Implied_value: The attribute is declared without default value,
and there is no definition for the attribute.

Pools

A probalistic string pool tries to map strings to pool strings in order
to make it more likely that equal strings are stored in the same memory
block.
The int argument is the size of the pool; this is the number of entries
of the pool. However, not all entries of the pool are used; the fraction
argument (default: 0.3) determines the fraction of the actually used
entries. The higher the fraction is, the more strings can be managed
at the same time; the lower the fraction is, the more likely it is that
a new string can be added to the pool.

Tries to find the passed string in the pool; if the string is in the
pool, the pool string is returned. Otherwise, the function tries to
add the passed string to the pool, and the passed string is returned.

Configuration

typeconfig = {

warner : collect_warnings;

(*

An object that collects warnings.

*)

swarner : symbolic_warnings option;

(*

Another object getting warnings expressed as polymorphic
variants. This is especially useful to turn warnings into
errors. If defined, the swarner gets the warning
first before it is sent to the classic warner.

*)

enable_pinstr_nodes : bool;

(*

if true, processing instructions (PI's) are represented by
nodes of their own in the document tree. If not enabled, PI's
are attached to their surrounding elements, and the exact
location within the element is lost.

For example, if the XML text
is <s><?x?>foo<?y?></s>, the parser normally produces only an element
object for s, and attaches the PIs x and y to it (without
order), and the details of x and y can be only found out
with the pinstr method of the surrounding element. The only
subelement is the data node for "foo".
If enable_pinstr_nodes the node for element s will contain
two additional subnodes of type T_pinstr, one as left sibling
of "foo", and one as right sibling.
Any code processing such a tree must be prepared that
processing instructions occur as normal tree members, and are no
longer attached to the surrounding nodes.

The event-based parser reacts on the enable_pinstr_nodes mode
by emitting E_pinstr events exactly at the locations where the
PI's occur in the text.

*)

enable_comment_nodes : bool;

(*

When enabled, comments are represented as nodes with type
T_comment. If not enabled, comments are ignored.

The enable_super_root_node changes the layout of the document
tree: The top-most node is no longer the top-most element of the
document (i.e. the element root), but a special node called the
super root node (T_super_root). The top-most element is then
a child of the super
root node. The super root node can have further children, namely
comment nodes and processing instructions that are placed before
or after the top-most element in the XML text. However, the exact
behaviour depends on whether the other special modes in the
configuration are also enabled:

If enable_pinstr_nodes is also true, processing instruction
nodes (T_pinstr) can occur as children of the super root
node when processing instructions occur before or after the
root element. If enable_pinstr_nodes is false, these
instructions are simply attached to the super root node as
they would be attached to ordinary elements within the tree.
Note that processing instructions in the DTD part of the XML
text are not meant here (i.e. instructions between the square
brackets, or in an external DTD). These instructions are always
attached to the DTD object (see Pxp_dtd.dtd).

If enable_comment_nodes is also true, comment nodes can
occur as children of the super root node when comments
occur before or after the root element. If enable_comment_nodes
is false, comments are ignored.

*)

drop_ignorable_whitespace : bool;

(*

Ignorable whitespace is whitespace between XML nodes where
the DTD does not specify that #PCDATA must be parsed. For example,
if the DTD contains

<!ELEMENT a (b,c)>
<!ELEMENT b (#PCDATA)*>
<!ELEMENT c EMPTY>

the XML text <a><b> </b> <c></c></a> is legal. There are two
spaces:

Between <b> and </b>. Because b is declared with #PCDATA,
this space character is not ignorable, and the parser will
create a data node containing the character

Between </b> and <c>. Because the declaration of a does not
contain the keyword #PCDATA, character data is not expected
at this position. However, XML allows that whitespace can be
written here in order to improve the readability of the XML
text. Such whitespace material is considered as "ignorable
whitespace". If drop_ignorable_whitespace is true, the parser
will not create a data node containing the character.
Otherwise, the parser does create such a data node.

Note that c is declared as EMPTY. XML does not allow space
characters between <c> and </c> such that it is not the question
whether such characters are to be ignored or not - they are
simply illegal and will lead to a parsing error.

In the well-formed mode, the parser treats every whitespace
character occuring in an element as non-ignorable.

Event-based parser: ignored. (Maybe there will be a stream filter
with the same effect if I find time to program it.)

*)

encoding : rep_encoding;

(*

Specifies the encoding used for the internal representation
of any character data.

*)

recognize_standalone_declaration : bool;

(*

Whether the standalone declaration is recognized or not.
This option does not have an effect on well-formedness parsing:
in this case such declarations are never recognized.

Recognizing the standalone declaration means that the
value of the declaration is scanned and passed to the DTD,
and that the standalone-check is performed.

This means: If a document is flagged standalone='yes'
some additional constraints apply. The idea is that a parser
without access to any external document subsets can still parse
the document, and will still return the same values as the parser
with such access. For example, if the DTD is external and if
there are attributes with default values, it is checked that there
is no element instance where these attributes are omitted - the
parser would return the default value but this requires access to
the external DTD subset.

Event-based parser: The option has an effect if the `Parse_xml_decl
entry flag is set. In this case, it is passed to the DTD whether
there is a standalone declaration, ... and the rest is unclear.

*)

store_element_positions : bool;

(*

Whether the file name, the line and the column of the
beginning of elements are stored in the element nodes.
This option may be useful to generate error messages.

Positions are only stored for:

Elements

Processing instructions if T_pinstr nodes are created for
them (see enable_pinstr_nodes)

For all other node types, no position is stored.

You can access positions by the method position of nodes.

Event-based parser: If true, the E_position events will be
generated.

*)

idref_pass : bool;

(*

Whether the parser does a second pass and checks that all
IDREF and IDREFS attributes contain valid references.
This option works only if an ID index is available. To create
an ID index, pass an index object as id_index argument to the
parsing functions (such as Pxp_tree_parser.parse_document_entity).

"Second pass" does not mean that the XML text is again parsed;
only the existing document tree is traversed, and the check
on bad IDREF/IDREFS attributes is performed for every node.

Event-based parser: this option is ignored.

*)

validate_by_dfa : bool;

(*

If true, and if DFAs are available for validation, the DFAs will
actually be used for validation.
If false, or if no DFAs are available, the standard backtracking
algorithm will be used.

DFAs are only available if accept_only_deterministic_models is
true (because in this case, it is relatively cheap to construct
the DFAs). DFAs are a data structure which ensures that validation
can always be performed in linear time.

I strongly recommend using DFAs; however, there are examples
for which validation by backtracking is faster.

Event-based parser: this option is ignored.

*)

accept_only_deterministic_models : bool;

(*

Whether only deterministic content models are accepted in DTDs.

Event-based parser: this option is ignored.

*)

disable_content_validation : bool;

(*

When set to true, content validation is disabled; however,
other validation checks remain activated.
This option is intended to save time when a validated document
is parsed and it can be assumed that it is valid.

Do not forget to set accept_only_deterministic_models to false
to save maximum time (or DFAs will be computed which is rather
expensive).

Event-based parser: this option is ignored.

*)

name_pool : Pxp_core_types.I.pool;

enable_name_pool_for_element_types : bool;

enable_name_pool_for_attribute_names : bool;

enable_name_pool_for_attribute_values : bool;

enable_name_pool_for_pinstr_targets : bool;

(*

The name pool maps strings to pool strings such that strings with
the same value share the same block of memory.
Enabling the name pool saves memory, but makes the parser
slower.

Event-based parser: As far as I remember, some of the pool
options are honoured, but not all.

Setting this option to a namespace_manager enables namespace
processing. This works only if the namespace-aware implementation
namespace_element_impl of element nodes is used in the spec;
otherwise you will get error messages complaining about missing
methods.

Note that PXP uses a technique called "prefix normalization" to
implement namespaces on top of the plain document model. This means
that the namespace prefixes of elements and attributes are changed
to unique prefixes if they are ambiguous, and that these
"normprefixes" are actually stored in the document tree. Furthermore,
the normprefixes are used for validation. (See
Intro_namespaces for details.)

Event-based parser: If true, the events E_ns_start_tag and
E_ns_end_tag are generated instead of E_start_tag, and
E_end_tag, respectively.

Experimental feature.
If defined, the escape_contents function is called whenever
the tokens "{", "{{", "}", or "}}" are found in the context
of character data contents. The first argument is the token.
The second argument is the entity manager, it can be used to
access the lexing buffer directly. The result of the function
are the characters to substitute.

"{" is the token Lcurly, "{{" is the token LLcurly, "}" is the
token Rcurly, and "}}" is the token RRcurly.

Experimental feature.
If defined, the escape_attributes function is called whenever
the tokens "{", "{{", "}", or "}}" are found inside attribute
values. The function takes three arguments: The token (Lcurly,
LLcurly, Rcurly or RRcurly), the position in the attribute value,
and the entity manager.
The result of the function is the string substituted for the
token.

Example:
The attribute is "a{b{{c", and the function is called as
follows:

escape_attributes Lcurly 1 mng -
result is "42" (or an arbitrary string, but in this example it
is "42")

Deprecated.
Same as default_config, but namespace processing is turned on.
Note however, that a globally defined namespace manager is used.
Because of this, this config should no longer be used. Instead, do

Sources

Sources specify where the XML text to parse comes from. The type
source is often not used directly, but sources are constructed
with the help of the functions from_channel, from_obj_channel,
from_file, and from_string (see below). Note that you can
usually view the type source as an opaque type. There is no need
to understand why it enumerates these three cases, or to use them
directly. Just create sources with one of the from_* functions.

The type source is an abstraction on top of resolver (defined in
module Pxp_reader). The resolver is a configurable object that knows
how to access files that are

identified by an XML ID (a PUBLIC or SYSTEM name)

named relative to another file

referred to by the special PXP IDs Private and Anonymous.

Furthermore, the resolver knows a lot about the character encoding
of the files. See Pxp_reader for details.

A source is a resolver that is applied to a certain ID that should
be initially opened.

This function creates a source that reads the XML text from the
passed in_channel. By default, this source is not able to read
XML text from any other location (you cannot read from files etc.).
The optional arguments allow it to modify this behaviour.

Keep the following in mind:

Because this source reads from a channel, it can only be used once.

The channel will be closed by the parser when the end of the channel
is reached, or when the parser stops because of another reason.

Unless the alt argument specifies something else, you cannot
refer to entities by SYSTEM or PUBLIC names (error "no input method
available")

Even if you pass an alt method that can handle SYSTEM, it is
not immediately possible to open SYSTEM entities that are defined
by a URL relative to the entity that is accessed over the in_channel.
You first must pass the system_id
argument, so the parser knows the base name relative to which
other SYSTEM entities can be resolved.

For more instructions how to construct sources and resolvers
look at Intro_resolution.

Arguments:

alt: A list of further resolvers that are used to open further entities
referenced in the initially opened entity. For example, you can pass
newPxp_reader.resolve_as_file() to enable resolving of
file names found in SYSTEM IDs.

system_id: By default, the XML text found in the in_channel does not
have any visible ID (to be exact, the in_channel has a private ID, but
this is hidden). Because of this, it is not possible to open
a second file by using a relative SYSTEM ID. The parameter system_id
assigns the channel a SYSTEM ID that is only used to resolve
further relative SYSTEM IDs. -
This parameter must be encoded as UTF-8 string.

fixenc: By default, the character encoding of the XML text is
determined by looking at the XML declaration. Setting fixenc
forces a certain character encoding. Useful if you can assume
that the XML text has been recoded by the transmission media.

Deprecated arguments:

id: This parameter assigns the channel an arbitrary ID (like system_id,
but PUBLIC, anonymous, and private IDs are also possible - although
not reasonable). Furthermore, setting id also enables resolving
of file names. id has higher precedence than system_id.

system_encoding: (Only useful together with id.) The character encoding
used for file names. (UTF-8 by default.)

This source reads initially from the file whose name is passed as
string argument. The
filename must be UTF-8-encoded (so it can be correctly rewritten into
a URL).

This source can open further files by default, and relative URLs
work immediately.

Arguments:

alt: A list of further resolvers, especially useful to open
non-SYSTEM IDs, and non-file entities.

system_encoding: The character encoding the system uses to represent
filenames. By default, UTF-8 is assumed.

enc: The character encoding of the string argument. As mentioned, this
is UTF-8 by default.

Examples.

The source

from_file "/tmp/file.xml"

reads from this file, which is assumed to have the ID
SYSTEM"file://localhost/tmp/file.xml". It is no problem when
other files are included by either absolute SYSTEM file name,
or by a relative SYSTEM.

does roughly the same, but uses a channel for the initially opened
entity. Because of the alt argument, it is possible to reference
other entities by absolute SYSTEM name. The system_id assignment
makes it possible that SYSTEM names relative to the initially used
entity are resolvable.

`Entry_document: The parser reads a complete document that
must have a DOCTYPE and may have a DTD.

`Entry_declarations: The parser reads the external subset
of a DTD

`Entry_element_content:
The parser reads an entity containing contents, but there must
be one top element, i.e. "misc* element misc*". At the beginning,
there can be an XML declaration as for external entities.

`Entry_content:
The parser reads an entity containing contents, but without the
restriction of having a top element. At the beginning,
there can be an XML declaration as for external entities.

`Entry_expr: The parser reads a single element, a single
processing instruction or a single comment, or whitespace, whatever is
found. In contrast to the other entry points, the expression
need not to be a complete entity, but can start and end in
the middle of an entity

More entry points might be defined in the future.

The entry points have a list of flags. Note that `Dummy is
ignored and only present because O'Caml does not allow empty
variants.
For `Entry_document, and `Entry_declarations, the flags determine
the kind of DTD object that is generated.

Without flags, the DTD
object is configured for well-formedness mode:

Elements, attributes, and notations found in the XML text are not
added to the DTD; entity declarations are added, however. Additionally,
the DTD is configured such that it does not complain about missing
elements, attributes, and notations (dtd#arbitrary_allowed).

The flags affecting the DTD have the following meaning.
Keep in mind that the event parser can only conduct some validation
checks because it does not represent the XML nodes as tree.

`Extend_dtd_fully: Elements, attributes, and notations are added
to the DTD. The DTD mode dtd#arbitrary_allowed is enabled.
If the resulting event stream is validated later, this mode
has the effect that the actually declared elements, attributes,
and notations are validated as declared. Also, non-declared
elements, attributes, and notations are not rejected, but
handled as in well-formed mode.

`Val_mode_dtd: The DTD object is set up for validation, i.e. all
declarations are added to the DTD, and dtd#arbitrary_allowed is
disabled. Furthermore, some validation checks are already done
for the DTD (e.g. whether the root element is declared).
If the resulting event stream is validated later, all validation
checks are conducted (except for the XML declaration - see the
next flag - this check must be separately enabled).

`Parse_xml_decl: By default, the XML declaration
<?xml version="1.0" encoding="..." standalone="..."?> is
ignored except for the encoding attribute. This flag causes
that the XML declaration is completely parsed.