StaxBuilder, StaxParser, StaxUtil, StaxReader have improved and graduated to the nux.xom.io package. This includes...

A facility to write to a StAX XMLStreamWriter (StreamingSerializerFactory).

A subclass of nu.xom.Builder that has the same behaviour as the super class, except that it runs over StAX instead of SAX (StaxUtil.createBuilder(XMLInputFactory, NodeFactory)).
Can be used for polymorphic pluggability of SAX vs. StAX.

Added NodeBuilder and various performance improvements to redirecting factory and streaming bnux serializer.

Added the following ant targets for massive integration tests:

Target

Description

download-testdata

Downloads some 50000 XML files from various public test data repositories.

test-xqts

Runs the Official W3C XQuery Test Suite (XQTS).

test-bnux

Parses all 50000 files with XOM/SAX/Xerces, roundtrips them with bnux binary XML and compares results against
original documents parsed with XOM/SAX/Xerces, testing for equality wrt.
Canonical XML as well as XOM's more strict assertEquals() test routine.

test-staxbuilder

Same as test-bnux except that it uses XOM/StAX/Woodstox instead of bnux.

test-staxserializer

Same as test-bnux except that it parses with XOM/SAX/Xerces, serializes with XOM/StAX/Woodstox,
reparses with XOM/SAX/Xerces, then compares results against original documents
parsed with XOM/SAX/Xerces.

Before running any of these targets, tell them to use enough memory and the latest stable Xerces version, like this:export JAVA_OPTS='-Xmx200m -Djava.endorsed.dirs=/Users/hoschek/unix/java/share/apache/xerces-2.8.0'
For tests involving StAX download Woodstox (e.g. wstx-asl-3.0.jar) and copy it into nux/lib.

Added Streaming Serialization of Very Large Documents in the nux.xom.io package.
Using memory consumption close
to zero, the new StreamingSerializer
enables writing arbitrarily large XML documents onto a destination, such as an OutputStream,
both for standard textual XML as well as bnux binary XML (and StAX).

Added streaming bnux deserialization
for handling arbitrarily large input documents;
uses an InputStream and an application provided NodeFactory just like a XOM Builder does.

Added bnux serialization to an OutputStream.

To enable true streaming, a serialized bnux document now consists internally of one or more
independent pages, each at most 64 KB large. Each page is a tokenized byte array containing a portion of the XML document, in document order.
Once a page has been read/written related (heavy) state can be discarded, freeing memory.
No more than one page needs to be held in memory at any given time.
For very large documents this reduces memory consumption, increases throughput and reduces latency.
For small to medium sized documents it makes next to no difference.

Removed deprecated methods XOMUtil.toByteArray() and XOMUtil.toString().
The methods remain available but have been moved into class FileUtil.

Added more test document collections in samples directory.

Added package nux.xom.sandbox,
a playground for kicking around various ideas and prototypes without any
API compatibility guarantees. Code quality varies from sketchy to reliable,
but is generally not nearly as well designed and tested as the remainder of Nux.
In the future some of these classes may (or may not) graduate into stable
packages.

Upgraded to saxonb-8.5 (saxon-8.4 and 8.3 should continue to work as well).

Upgraded to xom-1.1-rc1 (with compatible
performance patches).
Plain xom-1.0 should continue to work as well, albeit less efficiently.

Numerous bnux Binary XML
performance enhancements for serialization and deserialization
(UTF-8 character encoding, buffer management, symbol table, pack sorting, cache locality, etc).
Overall, bnux is now about twice as fast, and, perhaps more importantly, has a much
more uniform performance profile, no matter what kind of document flavour is thrown at it.
It routinely delivers 50-100 MB/sec deserialization performance, and 30-70 MB/sec
serialization performance (commodity PC 2004).
It is roughly 5-10 times faster than xom-1.1 with xerces-2.7.1
(which, in turn, is faster than saxonb-8.5, dom4j-1.6.1 and xerces-2.7.1 DOM).
Further, preliminary measurements indicate bnux deserialization and serialization
to be consistently 2-3 times faster than Sun's FastInfoSet implementation, using XOM.
Saxon's PTree could not be tested as it is only available in the commercial version.
The only remaining area with substantial potential for performance improvement seems to be complex
namespace handling.
This might be addressed by slightly restructuring private XOM internals
in a future version.

BinaryXMLTest now also has command line support for testing and benchmarking
Saxon, DOM and FastInfoSet (besides bnux and XOM).

Rewrote XQueryCommand. The new
nux/bin/fire-xquery
is a more powerful, flexible and reliable command line test tool that runs a given
XQuery against a set of files and prints the result sequence.
In addition, it supports schema validation, XInclude (via XOM),
an XQuery update facility, malformed HTML parsing (via TagSoup) and much more.
It's available for Unix and Windows, and works like any other decent Unix command
line tool.

Added nux.xom.xquery.ResultSequenceSerializer,
which serializes an XQuery/XPath2 result sequence onto a given output stream, using
various configurable serialization options such encoding and indentation.
Implements the
W3C XQuery/XSLT2 Serialization Draft Spec.
Also implements an alternative wrapping algorithm that ensures that any arbitrary
result sequence can always be output as a well-formed XML document.

Added XQueryFactory.createXQuery(File file, URI baseURI) and
XQueryPool.getXQuery(File file, URI baseURI) to allow for separation
of the location of the query file and input XML files.

The default XQuery DocumentURIResolver now recognizes the ".bnux" file
extension as binary XML, and parses it accordingly.
For example, a query can be 'doc("samples/data/articles.xml.bnux")/articles/*'

Added FileUtil.listFiles().
Returns the URIs of all files who's path matches at least one of the given
inclusion wildcard or regular expressions but none of the given exclusion
wildcard or regular expressions; starting from the given directory, optionally
with recursive directory traversal, insensitive to underlying operating system
conventions.

Arbitrary Lucene fulltext queries can be run from Java or
from XQuery/XPath/XSLT via a simple extension function.
The former approach is more flexible whereas the latter is more convenient.
Lucene analyzers can split on whitespace, normalize to lower case
for case insensitivity, ignore common terms with little
discriminatory value such as "he", "in", "and" (stop words),
reduce the terms to their natural linguistic root form such as
"fishing" being reduced to "fish" (stemming), resolve
synonyms/inflexions/thesauri (upon indexing and/or querying), etc.
Also see Lucene Query Syntax
as well as Query Parser Rules.

Background: The first prototype was put together over the weekend.
The functionality worked just fine, except that it took ages
to index and search text in a high-frequency environment. Subsequently I wrote a complete
reimplementation of the Lucene interfaces and contributed that
back to Lucene (the bits in org.apache.lucene.index.memory.*).
Next, I placed a smart cache in front of it (the bits in
nux.xom.pool.FullTextUtil / FullTextPool).
The net effect is that fulltext queries over realtime data
now run some three orders of magnitude faster while preserving the same
general functionality (e.g. 100000-500000 queries/sec ballpark).
In fact, you'll probably notice little or
no overhead when adding fulltext search to your streaming apps.
See MemoryIndexBenchmark and
XQueryBenchmark.

Explore and enjoy, perhaps using the queries and sample data from the
samples/fulltext directory as a starting point.

Removed deprecated XQueryUtil.normalizeTexts(). The same functionality remains available through XOMUtil.Normalizer.PRESERVE.normalize().

nux.xom.pool: Added a configurable XML caching framework.
Classes DocumentFactory, DocumentPool, DocumentMap and PoolConfig
enable efficient compact thread-safe pooling/caching of XOM document objects.
Cached documents typically consume 20-100 times less memory than the equivalent XOM main memory tree.
Usage is safe: It survives stress tests looking for memory leaks, race conditions, etc.
Plugins for dependency chain invalidation could be added in the future,
but for the moment this isn't explicitly supported.
Comments on this or any other Nux aspect are always welcome.
See API.

nux.xom.pool.*: All pools and ThreadLocals now internally use SoftReferences to allow for automatic garbage collection of cached objects in low-memory situations.

nux.xom.pool:*: All pools now have a constructor that takes a PoolConfig object.

XQuery/XPath: Now requires saxon-8.4 (bundled with the download).
See the Saxon documentation for the changelog.
In particular note that the namespace axis is nomore supported in XQuery.

XQuery/XPath: On output, now auto-converts any Saxon NodeInfo implementation
(e.g. TinyTree, StandardTree, JDOM, DOM)
while fully preserving node identities, even in the presence of documentless nodes.
This required some minor refactoring internal to NodeWrapper.

XQuery/XPath: Better documentation on how to use extension functions and modules.

Added XOMUtil.Normalizer with standard XML algorithms for text and whitespace normalization of trees.

Deprecated XQueryUtil.normalizeTexts(). The same functionality is now available through XOMUtil.Normalizer.PRESERVE.normalize().

The obnoxious dependencies on jars for DOM Level 3, JAXP-1.3 and JaxMe
have now disappeared, even under JDK < 1.5! This functionality is nomore needed at all,
meaning less baggage, installation and classpath problems for all of us :-)

XQuery/XPath: By default the doc() function now uses a DocumentURIResolver
that uses a non-validating XOM Builder to parse documents.
This can be overriden by passing in your custom DocumentURIResolver.

XQuery/XPath: By default a top-level atomic value in the result sequence is
converted to an Element named "atomic-value" with a child Text
node holding the atomic value's standard XPath 2.0 string representation.
An "atomic-value" element is decorated with a namespace and a W3C XML Schema type attribute.
The XPath 2.0 string representation continues to be accessible via Node.getValue().
Because XOM has no concept of a namespace node, the same conversion occurs for XPath namespace nodes in the result sequence.
(The standard XPath 2.0 string representation of a namespace node is its URI).
"Normal" nodes and anything not at top-level continue to be returned "as is", without conversion.

XQuery/XPath: now accepts any XOM Node as context node and as variables,
not just a ParentNode. A variable can now also be bound to a node list
(i.e. a XOM Nodes object).
This enables to pass the output of a query as input into another query.

net.sf.saxon.xom.DocumentWrapper: id() function now finds first rather than last element
in invalid documents that have multiple elements with the same ID.
Performance is also improved via a hash index.

This release is synchronized and works in tandem with the
recent saxon-8.2 and xom-1.0 final releases.

Nux now works with the recent saxon-8.2 release,
hence implementing W3C XQuery Working Draft 29 October 2004.
The download includes the saxon8.jar file from saxonb-8.2.
As far as we can tell from testing, it is not necessary to include any
JAXP-1.3 jars (i.e. dom.jar, jaxp-api.jar) in the classpath,
even when running saxon-8.2 with JDK < 1.5.
This is good news because it avoids licensing problems as well as classpath,
version, redistribution and deployment problems.
Let us know if you find Nux use cases where JAXP 1.3 is required.
In any case, this Nux release should continue to work fine with the old saxon-8.1.1.

Added class BinaryXMLCodec,
which serializes (encodes) and deserializes (decodes) XOM XML documents to and from
an efficient and compact custom binary XML data format (termed bnux
format), without loss or change of any information. Serialization and
deserialization is much faster than with the standard textual XML format, and
the resulting binary data is more compressed than textual XML.

Requires a small backwards compatible
external patch
to the XOM DocType, making method setInternalDTDSubset public.
Copy the file into the XOM source codebase, and rebuild XOM from source
with cd xom; ant jar

Versions are now labelled as "Beta", meaning: No known bugs exist, and no incompatible changes are planned.
Please stress this release to shake out any remaining bugs potentially lurking in remote corners.

For simple and complex continuous queries and/or transformations over very large or infinitely long XML input documents,
we have added a convenient streaming path filter
API, combining full XQuery support with straightforward filtering.

If you want it, you can get the
external patch.
The patch adds an additional constructor (needed for thread-safety and flexibility) in a backwards compatible way.
Copy the file into the XOM source codebase, and rebuild XOM from source
with cd xom; ant jar

The license statement now makes it clear that package net.sf.saxon.xom
is under the Mozilla license (co-developed with Michael Kay, the Saxon author).