Canonicalization for subtrees can be done using a simple and efficient
algorithm whose performance almost same as plain xml serialization, and
it produces the same output as the regular canonicalization algorithm.
This is not really a new proposal, many implementations already do this;
implementations go to great length to determine if this signature can be
done in an optimized way, otherwise they fall back to the generic
nodeset based approach. Do we really need to support the generic
nodeset? In another email chain I listed all the possible cases
http://lists.w3.org/Archives/Public/public-xmlsec/2009Apr/0045.html ,
cases 1-4 can be done using the subtree algorithm.
Description of the algorithm
*Data model*
Section 2.1 of the C14n spec talks about the Data Model being based on
the XPath Nodeset.
For the subtree algorithm the data model is different - the input is
a) either a complete document, or a list of elements. Each element
represents a complete subtrees, the elements are ordered in document
order, and no element is a descendant of another element in the list.
b) optionally a set of exclusion elements or attributes. The exclusion
elements are not ordered in any way
The algorithm efficiency depends on these lists being small, usually <10
*Processing Model*
The processing model is fairly simple, just do a traversal of each of
the subtrees, prune traversal when you reach an exclusion node. For each
node type follow the rules mentioned in Section 2.3 processing model.
For namespace nodes do this
- At the root of the subtree look combine all the namespace
declarations of that subtree's ancestors plus that subtree's root node,
and emit them. Also put them into a hashmap. Now as you traverse the
tree, whenever you come across a namespace declaration, check if it
already in the hashmap, if it is not there then emit it, and add the
namespace declaration to the hashmap. Also if you have added the
declaration, remember to remove it, once you have finished traversing
the node.
-------------------------
Why is this faster? This is because it avoids a lot of expensive things
that the current algorithm does
1. Does not create a large array to store a nodeset. Remember the
Nodeset can have tens of thousands of elements, whereas the
inclusion / exclusion list only has a handful of elements
2. Does not sort the nodeset - Sorting the nodeset by document order
is expensive
3. Does not traverse the entire document - only traverses the
required subtrees.
4. Does not use XPath at all. Xpath's namespace axis causes an
explosion of nodes.
5. Does not use the XPath namespace axis or the ancestor axis at all,
instead it maintains a current set of namespaces, and adds/removes
to it, as it is traversing the document. This is much more
efficient then looking at the ancestor axis for every element.
Pratik