Streamlining Iterators in QueryPath 3.x

Dec 1 2009

Work has officially begun on QueryPath 3.x. The upcoming release is focused on implementing and supporting many of the new features introduced in PHP 5.3, including enhanced SPL support, namespaces, closures, and phar archives.

(This is from the original QueryPath 3.x branch, and uses the backslash to identify the IteratorIterator's namespace. If that doesn't make sense to you, just ignore it.)

What makes the SPL iteration classes so compelling is the ease with which they can be selectively extended. By overriding only one function, current(), we can create a custom iterator that wraps every item in a QueryPath object. For that reason, we can execute a loop like this:

<?phpforeach($qpas$item){// $item is a QueryPath because of current().
}?>

Each time an item is retrieved from the QueryPath collection, it is wrapped in a QueryPath object.

As nice as this is, it has a drawback, as my previous benchmarks showed. It is slow, and the primary reason for slowness is the creation of a new QueryPath object for each item.

This got me thinking: What if the same QueryPath object was re-used, instead of creating a new QueryPath for each item? With this in mind, I re-visited the code.

The New Iterator Code

The QueryPath class already has a method for manually setting the contents of the QueryPath item list. While this method has been around for a long time, it has been marked as private to prevent outside access. However, while working on these benchmarks I realized that re-marking this as public would not only help me solve my problem, but open the doorway for others to create useful outer iterators on QueryPath objects.

Thus, the first step in fixing the problem was to mark the setMatches() method as public.

With that done, the QueryPathIterator class can be re-written as follows:

<?phpclassQueryPathIteratorextends\IteratorIterator{public$options=array();private$qp=NULL;publicfunctioncurrent(){// Re-using the QueryPath object cuts of 4/5 of the iteration time
// on large sets.
if(!isset($this->qp)){$this->qp=qp(parent::current(),NULL,$this->options);}else{$splos=new\SplObjectStorage();$splos->attach(parent::current());$this->qp->setMatches($splos);}return$this->qp;}}?>

The real change in the code above is that only one QueryPath object is created. It is then re-used for each item in the collection.

While this solution is considerably more complex (it requires much more knowledge of QueryPath's internals), it represents a tremendous performance gain.

With the re-architecting of the QueryPathIterator the time goes down substantially. Here's a representative example:

Test of iterator took 0.018096 sec.

So the new version is just about three times as fast as the original. While this is still slower than working directly with the DOMNode objects, it is still a substantial improvement, and should help when working with moderately sized lists of nodes. (Recall that the benchmark is based on a node list with 1100 elements.)

This new improvement will come along in QueryPath 3.x. Because the 2.x branch will likely remain active for a longer period of time as PHP 5.2 continues to be dominant, I may backport this change and release it as part of the next QueryPath 2.x release.