Wednesday, May 15, 2013

Been a long time since my last post here, busy year! And man does this blog layout look dated.. priorities...

Here's a quick and easy solution to a problem for which I didn't find a complete solution elsewhere. A nice input form UI will let the user press the ENTER key to submit the form. Browsers don't have a predictable and consistent handling of this action, especially if there's more than one form on a page. Furthermore, if you aren't actually using a submit type element, browsers won't do anything as they have no way of knowing the intent anyway.

This extender for knockout.js + jQuery adds this functionality. All you need to do is add a binding for enterkey to an element that wraps your form, and pressing enter on any input control that's a descendent of the element containing the binding will cause that controller method to be invoked.

Tuesday, October 16, 2012

CsQuery 1.3 has been released. You can get it from NuGet or from the source repository on GitHub.

New HTML5 Compliant Parser

This release replaces the original HTML parser with the validator.nu HTML5 parser. This is a complete, standards-compliant HTML5 parser. This is the same codebase used in gecko-based web browsers (e.g. Firefox). You should expect excellent compatibility with the DOM that a web browser would render from markup. Problems that people have had in the past related to character set encoding, invalid HTML parsing, and other edge cases should simply go away.

In the process of implementing the new parser, some significant changes were made to the input and output API in order to take advantage of the its capabilities. While these revisions are generally backwards compatible with 1.2.1, there are a few potentially breaking changes. These can be summarized as follows:

DomDocument.DomRenderingOptions has been removed. The concept of assigning output options to a Document doesn't make sense any more (if it ever did); rather, you define options for how output is rendered at the time you render it.

IOutputFormatter interface has changed. This wasn't really used for anything before, so I doubt this will impact anyone, but it's conceivable that someone coded against it. The interface has been revised somewhat, and it is now used extensively to define a model for rendering output.

Hopefully, these changes won't impact you much or at all. But with this small price comes a host of new options for parsing and rendering HTML.

Create Method Options

In the beginning, there was but a single way to create a new DOM from HTML: Create. And it was good. But as the original parser evolved towards HTML5 compliance, the CreateFragment and CreateDocument methods were added, to define intent. Different rules apply depending on the context: a full document must always have an html tag (among others) for example. But you wouldn't want to add any missing tags if your intent was to create a fragment that was not supposed to stand alone.

The new parser has some more toys. It lets us define an expected document type (HTML5, HTML4 Strict, HTML4 Tranistional). We can tell it the context we expect out HTML to be found in when it starts parsing. We can choose to discard comments, and decide to permit self-closing XML tags. All of these things went into the Create method, allowing you complete control over how your input gets processed.

New Overloads

The basic Create method has overloads to accept a number of different kinds of input:

When calling the basic methods, the "default" values of each of these will be used. The default values are defined on the CsQuery.Config object (the "default defaults" are shown here -- if you change these on the config object, your new values will be used whenever a default is requested):

If you pass a method both Default and some other option(s), it will merge the default values with any additional options you specified. On the other hand, passing options that do not include Default will result in only the options you passed being used.

The other methods remain more or less unchanged. CreateDocument and CreateFragment now simply call Create using the appropriate HtmlParsingOption to define the intended document type.

The Create method offers a wide range of options for input and parsing. These other methods were created for convenience and before an API to handle input features had been thought out. Though I don't intend to deprecate them right away, I will not likely extend them to support the various options. Anything you can do with these methods can be done about as easily with `Create` and a helper of some kind. For example, if you want to load a DOM from a file using options other than the defaults, you can just pass `File.Open(..)` to the standard `Create` method.

Render Method Options

The Render method signatures look pretty much the same as 1.2.1.. but a lot has changed behind the scenes. The IOutputFormatter interface, which used to be more or less a placeholder, now runs the show. All output is controlled by OutputFormatters implementing this interface. Any Render method which doesn't explicitly identify an OutputFormatter will be using the default formatter provided by the service locator CsQuery.Config.GetOutputFormatter.

public static Func<IOutputFormatter> GetOutputFormatter {get;set;}

You can replace the default locator with any delegate that returns IOutputFormatter.. Additionally, you can assign a single instance of a class to the CsQuery.Config.OutputFormatter property, which, if set, will supercede use of service locator. When using this method, the object must be thread safe, since new instances will not be created for each use.

There are a number of built-in IOutputFormatter objects accessible through the static OutputFormatters factory:

Each of these except the last returns an OutputFormatter configured with a particular HtmlEncoder. The last strips out HTML and returns just the text contents (to the best of its ability). The factory also has Create methods that let you configure it with specific DomRenderingOptions too. Complete details of these options are in the Render method documentation.

Bug Fixes

Issue #51: Fix an issue with compound subselectors whose target included CSS matches above the level of the context.

Fix for :empty could return false when non-text or non-element nodes are present

Other New Features

The completely new HTML parser, input and output models aren't enough for you? Well, there are a couple other minor new features.

CsQuery should compile under Mono now, after implementing a suggestion to change to `CsQuery.Utility.JsonSerializer.Deserialize` to avoid an unimplemented Mono framework feature.

`CQ.DefaultDocType` has been marked as obsolete and will be removed in a future version. Use `Config.DocType` instead

`CQ.DefaultDomRenderingOptions` has been marked as obsolete and will be removed in a future version. Use `Config.DomRenderingOptions` instead.

There are other changes in the complete change log, however, many of them are related to the deprecated parser and no longer relevant.

Thanks To The Community

This is a big project, and the new parser is a huge step forward. I think you'll find this release is fast, stable, flexible, and standards-compliant. I owe a debt to a number of people who suffered through the development and beta releases for the last couple months, without their patience and feedback, this would not have been possible. A bug report is a gift! So thanks to all the givers. The following is a list of all the people who've contributed code or bug reports recently. (If I missed anyone, it wasn't intentional!) Thanks - please keep it coming.

Thursday, September 13, 2012

I've just about finished my transition from Visual Studio 2010 to Visual Studio 2012. While this has probably been the easiest of any VS update I can remember, it wasn't without a few painful moments. Here's a summary of the annoyances and the solutions I found.

Uppercase Menus

Why, Microsoft, why? I don't want my menus to shout at me. It just looks so... 1992. Luckily, the fix is a piece of cake and requires adding a registry key:

Impenetrable color themes

Metro has is moments, but eliminating any visual distinction between windows, boundaries, and areas is not one of them. Neither of the two themes that come with VS2012 were especially workable for me.

Visual Studio 2012 Color Theme Editor to the rescue. The "blue" theme that comes packaged with this painless extension is comfortingly familiar to those used to VS2010's default scheme. Hooray! I can find the edge of a window again.

Ultrafind (and other non-updated VS2010 extensions)

Did you use Ultrafind with VS2010? If no, I feel sorry for you. If yes, you probably miss it now, since it hasn't been updated.

Not content to wait for an update, I threw caution to the wind and figured I'd see what happens if I just shoehorned it into VS2012. What do you know-- it works. Here's how to get your VS2010 extensions running in VS2012. Warning: I know nothing about what, if any, differences there may be in the extension model from VS2010 to VS2012. This works for me. It's absolutely not guaranteed to work for you or for all extensions, but there's not likely much harm you can do.

1. Locate your VS2010 user extensions folder.

Start by opening up C:\Program Files (x86)\Microsoft Visual Studio 10.0\Common7\IDE\devenv.pkgdef which shows you the locations from where extensions are loaded. Anything you've installed will likely be in "UserExtensionsFolder":

2. Copy them.

Within this folder should be a subfolder for each extension you've installed. Copy just the folders related to the extensions you want to migrate from here to the same folder for VS2012 -- the same path, but with "11.0" instead of "10.0". For Ultrafind, it's Logan Mueller [MSFT].

3. Clear cache.

There are two ".cache" files in the Extensions folder. Just delete them. This step might not be needed; I tried this a couple times with and without. If you don't do it, VS seems to get confused about which extensions are enabled. If you do, you may need to re-enable other extensions that are installed.

3. Enable.

You should now just be able to restart VS2012 and see your extension in the extension manager. Cross you fingers and click Enable.

Build Version Increment (and other add-ins)

You can use a similar technique for add-ins. It's even easier.
The one I really care about is Build Version Increment, which seems even less likely that Ultrafind to get an update any time soon (since it was barely updated for VS2010!).

1. Find the add-ins folder.

Go to Tools->Options->Add In Security from within VS to find the add-in search path. (I happen to keep mine in dropbox so they stay in sync across several machines). If you've never touched this, your add-ins are probably located in %VSMYDOCUMENTS%\Addins, which is here:

C:\Users\{username}\Documents\Visual Studio 2010\Addins

I have no idea why it's in a completely different place than extensions. Never question Microsoft logic.

2. Copy files

Like before, just copy the files related to the addins you want to migrate to the same folder for VS 2012. It could be just a single file called "something.addin". For BuildVersionIncrement there's also a DLL.

3. Update version.

Just change that "10.0" to "11.0" and save. That's all. Restart visual studio. If the add-in isn't immediately available, go to Tools->Add In Manager and it should be listed; you can enable it there.
CodeProject

... and why it matters

Sure enough.. it's got the scarlet letter "Deprecated" tag. What the...? These jQuery pseudo-selectors are probably the first thing I ever learned about using jQuery. This seems to be a... confusing move at best.

Most of these jQuery extension selectors are easily replaced using longform CSS. Indeed, this is the rationale presented with the original request: they're redundant. For example, :checkbox is literally the same as input[type=checkbox]. While I've always like the terseness of the jQuery aliases, I could live without them.

The problem is specifically with :text selector. The CSS version input[type=text]does not work the same as the jQuery :text selector. This is because when there's no type attribute, :text will select it, and the CSS version will not. CSS works only against actual attributes in the markup. This is important with "text" inputs because "text" is the default value. It's perfectly legal, valid, and even encouraged by some (because it's terse), to omit the "type" attribute for the ubiquitous text input. The simplest possible text input is just <input />.

Okay, this is not the end of the world. "Deprecated" is a lot different from "removed." jQuery contains features that were deprecated years ago, and it's not especially likely that this is going to be removed any time soon. But most people are uncomfortable writing new code that uses features they know are slated for future removal. So, starting with jQuery 1.8, you need to either choose to always have a "type" attribute, even though it's not required, or use a feature that's been deprecated to select all "text" inputs.

So, this post is mostly an observation. If at some point in the future :text stopped working, you could always use a simple plugin to replace it. No big deal. But it's certainly a curious feature to remove. It's at the core of jQuery's original purpose: making it easy to work with HTML; filling the void left by the DOM and CSS. The :text filter clearly fills such a void; this change undoes something useful.

Wednesday, August 8, 2012

CsQuery 1.2 has been released. You can get it from NuGet or from the source repository on GitHub.

This release does not add any significant new features, but is tied with the first formal release of the CsQuery.Mvc framework. This framework simplifies integrating CsQuery into an MVC project by allowing you to intercept the HTML output of a view before it's rendered and inspect or alter it using CsQuery. Additionally, it adds an HtmlHelper method for CsQuery so you can create HTML directly in Razor views. It's on nuget as CsQuery.Mvc.

Breaking Change

Though this change is unlikely to affect many people, it is a significant change to the public API for DOM element creation. Any code which creates DOM elements using "new" such as:

IDomElement obj = new DomElement("div");

will not compile, and should be replaced with:

IDomElement obj = DomElement.Create("div");

This was necessary to support a derived object model for complex HTML Element implementations to better implement the browser DOM. Previously, any element-type specific functionality was handled conditionally. This was OK when the DOM model was mostly there to support a jQuery port, but as I have worked to create a more accurate representation of the browser DOM itself, it became clear this was not sustainable going forward

In the new model, some DOM element types will be implemented using classes that derive from DomElement. This means that creating a new element must be done from a factory so that element types which have more specific implementations will be instances of their unique derived type.

Any code that used CQ.Create or Document.CreateElement will be unaffected: this will only be a problem if you had been creating concrete DomElemement instances using new.

Bug Fixes

CsQuery.Mvc

As usual I'm behind on documentation, but the usage of CsQuery.Mvc is simple and there's an example MVC3 project in the github repo.

The CsQuery MVC framework lets you directly access the HTML output from an MVC view. It adds a property Doc to the controller and methods Cq_ActionName that run concurrently with action invocations, letting you manipulate the HTML via CsQuery before it's rendered. There's basic documentation in the readme and there's also an example MVC application showing how to use it. You can also take a look at the CsQuery.Mvc.Tests project which is, itself, an MVC application.

Using the CsQuery HTML helper requires adding a reference to CsQuery.Mvc in Views/web.config as usual for any HtmlHelper extension methods:

Tuesday, July 10, 2012

CsQuery 1.1.3 has been released. You can get it from NuGet or from the source repository on GitHub.

New features

This release adds an API for extending the selector engine with custom pseudo-class selectors. In jQuery, you can do this with code like James Padolsey's :regex extension.. In C#, we can do a little better than this since we have classes and interfaces to make our lives easier. To that end, in the CsQuery.Engine namespace, you can now find:

The two different incarnations of the base IPseudoSelector interface represent two different types of pseudoclass selectors, which jQuery calls basic filters and child filters. Technically there are also content filters but these work the same way as "basic filters" in practice.

If you are only testing characteristics of the element itself, then use a filter-type selector. If an element's inclusion in a set depends on its children (such as :contents, which tests text-node children) or depends on its position in relation to its siblings (such as nth-child) then you should probably use a child-type selector. In many cases you could do it either way. For example, nth-child could be implemented by looking at each element's ElementIndex property and figuring out if it's a match. But it would be much more efficient to start from the parent, and handpick each child that's at the right position.

The basic API

To create a new filter, implement one of the two interfaces. They both share IPseudoSelector:

In both cases, you should set the min/max values to the number of parameters you want your filter to accept (the default is 0). "Name" should be the name of this filter as it will be used in a selector. Then choose the one that works best for your filter:

public abstract class PseudoSelectorChild:
PseudoSelector, IPseudoSelectorChild
{
/// <summary>
/// Test whether an element matches this selector.
/// </summary>
///
/// <param name="element">
/// The element to test.
/// </param>
///
/// <returns>
/// true if it matches, false if not.
/// </returns>
public abstract bool Matches(IDomObject element);
/// <summary>
/// Basic implementation of ChildMatches, runs the Matches method
/// against each child. This should be overridden with something
/// more efficient if possible. For example, selectors that inspect
/// the element's index could get their results more easily by
/// picking the correct results from the list of children rather
/// than testing each one.
///
/// Also note that the default iterator for ChildMatches only
/// passed element (e.g. non-text node) children. If you wanted
/// to design a filter that worked on other node types, you should
/// override this to access all children instead of just the elements.
/// </summary>
///
/// <param name="element">
/// The parent element.
/// </param>
///
/// <returns>
/// A sequence of children that match.
/// </returns>
public virtual IEnumerable<IDomObject> ChildMatches(IDomContainer element)
{
return element.ChildElements.Where(item => Matches(item));
}
}

If you implement one of the abstract classes, you get most of the functionality pre-rolled:

Name is the un-camel-cased name of the class itself, e.g. class MySpecialSelector would become a selector :my-special-selector

MinimumParameterCount and MaximumParameterCount are 0, meaning no parenthesized parameters.

Arguments is parsed into a protected property string Parameters[] (using comma as a separator) using the min/max values as a guide. Additionally, you can override QuotingRule ParameterQuoted(int index)
and return a value to tell the class how to parse each parameter. The index refers to the zero-based position of the parameter, and QuotingRule is an enum that indicates how quoting should be handled for the parameter at that position: NeverQuoted, AlwaysQuoted or OptionallyQuoted. NeverQuoted means single and double quotes will be treated as regular characters, and AlwaysQuoted means single or double-quote bounds are required. OptionallyQuoted means that if found, they will be treated as bounding quotes, but are not required.

The PseudoSelectorChild class implements ChildMatches by simply passing each element child to the Matches function. If you want to test other types of children (like text nodes) or have a smarter way to choose matching children, then override it.

Adding Your Selector to CsQuery

Here's the cool part. To add your selector to CsQuery, you don't need to do anything.. If you include it in a namespace called CsQuery.Extensions, it will automatically be detected. This works as long as this extension can be found in the assembly which first invokes a selector when the application starts. If for some reason this might not be the case, you can force CsQuery to register the extensions explicitly by calling from the assembly in which they're found:

CsQuery.Config.PseudoClassFilters.Register();

You can also pass an Assembly object to that method. Finally, you can register a filter type explicitly:

This is actually a relatively complicated pseduo-selector. To see some simpler examples, just go look at the source code for the CsQuery CSS selector engine. Most of the native selectors have been implemented using this API. The exceptions are pseudoselectors that match only on indexed characteristics, e.g. all the tag and type selectors such as :input and :checkbox. These could have been set up the same way, but they wouldn't be able to take advantage of the index if they were implemented as filters.

Speaking Of Which... Selector Performance

Many of the same rules about selector performance apply here as they do in jQuery. Don't do this:

var sel = doc[":some-filter"].Filter("div");

Do this:

var sel = doc["div:some-filter"];

Obviously that's a pretty silly example - most people wouldn't go out of their way to do the first. But generally speaking, you should order your selectors this way:

ID, tag, and class selectors first;

attribute selectors next;

filters last

Unlike jQuery, it doesn't matter whether a filter or selector is "native" to CSS or not - everything is native in CsQuery. What matters is whether it's indexed. All attribute names (but not values), node names (tags), classes and ID values are indexed. It doesn't matter if you combine selectors -- the index can still be used as long as you're selecting on one of those things. But you should try to organize your selectors to chose the most specific indexed criteria first.

It's very fast for CsQuery to pull records from the index. So if you are targeting an ID, that's unique - always use that first. Classes are probably the next best, followed by tag names, and last attributes. Nodes with a certain attribute will be identified in the index just as fast as anything else, but then the engine still has to check the value of each node that has that attribute against your selection criteria.

Wednesday, June 27, 2012

I put together some performance tests to compare CsQuery to the only practical alternative that I know of (Fizzler, an HtmlAgilityPack extension). I tested against three different documents:

The sizzle test document (about 11 k)

The wikipedia entry for "cheese" (about 170 k)

The single-page HTML 5 spec (about 6 megabytes)

The overall results are:

HAP is faster at loading the string of HTML into an object model. This makes sense, since I don't think Fizzler builds an index (or perhaps it builds only a relatively simple one). CsQuery takes anywhere from 1.1 to 2.6x longer to load the document. More on this below.

CsQuery is faster for almost everything else. Sometimes by factors of 10,000 or more. The one exception is the "*" selector, where sometimes Fizzler is faster. For all tests, the results are completely enumerated; this case just results in every node in the tree being enumerated. So this doesn't test the selection engine so much as the data structure.

CsQuery did a better job at returning the same results as a browser. Each of the selectors here was verified against the same document in Chrome using jQuery 1.7.2, and the numbers match those returned by CsQuery. This is probably because HtmlAgilityPack handles optional (missing) tags differently. Additionally, nth-child is not implemented completely in Fizzler - it only supports simple values (not formulae).

The most dramatic results are when running a selector of a single ID or a nonexistent ID in a large document. CsQuery returns the result (an empty set) over 100,000 times faster than Fizzler. This is almost certainly because it doesn't index on IDs; other selectors are much faster in Fizzler than this (though still substantially slower than CsQuery).

Size Matters

In the very small documents (the 11k sizzle test document) CsQuery still beats Fizzler, but by much less. The ID selector is still pretty substantial about 15-15x faster. For more complex selectors, the margin is just over 1x to 3x faster.

On the other hand, in very large documents, the edge that Fizzler has in loading the documents seems to mostly disappear. CsQuery is only about 10% slower at loading the 6 megabyte "large" document. This could be an opportunity for optimizing CsQuery - this seems to indicate that overhead just in creating a single document is dragging performance down. Or, it could be indicative of the makeup of the respective test documents. Maybe CsQuery does better with more elements, and Fizzler with more text - or vice versa.

You can see a detailed comparison of all the tests so far here in a google doc:

"FasterRatio" is how much faster the winner was than the loser. Yellow ones are CsQuery; red ones are Fizzler.

Red in the "Same" column means the two engines returned different results.

Try It Out

This output can be created directly from the CsQuery test project under "Performance."