Contents

Introduction

Nowadays everything is centered around the web. We are always either downloading or uploading some data. Our applications are also getting more chatty, since users want to synchronize their data. On the other side the update process is getting more and more direct, by placing volatile parts of our application in the cloud.

The whole web movement does not only come along with binary data. The movement is mainly pushed by HTML, since most of the output will finally be rendered to some description code that is known as HTML. This description language is decorated with two nice additions: Styling in form of CSS3 (yet another description language) and scripting in form of JavaScript (officially specified as ECMAScript).

This unstoppable trend is now gaining momentum since more than a decade. Nowadays having a well-designed webpage is the cornerstone of every company. The good thing is that HTML is fairly simple and even people without any programming knowledge at all can create a page. In the most simple approximation we just insert some text in a file, and open it in the browser (probably after we renamed the file to *.html).

To make a long story short: Even in our applications we sometimes might need to communicate with a webserver to deliver some HTML. This is all quite fine and solved by the framework. We have powerful classes that handle the whole communication by knowing the required TCP and HTTP actions. However, once we need to do some work on the document we are basically lost. This is where AngleSharp comes into play.

Background

The idea for AngleSharp has been born about a year ago (and I will explain in the next paragraphs why AngleSharp goes beyond HtmlAgilityPack or similar solutions). The main reason to use AngleSharp is actually to have access to the DOM as you would have in the browser. The only difference is that in this case you will use C# (or any other .NET language). There is another difference (by design), which changed the names of the properties and methods from camel case to pascal case (i.e. the first letter is capitalized).

Therefore before we go into details of the implementation we need to have a look what's the long-term goal of AngleSharp. Actually there are several goals:

Parsers for HTML, XML, SVG, MathML and CSS

Create CSS stylesheets / style rules

Return the DOM of a document

Run modifications on the DOM

*Provide the basis for a possible renderer

The core parser is certainly given by the HTML5 parser. A CSS4 parser is a natural addition, since HTML documents contain stylesheet references and style attributes. Having another XML parser that complies with the current W3C specification is a good addition, since SVG and MathML (which can also occur in HTML documents) can be parsed as XML documents. The only difference lies in the generated document, which has different semantics and a uses a different DOM.

The * point is quite interesting. The basic idea behind this is not that a browser that is written entirely in C# will be build upon AngleSharp (however, that could happen). The motivation here lies in creating a new, cross-platform UI framework, which uses HTML as a description language with CSS for styling. Of course this is a quite ambitious goal, and it will certainly not be solved by this library, however, this library would play an important in the creation of this framework.

In the next couple of sections we will walk through some of the important steps in creating an HTML5 and a CSS4 parser.

HTML5 parser

Writing an HTML5 parser is much harder than most people think, since the HTML5 parser has to handle a lot more than just those angle brackets. The main issue is that a lot of edge cases arise with not well-defined documents / document-fragments. Also formatting is not as easy as it seems, since some tags have to be treated different than others.

All in all it would be possible to write such a parser without the official specification, however, one either has to know all edge cases (and manage to bring them into code or on paper) or the parser will simple be only working on a fraction of all webpages.

Here the specification helps a lot and gives us the whole range of possible states with every mutation that is possible. The main workflow is quite simple: We start with a Stream, which could either be directly from a file on the local machine, the data from the network or an already given string. This Stream is given to the preprocessor, which will control the flow of reading from the Stream and buffering already read contents.

Finally we are ready to hand in some data to the tokenizer, which transforms the data from the preprocessor to a sequence of useful objects. These temporary objects are then used to construct the DOM. The tree construction might have to switch the state of the tokenizer on several occasions.

The following image shows the general scheme that is used for parsing HTML documents.

In the following sections we will walk through the most important parts of the HTML5 parser implementation.

Tokenization

Having a working stream preprocessor is the basis for any tokenization process. The tokenization process is the basis for the tree construction, as we will see in the next section. What does the tokenization process do exactly? The tokenization process transforms the characters that have been processed by the input stream preprocessor to so-called tokens. Those tokens are objects, which are then used to construct the tree, which will be the DOM. In HTML there are not many different tokens. In fact we just have a few:

Tag (with the name, an open/close flag, the tag's attributes and a self-closed flag)

Doctype (with additional properties)

Character (with the character payload)

Comment (with the text payload)

EOF

The state-machine of the tokenizer is actually quite complicated, since there are many (legacy) rules that have to be respected. Also some of the states cannot be entered from the tokenizer alone. This is also kind of special in HTML as compared to most parsers. Therefore the tokenizer has to be open for changes, which will usually be initiated by the tree constructor.

The most used token is the character token. Since we might need to distinguish between single characters (for instance if we enter a <pre> element an initial line feed character has to be ignored) we have to return single character tokens. The initial tokenizer state is the PCData state. The method is as simple as the following:

There are some states which cannot be reached from the PCData state. For instance the Plaintext or RCData states can never be entered from the tokenizer alone. Additionally the Plaintext state can never be left. The RCData state is entered when the HTML tree construction detects e.g. a <title> or a <textarea> element. On the other side we also have a Rawtext state that could be invoked by e.g. a <noscript> element. We can already see that the number of states and rules is much bigger than we might initially think of.

A quite important helper for the tokenizer (and other tokenizers / parsers in the library) is the SourceManager class. This class handles an incoming stream of (character) data. The definition is shown in the following image.

This helper is more a less a Stream handler, since it takes a Stream instance and reads it with a detected encoding. It is also possible to change the encoding during the reading process. In the future this class might change, since until now it is based on the TextReader class to read text with a given Encoding from a Stream. In the future it might be better to handle that with a custom class, that supports reading backwards with a different encoding out of the box.

Tree construction

Once we have a working stream of tokens we can start constructing the DOM. There are several lists we need to take care of:

Currently open tags.

Active formatting elements.

Special flags.

The first list is quite obvious. Since we will open tags, which will include other tags, we need to memorize what kind of path we've taken along the road. The second one is not so obvious. It could be that currently open elements have some kind of formatting effect on inserted elements. Such elements are considered to be formatting elements. A good example would be the <b> tag (bold). Once it is applied all* contained elements will have bold text. There are some exceptions (*), but this is what makes the HTML5 non-trivial.

The third list is actually very non-trivial and impossible to reconstruct without the official specification. There are special cases for some elements in some scenarios. This is why the HTML5 parser distinguishes between <body>, <table>, <select> and several other sections. This differentiation is also required to determine if certain elements have to be auto-inserted. For instance the following snippet is automatically transformed:

<precontenteditable>

The HTML parser does not recognize the <pre> tag as being a legal tag before the <html> or <body> tag. Thus a fallback is initialized, which first inserts the <html> tag and afterwards the <body> tag. Inserting the <body> tag directly within the <html> tag also creates an (empty) <head> element. Finally at the end of the file everything is closed, which implies that our <pre> node is also closed as it should be.

<html><head/><body><precontenteditable=""></pre></body></html>

There are hard edge cases, which are quite suitable to test the state of the tree constructor. The following is a good test for finding out if the "Heisenberg algorithm" is working correctly and invoked in case of non-conforming usage of tables and anchor tags. The invocation should take place on inserting another anchor element.

<ahref="a">a<table><ahref="b">b</table>x

The resulting HTML DOM tree is given by the following snippet (without <html>, <body> etc. tags):

<ahref="a">
a
<ahref="b">b</a><table/></a><ahref="b">x</a>

Here we see that the character b is taken out of the <table>. The hyperlink has therefore to start before the table and continue afterwards. This results in a duplication of the anchor tag. All in all those transformations are non-trivial.

Tables are responsible for some edge cases. Most of the edge cases are due to text having no cell environment within a table. The following example demonstrates this:

A<table>B<tr>C</tr>D</table>

Here we have some text that does not have a <td> or <th> parent. The result is the following:

ABCD
<table><tbody><tr/></tbody></table>

The whole text is moved before the actual <table> element. Additionally, since we have a <tr> element being defined, but neither <tbody> nor <thead> nor <tfoot>, a <tbody> section is inserted.

Of course there is more than meets the eye. A big part of the validate HTML5 parsing goes into error correction and constructing tables. Also the formatting elements have to fulfill some rules. Everyone who is interested in the details should take a look at the code. Even though the code might not be as readable as usual LOB application code, it should still be possible to read it with the appropriate comments and the inserted regions.

Tests

A very important point was to integrate unit tests. Due to the complicated parser design most of the work has not been dictated by the paradigm of TDD, however, in some parts tests have been placed before any line of code has been written. All in all it was important to place a wide range of unit tests. The tree constructor of the HTML parser is one of the primary goals of the testing library.

Also the DOM objects have been subject to unit tests. The main objective here was to ensure that these objects are working as expected. This means that errors are only thrown on the defined illegal operations and that integrated binding capabilities are functional. Such errors should never occur during the parsing process, since the tree constructor is expected to never try an illegal operation.

Another testing environment has been set up with the AzureWebState project, which aims to crawl webpages from a database. This makes it easy to spot a severe problem with the parser (like StackOverflowException or OutOfMemoryException) or potential performance issues.

Reliability tests are not the only kind of tests we are interested in. If we need to wait too long for the parsing result we might be in trouble. Modern web browsers require between 1ms and 100ms for webpages. Hence everything that goes beyond 100ms has to be optimized. Luckily we have some great tools. Visual Studio 2012 provides a great tool for analyzing performance, however, for me in some scenarios PerfView seems to be the best choice (it works across the whole machine and is independent of VS).

A quick look at the memory consumption gives us some indicators that we might want to do something about allocating all those HtmlCharacterToken instances. Here a pool for character tokens could already be very beneficial. However, a first test showed, that the impact on performance (in terms of speed of processing) is negligible.

CSS4 parser

There are already some CSS parser out there, some of them written in C#. However, most of them make just a really simple parsing, without evaluating selectors or ignore the specific meaning of a certain property or value. Also most of them are way below CSS3 or do not support any @-rule (like namespace, import, ...) at all.

Since HTML is using CSS as its layout / styling language it was quite natural to integrate CSS directly. There are several places where this has been proven to be very useful:

Selectors are required for methods like QuerySelector.

Every element can have a style attribute, which has a non-string DOM representation.

The stylesheet(s) is / are considered by the DOM directly.

The <style> element has a special meaning for the HTML parser.

At the moment external stylesheets will not be parsed directly. The reason is quite simple: AngleSharp should require the least amount of external references. In the most ideal case AngleSharp should be easy to port (or even to exist) as a portable class library (where the intersection would be between "Metro", "Windows Phone" and "WPF"). This might not be possible at the moment, due to using TaskCompletitionSource at certain points, but this is actually the reason why the whole library is not decorated with Task instances or even await and async keywords all over the place.

Tokenization

The CSS tokenizer is not as complicated as the HTML one. What makes the CSS tokenizer somewhat complex is that it has to handle a lot more types of tokens. In the CSS tokenizer we have:

String (either in single or double quotes)

Url (a string in the url() function)

Hash (mostly for selectors like #abc or similar, usually not for colors)

The CSS tokenizer is a simple stream based tokenizer, which returns an iterator of tokens. This iterator can then be used. Every method in the CssParserclass takes such an iterator. The great advantage of using iterators is that we can basically use any token source. For instance we could use another method to generate a second iterator based on the first one. This method would only iterate over a subset (like the contents of some curly brackets). The great advantage is that both stream advance, but we do not have to proceed in a very complicated token management.

Here we just ignore some tokens. In the special case of an at-keyword we start a new @-rule, otherwise we assume that a style rule has to be created. Style rule start with a selector as we know. A valid selector makes more constraints on the possible input tokens, but in general takes any tokens as input.

Quite often we want to skip any whitespaces to come from the current position to the next position. The following snippet allows us to do that:

Additionally we also get the information if we reached the end of the token stream.

Stylesheet creation

The stylesheet is then created with all the information. Right now special rules like the CSSNamespaceRule or CSSImportRule are parsed correctly but ignored afterwards. This has to be integrated at some point in the future.

Additionally we only get a very generic (and meaningless) property called CSSProperty. In the future the generic property will only be used for unknown (or obsolete) declarations, while more specialized properties will be used for meaningful declarations like color: #f00 or font-size: 10pt. This will then also influence the parsing of values, which must take the required input type into consideration.

Another point is that CSS functions (besides url()) are not included yet. However, these are quite important, since the toggle() and calc() or attr() functions are getting used more and more these days. Additionally rgb() + rgba() and hsl() + hsla() or others are mandatory.

Once we hit an at-rule we basically need to parse special cases for special rules. The following code snippet describes this:

Let's see how the parsing for the CSSFontFaceRule is implemented. Here we see that we push the font-face rule to the stack of open rules for the duration of the process. This ensures that every rule gets the right parent rule assigned.

Additionally we use the LimitToCurrentBlock method to stay within the current curly brackets. Another thing is that we re-use the AppendDeclarations method to append declarations to the given font-face rule. This is no general rule, since e.g. a media rule will contain other rules instead of declarations.

Tests

A very important testing class is represented by CSS selectors. Since these selectors are used on many occasions (in CSS, for querying the document, ...) it was very important to include a set of useful unit tests. Luckily the guys who maintain the Sizzle Selector engine (which is primarely used in jQuery) solved this problem already.

So we compare known results with the result of our evaluation. Additionally we also care about the order of the results. This means that the tree walker is doing the right thing.

DOM implementation

The whole project would be quite useless without returning an object representation of the given HTML source code. Obviously we have two options:

Defining our own format / objects

Using the official specification

Due to the project's goal the decision was quite obvious: The created objects should have a public API that is identical / very similar to the official specification. Users of AngleSharp will therefore have several advantages:

The learning curve is non-existing for people who are familiar with the DOM

Porting of code from C# to JavaScript is even more simplified

Users who are not familiar with the HTML DOM will also learn something about the HTML DOM

Other users will probably learn something as well, since everything can be accessed by intellisense

The last point is quite important here. A huge effort of the project went into (beginning to do a little bit of) writing something that represents a suitable documentation of the whole API and functions. Therefore enumerations, properties and methods, along with classes and events are documented. This means that a variety of learning possibilities is available.

Additionally all DOM objects will be decorated with a special kind of attribute, called DOMAttribute or simply DOM. This attribute could help to find out which objects (additionally to the most common types like String or Int32) could be used in a scripting language like JavaScript. In a way this integrates the IDL that is used in modern browsers.

The attribute also decorates properties and methods. A special kind of property is an indexer. Most indexers are named item by the W3C, however, since JavaScript is a language that supports indexers we don't see this very often. Nevertheless even here the decoration has been placed, which lets us choose how to use it.

The basic DOM structure is displayed in the next figure.

It was quite difficult to find a truly complete reference. Even though the W3C creates the official standard, it is often in contradiction with itself. The problem is that the current specification is DOM4. If we take a look into any browser we will see that either not all elements there are available, or that additionally other elements are available. Using DOM3 as a reference points makes therefore more sense.

AngleSharp tries to find the right balance. The library contains most of the new API (even though not everything is implemented right now, e.g. the whole event system or the mutation objects), but also contains everything from DOM3 (or previous versions) that has been implemented and used across all major browsers.

Performance

The whole project has to be designed with performance in mind, however, this means that sometimes not very beautiful code could be found. Also everything has been programmed as close as possible to the specification, which has been the primary goal. The first objective was to apply the specification and create something that is working. After this has been achieved some performance optimization has been applied. In the end we can see that the whole parser is actually quite fast compared to the ones known from the big browsers.

A big performance issue is the actual startup time. Here the JIT process is not only compiling the MSIL code to machine code, but also performing (necessary) optimizations. If we start some sample runs we can immediately see that the hot path are not optimized at all. The next screenshot shows a typical run.

However, the JIT also does a great job at these optimizations. The code has been written in such a fashion that inlining and other crucial (and mostly trivial) optimizations are more likely to be performed by the JIT. A quite important speed test has been adopted from the JavaScript world: Slickspeed. This test shows us a whole range of data:

The performance of our CSS tokenizer.

The performance of our Selector creator.

The performance of our tree walker.

The reliability of our CSS Selectors.

The reliability of our node tree.

On the same machine the fastest implementation in JavaScript makes heavy use of document.QuerySelectorAll. Hence our test is nearly a direct comparison against a browser (in this case Opera). The fastest implementation takes about 12ms in JavaScript. In C# we are able to have the same result in 3ms (same machine, release mode with a 64-bit CPU).

Caution This result should not convince you that C# / our implementation is faster than Opera / any browser, but that the performance is at least in a solid area. It should be noted that browsers are usually much more streamlined and probably faster, however, the performance of AngleSharp is quite acceptable.

The next screenshot has been taken while running the Slickspeed test. Please note that the real time aggregate is higher than 3ms, since the times have been added as integers to ensure an easy comparison with the original JavaScript Slickspeed benchmark.

In total we can say that the performance is already quite OK, even though no major efforts have been put into performance optimization. For documents of modest sizes we will be certainly far below 100ms and eventually (enough warm-up, document size, CPU speed) come close enough to 1ms.

Using the code

The easiest way to get AngleSharp is by using NuGet. The link to the NuGet package is at the end of the article (or just search for AngleSharp in the NuGet package manager official feed).

The solution that is available on the GitHub repository also contains a WPF application called Samples. This application looks like the following image:

Every sample uses the HTMLDocument instance in another way. The basic way of getting the document is quite easy:

At the moment four sample usages are described. The first is a DOM-Browser. The sample creates a WPF TreeView that could be navigated through. The TreeView control contains all enumerable children and DOM properties of the document. The document is the HTMLDocument instance that has been received from the given URL.

Reading out these properties can be achieved with the following code. Here we assume that element is the current object in the DOM tree (e.g. the root element of a document like the HTMLHtmlElement or attributes like Attr etc.).

Hovering over an element that does not contain items usually yields its value (e.g. a property that represents an int value would display the current value) as a tooltip. Next to the name of the property the exact DOM type is shown. The following screenshot shows this part of the sample application.

The renderer sample might sound interesting in the beginning, but in fact it just uses the WPF FlowDocument in a very rudimentary way. The output is actually not very readable and far away from the rendering that is done in other solutions (e.g. the HTMLRenderer project on CodePlex).

Nevertheless the sample shows how one could use the DOM to get information about various types of objects and use their information. As a little gimmick <img> tags are renderer as well, putting at least a little bit of color into the renderer. The screenshot has been taken while being on the English version of the Wikipedia homepage.

Much more interesting is the statistics sample. Here we gather data from the given URL. There are four statistics available, which might be more or less interesting:

This snippet is first used on the root element of the document. From this point on it will recursively call the method on its child elements. Later on the dictionaries can be sorted and evaluated using LINQ.

Additionally we perform some statistics on the text content in form of words. Here any word has to be at least 2 letters. For this sample OxyPlot has been used to display the pie charts. Obviously CodeProject likes to use anchor tags (who doesn't?) and a class called t (in my opinion very self-explanatory name!).

The final sample shows the usage of the DOM method querySelectorAll. Following the C# naming convention here use it like QuerySelectorAll. The list of elements is filtered as one enteres the selector in the TextBox element. The background color of the box indicates the status of the query - a red box tells us that an exception would be thrown due to a syntax error in the query.

The code is quite easy. Basically we take the document instance and call the QuerySelectorAll method with a selector string (like * or body > div or similar). Everyone who is familiar with the basic DOM syntax from JavaScript will recognize it instantly:

Finally we take the list of elements (QuerySelectorAll gives us an HTMLCollection (which is a list of Element instances), while QuerySelector only returns one element or null) and push it to the observable collection of the viewmodel.

Update

Another demo of interest might be the handling of stylesheets. The sample application has been updated with a short demo that reads out an arbitrary webpage and shows the available stylesheets (either <link> or <style> elements).

All in all the example looks like the following (with the available sources on the left and the stylesheet tree on the right).

Again the required code is not very complicated. In order to get the available stylesheets of an HTMLDocument object we only need to iterate over elements of its StyleSheets property. Here we will not obtain objects of type CSSStyleSheet but StyleSheet. This is a more general type as specified by the W3C.

In the next part we actually need to create a CSSStyleSheet containing the rules. Here we have two possibilities:

The stylesheet originates from a <style> element and is inlined.

The stylesheet is associated with a <link> element and has to be loaded from an external source.

In the first case we already have access to the source. In the second case we need to download the source code of the CSS stylesheet first. Since most of the time is going to be used by receiving the source code we need to ensure that our application stays responsive.

Finally we add the new elements (nodes with sub-nodes with possible sub-nodes with possible ...) in chunks of 100 - just to stay a little bit responsive while filling the tree.

The deciding part is actually to use DocumentBuilder.Css for constructing the CSSStyleSheet object. In the CssRuleViewModel it is just a matter of distinguishing between the various rules to ensure that each one is displayed appropriately. There are three types of rules:

Additionally declarations and values have their own constructors as well, even though they are not required to be as selective as the shown constructor.

Update 2

Another interesting (yet obvious) possibility of using AngleSharp is to read out the HTML tree. As noted before, the parser takes HTML5 parsing rules into account, which means that the resulting tree could be non-trivial (due to various exceptions / tolerances that have been applied by the parser).

Nevertheless most pages try to be as valid as possible, which does not require any special parsing rules at all. The following screenshot shows how the tree sample looks like:

The code for this could look less complicated than it does, however, it distinguishes between various types of nodes. This is done to hide new lines, which (as defined) ended up as text nodes in the document tree. Additionally multiple spaces are combined to one space character.

Here we just return an iterator that iterates over all nodes and returns the created TreeNodeViewModel instance, if any.

Points of Interest

When I started this project I have already been quite familiar with the official W3C specification and the DOM from a JavaScript programmer's point of view. However, implementing the specification did not only improve my knowledge about web development in general, but also about (possible) performance optimizations and interesting (yet widely unknown) issues.

I think that having a well-maintained DOM implementation in C# is definitely something nice to have for the future. I am currently busy doing other things, but this is a kind of project I will definitely pursue for the next couple of years.

This being said I hope that I could gain a little bit of attention and that some folks would be interested in committing some code to the project. It would be really nice to get a nice and clean (and as perfect as possible) HTML parser implementation in C#.

References

The whole work would not have been possible without outstanding documentation and specification supplied by the W3C. I have to admit that some documents seem to be not very useful or just outdated, while others are perfectly fine and up to date. It's also very important to question some points there, since (mostly only very small) mistakes can be found as well.

This is a list of the my personal most used (W3C and WHATWG) documents:

Share

About the Author

Florian is from Regensburg, Germany. He started his programming career with Perl. After programming C/C++ for some years he discovered his favorite programming language C#. He did work at Siemens as a programmer until he decided to study Physics. During his studies he worked as an IT consultant for various companies.

Florian is also giving lectures in C#, HTML5 with CSS3 and JavaScript, and other topics. Having graduated from University with a Master's degree in theoretical physics he is currently busy doing his PhD in the field of High Performance Computing.

if an incorrect index is queried, indexer must throw an exception, not silently ignore errors. Returning null may cause the program to fail in some other unrelated place later, complicating finding the true source of the error. This code should be rewritten like this:

3.CssParser, when created from a Stream argument, always closes the stream. It should offer an option to leave the stream open (as property and/or constructor argument).

4. Is IsQuirksMode supported? I could only find assignments to the property.

5.CssParser has properties for async parsing, but essentially it just calls Task.Run. Unless the code uses true async methods (for reading streams, for example), providing "async" methods should be avoided as it confuses users: they don't know wether they benefit from asynchronous call or not.

Furthermore, the parser has a started field which is not protected from race conditions. If all "async" code is like this, it should be completely removed from the library, as it provides no benefits (users can call Task.Run if they actually need it).

6. Some classes are never used, like Rect and Counter. What is their purpose?

1. This is because JS (or other scripting environments) do not throw an exception either - this is intentional.

2. Its not the same. The one from a List<T> uses a struct, which cannot be used in linked iterators since each time one passes the iterator, it will be reset. Hence we use yield so the C# compiler generates us a nice class. Problem solved!

3. This is an interesting option - I will include this option in a future release.

4. Its currently not supported - but the official specification uses it on some points (there are some points where they refer to it, however, it only applies in some parsing parts which have currently be left out).

5. Well, actually you benefit from it. Running the parser async will run it in another thread, which enables you do have a "live" DOM view on the generation. Of course this is not strictly thread-safe, since you could then do manipulations while the DOM is still generated (hence one shouldn't do that, i.e. do live manipulation). I figured out that having the parser running in a new thread is more beneficial than just using async methods for the stream.

Last remark on this: async has just been implemented - before this method threw an NotImplementedException. The reason why it took me so long is exactly the discussion above. However, I did a lot of tests and running it in its own thread is the most simple and effective solution.

6. Same purpose as CssRect in your code for Rect - as far as Counter is concerned: this one will be important for CSS counters. However, at the moment both are not used for the same reasons as specified in another answer: the CSS value OM is right now not implemented.

1. Your library is a .NET library, not JS library. As such, it should follow .NET recommendations.

Why do you value scripting more? I bet the library will mostly be used in C#/VB. And if anyone wants to provide scripting support (for example, via IronPython), then classes can easily be wrapped. Iron* libraries have many features for modifying behavior of .NET classes, as there're many differences. If you use a library in IronPython and want camelCase propeties, you'll have to wrap classes anyway.

Do you have a real use case for avoiding exceptions? I don't know any scripting engines where it'll save more than 5% of code.

According to official specification there is no exception on an unindexed element. Its just that there is no element.

What should I do ? Throw an exception and let people have to encounter that in their scripting engine? The only way to get such a response (null) is by INTENTIONALLY having a wrong index. That could be easily avoided by checking either the index or the response. Otherwise one also would have to check for exceptions. What do you prefer? I take the red pill and follow the specification.

I know all that stuff since I am a C# MVP and I am in good contact with the language team. Here we have a compute intense task that could take longer than 100ms. Now the user could of course do the same thing, however, there are much more benefits. Internally also async will be used and also asynchronous requests. Those network requests are real async - so this is not just some fancy Task.Run thread spawning.

> According to official specification there is no exception on an unindexed element.

ArgumentOutOfRangeException? That's what List, Array and virtually all other collection classes throw if invalid index is passed.

> Throw an exception and let people have to encounter that in their scripting engine?

Yes. This is what those who use List and numerous other classes will need to do if they want to avoid exceptions (according to the rules of their scripting engine or for any other reason).

> The only way to get such a response (null) is by INTENTIONALLY having a wrong index.

Mistakes happen. Just that. I can write for (int i=0; i<=array.Length; i++) — I won't notice the error right away, the compiler won't complain either. But I got a null out of nowhere. If there were exception, I would have noticed it the first time the code is executed.

> Internally also async will be used and also asynchronous requests.

How can CSS parser cause requests? If I understood correctly, all HTTP requests a made via IHttpRequester. I've followed its RequestAsync method's incoming calls and none of them come from CssParser. I see async methods used only in DocumentBuilder. Neither its methods nor CssParser.ParseAsync are called from the library.

Unless I've missed something, there's no benefit in having ParseAsync method as a user who wants to offload the operation to a separate thread, can do this themselves with Task.Run. And those who don't want to do that, won't worry that they miss scalability benefits.

?? The official W3C specification does tell you what to do in such cases - namely: return nothing. This is the official behavior and there is no discussion on that since the official specification has top priority.

This is a very newbish mistake and if you then get a null exception you did 2 things wrong: not check for null reference as it is always required in unsure cases and going beyond the top boundary.

Any CSS parser can easily perform requests (@import rule!), however, this feature of performing other requests is obviously intended for the HTML parser.

The *Async method will stay for sure since there will be more to it than just parsing it (and this will run in the original thread). However, already the example I mentioned should be enough to convince you that this is a good idea. If you didn't get that then I am sorry, but this is just ignorant.

The problem is, you value W3C spec over .NET recommendations. I have the opposite priorities. In my opinion, good libraries and applications are those that behave natively in their environment. But you probably prefer using Bash on Windows through Cygwin and Powershell on Linux through WINE.

Now seriously. You can see an example of spec-vs-spec in .NET: XmlDocument vs. XDocument. You prefer XmlDocument, I prefer XDocument. Simple as that.

I don't think you see the advantages of having a DOM built after the official specification within (!) the .NET platform.

Actually I am using bash on any one of my various Linux systems either directly or over PuTTy.

Additionally I think you have the wrong impression when you think I would prefer XmlDocument over XDocument, however, I think that the DOM is also not the best API (hence JS got more popular once jQuery was out there, which is basically a very nice DOM wrapper). Nevertheless it seems to be a quite good basis - and I try to do the same here. Start with everything that's in the spec and then add custom things (important here: add).

So maybe I just was not able to show you the purpose of AngleSharp well enough. The main goal is to have a completely valid HTML5 parser only written in C#. That goal seems to be reached, now its time to build the other sugar in it, as well as very useful helpers, plus optimizations (it is, however, already faster than e.g. HtmlAgilityPack - while still following all rules like e.g. foster parenting or even the HTML5 template element which is not included in any browser but Chrome).

> I don't think you see the advantages of having a DOM built after the official specification within (!) the .NET platform.

You're right, I don't see advantages.

Naming conventions aside, what real advantages does provide swallowing exceptions in indexers? When you add support for JS scripting, does it help much? When you port other libraries using DOM to .NET, does it help? Is teaching people bare DOM interface worth it, considering almost nobody uses it (<insert StackOverflow jQuery joke> )?

On the other hand, violating fail-fast principle will cause grief. You may be a superman who never makes mistakes , but a lot of others are not. Lack of uniformity between .NET classes and your library will cause grief. That's another basic principle, the principle of least surprise — anyone who uses the library from .NET expects it to behave like other .NET classes.

What real advantages does provide avoiding implementing collection interfaces? If I want to add "syntactic sugar", I can't make your class implement another interface, I'll have to create plethora of wrappers, not to mention implement interfaces in not the most optimal way (throwing correct exception from indexers, like in other .NET collections, would require three boundary checks in total, four more than when implemented optimally). You, on the other hand, can just inherit from Collection<T> class.

If you currently aim to implement W3C standards, but don't mind implementing more sensible interfaces later — that's fine, I appreciate having right priorities. However, your replies gave me impression that you don't want to see ICollection interface even later, just because it has a few redundant properties.

By the way, XDocument doesn't rely on XmlDocument implementation, so it's not merely syntactic sugar.

P.S. Have you compared performance of AngleSharp with HtmlParserSharp? I think that would be interesting.

Of course I compared performance (have you ever had a look at the documentation? everything here does not sound so ...) - AngleSharp is at least 10% faster, up to 30% for large documents.

I also never said that XDocument is based on XmlDocument - so I don't know why you write something like that. Still I don't see why you still insist on the ICollection - first I already said that I will probably include it later as an explicit interface, second I don't know anyone who has written extension methods for ICollection. Usually you write extension methods for IEnumerable - its a much better ansatz.

So if you want to talk to me about coding standards, then have a look at your 1-1 port (sorry for being nasty). I just looked at some classes and you are not doing a good job there. No sealing, exposing variables directly, a C++ like naming convention for variables, ... and you talk about conventions and coding standards?

I don't get that hate here and I think I will stop discussing. There has been some good input, but unfortunately also a lot of pointless discussion.

What kind of real advantages and where are you asking those questions? Advantages in using the library? Advantages for using the library in a specific use case? Be more specific (the advantages of using the standard DOM API should be quite obvious - no need to learn a new API).

The performance is better than the other two libraries you mention. For instance parsing and querying a large document takes between 7ms (warmed up, best time) to 300ms (first run) on CsQuery, while AngleSharp does the job in 3ms (warmed up, best time) up to 180ms (first run). On average its 4ms to 8ms comparison.

Advantages of avoiding exceptions in indexers, of not implementing ICollection.

Advantages of "standard DOM" are not obvious ("obvious" is synonymous with "proved").
1. Most web devs use jQuery and never touch bare DOM.
2. .NET devs are more likely to know XmlDocument from .NET than DOM from browsers.
3. It cannot take advantage of features of specific programming languages.
4. It violates fail-fast principle.

1. Web devs do / should know the real DOM (those who only know jQuery are either beginners or not real web devs - sooner or later one needs to know whats going on).
2. .NET is declining and the web is still growing so getting to know the most important API in this field is crucial.
3. Which programming languages? What are you talking about? It actually takes a lot of advantages from the .NET world already and more is to come.
4. Why should one fail if one accesses some element that is not there? If you want that then please go to native and do some pointer arithmetic, just to end up at an invalid memory address. So null is returned as written in the specification. Just because others (especially Microsoft) have chosen another path does not mean that this has to be chosen again. Every browser does it like AngleSharp, so this is certainly not a wrong way.

Actually you are alone with your opinion as I get more than a dozen mails and IMs per day on how great this library is.

Okay, if you believe that W3C spec is the greatest thing since slice bread, the argument is pointless. But I still don't get something. W3C spec is awesome, right? A library strictly following it is awesome, right? Except for some minor things, XmlDocument strictly follows the spec, right? Therefore, XmlDocument is awesome, right? XDocument, on the other hand, is a total abomination — it does not follow W3C spec, its API is absolutely weird (it silently ignores adding nulls, it implements namespaced names with operator+ etc.). XDocument is plainly not modern, not JavaScript-ish, not Web-2.0-ish. It must be burned with fire!

But you prefer XDocument. How come? I... I don't understand. This lack of understanding forced me to write three more messages already. You can stop this mutual torture right now.

Oh, and since when .NET is declining?

P.S. Come on, the lib is awesome, I've never stated otherwise. I just like perfection. Too bad perfection can be different.

I also like perfection but I do not understand why you insist on implementing e.g. ICollection. The only benefit I see is the Count property. I am not avoiding implementing this interface, I just

a.) haven't thought about it yet in detail (currently the focus is on getting everything to work - then comes extending it, refactoring and optimizations)
b.) do not see much purpose for it on most collections (however, there are surely some that might be suited for this and, as I said, I will definitely implement it there at least explicitly)

So what I want to achieve here is implementing everything that is official first, getting it to work as expected and then (and I think I already started with this in form of the public extension methods and other things like ToText) making it beautiful to work with.

BTW: DOM4 also specifies some more methods that are definitely more modern and polished than the old ones. Those will be also implemented.

Unfortunately Microsoft decided to put more focus on consumers, especially with Windows Store apps. Hence since about 1 year .NET is stagnating or slowly declining. It is still the best platform in my opinion and C# is as elegant and well-thought as ever before, however, since the trend is going strongly in favor of tablets / smartphones and the web, there seems to be a turning point.

To finish this XmlDocument vs XDocument discussion: XmlDocument tried to follow a much earlier spec and had some weird issues (some of them you already mentioned). The most important thing that it was missing (wasn't specified at that time) is a QuerySelector / QuerySelectorAll. In my opinion these are the methods that are most useful nowadays.

According to C# coding standards, "CSS" must be replaced with "Css". I see that you try to adhere to official CSSOM, but since you've already converted item methods to indexers, changed camelCase to PascalCase, I see no reason to violate C# class naming conventions. This will also improve uniformity as currently half classes are prefixed with "CSS", the other half with "Css". This also applies to HTML and other classes.

Implementing only IEnumerable<T> interface in "list" classes is probably a bad idea as it complicates the access to properties. Having a Length property, an indexer and not implementing any standard interfaces with Count property should be avoided too.

If we look at the "good old" System.Xml.XmlDocument and related classes, we'll notice that lists are implemented as collections. These implement IEnumerable and ICollection. I think it should be a minimum.

A more modern LINQ-y System.Xml.Linq.XDocument and related classes use generic IEnumerable<T> interface and don't use any specific "list" classes.

Looks like you want to make the library featureful, and that probably involves modification of styles using object model, not only getting read-only model from the parser. So I think collections should implement ICollection<T> and IList<T>. This will greatly simplify queries to the model. Or, since you target .NET 4.5, you can implement IReadOnlyList<T> in the meantime, while modification is not truly supported.

(In my Alba.CsCss library, in many cases, I was forced to implement IEnumerable<T> as Firefox code heavily relies on manually constructed linked lists. While it looks relatively "natural", there're many queries and conversions involved. So my library is not a good example of public interfaces, unfortunately.)

I've seen those linked lists, but I would rather replace those with a List. It will be more efficient in C#.

As far as the naming goes: I think all public visible classes of the CSSOM are CSS* since the official specification says so. There are also some good reasons for mixing them: CssRule is an enumeration and CSSRule is a class - one could have changed CssRule to CssRuleType or similar, but those names fit. Some classes that start with Css* are parser related like CssParser itself. But they have nothing to do with the specification, which is why I think this cut is alright.

Maybe some more methods will come, right now its just a portrait of the official specification. The official specification does not give you much access to the CSSOM - but I will probably change that and add some more methods. This is also the reason for not having a Length attribute - if not specified by the W3C I omitted it. I don't think ICollection has to be used - IEnumerable already gives you LINQ access and ICollection requires members like CopyTo or IsSynchronized to be implemented, which would not be meaningful at all for those kind of objects.

As far as the naming for properties and methods go: Here is where the DOMAttribute comes into play. Indexers do not violate them, because indexers can also carry this attribute and in the end its up to the user (or scripting engine) to decide if an indexer can / should be used.

One last thing: I am not targeting .NET 4.5 - its PCL78. Unfortunately there is a huge difference, which does not make my life easier (also I am still thinking what the best way would be for making an easy-to-maintain .NET 4 release).

In the end this is good input, right now the focus is somewhere else (not on CSS), but its on the list and everything will be in there. Some features you mentioned are already fixed (like the naming), some are interesting (like some missing IEnumerable implementations) and some or obsolete (like the ICollection). Thanks a lot!

I don't understand: why not name classes according to C# naming standards and apply DomAttribute to them, like you did with properties and methods?

As for naming different classes CSSRule and CssRule — sorry, this is extremely confusing and just crazy. Especially considering you aim at scripting — some scripting languages are case-insensitive.

You don't have to implement all ICollection<T> methods yourself. You can just inherit from Collection<T> or ReadOnlyCollection<T>. Or you can proxy it to a wrapped instance of List/Collection — they both implement all necessary properties. CopyTo may not be used often, but it is meaningful. Synchronization properties aren't very useful, but having them won't hurt.

Hm I think the enum was never called CssRule but CssRuleType - so the name is still different for case insensitive languages.

Well, I think having properties or methods that have no meaning and are not officially specified are not very useful. I still don't see any benefit that comes with ICollection. If its really about a proper length property, then I think its either already there (in form of Length, due to the specification) or there are reasons for (publicly) omitting it.

As far as the naming goes: I don't see any harm in class names like HTMLDocument as compared to HtmlDocument. There are even examples in the .NET Framework that do not use a single capitalized version even though the majority of classes follows this. So for me this is not a hard a very logical rule (and I think the majority of developers also does not insist on this rule).

Developers can have extension methods for ICollection<T>, some methods expect collections, some optimizations rely on collection type (Count LINQ method works faster on collections than on just enumerables) etc. Therefore implementing all supported interfaces is very useful.

Well, I think having properties or methods that have no meaning and are not officially specified are not very useful.

These methods come from .NET Framework. For example, CSSOM doesn't have GetEnumerator method and IEnumerable<T> interface, but you've used them. Why? Because they are highly useful for developers who use the library. Because it would be very inconvenient to use indexer and length property to iterate over the collection.

CSSOM defines methods which are required to be implemented, but it does not force you to implement them and only them. ToString and GetHashCode don't have a meaning in many cases too.

I don't see any harm in class names like HTMLDocument

The reason why Microsoft has chosen PascalCase for long abbreviations is that it makes the code more readable and avoids SCREAMING_CAPS. (There's already WinAPI with crazy naming conventions, they don't want to repeat this mistake.) PascalCase is basically easier to read than SCREAMINGCAPS.

Stylistically, it's very good if all code follows the same conventions. I like that in C#, my code looks pretty. I cringe every time I look at C++ codeWhere every_library FollowsItsOwn NAMING_CONVENTION.

A more practical reason: tools like Resharper allow to search for classes by using first letters of the words in the class names. If a class is named CssStyleRule, I can enter "CSR" (or "csr") and it'll suggest the matching names. If CAPS are used, I'll have to enter every letter: "CSSSR" — before it even starts narrowing down the search to classes I need.

And althouth there're a few violations of the naming conventions in .NET Framework, almost all classes follow them. Conventions should be followed unless they cause issues, not vice versa (not following rules unless it causes issues).

Well, the argument with extension methods is good, but I dislike having 2 properties, Count and Length. Maybe I'll implement the interface for explicit usage, which would be a good way around that.

First I think ReSharper has useful aspects, but has created a generation of dump developers (sorry to say, but one should be able to think without such tools), and second its not a caps only name. HTML is HTML and not Html. Its Hyper Text Markup Language. Also I think the code is easier to maintain if the DOM name is the same as the official name. On the other side using camelCase instead of PascalCase would have been too much. Hence I went for the middle.

There are now over 2k NuGet downloads and reactions from a lot of people who are actively using it - and they think the API is great. They probably would have had the same opinion with e.g. HtmlDocument instead of HTMLDocument, but since they never complained I assume this is just fine and breaking this now would be too much.

To conclude: I welcome all your input and there have been some good points, but there is certainly no more discussion on the naming convention of the existing classes and methods. There have been reasons and no matter if you like them or not, they follow the basic .NET rules, maintain readability and still are very close to the original DOM name (only methods and properties change their first letter from upper to lower case), which is also great for people who want to learn about the DOM in a programming environment like VS with features like intellisense or in-built documentation.

Don't get me wrong: The discussion would have been great before the actual release, but even though AngleSharp is currently just an alpha version and far away from completeness, I do not want to change the existing API for obvious reasons. I hope you understand this.

CSS parser in its current state cannot be used to correctly calculate styles. All values are represented as value lists, and it makes calculating complete values of complex (especially shorthand) properties too complex.

Example:

margin: 20px is represented by margin: list { 20px }, while this actually means margin: rect { 20px, 20px, 20px, 20px }. Or even margin-left: 20px; margin-top: .... The style with a margin can be overriden only partially, for example by specifying only margin-left.

border-color: rgba(82,168,236,0.8) becomes border-color: (the same thing happens with gradients and other properties).

In your readme, you say that one of possible use-cases is minification. If you actually want to support minification, you'll have to support all non-standard properties and selectors, all popular CSS hacks, multiple values for the same property (this is perfectly valid) etc. I don't think it is feasible, as it will hinder calculation of styles and other advanced features.

You can find more errors if you parse a complex CSS file. To find the aforementioned bugs, I've parsed Twitter Bootstrap's bootstrap.css (I haven't listed everything as there're too many issues).

Right now CSS does not convert the values to real values. Its just about grouping and a vague representation.

In my readme is also the roadmap - where you can see that getComputedStyle() is part of the future. This part will have to rely on CSS values, which means that implementing this feature will include a final implementation of the whole CSS value model.

The examples you mention all work - just the string representation does not. This is related to the not fully implemented value model (you see that list of values are all represented as a single value).

Thanks for your input - it will certainly be helpful once I touch the CSS again (it was just much more important to have a 100% HTML5 parser impl. ready and the QuerySelector / QuerySelectorAll working).

They are already included (if you mean QuerySelector and QuerySelectorAll) and working (passing all sizzle tests and all of the currently included W3C tests). They are at least as fast as using jQuery (usually even faster - though one doesn't "see" the difference). HTML documents with thousands of nodes are inspected within a few ms.

There are some useful DOM manipulation helpers (with very similar names to the ones in jQuery - like Html() or Text()). They are available in form of extension methods (the required namespace is AngleSharp).

Actually those CSS queries are (or have been) one of the goals of AngleSharp - in my opinion its much nicer to operate through the DOM with CSS queries and one will learn CSS selectors along the way (I think this knowledge is important).

Excellent. Thanks for the great library. I'm developing a screen scraper application, I tried other libraries HtmlAgilityPack etc but none comes close to your library when manipulating DOM.

1) In HtmlSelectElement Options property is missing. I can get option from Children.
2) Form submit() Not sure how you are going to implement. After submit, to refresh existing DOM or return HTML from http post?
3) Javascript engine implementation in future?. If not, atleast integration with any .Net javascript engines like Jurrasic, JInt, EcmaScript.Net, JScript.Net etc.

Still testing and trying to implement some missing functionality like form submit.

Submit and other features rely on having a layer to perform HTTP requests. This is not supported by the given project type, however, in a future release (look for that at the roadmap) this will be supported by registering a proper type.

Also please note that this is a parser, not a browser or something. So this lib will have no JS implementation included, but it is quite easy to wrote your own library which works with AngleSharp and uses on the used projects. I will probably release a V8 connector lib (like AngleSharp.V8) in the next couple of weeks.

Thanks a lot - if you see other bugs or feel that any method is missing: just post here or on GitHub. I will try to add unit tests for most things; but I always feel that my time is too limited for all that.

///<summary>/// Returns the Element whose NAME is given by elementName. If no such element exists, returns null.
/// The behavior is not defined if more than one element have this NAME.
///</summary>///<paramname="elementId">A case-sensitive string representing the unique NAME of the element being sought.</param>///<returns>The matching element.</returns> [DOM("getElementByName")]
public Element GetElementByName(String elementName)
{
return GetElementById(_children, "name", elementName);
}