Friday, August 24, 2012

I keep running into problems with my Unit Tests for a project I'm working on at work, in particular, when I execute Assert.AreEqual(expected, actual).

Some Background Information

Currently, I'm working on some software that will integrate an accounting system with some online banking web services. This is so that instead of manually entering transactions in both the accounting system and then again up at the bank's website, firms can enter them once into the accounting system and the integration software I'm writing will handle making the appropriate web service calls to initiate the actual transactions at the bank.

There is a consortium out there known as the Interactive Financial eXchange Forum, usually known simply as IFX. They're an industry group that writes specifications for the transfer of financial data using XML and web services. Many financial instutions and software packages use specifications from the IFX, such as Intuit, the makers of Quicken software.

Some of these specifications are available for download for free. The most recent versions (and the most useful, e.g. XSD schemas) are often available to members only (and that pricetag is quite steep). I'm using version 1.7. Within the version 1.7 schemas, there is a type known as NC, which stands for "narrow-character" string. A narrow-character string is basically all 7-bit ASCII characters below 0x7F.

My first thought was to write an extension method to perform the conversion:

public static string UTF8ToLatin1(this string value);

This worked well until I discovered that I had to constantly remember that if a string variable needed to be a narrow-character string, I needed to call that extension method anytime I changed its value. Needless to say, I found occassions where I forgot to call the method. So while cool, this was not a good long-term solution.

A New Class is Born

So then I decided to get "cute"; but hopefully not too cute. It turns out that string is a sealed class. This is unfortunate, but there's probably a good reason for that. So I decided to write my own wrapper class that wraps a string value that provides all the necessary validation while delegating much of its behavior to string and all the while making it inheritable (there are several specializations of a narrow-character string used throughout the IFX specification schemas). I called it NCString.

Everything was going pretty well until I realized that I couldn't just simply assign a string to a NCString object and vice versa without using some sort of property or method accessor. Well I don't know about you, but that just smelled to me.

I then went researching to see if there was any way to create some sort of overloaded casting operator (sort of like C++'s static_cast and dynamic_cast operators, and what-not). It turns out, .NET has what they call conversion operators. I had forgotten all about these, mainly because they look a lot different since CLR 2.0 than from CLR 1.0, and honestly, I never had the need to implement one before (except for the occasional ToXXXX/FromXXXX methods).

Conversion Operators

In CLR 1.0, conversion operators were merely method calls in the form of ToXXX(...)/FromXXX(...) for languages that don't support operator overloading, and named operators called op_Explicit and op_Implicit for those languages that do support operator overloading.

In CLR 2.0, conversion operators got some syntactic sugar (at least, in C#). Explicit conversion operators, when used, look like regular casts from one type to another (much like how you can create a constructor in C++ that takes the type to convert from, which then allows you to use the C++ cast convention instead of calling the constructor directly). Implicit operators are even cooler in that the syntactic sugar allows you to directly assign the object to convert from directly to the object being converted to, as if both objects were of the same type. Obviously, there's much room for abuse here, so there are some compiler-enforced rules as well as general guidelines on their (as well as explicit conversion operator's) use, which can be found here and here.

So in my NCString class, I created the required constructors, implemented the requisite object.Equals methods, overloaded the equality operators, and finally created implicit conversion operators, both to and from string. Now, if you took the time to read the guidelines I linked to above, you will notice that I may have violated one of the guidelines:

Do not provide an implicit conversion operator if the conversion is potentially lossy.

This is sort of a judgement call. On the one hand, you could consider the conversion from string to NCString as "lossy", because potentially, a string can hold a wider range of characters than what is allowed in a NCString. However, if you try to assign a string with characters that fall outside the allowable range for a NCString, a System.ArgumentException is thrown—from the constructor of NCString, which curiously, is shown as an example in the Framework Design Guidelines documentation, and yet, which also seems to violate the following guideline:

Do not throw exceptions from implicit casts.

(If anyone has anything to say about the examples shown in the Framework Design Guidelines versus the guidelines themselves, I'd be interested to hear; leave a comment.)

As I was saying, on the one hand, the implicit conversion from string to NCString could be considered lossy. On the other hand, the two are strings, it's just that one has a more restrictive character set (e.g. it's not like I'm trying to convert from a double to an int, the conversion of which could result in actual data loss). So, in light of that, it would make it much simpler to work with the NCString class if I could convert between instances of NCString and string without needing to explicitly write a cast all the time.

Unit Testing Problems Begin

About this time is when some unit testing problems began to rear their ugly head when it came to calls to Assert.AreEqual(expected, actual). I initially began writing this blog post thinking that the generic version of this method was not properly inferring the parameter types or that the non-generic version was being called instead of the generic version. Well, it turns out to be the latter rather than the former.

When I began writing unit tests, I would have code similar to the following:

Identity Crisis

I expected that, 1) with the method having a generic method signature overload, 2) .NET's ability to infer generic method type parameters, and 3) since the parameters I passed were two different types with conversions from one to another, that actual would attempt to be casted into the type of expected, after which, they would then be tested for equality.

As I just mentioned above, in .NET, generic methods can most often infer their generic parameters, so you don't need to explicitly tell .NET what the type of the generic parameter is. This is precisely what's causing me issues. I'm used to letting .NET infer my generic parameter types (especially when it comes to LINQ statements). But in this case, it's not being inferred. Why not?

The answer is rather simple; but throw in some implicit conversion operators and the answer becomes more complex. So here's the not-so-simple answer. What was happening was that the compiler did (probably) attempt to use the generic method. But, because I didn't specify the type of the generic parameter T, instead, letting .NET (try to) infer the type, .NET was unable to infer the type. Why? Because of the implicit conversion operators. What type should .NET infer for the parameters? string? NCString? Both parameters are implicitly convertible to one another. Furthermore, even if .NET did choose one type over another, there really isn't a way to dynamically cast one type to the other through reflection. Indeed, in the end, the compiler had to choose to use the overload that takes two object parameters. And, of course, object has an Equals method that does little more than test for type equality and referential equality.

Finding Myself

There are two solutions to this problem. One easy, the other somewhat easy.

First, the somewhat easy one (my first approach, mostly because I didn't look to see what else was available in the Assert class). Explicitly tell the compiler what the generic type parameter should be so that the generic method signature is used: Assert.AreEqual<NCString>(expected, actual);. Because the two types are implicitly convertible to/from one another, and I've explicitly specified the generic type parameter to be a NCString, the actual parameter will be converted to a NCString and NCString's implementation of Equals will be used.

The really easy solution (for my case) is to use the overload that takes two string parameters and a bool (indicating whether or not the assertion should be handled case-insensitively). Using this overload, I'd have to use the third boolean parameter (in my case, always set to false) and no explicit type casting/conversion would need to be performed. (Note however, that an implicit type conversion would still be performed from NCString to string.)

Which is better? Hard to say and it mostly depends on what you're trying to test for equality. In my case, both types were strings (one more specialized than another), so there's not much of a diffirence one way or another and not much more in the way of typing for one way or the other.

Some Final Questions and Concluding Remarks

So, I wonder, does the .NET Unit Test Framework always use the Assert.AreEquals(expected, actual) overload that takes two object parameters whenever the two parameters are of different types, or just in the case when the two parameters are not the primitives for which an overload exists and a generic type parameter was not specified?

Anyway, I just wanted to post this, because this has been bothering me for a while. I didn't expect this behavior, but now that I've worked through it, I understand it. I hope that I can save you many hours of troubleshooting your unit tests.

Hello everyone. I'm back again. Before we get started, I just kind of wanted to put out a notice. I'm working on my blog template. So if you come here and see that stuff doesn't look exactly the way it should, well, you've been warned. This includes the mobile template (so far, I'm not impressed with blogger's mobile templates).

Wow, so many articles in such a short amount of time! Yes, I'm being facetious ;). Let's hope I can continue this pattern. Afterall, I started this blog 3 years ago to start writing down things I learn and which I'd like not to forget.

Well, this blog post certainly fits the bill. I know I've done some research on this once before, but I failed to write it down. So I wasted some more time re-researching this problem. What a waste. So, to help you not waste your time, I hope that you find this blog post helpful. Enough already, let's dive in!

Extensible Markup Language NAME Tokens

I first learned eXtensible Markup Language (XML) back in 2000-something-or-other and haven't used it much. A lot of people use XML everywhere for anything. I'm not a big believer in that, actually. It's a great tool and it makes sense to use it where and when it needs to be used.

Having said that, there have been 3 occassions over the last eight months where I had to brush up on my XML skills and really know it well. I'll put a shameless plug in for a book that, while old, has helped me tremendously (I haven't seen its equal): XML Primer Plus by Nicholas Chase.

Like I said, the book is a bit dated now, but it still has very valuable information in it and not much has changed. Having said that, one thing that has changed are the allowable characters in NMTOKENs. I recently needed to validate values I read from user input to ensure that they were in fact valid an id attribute values. I naïvely assumed the following regular expression:

if ( ... && Regex.IsMatch(id, @"^\\d.*|\\w*[\\p{P}-[_]]+.*$") { ... }

So, basically, a XMLNMTOKEN can have anything in it except punctuation characters (except for the underscore). I just realized as I'm typing this that I completely forgot to check if the NMTOKEN started with a number, which also isn't valid.

This might have been fine back in 1999, but the XML standard has changed since then. It's still version 1.0 (well, there is a version 1.1, but let's not go there), but it's the 5th edition.

The current XML specification defines a NAME token (to which ID tokens must adhere and which is a specialized form of a NMTOKEN as follows:

[Definition: A Name is an Nmtoken with a restricted set of initial characters.] Disallowed initial characters for Names include digits, diacritics, the full stop and the hyphen.

Names beginning with the string "xml", or with any string which would match (('X'|'x') ('M'|'m') ('L'|'l')), are reserved for standardization in this or future versions of this specification.

Note:

The Namespaces in XML Recommendation [XML Names] assigns a meaning to names containing colon characters. Therefore, authors should not use the colon in XML names except for namespace purposes, but XML processors must accept the colon as a name character.

The first character of a NameMUST be a NameStartChar, and any other characters MUST be NameChars; this mechanism is used to prevent names from beginning with European (ASCII) digits or with basic combining characters. Almost all characters are permitted in names, except those which either are or reasonably could be used as delimiters. The intention is to be inclusive rather than exclusive, so that writing systems not yet encoded in Unicode can be used in XML names. See J Suggestions for XML Names for suggestions on the creation of names.

Document authors are encouraged to use names which are meaningful words or combinations of words in natural languages, and to avoid symbolic or white space characters in names. Note that COLON, HYPHEN-MINUS, FULL STOP (period), LOW LINE (underscore), and MIDDLE DOT are explicitly permitted.

The ASCII symbols and punctuation marks, along with a fairly large group of Unicode symbol characters, are excluded from names because they are more useful as delimiters in contexts where XML names are used outside XML documents; providing this group gives those contexts hard guarantees about what cannot be part of an XML name. The character #x037E, GREEK QUESTION MARK, is excluded because when normalized it becomes a semicolon, which could change the meaning of entity references.

At this point, I thought, "No problem," I'll just use those productions in a Regular Expression while making sure the token doesn't start with xml or some form of that thereof and that'll be that. During unit testing, I discovered that out of 29 invalid sequences that I tried (which were all in the ASCII portion, so this unit test was not comprehensive in any way for the time being), only 7 were flagged. What's going on here?

Well, I read the MSDN documentation for .NET Regular Expressions. Here was the expression I used that was failing the unit test:

Now, I must admit, .NET's Regular Expression syntax is a bit funky. Anyway, the (?i:(?!xml)) says if the string case-insensitively starts with XML, then it doesn't match. Otherwise, use the production rules as shown above from the W3C.

The catch is, .NET strings are encoded in UTF-16 (and therefore, the escape sequence \u can only be followed by 4 hexadecimal digits), but according to the production for a NameStartChar I need to detect some characters outside of this range (the [\u10000-\uEFFFF] character class inside of the Regular Expression). Now, I had done some research on internationalization and Unicode, but I never really had to pay much attention to it. I'm the one who's mostly using my software. However, this is not the case this time. This software may be used by firms doing banking all over the country and aronud the world. So I definitely need to check that the ID string I'm getting is valid. During my research, I came across Joel Spolsky's blog on Unicode and character encoding. This was a good start (and you should read this if you haven't already), but I needed to know more. Specifically, how can I encode a Unicode character that falls outside of the range 0 - 65535 into a 16-bit value?

UTF-16 Surrogate Pairs

The answer is UTF-16 surrogate pairs. I had heard of these, but I didn't know how to generate them. This article helped me out. In the UTF-16 character encoding, there are no UTF character code points defined in the range 0xD800 - 0xDFFF. This range is used in an algorithm to generate the UTF-16 surrogate pairs that represent Unicode character code points above 65535. The algorithm is really simple and is outlined below..

Take the hex value of the Unicode character to encode as UTF-16 and subtract 0x10000 from it.

Take the result from step 1 above and shift it right 10 bits (0xA).

Take the result from step 2 and add 0xD800. This gives you the first surrogate of the surrogate pair.

Again, taking the result from step 1, AND the value with 0x3FF to mask off the upper 10 bits.

Add 0xDC00 to the result from step 4 above. This represents the second surrogate of the UTF-16 surrogate pair.

The resulting 16-bit surrogates from above properly encode Unicode character code points above 65535 in UTF-16. Let's run through a quick examlpe.

Enocding a Unicode Character Code Point Above 65535 into UTF-16

Let's start with the Unicode character code point U+18657. (I don't know what this character is, I just chose something at random.) Following our algorithm above:

0x18657 - 0x10000 = 0x8657

0x8657 >> 0xA = 0x21

0x21 + 0xD800 = 0xD821(This is the first surrogate of the surrogate pair.)

0x8657 & 0x3FF = 0x257

0x257 + 0xDC00 = 0xDE57(And this is the second surrogate of the surrogate pair.)

So, from the algorithm above, the Unicode character code point U+18657 can be encoded into UTF-16 using the surrogate pair U+D821 U+DE57.

Putting it All Together

Finally, we come to the end. I needed to replace the character class [\u10000-\uEFFFF] with a valid \u escape sequence construct. That involves calculating the range of the surrogate pairs for the character class. I used the algorithm above to calculate the range of surrogate pairs which results in the following pattern that should be used to replace the invalid character class: ([\uD800-\uDB7F][\uDC00-\uDFFF]). This pattern will match all UTF-16 encoded Unicode character code points between the range of U+10000 - U+EFFFF, which is exactly what we want. Here's the final Regular Expression that will validate an XML ID token:

A Few Closing Remarks

Firstly, the second and third Regular Expressions shown in this article could be simplified (e.g. getting rid of all the alternation (|) constructs between character classes and creating one big character class). Second, and most important, this probably isn't exactly what you should normally do. The productions given by the W3C are meant to be inclusive, as they noted for their justification of the productions. But it would probably be more efficient to write a Regular Expression that matches on what should not be present in a XMLID token (though, the Regular Expression would be almost as long and complex).

I did test my final Regular Expression as shown above, and it did pass (again, my unit test was not comprehensive with regards to characters outside the 7-bit ASCII character set). However, since I used the production shown in the XML specification, I have no reason to think that this Regular Expression would not let an invalid ID token "slip" through.

Wednesday, August 22, 2012

UPDATE - 16 January 2013

The contents of this article have been updated to correct a problem that existed at the time of publication. In addition, I have uploaded this updated Visual Studio extension to the Visual Studio Gallery.

Hello everyone. It's been way too long. It seems I'm always doing this and that and never have time to write down anything, anywhere—whether that be on this blog, or on Facebook. But today, I have something special for those of you who have that favorite Visual Studio 2010 Extension that they just can't live without, but it hasn't been updated for Visual Studio 2012.

In order to update the extension, you will need access to the extension's source code; so these instructions will have limited impact for most of you out there unless you can get your hands on the source code for that favorite extension.

Fortunately for me, the author of my favorite extension, AllMargins, did post his source code, especially since this extension is no longer available from the Visual Studio Gallery.

Install the Visual Studio 2012 and 2012 SDKs

In order to author extensions for Visual Studio, one must have the Visual Studio SDK installed. The same is true for modifying them. If you don't already have it installed, go out and download it and install it.

Prepare the Source Code for Visual Studio 2012

If you want to maintain backward compatibility for the extension, it would be best not to open the original Visual Studio 2010 solution in Visual Studio 2012. That's because Visual Studio Extension projects for Visual Studio 2010 are not compatible with Visual Studio 2012. Well, that's not exactly true. The real truth is that the VSIX manifest XML schema has changed between Visual Studio 2010 and Visual Studio 2012. The old schema still works for Visual Studio 2012, but Microsoft would have you use the new schema going forward. The problem is, if you use the new schema, you won't be able to use the extension in Visual Studio 2010. Don't worry though, Visual Studio 2012 will not upgrade the schema, just the project file(s); but in doing so, the projects will no longer open in Visual Studio 2010.

Why should you care? Well, for one, testing. You can't simply instruct VsixInstaller to install the extension to an experimental instance of Visual Studio, unfortunately. That really makes it hard to test the extension for more than one version of Visual Studio. Even so, you really should be testing your extension on a virtual machine with the installed version of Visual Studio to test it with and which does not have the Visual Studio SDK installed. If the Visual Studio SDK is installed on your test machine, your extension will always work but, you may find that it doesn't work users who don't have the Visual Studio SDK installed. Your extension should not explicitly require the Visual Studio SDK be installed (unless you're writing a Visual Studio Extension project template! Lol...).

If you want to be able to develop and test in both Visual Studio 2010 and Visual Studio 2012 on the same machine, here's my suggestion.

Create a new empty solution called <Your Extension>.2012 in the same place as the original solution.

Open Windows Explorer to the location of one of the projects that are in the Visual Studio 2010 solution.

Select the project file and copy and paste it into the same folder. The file will be renamed 'Copy of <Your Project File>.csproj' (or .vbproj, if you use Visual Basic). Rename the file to <Your Project File>.2012.csproj.

Add all the files that are part of the Visual Studio 2010 project to the newly created Visual Studio 2012 project.

Repeat these steps for each of the other projects that were part of the original Visual Studio 2010 solution.

Now you can work on the source code in both Visual Studio 2010 and Visual Studio 2012, opening the respective solution for the version of Visual Studio you're using to make changes. When you build the Visual Studio 2012 solution, it will be deployed to the Visual Studio 2012 experimental instance. The same is true for 2010, the extension will be deployed to the Visual Studio 2010 experimental instance. Also, it allows you to work on a single code-base that should work in both environments so long as you don't use features exclusive to Visual Studio 2012.

While upgrading this extension to work with Visual Studio 2012, I discovered that there is an enumeration value in the VSIX XML schema for specifying the Express editions of Visual Studio. However, the only "extensions" that Express SKUs support are project/item template extensions.

We're now ready to start modifying the various files to make this extension compatible with Visual Studio 2012!

Update the *.vsixmanifest file

The first order of business in updating the extension is to modify the *.vsixmanifest file. Here's an example of one of the files from the AllMargins.vsix extension:

By removing the MaxVersion attribute for the SupportedFrameworkRuntimeEdition element, we're telling Visual Studio that so long as the machine supports the .NET Framework 4.0, the extension can be used.

Also for good measure, I updated the Version value (e.g. 1.3) and updated the Description element noting that I changed the VSIX package and it is not the original. As an aside, every time you make a change to the extension that you wish to distribute, you must increment the value for the Version element.

Now I ran across a tricky problem with updating the AllMargins extension. This extension has several "sub-extensions" that are contained within the AllMargins extension itself. And some of these "sub-extensions" have yet other "sub-extensions". The long and short is, during testing, Visual Studio had a problem locating some of the required DLLs.

What I found out is that some of the sub-extension projects contained an additional file—a *.pkgdef file. This is an older file that was used with VSPackages back in the VSI days of Visual Studio 2005/2008. This file is not much unlike a *.reg file, though with a slightly different syntax.

This file for the SettingsStore sub-extension (for which the VSIX manifest file was shown above) looks like this:

This creates the GUID under HKCU\Software\Microsoft\VisualStudio\<Maj. Version Num>_Config\BindingPaths key. $PackageFolder$ expands to the folder where the extension was installed by VSIXInstaller.

Apparently, in Visual Studio 2010, just placing this file within your VSIX package was enough for Visual Studio to go ahead and create the requisite registry keys and values. However, this does not work with Visual Studio 2012 and attempts to use extensions relying on this extension result in a System.IO.FileNotFoundException. In order to resolve this issue, you must place some additional content in your VSIX manifest file as shown below.

<References>
<Reference Id="Microsoft.VisualStudio.MPF" MinVersion="10.0" />
<Name>Visual Studio MPF</Name>
</Reference>
<!-- Any other references you may have -->
</References>
<Content>
<MefComponent>|%CurrentProject%|</MefComponent><!-- Expands to the assembly name -->
<VsPackage>SettingsStore.pkgdef</VsPackage><!-- Not required by VS2010, but is required by VS2012 in order for the file to be processed -->
</Content>

With all of that out of the way, save the file and we're done with the *.vsixmanifest file. If there are several VSIX extensions that make up the extension you're updating (as is the case for AllMargins.vsix), you will need to do this for each *.vsixmanifest file in the solution.

At this point, at the original date of publication (22 August 2012), I had you retarget the projects in the solution and replace some DLL references. If you want to maintain compatibility with Visual Studio 2010, this is not necessary. So I have removed that section of the original blog entry.

Recompile a Debug Build

You should now recompile the solution using a Debug build. This will automatically install the extension(s) into the Visual Studio experimental instance(i.e. it won't muck up your Visual Studio that you use for everyday coding). Make sure that everything works correctly. If all is well, you're ready for the final step.

Recompile in Release Mode

If your testing above went OK, then change the build mode from Debug to Release and recompile. Then simply navigate to the bin\Release directory under the project folder for the extension and double-click the VSIX file to install it.

Now, in this case, I was not required to change a single line of code. This may not be the case for every extension you wist to upgrade. YMMV.