System.WeakReference is a special runtime class used for collection and finalization of resources that have no other live references. The only problem is that the encapsulated value is of type "object", and so using it requires a lot more casting than it should. More often than not, a WeakReference will only encapsulate a value of one type, so this casting is often superfluous.

Enter Sasa.Weak<T>, which is a struct that wraps WeakReference and provides a typed interface to encapsulated values. As a struct, it does not incur any additional memory allocation costs, and the casting it performs are operations you would likely have to do anyway, so the overhead of the typed interface is virtually nil. Sasa.Weak<T> is available in the core Sasa.dll assembly.

Sunday, March 24, 2013

Awhile ago, I wrote a small library for describing, importing, and exporting tabular data to and from Excel XML, and CSV formats. I just never got around to releasing it, but I'm making a more concerted effort recently to push releases forward. The full documentation is online, and the binaries here. The license is LGPL.

Describing Tabular Data

Tabular.dll is an assembly via which you can declaratively describe tabular data:

There are really only three classes of interest, Table, Row, and Cell. A table consists of a series of rows, a row consists of a series of cells, and each cell consists of a string value together with an alleged data type describing the string contents.

The DataType enumeration is the list of recognized data strings. Cell provides numerous implicit coercions from CLR types to Cell and sets the appropriate data type, so declarative tables are simple to describe as the above code sample demonstrates.

It's also quite simple to describe a table that's derive from some enumerable source:

The CsvFormat enumeration describes whether the CSV data should be formatted in the safer quoted data format, or in the less safe raw format.

Import/Export of Excel Data

Tabular.Excel.dll provide import and export features for Excel XML file format. Excel 2002 XML schema is used for simplicity. See the docs for the full details, but here's a relatively straightforward overview:

Sasa.Types is a static class containing a number of extension methods on System.Type, together with extensions that mirror some of the CLR metadata instructions which aren't typically available in C#. It's available in the core Sasa.dll.

Sasa.Types.Create

Sasa.Types.Create is a static method used to create a dynamic type in a dynamic assembly, often for code generation purposes. It automates various steps and provides a boolean parameter indicating whether to save the assembly to a file, so you can run verification passes on it:

The first overload is for static field, and the second overload is for instance fields since it accepts an expression taking an instance and returning the field value.

The one caveat is that the C# can't enforce that you properly reference fields, or that you're using the write overload to access static vs. instance fields. Instance fields require an instance in order to reference them, so you must use the overload that accepts two generic arguments. Static fields only require the use of one generic argument.

Sasa.Types.FieldName

Sasa.Types.FieldName is an extension method on FieldInfo that extracts a "normalized" field name. By "normalized", I mean that if the field was a compiler-generated backing field for an auto property, then it will extract the property name. Otherwise, it will just return the field name itself:

Note that this method currently depends on the naming convention used by the compiler, so it may not be 100% future-proof. If the convention ever does change, I anticipate updating this implementation to reflect that.

Sasa.Types.GetTypeTree

Sasa.Types.GetTypeTree is an extension method on System.Type that extracts the whole sequence of generic type instances, declarations and arguments that make up a type:

The first overload is for static properties, and the second overload is for instance properties since it accepts an expression taking an instance and returning the property value.

Sasa.Types.ShortGenericType

Sasa.Types.ShortGenericType is a set of extension methods on System.Type that generates an abbreviated string describing a generic type, ie. assembly references omit versioning and public key information:

Well there are plenty of explanations of C#'s basic syntax, classes, structs, etc., but nothing that specifically addresses the functional mindset a Haskell programmer would be starting with, so I wrote a reply providing links to the various familiar concepts from functional programming found in C#, and described various caveats that might be surprising to a Haskell user. I'll reproduce the post here for posterity:

I think there's a reasonable C# subset for functional programming, so if you stick to that you should be able to pick it up relatively quickly. Read up on:

LINQ -- you can use the query comprehension syntax, or the regular first-class function syntax. The former is sugar for the latter.

lambdas and delegates are nominally typed, so you can't implicitly or explicitly coerce a delegate of one type into a delegate of a compatible signature, ie. System.Predicate<int> is signature compatible with System.Func<int, bool>, but they are not interconvertible without doing some magic like I do in my Sasa.Func.Coerce library function.

methods and delegates are multiparameter, not curried like in OCaml and Haskell, thus leading to all the Func* and Action* overloads.

The "void" return type is not a type, so it can't be used as a generic argument. Hence the need for all the Action* delegate types distinct from the Func* delegate types. Action* differ only in the fact that they return void.

generic parameters on methods are strictly more general than generic parameters on delegates (which are types). Method generics support first-class polymorphism, while type declaration generics do not.

classes are always implicitly option types, ie. they are nullable, while struct types always have a "valid" value, ie. are not nullable. Struct types are then useful for eliminating null reference exceptions in programs, as long the default struct value is meaningful.

Sasa.Strings.HardWrapAt

Sasa.Strings.HardWrapAt is a set of extension methods that insert new lines directly at the index specified in a function parameter, without regard to words or other considerations, and returns an enumerable sequence of lines:

The overload with the generic type parameter simply calls ToString() on its argument. It was added simply to avoid the overhead of calling ToString on the second argument before knowing whether the first string was empty:

The overloads take a variable length list of either characters or strings to use for splitting, basically reversing the order in the base class libraries for convenience.

Sasa.Strings.ToBase64

Sasa.Strings.ToBase64 is a convenient extension method for converting a Unicode string into a Base64 encoded string, using an optional text encoding parameter. If no encoding provided, UTF-8 is the default:

Sasa.Strings.WordWrapAt

Sasa.Strings.HardWrapAt is an extension method that inserts newlines at the whitespace boundary nearest but less than the provided column parameter. In other words, contrary to Strings.HardWrapAt which inserts the newline at the specified index, this method searches back for the nearest whitespace boundary less than the column boundary:

The simplest overload simply takes the inclusive lower and exclusive upper bounds for the sequence, and the second overload additionally takes a step size designating the increment between each number emitted.

These are defined as generic methods which make use of Sasa.Operators<T>, Sasa's generic operators class.

Sasa.Numbers.DownTo

Sasa.Numbers.DownTo is the complement to Sasa.Numbers.UpTo, where instead of generating a sequence of increasing numbers, it generates a sequence of decreasing numbers:

The simplest overload simply takes the inclusive upper and exclusive lower bounds for the sequence, and the second overload additionally takes a step size designating the decrement between each number emitted.

These are defined as generic methods which make use of Sasa.Operators<T>, Sasa's generic operators class.

Sasa.Numbers.Bound

The Sasa.Numbers.Bounds extension method ensures that a value falls between lower and upper inclusive limits:

Sasa.Result&ltT> encapsulating the results of a computation, either it's return value, or the error it generated, if any. For instance, you could spawn a thread to perform some calculation and deposit the result into an instance of Result<T>. Sasa.Result<T> is available in the core Sasa.dll.

Result<T> implements all the usual equality tests on itself, and on T, so you can perform direct comparisons to values.

Sasa.Result&ltT>.Value

The Sasa.Result<T>.Value property allows clients to obtain the value that was returned from the computation. If an error was instead generated, then this throws InvalidOperationException with the InnerException property set to the original exception:

This property is also required by IValue<T>, which Result<T> implements.

Sasa.Result<T>.HasValue

The Sasa.Result<T>.HasValue property allows clients to check whether the result is a legitimate value, or if it's an error result. If HasValue returns true, then clients can safely access the Value property. If it's false, then doing so will throw InvalidOperationException:

There is also an implementation of Select for the more abstract IResult<T> interface, but because Result<T> is a struct this would incur too much unnecessary boxing, so we override the Select and SelectMany extension methods with instance methods.

There is also an implementation of SelectMany for the more abstract IResult<T> interface, but because Result<T> is a struct this would incur too much unnecessary boxing, so we override the Select and SelectMany extension methods with instance methods.

Sasa.Result.Try

The Sasa.Result.Try static extension methods allow clients to execute some code within an exception handling block, which then returns the appropriate result for you, ie. either a value or an error:

The last overload over IEnumerable is useful because such sequences are often lazily evaluated, which means they may have hidden errors that you will only incur while iterating. This overload then turns a lazy sequence of possible unsafe errors into a lazy sequence of exception-safe result types.

Sasa.Option is an abstraction to deal with nullable/optional values. It is available in the core Sasa.dll. Reference types are already nullable, and structs have System.Nullable, so why write yet another abstraction to deal with optional values?

The answer is pretty simple: there is no other way to write a function whose generic arguments are clearly and plainly optional. This is partially complicated because C# doesn't consider type constraints when selecting method overrides, otherwise you could do something like this:

The nullable overload is clear from the type information, but this type information is never used and C# complains about ambiguous methods. Technically, you could simply write Foo like so:

int Foo<T>(T possiblyNull)
{
if (possiblyNull == null)
...
else
...
}

This will work even if you pass in Nullable<int> because a Nullable<int> can be compared to null, and so will take the proper branch. However, it's not at all clear from the type signature of Foo that possiblyNull is an optional Value. So Sasa.Option was created to easily specify this sort of contract, and have the contract enforced by the type system:

Option<T> has safe implicit conversions from T, and comparisons to null are fully defined.

Sasa.Option<T>.Value

Sasa.Option encapsulates a Value that may or may not be there. The Value property of the Option<T> struct returns the encapsulated Value if it exists, or throws InvalidOperationException if no Value exists:

This post will deal with a the Sasa.Func static class in the stand-alone core Sasa assembly. This core assembly is concerned mainly with addressing limitations in the core .NET base class libraries. For instance, it contains type-safe, null-safe and thread-safe event operations, extensions on IEnumerable, useful extensions to numbers, and so on.

Sasa.Func is particularly concerned with providing type-safe extensions on delegates. You can view the whole API online. Sasa.Func is available in the core Sasa.dll.

Sasa.Func.Id

The simplest starting point is Sasa.Func.Id. Use this method whenever you need a delegate that simply returns its argument. This is fairly common when using the System.Linq API. Usage:

Sasa.Func.Create

However, Func.Create is strictly more powerful than Delegate.CreateDelegate because it addresses certain limitations of the CLR I discovered two years ago. It was previously impossible to create an open instance delegate to either virtual methods, or to generic interface methods. Func.Create handles both cases by automatically generating a small dispatch thunk which wraps the invocation for you, and returns a delegate of the appropriate type.

The example above uses reflection to access the method metadata, but Sasa does provide a type-safe way to obtain the MethodInfo without reflection via Sasa.Types. This will be covered in a future post.

Sasa.Func.AsFunc

One somewhat frustrating limitation of the CLR is that System.Void is not considered a value, and so cannot be used as a type parameter that is used in return position. So for instance, you can't create a Func<void>. This relegates void to second-class status, where all other types produce values as first-class citizens.

This effectively divides the logical space of function types into those that return void (System.Action), and those that return a value (System.Func), and you cannot mix the two. Every operation that abstracts over the return type must then be written twice: once for functions that return a value, and again for functions that return void.

Sasa.Func.AsFunc provides a wrapper around the various System.Action delegates, effectively transforming them into the corresponding System.Func instance with a return value of Sasa.Empty. Func.AsFunc is also an extension method on the System.Action overloads, to make this wrapping as concise as possible. Usage:

Sasa.Func.Getter

A somewhat recurring pattern in C# programming is generating a delegate to access the value of a property. It's a little wasteful to generate a whole new delegate that closes over an object instance and then accesses the property, considering the object already has a method getter, ie. for property Foo, the C# compiler generates a get_Foo method.

Sasa.Func.Getter allows you specify an expression naming a property, and will return a typed delegate to the direct method getter for that property. Usage:

At the moment, the whole expression tree is generated every time this method is invoked, but a future extension to Sasa's ilrewriter will eliminate this entirely and generate direct operations on CIL metadata.

Sasa.Func.Setter

The dual to Sasa.Func.Getter, Sasa.Func.Setter obtains a typed delegate for the direct setter method of an object. Usage:

Sasa.Func.Open, Sasa.Func.OpenAction

The typical closed delegate has type System.Func<string> and encapsulates the reference to the object being converted to a string.

The open instance delegate would have type System.Func<object, string>, so the object being converted to a string must be passed in each time.

Sasa.Func.Open and Sasa.Func.OpenAction methods serve the same purpose, namely to create a so-called open instance delegate, where the 'this' parameter is not encapsulated within the delegate itself, but is itself the first parameter passed to the delegate.

This allows you to reuse the same delegate multiple times on different objects without needing a different delegate for each object you want to convert to a string, or whatever other operation desired. This is also how efficient dispatch works in Sasa.Dynamics, ie. the cached delegate in Type<T>.Reduce(IReduce, T) is a statically computed, cached open instance delegate to the method that handles type T in the IReduce interface.

Sasa.Func.VOpen, Sasa.Func.VOpenAction

These methods serve the same purpose as Sasa.Func.Open and Sasa.Func.OpenAction above, but they operate directly on value types (hence the V prefix). As before, the first parameter to a method is always a reference to the object being manipulated, but structs aren't reference types, so a method for struct T actually accepts a "ref T" as its first argument. Thus, open instance delegates that modify their struct argument must have a different signature, namely that of Sasa.VAction [1, 2, 3, 4] and Sasa.VFunc [1, 2, 3, 4] all of which take a "ref T" as the first argument.

I'm not entirely satisfied with the naming of Sasa.Open, Sasa.VOpen, Sasa.OpenAction, and Sasa.VOpenAction, so I'm very open to suggestions. Part of the problem is that overload resolution does not take type constraints into account, so even though Open and VOpen have constraints T:class and T:struct respectively, they need a different name or the compiler complains of ambiguous methods. We also seem to need different names for Open/OpenAction or there is another ambiguity as to whether we want a delegate that returns a value, or that returns void.

Sasa.Func.Fix

Sasa.Func.Fix are a set of overloaded methods to generate recursive lambdas. Delegates built from lambdas can't refer to themselves to make recursive calls. Sasa.Func.Fix addresses this by providing what's known as a fixpoint function. Usage:

Sasa.Func.Coerce

The type language for delegates is somewhat limited compared to the type language for interfaces and classes. For instance, interface methods support first-class polymorphism, where delegates do not. Combined with the fact that delegates are nominal types in their own right, this causes a proliferation of delegate types that are identical in type signature, but differ in nominal type and so cannot be substituted for each other. For instance, System.Predicate<T> is equivalent to System.Func<T, bool>, but you cannot use a delegate of one type in a place where the other delegate type is expected.

The Sasa.Func.Coerce extension method allows you to coerce one delegate type into another equivalent type. Usage:

Sasa.Func.Generate

The Sasa.Func.Generate method overloads are the real workhorses behind the scenes. These methods eliminate all the boilerplate in generating and debugging a DynamicMethod, and is statically typed in the delegate type to generate. The most useful overloads by far are the ones that accept a boolean "saveAssembly" parameter. If saveAssembly:true is passed in, then the method is generated in a newly created assembly that is written to disk. You can then run peverify on it to check for any verification errors in your dynamically generated code. A single change to this variable can switch between debugging and production modes.

Func.Generate is used throughout Sasa, even in Sasa.Func to create the thunks to dispatch open instance delegates for virtual methods and generic interface methods. Usage:

The basic interface defines the operations from which you can construct a lexer that can recognize individual characters, compose a fixed sequence of lexers, choose from a fixed set of alternate lexers, or recursively apply a lexer until it terminates. The "Tag" operation simply adds a descriptive name to the lexer for debugging purposes. This interface is sufficient to lex pretty much any sort of token stream you'd need to, but won't necessarily build the most efficient lexer for complex strings.

The Alternate operation additionally accepts a "merge" parameter of the following delegate type:

The Alternate lexer processes a series of possible lexes, and then invokes the LexMerge function to choose among them. LexMerge can keep all the lexes and the parser would return a parse forest, or it can apply some set of rules to choose the best lex to use, eg. choose longest match, or enforce only a single match.

These extended lexers don't add any lexing power, since they can be implemented in terms of the basic lexer operations as the code comments show. However, the implementations were added to optimize performance.

In general, the lexer takes a string as input and produces a series of outputs designating the tokens. For more detailed information, you can review ILexerState and the lexer implementation details, but for the most part you will never need to know these details since lexing is completely abstract when defining grammars.

Sasa.Parsing.Pratt

Having covered lexing and produced a stream of tokens, we can now move to parsing those tokens using a simple, yet efficient Pratt parser. Pratt parsing performs no backtracking and is technically Turing-complete, so you should be able to parse as complicated a token stream as you like. However, we can start simply with the canonical calculator example. Here's an interface that defines the semantics of simple arithmetic:

The MathGrammar accepts a math interpreter as an argument, and while declaring various parsing rules, registers callbacks into the interpreter to process the parsed values. The base interface of Sasa.Parsing.Pratt.Grammar accepts a LexMerge function to choose among alternate lexes. Sasa parsing provides a number of choices, though longest match will probably meet the most common needs.

Proceeding with operator declarations, here's the signature for the Infix operators:

Rule<T> is a parsing rule that encapsulates the lexer, the precedence of the operator, and the function to apply when this rule is triggered. The expanded declaration for the addition operator should now be clear:

Infix(op: "+", bindingPower: 10, selector: math.Add);

The "op" parameter is simply a shortcut that declares the operator symbol to associate with this rule, and creates a lexer that recognizes only that symbol. The parser triggers this rule when it recognizes a token stream where this symbol appears in infix position, ie. where this symbol is surrounded by two valid parses.

The "bindingPower" parameter is the absolute precedence of the declared operator. You can see from the math grammar above, that addition has lower precedence than multiplication and equal precedence to subtraction, just as we've all learned in elementary arithmetic. Given the definition of Infix, the Postfix and Prefix should be obvious as declaring postfix and prefix operators. Infix is left-associative by default, and InfixR is simply Infix with a right-associative default.

It simply declares that a token stream that starts with "leftGrouping", followed by some other parse, and terminated by "rightGrouping", should trigger this rule with the given precedence. Brackets have the highest precedence of all symbols in arithmetic, so the math parser declares groupings with precedence = int.MaxValue.

Grammar's Match rule is where the parsing all starts, since it recognizes and parses numbers. Here is the signature:

The "id" parameter is simply a unique identifier for this rule, similar to how tagging is used for identifying lex rules. The Match rule accepts an arbitrary lexer as a parameter so you can recognize any sort of token sequence as belonging to this rule. The "bindingPower" is the precedence of this rule, as before, and the "parse" parameter is where we actually take the string recognized by the lexer, and turn it into a real value produced by the parser.

The only values we're concerned with are numbers, so the declaration for parsing numbers should be clear:

Match("(digit)", OneOrMore(Where(char.IsDigit)), 0, math.Int);

This declares a rule called "(digit)" with a lexer that recognizes characters that are digits. The rule has precedence of 0, so it has the lowest of all the rules in math. When a number is recognized, it then dispatches into the math interpreter which can turn a digit string into a value of type T, which are the values operated on by that interpreter.

What's particularly noteworthy in this design is that the types of values produced by the parser are completely abstract, and so we don't even need to produce actual numbers for values. For instance, we can create an implementation of IMathSemantics that produces an abstract syntax tree describing the expression. Furthermore, we can actually extend the parser definition more or less arbitrarily to add new operations, new parseable values, etc.

Extensible Parsers

Starting with IMathSemantics, we can easily extend the math language by inheriting from IMathSemantics and adding variable declarations and applications:

Note that the type T must be able to support both values and bindings to values, so we can't provide an implementation of math semantics where T = int. We must add a notion of an environment into which bindings are injected. This is pretty straightforward. Here's the type we'll use to unify numbers and environments:

delegate Pair<Env<string, int>, int> Eval(Env<string, int> env);

So T will be a delegate that accepts an environment as a parameter, and produces a new environment and an integer value as a result. Here's the corresponding implementation of IEquationSemantics:

A little more involved than the simple math language, but most of that is due to passing around the environment. The original operations are all essentially the same, just operating on the tuple's second int value. Only the new Let/Var operations are markedly different. Let adds a new binding to the environment, and var looks up the most closest binding with the given name in the current environment.

Like the semantics, the equation grammar inherits the implementation of the simple math grammar and simply adds a few new rules. The Match rule identifies variable names that must start with a letter, followed by an arbitrarily long sequence of letters and numbers.

The TernaryPrefix rule identifies operators that have the following form: PREFIX expression INFIX expression POSTFIX. This is again a shortcut for defining a common operator type. In this case, we use the let-binding syntax from F# and other ML languages, ie. let x = 3 in x + 1.

The Disambiguate declaration is intended to simplify the common case where we only want a single valid parse. Basically, the keywords "let" and "in" also match the "(ident)" rule for variable names. Without the disambiguate declaration, some parses would be ambiguous, and since we want to produce a single parse for any given string, Disambiguate allows you to specify the priority of the parses. In this case, the "let" operator rule dominates over the identifier rule, so "let" will never be parsed as an identifier.

As with MathGrammar, the concrete type of the parse, T, is completely abstract, so we can produce values, or parse trees, or any other type that can satisfy the constraints of IEquationSemantics. This parsing design allows independent extensions of grammars, and parsed values, the latter of which can include direct interpreters, compilers, or other data structures that satisfy the required semantics.

Turing-Complete Parsing

So far we've reviewed simple parses with a set of pre-defined parse structures, like infix, prefix and ternary operators. However, Pratt parsing is Turing-complete, so the parsing can be arbitrarily complex. Digging into this machinery will demonstrate how the above operator rules were created, and how you can define your own operator rules.

I haven't bothered separating out the semantics for numbers and lists into its own interface for this simple example. The basic idea of defining parsing rules is defining the symbols that should appear in the token stream, and then registering either a NullDenotation, or a LeftDenotation, depending on the type of rule it's suppose to be.

A NullDenotation is a rule that has no parsed values to its left. The rule simply recognizes the starting tokens, and then takes the current parser as an argument and proceeds to manually process a token stream, ultimately returning a parsed value:

NullDenotation: Func<PrattParser<T>, T>

A simple example of a null denotation is a prefix operator like negation. Parses to the right of a negation operator exist, but none to its left.

A LeftDenotation is a rule that expects some parses to its left. The signature of a LeftDenotation is:

LeftDenotation: Func<Token<T>, PrattParser<T>, T, T>

A LeftDenotation accepts the token representing the operator, the current parser, and the parsed value immediately to the left. Infix and Postfix operators are simple examples of a LeftDenotation.

The operation of the Infix, Postfix and Prefix declarations previously covered should now be clear:

The parser.Parse(int) operation performs a parse according to the expected precedence of the next rule. This makes Pratt parsing a simple and efficient forward-only parser.

We can use any looping or testing constructs within NullDenotation and LeftDenotation, so the token streams that can be recognized are arbitrarily complex. The grammars definable using Sasa.Parsing are declarative, simple and quite efficient, so download and give it a whirl!