In this article

C# - A Split-and-Merge Expression Parser in C#

I published a new algorithm for parsing a mathematical expression in C++ in the May and July 2015 issues of CVu Journal (see items 1 and 2 in “References”). It took two articles because one astute reader, Silas S. Brown, found a bug in the first algorithm implementation, so I had to make some modifications to it. Thanks to that reader, the algorithm became more mature. I’ve also fixed a few smaller bugs since then. Now I’m going to provide an implementation of the corrected algorithm in C#.

It’s not too likely you’ll ever have to write code to parse a mathematical expression, but the techniques used in the algorithm can be applied to other scenarios, as well, such as parsing non-standard strings. Using this algorithm you can also easily define new functions that do whatever you wish (for example, make a Web request to order a pizza). With small adjustments, you can also create your own C# compiler for your new custom scripting language. Moreover, you just might find the algorithm interesting in its own right.

The Edsger Dijkstra algorithm, published more than 50 years ago in 1961, is often used for parsing mathematical expressions (see item 3 in “References”). But it’s good to have an alternative that, though it has the same time complexity, is, in my opinion, easier to implement and to extend.

Note that I’m going to use the “virtual constructor” idiom for the function factory implementation. This idiom was introduced in C++ by James Coplien (see item 4 in “References”). I hope you’ll find its use interesting in the C# world, as well.

The Split-and-Merge Algorithm

The demo program in Figure 1 illustrates the split-and-merge algorithm for parsing a mathematical expression.

Figure 1 A Demo Run of the Split-and-Merge Algorithm

The algorithm consists of two steps. In the first step the string containing the expression is split into a list of Cell objects, where each Cell is defined as follows:

The action is a single character that can be any of the mathematical operators: ‘*’ (multiplication), ‘/’ (division), ‘+’ (addition), ‘-’ (subtraction) or ‘^’ (power), or a special character denoting the end of an expression, which I hardcoded as ‘).’ The last element in the list of cells to be merged will always have the special action ‘),’ that is, no action, but you can use any other symbol or a parenthesis instead.

In the first step, the expression is split into tokens that are then converted into cells. All tokens are separated by one of the mathematical expressions or a parenthesis. A token can be either a real number or a string with the name of a function. The ParserFunction class defined later takes care of all of the functions in the string to be parsed, or for parsing a string to a number. It may also call the whole string parsing algorithm, recursively. If there are no functions and no parentheses in the string to parse, there will be no recursion.

In the second step, all the cells are merged together. Let’s see the second step first because it’s a bit more straightforward than the first one.

Merging a List of Cells

The list of cells is merged one by one according to the priorities of the actions; that is, the mathematical operators. These priorities are calculated as follows:

For example, merging Cell(8, ‘-’) and Cell(5, ‘+’) will lead to a new Cell(8 – 5, ‘+’) = Cell (3, ‘+’).

But what happens if two cells can’t be merged because the priority of the left cell is lower than the right cell? What happens then is a temporary move to the next, right cell, in order to try to merge it with the cell next to it, and so on, recursively. As soon as the right cell has been merged with the cell next to it, I return to the original, left cell, and try to remerge it with the newly created right cell, as illustrated in Figure 3.

Note that, from the outside, this method is called with the mergeOneOnly parameter set to false, so it won’t return before completing the whole merge. In contrast, when the merge method is called recursively (when the left and the right cells can’t be merged because of their priorities), mergeOneOnly will be set to true because I want to return to where I was as soon as I complete an actual merge in the MergeCells method.

Also note that the value returned from the Merge method is the actual result of the expression.

Splitting an Expression into a List of Cells

The first part of the algorithm splits an expression into a list of cells. Mathematical operator precedence isn’t taken into account in this step. First, the expression is split into a list of tokens. All tokens are separated by any mathematical operator or by an open or close parenthesis. The parentheses may, but don’t have to, have an associated function; for example, “1- sin(1-2)” has an associated function, but “1- (1-2)” doesn’t.

First, let’s look at what happens when there are no functions or parentheses, just an expression containing real numbers and mathematical operators between them. In this case, I just create cells consisting of a real number and a consequent action. For example, splitting “3-2*4” leads to a list consisting of three cells:

Split (“3-2*4”) = { Cell(3, ‘-’), Cell(2, ‘*‘), Cell(3, ‘)’) }

The last cell will always have a special END_ARG action, which I define as:

const char END_ARG = ')';

It can be changed to anything else, but in that case the corresponding opening parenthesis START_ARG must be taken into account, as well, which I define as:

const char START_ARG = '(';

If one of the tokens is a function or an expression in parentheses, the whole split-and-merge algorithm is applied to it using recursion. For example, if the expression is “(3-1)-1,” the whole algorithm in the parentheses is applied to the expression first:

The LoadAndCalculate method adds all of the cells to the listToMerge list and then calls the second part of the parsing algorithm, the merge function. The StringBuilder item will hold the current token, adding characters to it one by one as soon as they’re read from the expression string.

The StillCollecting method checks if the characters for the current token are still being collected. This isn’t the case if the current character is END_ARG or any other special “to” character (such as a comma if the parsing arguments are separated by a comma; I’ll provide an example of this using the power function later). Also, the token isn’t being collected anymore if the current character is a valid action or a START_ARG:

I know that I’m done collecting the current token as soon as I get a mathematical operator described in the ValidAction method, or parentheses defined by the START_ARG or END_ARG constants. There’s also a special case involving a “-” token, which is used to denote a number starting with a negative sign.

At the end of this splitting step, all of the subexpressions in parentheses and all of the function calls are eliminated via the recursive calls to the whole algorithm evaluation. But the resulting actions of these recursive calls will always have the END_ARG action, which won’t be correct in the global expression scope if the calculated expression isn’t at the end of the expression to be evaluated. This is why the action needs to be updated after each recursive call, like so:

If the extracted token isn’t a function, this code will try to convert it to a double. Otherwise, an appropriate, previously registered function will be called, which can in turn recursively call the LoadAndCalculate method.

User-Defined and Standard Functions

I decided to implement the function factory using the virtual constructor idiom that was first published by James Coplien (see item 4 in “References”). In C#, this is often implemented using a factory method (see item 5 in “References”) that uses an extra factory class to produce the needed object. But Coplien’s older design pattern doesn’t need an extra factory façade class and instead just constructs a new object on the fly using the implementation member m_impl that’s derived from the same class:

private ParserFunction m_impl;

The special internal constructor initializes this member with the appropriate class. The actual class of the created implementation object m_impl depends on the input parameters, as shown in Figure 6.

The second function is a “catchall” that’s called when no function is found that corresponds to the last extracted token. This will happen when the extracted token is neither a real number nor an implemented function. In the latter case, an exception will be thrown:

Earlier I said I’d provide an example using a power function, which requires two arguments separated by a comma. This is how to write a function requiring multiple arguments separated by a custom separator:

Any number of functions can be added to the algorithm from the user code, like so:

ParserFunction.AddFunction("pi", new PiFunction());
ParserFunction.AddFunction("exp", new ExpFunction());
ParserFunction.AddFunction("pow", new PowFunction());

Wrapping Up

The split-and-merge algorithm presented here has O(n) complexity if n is the number of characters in the expression string. This is so because each token will be read only once during the splitting step and then, in the worst case, there will be at most 2(m - 1) – 1 comparisons in the merging step, where m is the number of cells created in the first step.

So the algorithm has the same time complexity as the Dijkstra algorithm (see item 3 in “References”). It might have a slight disadvantage compared to the Dijkstra algorithm because it uses recursion. On the other hand, I believe the split-and-merge algorithm is easier to implement, precisely because of the recursion, and easier also to extend with the custom syntax, functions, and operators.

Vassili Kaplanis a former Microsoft Lync developer. He is passionate about programming in C# and C++. He currently lives in Zurich, Switzerland, and works as a freelancer for various banks. You can reach him at iLanguage.ch.

Thanks to the following Microsoft technical expert for reviewing this article: James McCaffrey