Markdown Monster 1.09 releasedJan. 24, 2018 West Wind Technologies has released Markdown Monster 1.9 of its popular Markdown Editor, Viewer and Weblog Publishing tool. Markdown is an easy to...

West Wind Web Connection 6.18 releasedJan. 10, 2018 West Wind Technologies has released Web Connection 6.18 of its Web Development toolset for Visual FoxPro. Web Connection is a rich tool for building...

I've been building a number of solutions lately that relied heavily on parsing text. One thing that seems to come up repeatedly is the need to split strings but making sure that certain string tokens are excluded. For example, a recent MarkDown parser I've built for Help Builder needs to make sure it first excludes all code snippets, then performs standard parsing then puts the code snippets back for custom parsing.

Another scenario is when Help Builder imports .NET classes and it has to deal with generic parameters. Typically parameters are parsed via commas to separate them, but .NET generics may add commas as part of generic parameter lists.

Both of those scenarios require that code be parsed by first pulling out a token from a string and replacing it with a placeholder, then performing some other operation and then putting the the original value back.

For me this has become common enough that I decided I could really use a couple helpers for this. Here are two functions that help with this:

TokenizeString() basically picks out anything between one or more start and end delimiter and returns a collection of these values (tokens). If you pass the source string in by reference the source is modified to embed token place holders into the the passed string replacing the extracted values.

You can then use DetokenizeString() to detokenize either individual string values or the entire tokenized string.

This allows you to basically work on the string without the tokenized values contained in it which can be useful if the tokenized text requires separate processing or interferes with the string processing of the original string.

An Example – .NET Generic Parameter Parsing

Here's an example of the comma delimited list of parameters I mentioned above. Assume I have a list of comma delimited parameters that needs to be parsed:

Notice that this list contains generic parameters embedded in the < > brackets so I can't just run ALINES() on this list. The following code strips out the generic parameters first, then parses the list then adds the token back in. The Tokenization allows picking out a subset of substrings and replace them with tokens so additional parsing can be done without the noise of the generic parameters in brackets that would otherwise break the parse logic. This is quite common in text parsing where you often deal with patterns that you are matching – and trying to avoid edge cases where the pattern breaks down. This is where I've found tokenization super useful.

Specialized Use Cases

In Help Builder I have tons of use cases where this applies as documents are rendered: In parsing code snippets out of documents for parsing because the code snippets are rendered 'raw' while the rest of the document gets rendered as encoded context. Links that require special fixup before being embedded into the document – the tokenization allows easy capture of the links, replacing the the captured token value and writing it back out with a new value. In Web Connection the various template parsers do something very similar with expressions and code blocks that get pulled out of the document then injected back in later as expanded values.

There are lots of variations of how you can use these tokens effectively.

This isn't the sort of thing you run into all the time, but for me it's been surprisingly frequent that I've had to do stuff like this and while this isn't terribly difficult to do manually, it's very verbose code that's ugly to write as part of an application. These two functions greatly simplify application code as it's shrunk to a couple of simple helper functions.

Feedback for this Weblog Entry

I have to do this stuff frequently as well, and interestingly, like in your case, it usually involves parsing a comma-delimited list (usually column names, such as in a SQL SELECT statement) where one of the items in the list could contain commas, such as SELECT [This is a dumb column name, right], SomeOtherField ... I pretty much use the same technique (placeholders) but hadn't thought of creating a helper class.

It's been real handy to have this function. In Help Builder I dozens of places where this stuff comes in handy. My MarkdownParser class in Web Connection too uses this in a couple of places. It's reduced a few very common use cases to two lines of code.