Text Transformation with MGrammar and the Oslo SDK

My job involves moving different shapes and sizes of data linking systems and business processes together. Normally, I use Integration or Data Warehousing tools. Until I started using the Oslo SDK CTP "M" language, I've never considered building my own Domain Specific Language (DSL) as a tool in my repertoire.

If you've been following my Oslo SDK articles http://www.codeguru.com/columns/experts/article.php/c15779/, you've been introduced to MSchema, MGraph, and the Repository, all components of the Oslo SDK Modeling backbone. MGrammar is the third component in the "M" language. However, instead of defining data structure like MSchema, MGrammar defines data transformation, in particular, human-readable Text data transformation. Continuing to use the sample model I've developed in the other articles, I'm going to show you how MGrammer can be employed to populate Repository data.

Oslo's goal is to deliver a foundation for building and storing models of all types. Models are application metadata formatted for runtime consumption. Separate Microsoft initiatives aim to build runtimes and tooling into applications such as Visual Studio that are Oslo model aware.

As I mentioned earlier, the M Language is composed of MSchema, MGraph, and MGrammar. A complete introduction to M is beyond the scope of this article. MSchema and MGraph were covered in my prior articles this article will acquaint you with MGrammar.

MGrammar Overview

Unlike XML, Text is a natural human-consumable data medium. Although text can be semi-structured like XML, text is not standardized like XML is. Parsing text to store it in, for example, a relational database using traditional development tools, though not difficult, is difficult to do right. MGrammar bridges the gap between plain human-readable, composable text data and XML, making semi-structured text parsing more approachable.

In a typical MGrammar program, a developer defines the patterns to search in the text and defines how the pattern is translated into MGraph. MGraph looks a lot like inline C# collections. Oslo utilizes MGraph to populate MSchema models in the Repository.

In MGrammar, a developer defines a set of Rules for transforming text into MGraph. MGrammar has three types of Rules:

Token rules work and look a lot like Regular expressions

Syntax rules can be composed of Tokens and define the MGraph produced from text input.

Interleave rules define ignored text.

There are other features of MGrammar. However, a complete survey of the language is beyond the scope of this article and Rules are really the core of the language. So, I'm going to focus on Rules and, in particular, Token and Syntax rules. Using a sample, I'll illustrate how some basic Token and Syntax capabilities are employed to parse text.

As with other "M" programs, MGrammar applications are scoped to a module. Also like MSchema, MGrammar application can import and export libraries of Modules. As you can see, MGrammar supports comments and as I mentioned before the language syntax supports other things like, for example, preprocessor directives like #if and #define. Keep in mind, though, that the Oslo SDK is a CTP and MGrammar's current incarnation is not complete. The Language keyword begins the application definition.

Main is the application entry point. Main must always be a Syntax. Text input into the MGrammar must match a defined Syntax or the application emits an error. In the example, there are three Syntaxes defining three distinct patterns: Req, ServerConfig, and Nil. Later in the article, I'll explain how a Syntax is constructed. I want to start with the Tokens.

Syntaxes handle the input and resulting MGraph output, also called a Projection. Everything to the right of the "=>" symbol is the Projection (output text).

Syntaxes can include scoped variables. Variables can be used in the Projection. Syntaxes can be composed from multiple Syntaxes and Tokens.

In the example Req uses a variable called InputText. InputText contains the full text passed into the application and is used in the MGraph output. ServerConfig utilizes a variable called InputNum in a similar fashion.

Nil is a special case Syntax. If nothing is passed into the application an empty Projection is generated. Empty is a MGrammar keyword.

Main includes all of the defined Syntaxes. As I mentioned before, Main is the entry point of the application.

Running the Sample

There are some tricks to running the sample code.

First, to build and run an MGrammar application you must select the "Sample Enabled" Intellipad. Next you must select "Minibuffer" and type SetMode("MGMode") to enable MGrammer. The graphic below demonstrates enabling MGrammar mode.

Conclusion

MGrammar is a feature in the "M" programming language shipping with the Oslo SDK. MGrammar was built to make transforming Semi-structured text easier. Rules are the Core of MGrammar. Syntax and Token Rules define the text patterns and transformations.

Sources

About the Author

Jeffrey Juday is a software developer specializing in enterprise integration solutions utilizing BizTalk, SharePoint, WCF, WF, and SQL Server. Jeff has been developing software with Microsoft tools for more than 15 years in a variety of industries including: military, manufacturing, financial services, management consulting, and computer security. Jeff is a Microsoft BizTalk MVP. Jeff spends his spare time with his wife Sherrill and daughter Alexandra. You can reach Jeff at me@jeffjuday.com.

Advertiser Disclosure:
Some of the products that appear on this site are from companies from which QuinStreet receives compensation. This compensation may impact how and where products appear on this site including, for example, the order in which they appear. QuinStreet does not include all companies or all types of products available in the marketplace.