Introduction

I developed an interest in Rich Text back in my Delphi days. There is certainly a much better support for Rich Text provided by the .NET Framework, but interestingly, text shading is not supported, and it was my interest in shading which prompted this article.

By shading I mean the constant width background shading used by MSDN and others to highlight code examples. Articles such as this also use shading, for the same purpose. Oddly, shading comes as standard with MS Word – even my 1997 version.

Shaded text is always presented in a mono-spaced font, such as Courier New or Lucida sans Typewriter, and non-shaded text, for contrast, is commonly presented in a proportionally spaced font.

For some time I have maintained what is essentially a customized C# help file, drawing on Code Project articles, MSDN, my own code (occasionally!) and others and I wanted to be able to implement shading whether or not it was used in the source material. I was also frustrated by the loss of formatting which can occur when a code snippet is ported across to a Rich Text Box and so this article started to take shape.

It turns out that the shading effect can be achieved quite simply, and I will explain how it can be done.

I will go on to show how to incorporate syntax highlighting – a bit more of a challenge but quite straightforward once one appreciates what needs to be done. The color scheme can be set to match your IDE settings, or anything else you might choose.

This article will show how to apply:

shading with no syntax highlighting,

syntax highlighting with no shading, and

shading with syntax highlighting

to a text selection in a Rich Text Box.

All is done with a single mouse click.

The approach

Regular Expressions are at the heart of this exercise. Though I use Regular Expressions in a very minor way in implementing shading, the technique is used extensively with syntax highlighting, and indeed it is hard to see how it would be possible to perform syntax highlighting without using Regular Expressions.

Those new to Regular Expressions would benefit from reading on if only to see how Regular Expressions are used in a real world example, rather than the contrived examples which textbooks commonly have no choice but to use.

Where to start?

This is what we are setting out to achieve:

To be able to format a Rich Text selection we need to be able to edit the underlying escape sequences from which the Rich Text control rt1 will derive its formatted contents. That underlying version of the Rich Text is plain text and is easily edited, or at least it will be after we deal with some preliminaries.

Let us start by clearing rt1 and clicking Show/Hide RTF. t1 now holds the .Rtf version of what we see in rt1 (which of course is empty). The code is:

You see that a Rich Text Box has at least one font defined. The first font in the font table will be identified as \f0 and in this case that font is Verdana, because that is the font property I have set for rt1.

You may define as many fonts as you like. If you specify a font which the system is unable to implement, it will (as far as I am aware) substitute the font defined as \f0, and failing that it would use whatever the system default is.

Although we will not need to build a font table, the following is what a multi-font table might look like:

We can always find the font table, if we need to, because it starts in a defined way and terminates with double curly brackets:

{\fonttbl … … … … }}

More important than the font table, for our purposes, is the (optional) color table. The rt1 text, in the absence of any instruction to the contrary, will be black, the default text color. If you start typing, the text will certainly be black. (Note the absence of a color table in the case of an empty rt1, though had you set the text color property to other than black, you would see a color table with one entry).

We are going to need a color table for two reasons – we need to specify a shade color and we need at least four more colors for syntax highlighting.

I will show how to introduce a color table but first we must deal with a problem which will immediately arise. Colors are referenced according to their position in the color table. The escape sequences \cf1, \cf2, \cf3 … reference consecutive entries in the color table. (\cf0 is not explicitly defined, and can be regarded as black – though you are free to re-define say \cf7 as black also, if you wish. You can do what you like. Though there would be no point to it, you could define three differently indexed but identical green colors).

Our problem is that, having defined a color table, whenever we examine the background rt1 encoding, we will see that only those colors which have been invoked for the current rt1 selection appear in the table. Any unused or not yet used colors are no longer in the table, and the index values will have changed. My color table specifies \cf1 for black, \cf5 for green, \cf6 for blue and so on. I need to be able to color keywords blue by invoking an escape sequence which is synchronized with my nominated color table. I cannot allow blue to change to green because the table has been compromised.

The way around this is to utilize a second Rich Text Box rt2. (There are other reasons why it is advantageous to use a second Rich Text Box, unrelated to color referencing). You will see how this enables us to define an invariant color table.

Once the formatted text has been pasted back into rt1 we don't know (and don't care) how the color indexing might have changed.

Now we can start

A code snippet in rt1, having been selected, is programmatically copied to rt2. The first advantage in doing this is the rt1 selection will remain selected and we can later copy the formatted text back on top of that selection without having to re-select the text, which could be messy.

We will have set the rt2 font property to the mono-spaced font of our choice and we can now pad out each line to the intended width of the shading, which will give us a uniform right hand edge (the shading effect can be set to full width, or something less). The .Rtf version of rt2 will be copied to a string workstring and all processing will be done via workstring. (There is no reason for rt2 to ever be visible):

The second Replace operation locates the end of the font table and appends the color table. The colors making up colorDefinitions have been declared as string constants in the form \red#\green#\blue#;.

(Alternative means of creating a color table does not bear thinking about).

workstring now contains the plain text, encoded, version of the selected Rich Text, complete with font and color tables.

Applying Shade

Now we can shade the selected text. Remember, we are working with plain text now, and that text includes our chosen font table, color table and the rt1 text selection, together with the header escape sequences which we don't have to construct – they are supplied.

The new color table is stable because it is just plain text, and it will remain unchanged during our editing. When finished, the edited encoding will be copied back to rt1, where it will replace the (still current) text selection.

The escape sequence which will shade the selection is \highlight3 where the numeric is the index in my color table of the desired shade color (I have hardwired this, but the color index could be easily kept under program control).

(Remember, when running the demo, that if you do not have the nominated mono-spaced font, something else will be substituted and it will probably be proportionally spaced, so make sure you use Courier New or one of the other common mono-spaced fonts).

The escape sequence \highlight3 has the effect of setting the text background color to the third color in the color table. Specifically \f0\fs#<text /> is replaced by \f0\highlight3\fs#<text />. If you examine the encoded background, however, after the highlighting has been implemented, you will see that the color indexes (not the colors) have changed.

(You may have noticed that the backslash in the Regular Expression (\\f0) is escaped, but not the backslash in the Replacement. Take a moment to think about why this is necessary).

That is all there is to shading. To complete the operation, all we need do is:

rt1.SelectedRtf = workstring;

(If you select some text in rt1 and invoke Shade only, you will find on clicking Show/Hide RTF that the escape sequence for shading appears as \highlight1, not \highlight3. Although the table index has changed (there is only one color defined in the table, the shade color, so its index is now 1), the correct color is still referenced. This is a nice example of what I was referring to earlier).

Syntax Highlighting

We can incorporate syntax highlighting in five or six simple steps.

We will, in this order, color Keywords, Class names, Characters, Literals and Comments. Two styles of comments - those which start with // and those which are in the form of blocks encased by /* and */ are catered for. I have ignored the /// construct because I don't need it. It would, however, be quite easy to include.

Keywords, Class names, Characters, Literals embedded in Comments (and Comments embedded in Literals) must not be highlighted and you will see how this is done. We get into a bit of a bind, however, dealing with Literals in Comments as against Comments in Literals so I include a small clean up routine which looks after that. I will show that code later.

Let us start with Keywords:

Highlighting Keywords

Because they are a discrete set, I store the C# Keywords as a resource string. The keywords are 77 in number and can be picked up here.

The keywords cannot be used in the form they are supplied, however. For example, the first three keywords in an alphabetic listing are abstract, as and base. The abstract part of abstraction, the as part of has and the base part of baseless would be highlighted. To prevent this happening, each entry in the keyword list is placed between a pair of word boundary tags \b. The entries are converted to \babstract\b, \bas\b, \bbase\b, and so on.

Further, to enable the keyword list to be scanned via a single Regular Expression, entries are OR'ed together with the | character. The converted resource string now looks like:

\babstract\b|\bas\b|\bbase\b| … … … |\bwhile\b

so the pattern to be matched, in plain language, is abstract OR as OR base OR … … OR while

KY translates to \cf6 (which is blue in my color table) and TX to \cf1, which is black. The effect is to color all keywords blue (including any embedded in comments and literals – they will be cleaned up later):

Consider the keyword public, for example. Wherever it appears in workstring it has been changed to \cf6 public\cf1, and so on for the other keywords.

We will leave Class names to one side for the moment, because it is not really possible to put together a complete list of Class names. I will suggest a compromise later.

The delegates are straightforward and essentially reduced to inserting and removing \cf# escape sequences.

Highlighting Class Names

Though I speak of class names, you will see by looking at your IDE that words other than class names are also colored. The default highlight color for these other user types is the same as the default for class names. When I speak of class names, therefore, I should be understood to be including these other user types, because there is merit in highlighting them also.

Highlighting of class names, if it is considered worth doing, requires some thought. Though there are no programming difficulties, we are dealing with an open ended, undefined list. Further, how do we know whether a word is a class name?

There is no way that I can think of to define a class name list, as we are able to do with keywords. There could be thousands of library class names and, furthermore, a programmer can conjure up new classes and name them at will so we have to decide how we will handle this situation.

My solution is to have a class name list which is maintained in isolated storage and, whenever we see a class name not in the list, we add it to the list. Class names are easily recognized in code because, leaving aside the ones we are all familiar with, in my experience the Pascal naming convention is always used, and, were that rule to fail, context and usage will serve to identify class names.

In this demonstration program I maintain the class names list in the form: Classname2 Classname17 Classname5 Classname86 … … which is an alias for: \bClassname2\b|\bClassname17\b|\bClassname5| … … and so forth.

The first form I call the unformattedClassList and the second formattedClassList. The formatted list is synchronized to the unformatted list.

The unformatted list, which you can see in rt3, can be hand edited or it can be updated by simply double clicking a recognized class name in rt1. I also provide for sorting the list so that you may more easily look through it.

Similarly you can remove a class name from rt3 by double-clicking the name in rt3.

This way of handling class names has the advantage of simplicity and works well.

Some comments on the demo application

The demo application comes with a resource string which is loaded as the source material on the first run. You can experiment with and save your own source. Your source can be saved to isolated storage, so you never need to go looking for it. The isolated storage files are created on the first run, and I have included two buttons (T1 and T2) which enable you to examine the list of isolated storage files and to delete them, should you wish to force a "first run".

Though I have done quite rigorous testing, I cannot say that all permutations and combinations have been gone through, so I would be happy to deal with any issues which arise.

Regular Expressions

There is a wealth of material available on the internet – including many Code Project articles - and I have not found it necessary to buy a text, so I am not able to recommend one. To those new to regular expressions, they feature large in the Perl world and, although there are syntactic differences between their Perl and C# usage, the differences do not matter too much. The most comprehensive treatment of this topic, in my experience, is to be found in the Perl literature.

Rich Text Format

When I first got interested in rich text I bought the RTF Pocket Guide (O'Reilly) and it is my companion whenever I am wrestling with this topic. My edition was published in 2003 and I imagine it would still be in print. Its price at that time was $US 12.95 and I strongly recommend it.

Isolated Storage

Again, all that you need is available on the internet, and the topic has been well covered in Code Project and MSDN articles. Google will take you where you want to go.

Conclusion

It is likely that some readers will be looking at one or more of the three topics – Rich Text, Regular Expressions and Isolated Storage - for the first time and I hope the code shown here is of some value to them. I decided it was not practical to include in this article an explanation of the rather cryptic regular expressions I have used here. The length of the article would have doubled and it would perhaps still not have been an adequate treatment. I might submit at a later time an in depth article dealing with this extremely powerful tool.

Finally an illustration of a range of examples where syntax highlighting can be seen working. The list is not exhaustive:

Comments and Discussions

Hi, thank you for the great article.
I think is it possible to use your code for retrieving plain text from rtf string?
It will be very useful code.
Explain why: I need to use SQL CLR function which get rtf string and returns plain text. I cannot use System.Windows.Forms in SQLCLR.
That is why I use regular expressions to retrieve text. But for some rtfs it is not correct.

Hi, I've rewrote your mainloop to use a stringbuilder.string-concatination ( += ) is really slow because it needs to (re-)allocate memmory every time it's performed. A Stringbuilder allocates blocks of memmory, only expanding it when it's full. This should give a dramatic increase in performance, especially with large files, or large word-lists.

I'm currently working on an Snippet Manager application and am using a RichTextBox to display the data. I wanted to implement highlighting but was leery about doing parsing as it is an undertaking then ran across your article.
I hadn't really thought about using Regex in conjunction with RTF as I didn't know much about either but have since learned a little about both.
If your interested I have a stable and usable version that I would be happy to email you and you can see first hand what I've done.

Thanks,
Mike

Light travels faster than sound. That's why some people appear right until you hear them speak.