Introduction

Often there is a need to convert a file from one format to another. Most of the tools out there seem to cost some money. Not a lot of money, but they aren’t free. Anyway, several file types that you may want to convert to can be done through Microsoft Word. Here is an example of using C# to access Word Automation object to read in a file type and then write out a different file type.

Background

I had a need to convert an RFT format to TXT. As I was looking for solutions, I had a hard time finding one that was free. So I decided to use Word Automation to accomplish my goals. Note that in this example I am using Office 2003. If you do not have Office 2003 this will still probably work except for the XML output since that is new to Word 2003. The sample program I have included in the download with the source code allows you to set an input file. This is loaded into the Word object. Then there is a ComboBox that contains a list of formats that you can convert to.

The code

To do Word Automation you need to add a reference to your project to the Word DLL.

Add a reference to your project. Click on the COM tab. Down towards the bottom you will find a Microsoft Word 11.0 Object Library. Select and add to references. Now you will be able to access the Word functionality in code.

Conclusion

So this is a pretty simple solution. It is important to note that when you save to XML you will get the word formatting stuff included in the XML. Some of that may not be what you want in your XML doc. Still the solution works well for converting RTF format to TXT.

My thanks to all the other CodeProject articles on Word automation that helped me put this solution together.

Share

About the Author

I started my programmer career over 18 years ago doing COBOL and SAS on a MVS mainframe. It didn't take long for me to move into windows programming. I started my windows programming in Delphi (Pascal) with a Microsoft SQL server back end. I started working with vb.net when the beta 2 came out in 2001. After spending most of my programming life as a windows programmer I started to check out asp.net in 2004. I achieved my MCSD.net in April 2005. I have done a lot of MS SQL database stuff. I have a lot of experience with Window Service and Web services as well. I spent three years as a consultant programing in C#. I really enjoyed it and found the switch between vb.net and C# to be mostly syntax. In my current position I am programming in both vb.net and C#. Lately I have been using VS2012 and writing a Windows 8 app. You can search for the app it is called ConvertIT.

On a personal note I am a born again Christian, if anyone has any questions about what it means to have a right relationship with God or if you have questions about who Jesus Christ is, send me an e-mail. ben.kubicek[at]netzero[dot]com You need to replace the [at] with @ and [dot] with . for the email to work. My relationship with God gives purpose and meaning to my life.

I am pretty sure there is a property that tells you when the word doc has changed. I am not sure what it is. I would suggest you do a quick google search to find the property in word. Then you should be able to create a macro in word using what you found to figure out what property it is, so you can use it in your code.

I like your approach for getting a wide variety of out put formats, but for text it is overkill.

The richedit control support streaming data out as text by setting a type Parameter
SF_RTF gets data as RTF
SF_TEXT gets the text with the formatting removed.
SF_RTFNOOBJS strips com objects from rtf (will indicate commobject position with \objattph, followed by a space to denote the object position.

SF_TEXTIZED text with text representation of com objects

for more info read about EM_STREAMOUT on msdn.
Example in C++

CString MyRichedit::GetText()
{
// Return the RTF string of the text in the control.

If you explore the different available formats you can saveas in the
Microsoft.Office.Interop.Word.WdSaveFormat. class you will find there id a doc format. I am wondering why you even want to translate it to a doc format. IF you have a rtf format Word will open it just like a doc file? Anyway, just open the rft file, then make sure you pass in the correct format when you do the saveas command.

When I enter cyrllic character in RTF and i convert into HTML,it gives someother character
for Eq.
Unicode value for the character '?' in RTF is \'e9 but in HTML is 1081(hex value is 439)
how it is varies please any one explain me.
Thanks.
Karthik

It most likely depends on if you support unicode on your html page. You set what character set you will be using. The default is usually ansi which doesn't support unicode. So I think that is what you are seeing is the default character set in not unicode so it is translating the character differently.

Unfortunately, this solution can only convert files into formats that Word supports. PDF is not one of those formats. When I have created PDF files, I have done one of two things. I have used PDFWriter. It installs as a printer, so you print to it and it prompts you for a file name. The other way I have done it is to use a freeware cutepdfwriter, you also need to have something like ghostscript for it to work. It works the same way where it is installed as a printer, then you print to it and you are prompted for a file name.

If you want to convert in PDF file without any word formating and Images then there is a way
by .NET programming.

First convert .doc file into .txt file as above. now, you have to read that text file line
by line, put that text in to dataset, and assign that dataset to crystal report control. and
there is a inbuilt functionality provided in report to convert to pdf. I am using this way to convert txt files in to pdf files.

if you want that application take it : open http://www.bhaveshdave.8m.com/material.html and click on the link named "Multi-Convert 1.0.0.9"

I don't know for sure, but it looks like word doesn't want you to close it down. I am guessing that you may be trying to close a Word instance that you did not programmically start. Anyway, that is my guess.

That seems to work for me. Passing the false into the print method causes the printing to finish NOT in the background. I believe the problem you are having has to do with the printing is happening in the background and you are closing the word doc while the printing is still happening.

Usually the reason you get this is your version of office was probably installed before the .net framework was installed. If you run the install for office you should see a new option that talks about installing the .net add-ins or develper code or something like that.

You won't have this issue if the .net framework was already installed before office was installed.

You should be able to use the code above with some modification. You need to make sure when you do your save that you have the correct word doc format selected. Part of that has to do with what dll you bring into your .net project.

If you were to do it manually, you would open the rtf in Word and then do a save as and select your word 2.0 doc format. So you can do the same thing with word automation. Open the rtf, pick the word 2.0 doc format and save.

I am not sure I understand your question. If you are asking to convert rtf to html. The sample program you load the rtf file name into the text box. Then select web doc filtered (*.html) from the dropdown. click convert. Let me know if you are asking something else.

I had the same problem and I bought source code + compiled COM components that works with C# great. Unfortunely it's not free (source code is for 49$ with is quite cheap). http://www.ireksoftware.com/RTFtoHTML/sources.html

Just to convert RTF to plain TXT you really don't need Word!
You write about converters not being free, but I guess most of them are a lot cheaper than Microsoft Word!
The idea to use word for this purpose is really strange when you can simply use the Text property of a RichTextBox to access the plain text!
This way is truely free, a lot faster and easier to maintain because you don't have the Word reference hanging around.
Try this instead:

Thanks for the tip.
I did not realize that a richtextbox would do that for me. I will have to try that out. In my existing project I was already doing word automation to direct printing and to have features like re-print page ranges on a print job. So for this project it is assumed that office is installed. I guess I figured a lot of businesses out there have Microsoft office installed as a common tool on their PC's.

Mav is right, the RichTextBox will work as well. However, we really need a better solution; ideally, we'd have a regular expression that stripped out the control words in an RTF string, giving us only the resulting text. I've yet to find one on the web though, and I'm not experienced enough with regular expressions to create one myself, unfortunately.

I tend to agree as well. Still I live in the world of tight deadlines, so even though I would like to do something better, this works and fits in the timeframe I have to finish the project. I thought seriously about writing something similar to what you are suggesting. The problem is I don't have the time.

Still, that requires the creation of a RichTextBox, with all the overhead involved. This also limits you to using the UI thread; what if I want to convert large amounts of RTF data to plain text using a background thread? Either way is insufficient. What we really need is a regular expression that will strip out all the control words from an RTF string...

Like my previous message, because of a tight deadline, this is the quickest solution that will fit in the timeframe I have. I am actually going to stay with Word automation since the class I am doing the work in has no visual interface, so I don't want to include things in my using clause so I can go with the slicker solution of using the richtextbox. Again my solution is already using Word automation for printing purposes. So this works for me and I just though I would share what I learned.

I can not agree.
It is no problem to create a window like a rich text box in a non UI thread. There is no need for a message loop and do the conversion. I wrote a small command line utility that does this work for me.

Well, we may just need to agree to disagree. I don't think it would be a step up to be calling a command line utility to do this conversion. At least I think that is what you are suggesting. Again, my point was and is that I am already using word automation to do printing of rtf's so it is not a big deal to use word automation to convert the file to txt.
Thanks,
Ben

I am not using the Richtext control from the .NET framework!
I am using the Win32 Rich Edit Controls.
You can either use directly the EM_STREAMIN/EMSTREAMOUT.
It also supports a full COM Interface ITextDocument with function Open and Save to do the conversion.