The SitePoint Forums have moved.

You can now find them here.
This forum is now closed to new posts, but you can browse existing content.
You can find out more information about the move and how to open a new account (if necessary) here.
If you get stuck you can get support by emailing forums@sitepoint.com

If this is your first visit, be sure to
check out the FAQ by clicking the
link above. You may have to register
before you can post: click the register link above to proceed. To start viewing messages,
select the forum that you want to visit from the selection below.

From some versions of Microsoft Word 97 and later there is a feature to save as HTML. This seems like a great capability, however, at least with the latest Word 2000 document that I tried to save this way for a client, there were so many Microsoft specific HTML tags and Fonts that would not be supported that the document was basically worthless.

I tried to strip out the extra HTML tags, but finally gave up and created the HTML document from scratch and the from scratch document was a fraction of the size of the generated HTML document that I discarded.

As a second test, I created a Word 2000 document with one line "This is a Test" - the resulting HTML when the save as web page option was used was 85 lines long !!!!! (Seems like half a dozen lines or so should do it.) I have had a similar experience with Excel files.

My question is, does anyone know of a utility that will take Microsoft Word or Excel files and create standard, simple, easy to read HTML files?

I had the same problem with a client created file. Exported as HTML from Excel, it contained TONS of nasty stylseheet tags and classes as well as lots and lots of unnecessary table cells.

For a Spreadsheet of about 50 rows and 10 or so columns, the resultant HTML page was over 5000 lines long. Nice eh?

One thing that was an amazing GODSEND to me was a text editor that could use regular expressions.

For example, each line contained the same font specification except that the size was different. The result was like this:

<style "font-size: 1.2em">

and

<style "font-size: 1.5em">

All I did was use regular expressions to look for this:

<style "font-size: 1. [followed by and one-digit number] em">

Regex replaced every instance of the line with nothing.

The first time I did the page by hand it took me nearly and hour using normal find and replace. Once I realized that many of the lines were similiar and started using regex, the time dropped to about 5 minutes.

Little over my budget & Access Reports

Regex looks interesting, but I gather that I must have Frontier for it to work. $899 is quite a bit over my budget. However, thanks for the reply and I feel better knowing that others have been in the same situation. Maybe I'll start looking for tools with better pattern matching, but I was hoping for a generalize freeware tool that would just clean this code up. Or a competitors tool that could read the Microsoft files and generate good HTML.

On another project, I attempted to generate HTML from Microsoft Access 2000. The code was alright, but it did not bring across the cell background color - any ideas on a cheap HTML reportwriter so I don't have to hand code everything?

The editor that I use to work with RegEx costs about $30. It's called EditPlus (www.editplus.com) and you can even download a trial version to see if you like it. In my opinion, it is the best editor on the market for the PC.

Then you'll need to review RegEx to see what you need to look for. Here are a few pages that explain regex in more detail.

Dreamweaver has an option to clean up HTML documents created by Word. It significantly reduced the amount of code for the "this is a test" test doc. It also has several options to customize exactly what tags you want removed. It seems to work pretty good.

I have to export racing championship points into one of my sites from Excel documents - to be honest the spreadsheets aren't that complicated so I just use the "extended replace" function in Homesite 4.5 to get rid of most of the unwanted tag attributes.