If this is your first visit, be sure to
check out the FAQ by clicking the
link above. You may have to register
before you can post: click the register link above to proceed. To start viewing messages,
select the forum that you want to visit from the selection below.

Parsing Word Document & Export to Excel

Hi...

I'm a third year software design student and I'm currently on my work placement. I have never used VB before, but my research has led me to believe that this is the best method...I have been given the task of programatically parsing hundreds of word (2003) documents for information stored in certain tables. The documents comprise of paragraphs and tables. I need to extract the info in preferred tables and export it to excel...Can anyone help please. I know it can be done but I've been researching this for a week now and I really am pulling my hair out at this stage. If anyone can help or point me in the right direction or even link a solid tutorial I would be so, so grateful.

Re: Parsing Word Document & Export to Excel

how do you define which tables in each document contain information you want?

vb Code:

for each t in doc.tables' where doc is a document object

if t.cell(1, 1).range.text = mystring then' or some other criteria

'do stuff here

endif

next

are you writing your code in vb6 or inside some office application (VBA)?

i do my best to test code works before i post it, but sometimes am unable to do so for some reason, and usually say so if this is the case.Note code snippets posted are just that and do not include error handling that is required in real world applications, but avoid On Error Resume Next

dim all variablesas required as often i have done so elsewhere in my code but only posted the relevant part

come back and mark your original post as resolved if your problem is fixed
pete

Re: Parsing Word Document & Export to Excel

I think I would drive this from within an Excel VBA module. Hopefully, all your Word docs are in one directory! If that directory is a constant, you can hard-code your file path; otherwise you're going to have to set up a way to search all in one directory and then ask if you need to do more, if so navigate to the next directory and start again.

You can set a document object for each opened document, and search for the table you need. I hope there's something consistent and unique that identifies only the one table you need in each doc?? Likewise, are the table cells you need consistent for each applicable table?

As you copy the data into Excel, is it the same data type each time? For instance: first column = date; second = name; third = amount?? As you change docs, do you want to skip a line in the Excel sheet? Do you want each doc's info on a separate sheet?

Re: Parsing Word Document & Export to Excel

Hey Guys thank you so much for all of your input.

As per Ed_from_AZ's suggestion I decided to drive from within an application. However, my requirements have since changed slightly.

I now need to parse selected chapters within the word document. So at the moment I'm using a checklist populated from a config file (all chapter names) (requirement). Based on the selection(s) I have to locate chapter(s) within the word doc (chapter headings) once identified I need to copy that full chapter export into an excel doc. An issue is also the chapter names appear numerous times throughout the doc, so the only way to differentiate the chapter heading from any other mention is by heading style. All the chapter headings are in 'h2' style. Chapters include both paragraphs and tables. I am running out of excel, I wish to export to a different spreadsheet.

Here is what I've got so far, this does locate the chapter heading, but that is it. It is also quite slow, these documents are 20,000+ words, so if I select the final chapter, it takes 30-40 seconds to hit. I would like to keep the code pretty much along the same lines (same logic), but if anyone can please help I would greatly appreciate it!!

I have made pretty good progress (I think) but unfortunately progress is slowing and I'm coming up against a brick wall as I don't really have much experience in vba

Re: Parsing Word Document & Export to Excel

Re: Parsing Word Document & Export to Excel

When you say: "I am running out of excel, I wish to export to a different spreadsheet.", do you mean you're running out of room in the sheet to place your data? And you want to use a different program? Or you want to overflow onto a new worksheet? Why aren't you using the 2007 or 2010 versions (xlsx or xlsm)? These give you over a million rows and columns to XFD, vice 65536 rows and columns to IV.

Your config file: it contains the doc name and the specific chapters within that doc that you need to find? If not, how are you knowing which chpter in which doc?

When you say: "I need to copy that full chapter export into an excel doc" and "Chapters include both paragraphs and tables.", what do you need to copy into Excel and what is it supposed to look like when you're done? Whole chapters in one cell? Paragraphs in individual cells, and tables in an Excel table? Are there any special requirements that must be met to allow this data to be manipulated within Excel? Or is this just a report and the inrformation is imported for display only?

One option to get the whole chapter might be:
-- Open doc
-- Set bookmarks to every H2 chapter heading
-- Iterate through bookmarks until you hit one that matches the text in your config file
(This assumes you have an array or at least a string of all the chapter titles for that doc that you can match against.)
-- When you get a hit, grab the index number of that bookmark, and set a range from doc.bookmark(index) to doc.bookmark(index + 1).

That should give you a range to that chapter; with that, you can copy, parse, find tables, whatever.

Re: Parsing Word Document & Export to Excel

Oh - and if you're running this from within Excel VBA, you might consider pulling your config file list (if it's CSV or similar format) into a sheet in your Excel workbook, vice holding populated arrays in memory. Then you can just iterate down the listings in the sheet by row and column and know what you're looking for.

Re: Parsing Word Document & Export to Excel

Ed_from_AZ...There are already bookmarks within the doc, so I'm just wondering, the word documents I'm working on have an index, each chapter listed in the index has a hyperlink that when clicked brings you straight to the chapter. Can I use these in someway to extract the whole chapter between hyperlinks?

I would really appreciate a dig out with this as I've a progress meeting on Tues as we've a bank hol here on mon.

Re: Parsing Word Document & Export to Excel

Sounds like an auto-genereated Table of Contents. And with that, each of your chapter headings would indeed have a bookmark as a place for the hyperlink in the TOC to jump to. Sure, those bookmarks can be used - if you know what they are and they are consistent between docs.

You're looking at a loop, which contains repeating instructions to be run for each desired chapter in a doc - and then needs to be repeated for each desired doc. If the chapter heading bookmarks are all the same in each doc for each chapter you need to parse, then by all means use what's there. Otherwise, I'd set up my own bookmarks just so I could write code that knows exactly what to look for each and every time.

In "air code", what I envision would look something like:

vb Code:

For Each doc in your collection of docs

' Set bookmarks

numBkmark = 1

For Each H2 header

Set range to header range

Set bookmark to range _

bkmk.name = "Chapter_" & numBkmark

numBkmark = numBkmark + 1

Next H2 header

' Iterate bkmks to find the right chapter

numBkmark = 1

For Each bkmark in doc.bookmarks

If bkmark.name = "Chapter_" & numBkmark Then

If bkmark.name is in your list of chapters

' We've got a match!

Set a range _

Start:= bkmark.range.start, _

End:= next bookmark.range.start

'This gives you a range for everything

'between the two chapter headings

'Now you can extract the text, tables, etc.

EndIf

EndIf

numBkmark = numBlmark + 1

Next bkmark

Next doc

Anyway, that's how I would start it.
Hope I said it so it makes sense to you, too!! :8>)

Re: Parsing Word Document & Export to Excel

What I see as the building blocks for this are:

-- A way to create a collection of documents to work on, or
A method to use the doc path and name to get to the doc
-- A list that identifies which chapters you need to parse:
Is there a consistent chapter title? or
Will you need to identify a certain table or text within a chapter?
-- A loop that will iterate all the chapter headings or chapters, identify the right one(s), and grab the table, text, photo, whatever
-- A procedure that will pull the desired element(s) from the doc into the workbook in a useable manner
Does each doc and/or chapter need its own worksheet?
Are you going to try to aggregate some or all of this info? (Don't do it on the fly - get it all into the workbook first, and then you can work with it.)
Can you get someone to give you an idea of what a "useable manner" looks like??
-- Error handling:
Doc doesn't have the chapter - do you need to send someone a red flag?
What's in the doc range refuses to copy over into the worksheet - how should you handle it?
The doc range exceeds the limits of the worksheet - how should you handle it?
Someone else is using the doc when your code wants it - can you detect that and come back to it?

Flesh out the project as much as possible: flowchart the operation; detect loops within loops and If/Then branches; look at what you need to indentify everything you must work with - doc, workbook, worksheet, ranges in both Word and Excel, bookmarks, paragraphs, tables, etc. What are the blind spots? What are the possible hiccups and errors?

Craft some questions and put *them* on the spot!! What do you need from them to write good code? You can write any code they want, but you can't read their minds to know what they want. Intelligent questions can make you look better than lots of info.

Re: Parsing Word Document & Export to Excel

So these are the bookmarks/hyperlinks for each chapter, you can see there are 52 chapters. They are in numerical order, so I'm guessing I could use these? You see I'm populating a multi list box from a config file (which is a list of the chapters), so based on the selection, thats the chapter I need to parse.

At the moment selecting a folder to process.

At the moment the requirement is to parse the whole chapter and "dump" it into an excel spread sheet. So tbh, if I was that far for Tues, I would be extremely happy!!

Re: Parsing Word Document & Export to Excel

If you can translate "Toc292701022" into the chapter you need, then you're good to go. Try using the GoTo feature and type in one of these bookmark names and see if you go directly to the desired chapter.

Are you going to have to do all these one at a time? Open a doc, populate a list box, manually choose a chapter, and then run your code? Or is this just to create a demo of yhow you can get the info into Excel?

What does a chapter look like? Is it all paragraphs of plain text? Will you have graphics, graphs, charts, tables, hyperlinks?

Re: Parsing Word Document & Export to Excel

I tried putting in the link into goto but to no avail....

How it works at the moment, as I'm running this out of excel vba, I open the workbook (multi list already populated), I tick relevant chapters, then hit a parse button, this then opens a folder picker, which allows me to select the folder that contains the documents for parsing and hit ok, the program (at the moment), then opens the documents one by one and goes through them, finds the heading brings up a text box (for testing) and closes the doc, and moves onto the next...

I hope to find the heading, extract the chapter, then copy it to a worksheet, then close the doc and move to the next.

That is how it is now, eventually they want to link it to an internal tool within the company to parse docs stored in that, but I don't need to worry about that at the moment.

Chapters contain both paragraphs and tables..

I'm using office 2003

Thank you very much for your help thus far, I've gotten this far and now I'm pulling my hair out!

Re: Parsing Word Document & Export to Excel

could you set ranges by heading style rather than bookmarks?? so from one h2 style to the next?? I don't know and again, I'm really appreciative of your help as I'm quite inexperienced in vba, I am proficient in Java & c++, but when it comes to vba, it's absolutely learn as you go...

Just to add, I have posted a similar style question to this on another forum, I found out today that cross posting is a no no, I honestly wasn't aware of this, I thought that you post to forums to ask the experts and hope for some good advice....I've posted the link to my other post below. Apologies!

Re: Parsing Word Document & Export to Excel

I have managed to locate the bookmarks, they were hidden bookmarks, now unfortunately they aren't in order (numerically).

e.g.

_Toc129482794
_Toc292701023
_Toc63137014
_Toc63137015

these do work!

Now my thinking is if there is some way (a way of which I don't know) to incorporate these with the chapter listings in the config file, then once selected in the multi list box, i can go to bookmark rather than heading.

Re: Parsing Word Document & Export to Excel

Re: Parsing Word Document & Export to Excel

I've come up with something - hope it works for you. I tried to attach the files, but it wouldn't go. So here's the code that I put in a code module in an Excel file.

Sheet1 of this file is set up with your chapter headers in column A, and an X by some of them in column B. The code runs down column B - if there's an X (or x - it will catch either case), it grabs the text of that chapter.

Hope it gives you a start.

Ed

Code:

Sub GetMyStuff()
'*******************
'Object declarations
'
'To declare Word objects, set a reference
'to the Word object library.
'Tools >> References >> Microsoft Word
'
'(Since both Word and Excel have range objects,
'but they have different Methods and Properties,
'we must specifically declare as Word or Excel
'range to avoid confusion.)
'*******************
'Set Excel objects
Dim XLwkb1 As Workbook 'This workbook
Dim XLwkb2 As Workbook 'Workbook for data
Dim XLwks1 As Worksheet 'Worksheet with config
Dim XLwks2 As Worksheet 'Worksheet to hold data
Dim XLrngC1 As Excel.Range 'Individual cell in config wksht
Dim XLrngC2 As Excel.Range 'Individual cell in data wksht
Dim XLrngW1 As Excel.Range 'Range of cells in wks1
Dim XLrngW2 As Excel.Range 'Range of cells in wks2
'Set Word objects
Dim appWD As Word.Application
Dim WDdoc As Document 'Report document
Dim WDrngCh As Word.Range 'Range for one complete chapter
Dim WDrngHdr1 As Word.Range 'Range for one chapter header
Dim WDrngHdr2 As Word.Range 'Range for next chapter header
'Other objects
Dim strRptP As String 'Complete path for report
Dim strRptN As String 'Doc name for report
Dim strChp1 As String 'Title of chapter to grab
Dim strChp2 As String 'Title of next chapter
Dim numRow As Long 'Used for Excel row number
Dim arrName 'Used to extract doc name from path
'*******************
'Main routine
'*******************
'Set initial Excel objects
Set XLwkb1 = ActiveWorkbook
Set XLwks1 = XLwkb1.Worksheets("Sheet1")
'Find and open your Word doc
Set appWD = New Word.Application
strRptP = Application.GetOpenFilename
Set WDdoc = appWD.Documents.Open(strRptP)
appWD.Visible = True
'Get name of report
strRptN = WDdoc.Name
'Remove doc type from name
arrName = Split(strRptN, ".")
strRptN = arrName(0)
'Initialize the search range
Set WDrngHdr1 = WDdoc.Content.Duplicate
WDrngHdr1.Collapse wdCollapseStart 'This point is now the beginning of the document
Set WDrngCh = WDrngHdr1.Duplicate
'Create new workbook for search results
Set XLwkb2 = Workbooks.Add
Set XLwks2 = XLwkb2.Worksheets("Sheet1")
'Format first column
With XLwks2.Range("A:A")
.WrapText = True
.ColumnWidth = 65
End With
'Stop screen flashing
'Comment out or remove if you want to observe it working while testing
'Application.ScreenUpdating = False
'appWD.ScreenUpdating = False
'Find your X
For Each XLrngC1 In XLwks1.Range("B2:B12") 'Iterate through cells B2:B12 to find the X
If XLrngC1 = "x" Or XLrngC1 = "X" Then 'Big X or small x in cell
strChp1 = XLrngC1.Offset(0, -1) 'Get text in cell on same row, one column back
strChp2 = XLrngC1.Offset(1, -1) 'Get text from one row down, one column back
'Set the search range to the unsearched portion of the document
WDrngCh.SetRange _
Start:=WDrngHdr1.End, _
End:=WDdoc.Content.End
'Find chapter name and grab if it's style Heading2
pointGetChp1: 'Set a return point
WDrngCh.Find.Execute Findtext:=strChp1, Forward:=wdFindStop
'Check for Style; if wrong, it's not the chapter heading
If WDrngCh.Style <> "Heading 2" Then _
GoTo pointGetChp1 'Go back and try again
'GoTo CleanUp
'If we're here, we have found the heading for the chapter we need.
'WDrngCh encompases just the text we have found, which is only the heading.
'We need to find the heading of the chapter after this, the get the text in between.
'Set the Hdr1 range to a duplicate of the current range
Set WDrngHdr1 = WDrngCh.Duplicate
'Set the Chapter range to the rest of the document
WDrngCh.SetRange _
Start:=WDrngHdr1.End, _
End:=WDdoc.Content.End
'If the last chapter heading on the config worksheet is X'd, then strChp2 is blank.
'We'll skip the next Find and just grab the remaining text
If strChp2 <> "" Then
'Repeat Find to get next chapter name if it's style Heading2
pointGetChp2: 'Set a return point
WDrngCh.Find.Execute Findtext:=strChp2, Forward:=wdFindStop
'Check for Style; if wrong, it's not the chapter heading
If WDrngCh.Style <> "Heading 2" Then _
GoTo pointGetChp2 'Go back and try again
'Set the Hdr2 range to a duplicate of the current range
Set WDrngHdr2 = WDrngCh.Duplicate
'Get the text in between
WDrngCh.SetRange _
Start:=WDrngHdr1.End, _
End:=WDrngHdr2.Start
End If
'Set our text into new workbook, Sheet1
'Find the first empty cell in Column A and add 2
numRow = XLwks2.Range("A10000").End(xlUp).Row + 2
'Put chapter heading in first cell
XLwks2.Range("A" & numRow).Value = WDrngHdr1.Text
'Skip a cell and copy chapter text
WDrngCh.Copy
XLwks2.Range("A" & numRow + 2).PasteSpecial xlPasteValues
'If you stop here and look at the pasted-in results,
'Excel has put each paragraph in one cell.
'Not sure what it will do with tables, graphics, etc.
'This completes the grab for one chapter
'Lather, rinse, repeat until finished
End If
Next XLrngC1
CleanUp:
'Screen updating on
Application.ScreenUpdating = True
appWD.ScreenUpdating = True
'Save and close the Excel file
'If you have a designated folder, add it into the Filename
XLwkb2.SaveAs Filename:=WDdoc.Path & "\" & strRptN
XLwkb2.Close
'Close the report doc
WDdoc.Close wdDoNotSaveChanges
appWD.Quit 'Closes and quits Word
Set appWD = Nothing 'Releases the Word object
End Sub

Re: Parsing Word Document & Export to Excel

Hi Ed,

Thank you for putting so much work into the answer for me.

However, unfortunately, I'm trying to, once selected, go directly to that bookmark, then set the range from that bookmark to the next, then I can copy that range....thank you so much for helping me so far, it really has been appreciated.

Any thoughts on how I might proceed using the table of contents bookmarks? If you do feel inclined to have a look at this problem with me, I've attached a sample doc of which I would be working on...

Re: Parsing Word Document & Export to Excel

I did some playing with your sample doc. I don't think we're going to be able to use the Table of Contents bookmarks. (Probably someone else with a lot more experience in Word VBA could, but I don't think I'm going to get there!) Word has a lot of hidden bookmarks. The TOC bookmarks are hidden, and you can't access them by code unless you know the secret password:

vb Code:

ActiveDocument.Bookmarks.ShowHidden = True

Once that vat is open, though, there's a ton of stuff! Now you have to sift and sort through all the hidden bookmarks by name and range and anything else you can think of. Because your search text is goin to be found in several of them - I don't know why. And going to the place in the document for that specific bookmark is also difficult.

So I changed the code to search for your chapter/subchapter text and verify the style. The code searches from the end of the document for the text in the Config file, so it bypasses the TOC altogether. The only hiccup during test was that your Heading 2 style overflowed into your body text at one point, so it caught the caption text in the body paragraph instead of the header. And it doesn't treat tables very well - but that can all be smoothed out as they decide what they want to do with the data.

So here's the updated code. I do have several Stop points in there to allow you to see what's going on. Very useful for me since I have two monitors and I can put the code on one and the test doc on the other!

Re: Parsing Word Document & Export to Excel

Ed I'm wondering, based on my selection in the multi list box, it is possible to put an 'x' in the corresponding cell in the next column? That way, I could incorporate your answer into my program. The parsing has to be based on selection, so consider the following scenario:

If you select a chapter heading in the multi list...an x in placed in the adjacent relevant cell. That way when your code searches for the x in column b it will be there based on selection, then once parse we can clear column b?

Re: Parsing Word Document & Export to Excel

Well, the col B 'x' was just my way of setting up the list of what chapters to grab. It just made it easy to know which two headings we were working with, because for my code you needed the chapter heading you wanted to grab and the next one in line so you knew where to stop selecting.

So you're populating a ListBox ... from "Config!A1:A45"? I would:
-- grab the returned value from the ListBox
-- find the row number of that value in your A1:A45 list
-- drop down one row to get the next value
-- use those two values as WDrngHdr1 and WDrngHdr2

That eliminates the need for an extra column of "X"s next to your Config list.

The other thing you could do is simply view the Config list page and set it up so when you put an "X" by a heading, it grabs it, runs the code, and then clears the "X". Can be fired either by a Selection_Change or Worksheet_Change macro in the worksheet or by a button in the worksheet. That would eliminate the ListBox altogether. That would seem simpler to me - but I don't know why you have the requirements for the ListBox.

Re: Parsing Word Document & Export to Excel

I am required to develop a tool to be used within the organisation to allow users to parse relevant chapters of the IS/IP reports. So they wish to be able to click on the relevant chapter to parse. The idea is to make it user friendly, and one of the requirements is a multi list box in order to click on the chapter they wish to parse.

If you have a quick squiz at my code below, where I populate the checklist, I then add each "checked item" into a collection.

Could you suggest as to how I

-- grab the returned value from the ListBox
-- find the row number of that value in your A1:A45 list
-- drop down one row to get the next value
-- use those two values as WDrngHdr1 and WDrngHdr2

I know at this stage you probably feel like you're holding my hand, .....but I have worked my way up to this point with no or very little experience in VBA, it's just at this stage, i'm kinda stuck

Re: Parsing Word Document & Export to Excel

No problem at all with the hand-holding! I'd love to say I learned all this all by myself - but truth be told my hand was held many a time! Two things I have learned though - F1 (Help) and Google are your good friends.

With a ListBox, Value returns the value of the selected item as a text string, and ListIndex returns the position in the list of the selected item as Long I think. A gotcha here - the ListIndex starts with 0, not 1. So if you populate your listbox in the same order as the items are listed in A1:A45, Value will show you the selected header, and ListIndex + 1 will give you the row it's in on the Config sheet. Something like this: