PW 101 V8i - Chapter9 - Document Indexing

ProjectWise 101 – Chapter 9
Document Indexing
Gary Cochrane – Technical Director
Geospatial Sales – North America
Introduction
• ProjectWise Document Indexing
– Really means three things
• Full Text Indexing, in support of full text searching
• Thumbnail Extraction
• Document Property Extraction
– We won’t cover this one in PW101
– See Bentley Institute PW Admin course guide for this
Full Text Indexing
• We did not write the engine for this
– But elected to use the one Microsoft provides
• Included with every copy of Windows
– That engine is called the MS Indexing Service
• And it was installed in the VM as an optional Windows component
– Microsoft indexes the following file formats
• MSWord, Excel, PPT, HTML, XML, TXT
Pre-installed in VM
ProjectWise Integration Server
ProjectWise Orchestration Framework






MicroStation V8i-SS1
Supported Database Engine
Microsoft Message Queuing Service
Microsoft Indexing Service
Microsoft .NET Framework 2.0
Windows Server 2003 with SP2
Extending the MS Index Service
• Microsoft provides an SDK for third parties to
extend the Indexing service
– So the Indexing service will know how to “filter” files
from that vendor
• For instance, Adobe provides an “iFilter” that teaches the MS
Index Service how to extract text from a PDF file
• The Adobe PDF iFilter is installed with Acrobat Reader V9x
Indexing Overview
• Within PW, Indexing consists of:
– Scheduling
• A process that wakes up, checks for new, (or modified files),
adds them to the Copy-out queue, and goes back to sleep
– Copy-out
• Copy the file from the Storage Area, to the machine running the
Indexing Service. Then add file to the extraction queue.
• Remember, files may be stored on multiple servers
• Also, in large installations, a machine may be dedicated to
indexing
Indexing Overview – Part II
• Overview – continued
– Extraction
• This process gets the text from the file and adds it to the MS
Index catalog. Then adds the file to the Update queue
– Update
• This process sets the flag on the file (in the PW database) that
says it is “done”
• New files are added with the flag set to “undone”
• Check-out/in causes the flag to be set to “undone”
A note on “done”
• Done does not necessarily mean it was
successful
– It means the file has been processed
• In other words, what happens if an unknown file (Ex: an Autocad
file) is sent to the Indexing Service?
– The file is attempted…
• And the indexing service says, “I don’t know how to extract text
from this file”
– There would be no point in trying the file again
• So it is marked as “done”, even when unsuccessful
MicroStation and AutoCAD
• ProjectWise provides a mechanism to index the
text from these file types
– Instead of writing an iFilter, Bentley elected to:
• Copy-out the file
• Run MicroStation in the background, extract all the text, and
write it to an XML file
• Send the XML file to the Indexing Engine
– Since MicroStation can parse DWG as well…
• Then this method saved us from having to write two iFilters
Summary
• So within ProjectWise, we index:
– Word, PPT, Excel, XML, HTML, TXT
– Adobe PDF
– DGN, & DWG
• More good news
– iFilters can be found for many file formats
• Some free, and some for purchase
PW Orchestration Framework
• Remember when we installed this?
– PWOF is responsible for managing batch processes for
ProjectWise
• This includes all those processes discussed on the previous slides
– For Full Text Indexing, that means
• Scheduler process, Copy-out process, Extraction process,
Updater process, and the MicroStation instance running in the
background
Lab 1a
• PW Orchestration Framework
– Start the Windows Task Manager
• Hint: Right-click on empty part of Taskbar
– Examine memory usage
• On the Performance tab
– Switch to Processes tab
• Sort by Mem Usage column (descending)
• Look for ustation.exe
• Look for DmsAfpEngine(s)
– Lots of memory consumed here…
Lab 1b
• Now open Services dialog
– Remember “gears” icon on Quick-Launch
• Locate PW Orchestration Framework service
– Select the PW OF service, and choose> Stop
• Watch memory usage in Task Manager
– For remainder of exercise, we need PWOF running
• So start it back up now
• Note PWOF is configured for automatic startup
– It will run each time machine is booted
– Close Services and Task Manager
Lab 2a
• Open PW Administrator
– Log in as> adminpw
– Drill down to:
• Document Processors> Full Text Indexing
– Right-click, choose> Properties
Lab 2b - Full Text Indexing
Accept defaut, unless
Indexing is to be run on
another machine
Turn on
adminpw
adminpw
Set to 60
Lab 2c - Full Text Indexing
Enable all times in
the schedule
Set to 2
Lab 2d
• Switch to File Type Associations tab
– Press> Add
• In the Extension field, enter> DWG
• In the bottom field, enter> DGN
– So that DWG files are processed as if they were DGN
– Press> OK
Lab 2e
Lab 2f
• Still on the File Type Associations tab
– Again, press> Add
• In the Extension field, enter> itiff
• In the bottom, enable> Do not process these documents
– You can’t extract text from a raster so this prevents wasted
file transfers
– Press> OK
• Press OK again
– To close the Full Text Indexing Properties
Lab 2g
• Open Task Manager again
– Switch to Performance tab
• Within 2 minutes, you should see heavy CPU usage
• Memory usage will also go up
– Up to 60 documents will be indexed in the first pass
• If there are more than 60 documents to be done, then they will
be queued in the next pass
– 2 minutes from now
Analysis
• All documents will eventually be processed
– When done, the index will be ready for fast full text
searches
• Once the indexer has caught up, future load will be lighter due to
only processing incremental documents
Lab 3a
• When done, close Task Manager, open PW
Explorer
– Log in as user1
• From the main tool box, select> Find Documents
– Binocular icon
• Change to Full Text tab
– Enter Look For> detail
• Press OK to start search
– Then Close the Search dialog
• Your results should include: DGN’s, DWG’s, and PDF’s
Lab 3b
• Browse to:
– User1/Document Indexing/MS-SHT
• These files were not successful because they have an unknown
extension
• But they were attempted, and flagged as done
• Return to PW Administrator
– Select datasource name (pwdemo)
•
•
•
•
Right-click, choose> Properties
Change to Statistics tab
Choose Refresh
Review Full Text Statistics
– Close dialog
Lab 3c
• While still in PW Administrator
– Open Full Text Indexing Properties again
• Switch to the File Type Associations tab
– Press Add
• In the Extension field, enter> SHT
• In the bottom Extension field, enter> DGN
– So that SHT files will be processed as if they were DGN files
• Press OK to complete the Extension mapping
– Press OK again to close the Properties dialog
Lab 3d
• Once new file type has been added…
– Now a small problem
• These files were flagged as done, and the Indexer won’t try them
again unless they are checked out/in
• And even that won’t work unless you actually makes changes…
• PW compares files to version on server, and doesn’t transfer
back if there are no changes
Lab 3e
• Rather than check them all out, and back in
– From PW Administrator
• Right-click Full Text Indexing
– Choose>
• Mark folder Documents for Reprocessing
– Browse “…” to
• USer1/Document Indexing/MS-SHT
– Press OK
• Press OK again
Analysis
• Within 2 minutes, these documents will be reprocessed
– If you run the search again (in a few minutes), you
should also get SHT files in your results
– Re-visit Datasource statistics to see if it Full Text
categories have changed
Summary
• Once the index is created,
– You can stop the PW Orchestration Framework service
• It is used to create the index, but not to search the index
– This will save memory, and CPU cycles
• So in a demo, your machine will run faster
• BUT, new, (or modified) files will not be re-indexed
– Up until now, the PWOF was not being used at all
• Full Text Indexing is the first time we’ve needed PWOF, even
though it has been running since installation
PW Thumbnails
• PW Thumbnails is not “indexing” in the proper
sense, but it is similar in nature to Full Text
– PW Thumbnails extracts a thumbnail from the
document, and stores a copy in the PW database
• This allows one to browse PW Explorer, and see thumbnails in
the Preview Pane
– Not all file types support thumbnails
• Among those that do, some don’t do it per the industry standard
Thumbnails – Part II
• Important to remember
– ProjectWise does not create thumbnails
• It only extracts what might be in the file
– A good test is to check to see if Windows Explorer
displays a thumbnail for the file
• If it does, then PW should as well
Lab 4a
• Open Windows Explorer
– Browse to:
• C:\PW-101 Class Files\Document Indexing\MS-V8
– Change to Thumbnail display
• MicroStation V8 files have thumbnails
Lab 4b
• Browse through remaining Document Indexing
folders
– Note which include thumbnails
– Additional notes
• PDF files take a long time because you are really looking at a
small view of the whole file, not a thumbnail
• AutoCAD doesn’t adhere to the Industry standard
– These files only display correctly because MicroStation is
installed, and is responsible for displaying a thumbnail
– Autodesk may have fixed this in later versions?
Lab 5a
• Open PW Administrator
– Log in as> adminpw
– Drill down to:
• Document Processors> Thumbnail Extraction
– Right-click, choose> Properties
• Similar to Full Text Indexing
– But actually less involved
Lab 5b
Turn on
adminpw
adminpw
Set to 60
Lab 5c
Enable all times in
the schedule
Set to 2
Lab 5d
• No changed required on the File Type
Associations tab
– Press OK to complete the configuration and close the
dialog
• Within a few minutes, thumbnails should show up in the preview
pane
Analysis
• Thumbnails are extracted and stored in the PW
database
– Because document storage may not be local
• Thus “touching” the document to see thumbnail in real-time is
not practical
– Thumbnail notes
• Requires less processing than full text
– MicroStation not running in this process
– Requires PWOF to extract, but not to display
Review
• Topics covered in this Chapter
–
–
–
–
–
Full text Indexing – Configuration
Full Text Searches
ProjectWise Orchestration Framework
Thumbnail Extraction
Microsoft Indexing Service
• And iFilters to extend default supported file types
• (I have a free Visio, and MSG iFilter from Microsoft)