Verified Answer

I believe you've got it right on pointing to the second path mentioned above.

My test lucene.net index is at S:\LuceneDotNetIndex.

There are two locations in arachnode.net that need to be set for lucene.net to function properly.

1.) In CrawlActions.config. This location instructs ManageLuceneDotNetIndexes.cs as to where the index is located or should be located. The current code in SVN has a relative path listed and this should function as downloaded. Mine is changed as I have the indexes on a seperate volume.

2.) In Web.config. This configuration setting enables Search.aspx, et. al. to return search results. I need to add Server.MapPath or a friendly exception or something to make it a bit clearer what you need to do to enable searching. (and remove the incorrect hardcoded path)

All Replies

I believe you've got it right on pointing to the second path mentioned above.

My test lucene.net index is at S:\LuceneDotNetIndex.

There are two locations in arachnode.net that need to be set for lucene.net to function properly.

1.) In CrawlActions.config. This location instructs ManageLuceneDotNetIndexes.cs as to where the index is located or should be located. The current code in SVN has a relative path listed and this should function as downloaded. Mine is changed as I have the indexes on a seperate volume.

2.) In Web.config. This configuration setting enables Search.aspx, et. al. to return search results. I need to add Server.MapPath or a friendly exception or something to make it a bit clearer what you need to do to enable searching. (and remove the incorrect hardcoded path)

Super let me make these tweaks and see if I can get the search stuff working.

So, can you explain a bit what role arachnode is playing and what role lucene.net is playing? Are the indexes and other data created in the lucenedotnetindex folder dependant on the arachnode data being stored? Or is it two separate sets of data being created during the crawl, the db stuff for reporting and such and the lucene stuff for searching?

Thanks much for laying this out. Look forward to making these tweaks and seeing what's what.

1.) Are the indexes and other data created in the lucenedotnetindex folder dependant on the arachnode data being stored?

Somewhat. The option currently exists to not submit every content type except WebPages to the database. I have a feature slated to choose whether content discovered by arachnode.net is stored on disk, in the database, both, or not at all. So, you could use arachnode.net to crawl, and keep state, but not store any content. Or, you could duplicate the content in triplicate if you wanted to.

The settings *insert are how you elect to submit or not submit content to the database. Content types can be parsed (extracted) for use by plugins, but may not be required by your particular application to be stored. Certain settings are required when crawling to enable the lucene.net functionality. For example: 'extractWebPageMetaData' must be enabled, as this creates the on-disk content locations and data and also extracts the text of the WebPage for lucene.net indexing. But, you don't have to insert the WebPage meta data. I'd like to add a check when crawling to verify your current crawling configuration as certain configuration settings are non-sensical.

2.) Or is it two separate sets of data being created during the crawl, the db stuff for reporting and such and the lucene stuff for searching?

This is also correct. The data in the database is primarily for reporting and data mining, etc. Yet, you can search the data in the database with SQL Full Text Indexing. Full Text Indexing is enabled in 4 locations and arachnode.net keeps track of the proper full text index types so you can use IFilters to search .pdf's, .mp3's, etc. Look at the table 'DataTypes' to learn more about mappings between Content-Type (HTML Response Header) and FullTextIndexType.

It's too bad that SQL FTI isn't better at what it does. While it is quite good at transparently managing its indexes, as you say, there isn't much control over the indexes and the results. The synonym functionality is helpful but limited in that it isn't possible to have more than one synonym roll up to more than one parent word - but, that's a completely different forum post altogether.

You are correct the webpagemetadata setting is enabled so I'm good there. I changed the two config file locations for the lucene index directory to be a hard-coded path, to see if that gets the search working. Looks like it's updating the files in that folder (doing a crawl now) so I'll wait for it to finish then see if things work.

Question: I assume I need to let the crawl complete and not kill it via Visual Studio otherwise the files may not get updated properly and the search might bomb. Correct?

Oh yes another quick question. There is not already any code in the build that allows for console or web entry of a new site to crawl - correct? I assume there's just sample code in the console app and I could tweak that to build a console or web submitter.

Which leads me to another (final) question for now: Ideally it might be nice for submissions to be able to go to a holding queue where they can be reviewed before being actually crawled. For example, a site may be building a search engine for a specific topic or set of sites and they'd want to limit crawls. I know, this would lead to any discovered domains needing to be queued for review as well which could become a large list.

That got it! Hard-coding the path to the lucene dot net index directory seems to have gotten it working. I'm thinking however that these lucene data files are completely re-created every time I do a crawl. Is this correct?

Clearing out the db via the SP and doing another run to check things out more...

I haven't had any problems interrupting a crawl. Lucene.net seems to do a good job maintaining state. You can start and stop crawls at will and you should still be able to search over the indexes.

If you want to insert CrawlRequests to the database from an API, create an instance of ArachnodeDAO.cs. Use the method: InsertCrawlRequest. This will insert CrawlRequests directly into the CrawlRequests database table.

You are right about priority in the database. Look at the stored procedure: dbo.arachnode_omsp_CrawlRequests_SELECT. Here would be the best place to perform initial filtering. Part of what you've touched on I plan to elucidate upon in our post about ordered crawling.