Searching Publications How-To

Introduction

Display search with a publication's layout rather than the global layout.

Filter out a Member's Only area from the results.

Fix poor design decisions and bugs.

Changes include:

Index live XML files.The standard method crawls the site, putting the results into the "htdocs_dump" directory. The index is built from there. The index will not include documents not accessible from the start page, such as Members' Only sections. The index also include navigation menus.Using the XML files indexes all content, whether accessible or not, and does not index the site architecture. If a visitor searches on "search", they should receive documents including the word "search", not every page with the search function (which should be every page in a well-designed website).The index still includes everything in a document including header information such as author. It is easy to limit the index to the content body, but that could cause complications when using non-standard Lenya documents (Custom DocTypes/Resource Types). If Custom Doctypes are used, modify searchfixer.xsl (see below). Custom Doctypes were not tested with this configuration.

Remove "Members' Only" documents if not authorized. Visitors must be logged in and in one of the specified Goups. It is the reverse of the current Lenya security, since deep URLs must pass the test for all parents. Example: /employees/programmers must pass tests for both the "/employees" and "/employees/programmers" sections.

Add language to the index.

Limit initial search to current language.

Search page: Remove choice of publications. (This is a design decision. One publication = one website. With protected areas, there should be no need for multiple publications.)

Search page: Filter by chosen languages.

Default to search "Content", not "Title".

Increase the default results per page from 3 to 10.

NOTE: Replace {pub} with your publication name in all instructions.

Indexing on Windows

This assumes Lenya 1.2.2 was installed to C:\apache-lenya-1.2.2 If your installation is different, adjust the paths. The indexer adds namespaces to the data of Fields in the index. The namespaces are not used (and are annoying), so remove them. An alternative is to fix the XML later, but why bother? File: C:\apache-lenya-1.2.2\build\lenya\webapp\WEB-INF\classes\org\apache\lenya\lucene\index\configuration2xslt.xsl Add the following line:

<xsl:template match="namespace"/>

Set the configuration by changing: C:\apache-lenya-1.2.2\build\lenya\webapp\lenya\pubs\{pub}\config\search\lucene-live.xconf
To:

Create a new file in the same directory to tell lucene what fields to index (filename must match the configuration src in lucene-live.xconf): C:\apache-lenya-1.2.2\build\lenya\webapp\lenya\pubs\{pub}\config\search\lenyadocs.xconf
Add this:

Quit Lenya (to avoid file-locking issues). Run the batch file. Check the log.The index created works, but the results are not formatted properly.

"sitemap.xml" may appear.

All links are wrong. They have an extra slash and "/index_xx.xml" must be changed to ".html".

The excerpt is not available and displays a Java error.

These are fixed in the next section.

Fix the XML results to be usable.

Copy C:\apache-lenya-1.2.2\build\lenya\webapp\lenya\xslt\search\sort.xsl To: C:\apache-lenya-1.2.2\build\lenya\webapp\lenya\pubs\{pub}\lenya\xslt\search\sort.xsl. Copy C:\apache-lenya-1.2.2\build\lenya\webapp\lenya\xslt\navigation\search.xsl
to C:\apache-lenya-1.2.2\build\lenya\webapp\lenya\pubs\{pub}\lenya\xslt\navigation\search.xsl
After the other params, add this line:

Download new file: C:\apache-lenya-1.2.2\build\lenya\webapp\lenya\pubs\{pub}\lenya\content\search\search-and-results.xsp Based on C:\apache-lenya-1.2.2\build\lenya\webapp\lenya\content\search\search-and-results.xsp

Removed useless information. (I like dynamic lists better than anyone, but Search is a standard function with standard outputs, so why bother? I only left the <fields> tag to separate our output from lucene's.)

Added language filter.

Added protected section filter.

Hardcoded ProtectedUrls. The default is to require visitors be in an "employee" Group to access "/live/employee". Configure this for your website.

Uses Groups rather than Roles. (Roles are useless as long as "world" inherits "visit" for everything.)

Fixed counters and total. (Total-hits changed from property to element of results.)

Blocking default search.

It is important to block the default search when implementing ProtectedAreas to prevent visitors from typing the URL of the default search (or, if this publication was in production, using a bookmark to the old search) and seeing links to protected documents.