# epifonyCrawlerPlugin #
## Introduction ##
The aim of epifonyCrawler is to imitate a web search engine to spider your site and add the pages to a [Lucene](http://lucene.apache.org/) index using the excellent [Zend Lucene](http://framework.zend.com/manual/en/zend.search.lucene.html) library.
By using this plugin you will have control over the way your site is crawled by excluding certain areas from being indexed, like navigation. The crawler is also i18n aware and will index your translated copy.
## Installation ##
To install the plugin for a symfony project, the usual process is to use the symfony command line:
symfony plugin:install epifonyCrawlerPlugin --stability=alpha
The plugin includes the required zend lucene code.
## Crawling your domain ##
You can create your index by running:
symfony epifony:generate-index http://www.example.com --depth=2
This will crawl every link for 2 pages depth. If you ommit the depth then every link will be crawled, but be warned that this could take a long time and will consume a lot of memory. I would recommend a depth of about 5 for most sites. If you have very deep navigation on your site then 8 or 9 should suffice.
You can add more than one domain to the index:
symfony epifony:generate-index http://www.example.com,http://blog.example.com --depth=5
Or more than one language
symfony epifony:generate-index http://en.example.com/,http://fr.example.com/ --depth=5
Or add individual pages to your index like this:
symfony epifony:add-url http://www.example.com/new-page
Now you can test your search by running a lucene query on it
symfony epifony:search-query "dependency injection"
results:
id:91-----------------------------
score: 1
url: http://www.example.com/dependency-injection
id:234----------------------------
score: 0.518689
url: http://www.example.com/programming-design-patterns
## Configuration ##
You can override most classes using your app.yml
all:
epifonyCrawler:
index_class: Zend_Search_Lucene
# handles the DI for all the classes
manager_class: epifonyCrawlerManager
# defaults to all links - not recommended because php maxiumum nesting level can be reached quite quickly
depth: -1
# takes the url, reads the headers and gets the content
browser_class: sfWebBrowser
#null value attempts to guess
browser_adapter_class: sfCurlAdapter
browser_adapter_options:
followlocation: true
browser_default_headers:
User-Agent: epifonyCrawler v0.1
#the amount of debugging shown to the user
debug_level: <?php if(sfConfig::get('sf_error_reporting' > E_NOTICE)):?>3<?php else: ?>1<?php endif?>
#save the debugging to a file
log_file_path: %sf_log_dir%/search.log
logger: sfFileLogger
#where to save the index
data_dir: %sf_data_dir%/search/
crawler_class: epifonyCrawler
#assigns a class to read the content based on the mime type from the browser class
extractor_factory: epifonyCrawlerExtractorFactory
#the classes to assign to parsing a mime type
mime_document_classes:
text/html: epifonyCrawlerHtmlExtractor
application/pdf: epifonyCrawlerPdfExtractor
#you can pass in constructors to the mime_document_classes here
document_constructor_options:
epifonyCrawlerHtmlExtractor:
#give boost to certain html elements
boost:
h1: 1.5
description: 1.5
h2: 1.2
title: 1.6
*Example of overriding the mime document class*
Change the class in the app.yml to your own
mime_document_classes:
text/html: myEpifonyCrawlerHtmlExtractor
Override one of the parsing options - for example to remove some text from all the page titles
class myEpifonyCrawlerHtmlExtractor extends epifonyCrawlerHtmlExtractor
{
public function processTitle($xpath)
{
$docTitle = '';
$titleNodes = $xpath->query('/html/head/title');
foreach ($titleNodes as $titleNode) {
// remove the url from the title
$docTitle .= str_replace('www.example.com', '', $titleNode->nodeValue) . ' ';
}
$field = $this->createField($docTitle, 'title');
$this->addField($field);
}
}
## Selective Indexing ##
Some areas of your site might not need to be indexed, for example the navigation, which could skew the results of the search. There are 3 ways to exclude content from being searched and increasing the quality of your index
1. __Robots.txt__ You can add the epifonyCrawler user-agent to your robots and block by subdomain or subdirectory
2. __rel="nofollow"__ Add this to links that you don't want the crawler to follow
3. __class="crawl_ignore"__ Add this class to navigation or other html tags which you don't want to be indexed
## I18n ##
By default the html meta content-language value is stored in the index and can be added onto any search query. See the next section for an example of this.
## Creating a search results page ##
I'm currently working on epifonySearchPlugin which will handle the search results for an index. In the meantime in your frontend app you can create a search like this
in your action.class.php:
$crawlerManager = epifonyCrawlerManager::getInstance();
$index = $crawlerManager->openIndex();
// only search english results
$this->hits = $index->find($request->getParameter('search').' AND content-language:en');
in your template:
foreach ($hits as $hit) {
echo $hit->score;
echo $hit->title;
echo $hit->url;
}
## Helping with your index ##
Your index needs attention to make sure the results it's returning is right for your site. [Luke](http://www.getopt.org/luke/) is an excellent tool to analyse your index and can be installed on Windows or Unix.
## TODO ##
* Add more mime types to the extractors eg. Word, RSS
* Extract links from a PDF (currently returns an empty array)
* Test with different charsets (currently only UTF-8 is supported)
* Create a script to backup the index and roll-forward functionality so a new index can be generated without affecting the current index
* Some sort of log rotation