Space: Lucene Connector Framework (https://cwiki.apache.org/confluence/display/CONNECTORS)
Page: How to Build and Deploy Apache Connectors Framework (https://cwiki.apache.org/confluence/display/CONNECTORS/How+to+Build+and+Deploy+Apache+Connectors+Framework)
Edited by Karl Wright:
---------------------------------------------------------------------
h1. Building ACF
Apache Connectors Framework consists of the framework itself, a set of connectors, and an
Apache2 plug-in module. These can be built as follows.
h3. Building the framework and the connectors
To build the ACF framework code, and the particular connectors you are interested in, you
currently need to do the following:
# Check out [https://svn.apache.org/repos/asf/incubator/lcf/trunk].
# cd to "modules".
# Install desired dependent LGPL and proprietary libraries, wsdls, and xsds. See below for
details.
# Run ant.
If you supply *no* LGPL or proprietary libraries, the framework itself and only the following
repository connectors will be built:
* Active Directory authority
* Filesystem connector
* JDBC connector, with just the postgresql jdbc driver
* RSS connector
* Webcrawler connector
In addition, the following output connectors will be built:
* MetaCarta GTS output connector
* Apache Solr output connector
* Null output connector
The LGPL and proprietary connector dependencies are described in separate sections below.
The output of the ant build is produced in the _modules/dist_ directory, which is further
broken down by process. The number of produced process directories may vary, because optional
individual connectors do sometimes supply processes that must be run to support the connector.
See the table below for a description of the _modules/dist_ folder.
|| _modules/dist_ directory || Meaning ||
| _web_ | Web applications that should be deployed on tomcat or the equivalent, plus recommended
application server -D switch names and values |
| _processes_ | classpath jars that should be included in the class path for all non-connector-specific
processes, along with -D switches, using the same convention as described for tomcat, above
|
| _lib_ | jars for all the connector plugins, which should be referenced by the appropriate
clause in the ACF configuration file |
| _wsdd_ | wsdd files that are needed by the included connectors in order to function |
| _xxx-process_ | classpath jars and -D switches needed for a required connector-specific
process |
| _example_ | a jetty-based example that runs in a single process (except for any connector-specific
processes) |
For all of the _dist_ subdirectories above (except for _wsdd_, which does not correspond to
a process), any scripts resulting from the build that pertain to that process will be placed
in a _script_ subdirectory. Thus, the command for executing a command under Windows for the
_processes_ subdirectory will be found in _dist/processes/script/executecommand.bat_. (This
script requires two variables to be set before execution: JAVA_HOME, and LCF_HOME, which should
point to ACF's home execution directory, described below.) Indeed, everything you need to
run an ACF process can be found under _dist/processes_ when the ant build completes: a _define_
subdirectory containing -D switch description files, a _jar_ subdirectory where jars are placed,
and a _war_ subdirectory where war files are output.
The supplied scripts in the _script_ directory for a process generally take care of building
an appropriate classpath and set of -D switches. If you need to construct a classpath by
hand, it is important to remember that "more" is not necessarily "better". The process deployment
strategy implied by the build structure has been carefully thought out to avoid jar conflicts.
Indeed, several connectors are structured using multiple processes precisely for that reason.
h5. Building the Documentum connector
The Documentum connector requires EMC's DFC product in order to be built. Install DFC on
the build system, and locate the jars it installs. You will need to copy at least dfc.jar,
dfcbase.jar, and dctm.jar into the directory "modules/connectors/documentum/dfc".
h5. Building the FileNet connector
The FileNet connector requires IBM's FileNet P8 API jar in order to be build. Install the
FileNet P8 API on the build system, and copy at least "Jace.jar" from that installation into
"modules/connectors/filenet/filenet-api".
h5. Building the JDBC connector, including Oracle, SQLServer, or Sybase JDBC drivers
The JDBC connector also knows how to work with Oracle, SQLServer, and Sybase JDBC drivers.
For Oracle, download the appropriate Oracle JDBC jar from the Oracle site, and copy it into
the directory "modules/connectors/jdbc/jdbc-drivers". For SQLServer and Sybase, download
jtds.jar, and copy it into the same directory.
h5. Building the jCIFS connector
To build this connector, you need to download jcifs.jar from http://samba.jcifs.org, and copy
it into the "modules/connectors/jcifs/jcifs" directory.
h5. Building the LiveLink connector
This connector needs LAPI, which is a proprietary java library that allows access to OpenText's
LiveLink server. Copy the lapi.jar into the "modules/connectors/livelink/lapi" directory.
h5. Building the Memex connector
This connector needs the Memex API jar, usually called JavaMXIELIB.jar. Copy this jar into
the "modules/connectors/memex/mxie-java" directory.
h5. Building the Meridio connector
The Meridio connector needs wsdls and xsds downloaded from an installed Meridio instance using
*disco.exe*, which is installed as part of Microsoft Visual Studio, typically under "c:\Program
Files\Microsoft SDKs\Windows\V6.x\bin". Obtain the preliminary wsdls and xsds by interrogating
the following Meridio web services:
* http\[s\]://<meridio_server>/DMWS/MeridioDMWS.asmx
* http\[s\]://<meridio_server>/RMWS/MeridioRMWS.asmx
You should have obtained the following files in this step:
* MeridioDMWS.wsdl
* MeridioRMWS.wsdl
* DMDataSet.xsd
* RMDataSet.xsd
* RMClassificationDataSet.xsd
Next, patch these using Microsoft's *xmldiffpatch* utility suite, downloadable for Windows
from [http://msdn.microsoft.com/en-us/library/aa302294.aspx]. The appropriate diff files
to apply as patches can be found in "modules/connectors/meridio/upstream-diffs". After the
patching, rename so that you have the files:
* MeridioDMWS_axis.wsdl
* MeridioRMWS_axis.wsdl
* DMDataSet_castor.xsd
* RMDataSet_castor.xsd
* RMClassificationDataSet_castor.xsd
Finally, copy all of these to: "modules/connectors/meridio/wsdls".
h5. Building the SharePoint connector
In order to build this connector, you need to download wsdls from an installed SharePoint
instance. The wsdls in question are:
* Permissions.wsdl
* Lists.wsdl
* Dspsts.wsdl
* usergroup.wsdl
* versions.wsdl
* webs.wsdl
To download a wsdl, use Microsoft's *disco.exe* tool, which is part of Visual Studio, typically
under "c:\Program Files\Microsoft SDKs\Windows\V6.x\bin". You'd want to interrogate the following
urls:
* http\[s\]://<server_name>/_vti_bin/Permissions.asmx
* http\[s\]://<server_name>/_vti_bin/Lists.asmx
* http\[s\]://<server_name>/_vti_bin/Dspsts.asmx
* http\[s\]://<server_name>/_vti_bin/usergroup.asmx
* http\[s\]://<server_name>/_vti_bin/versions.asmx
* http\[s\]://<server_name>/_vti_bin/webs.asmx
When the wsdl files have been downloaded, copy them to: "modules/connectors/sharepoint/wsdls".
Note well: For SharePoint instances version 3.0 or higher, in order to support file and folder
level security, you also must deploy a custom SharePoint web service on the SharePoint instance
you intend to connect to. This is because Microsoft apparently overlooked support for web-service-based
access to such security information when SharePoint 3.0 was released.
In order to build the service, you need to have access to a Windows machine that has a reasonably
current version of Microsoft Visual Studio available, with .NET installed and (at least) SharePoint
2.0 installed as well. The fastest way to build the service is to do the following *after*
building everything else:
cd connectors/sharepoint
ant build-webservice
cd webservice/Package
Then, follow the directions in the file "Installation Readme.txt", found in that directory.
h3. Building ACF's Apache2 plugin
To build the mod-authz-annotate plugin, you need to start with a Unix system that has the
apache2 development tools installed on it, plus the curl development package (from [http://curl.haxx.se]
or elsewhere). Then, cd to modules/mod-authz-annotate, and type "make". The build will produce
a file called mod-authz-annotate.so, which should be copied to the appropriate Apache2 directory
so it can be used as a plugin.
h1. Running Apache Connectors Framework
h3. Quick start
You can run most of Apache Connectors Framework in a single process, for evaluation and convenience.
This single-process version uses Jetty to handle its web applications, and Derby as an embedded
database. All you need to do to run this version of ACF is to follow the build instructions
above, and then:
{code}
cd dist/example
<java> -jar start.jar
{code}
In this jetty setup, all database initialization and connector registration takes place automatically
(at the cost of some startup delay). The crawler UI can be found at http://<host>:8345/lcf-crawler-ui.
The authority service can be found at http://<host>:8345/lcf-authority-service. The
programmatic API is at http://<host>:8345/lcf-api.
You can stop Apache Connectors Framework at any time using ^C.
Bear in mind that Derby is not as full-featured a database as is Postgresql. This means that
any performance testing you may do against the quick start example may not be applicable to
a full installation. Furthermore, Derby only permits one process at a time to be connected
to its databases, so you *cannot* use any of the ACF commands (as described below) while the
quick-start ACF is running.
Another caveat that you will need to be aware of with the quick-start version of ACF is that
it in no way removes the need for you to run any separate processes that individual connectors
require. Specifically, the Documentum and FileNet connectors require processes to be independently
started in order to function. You will need to read about these connector-specific processes
below in order to use the corresponding connectors.
h5. The quick-start connectors.xml configuration file
The quick-start version of ACF reads its own configuration file, called _connectors.xml_,
in order to register the available connectors in the database. The file has this basic format:
{code:xml}
<?xml version="1.0" encoding="UTF-8" ?>
<connectors>
(clauses>
</connectors>
{code}
The following tags are available to specify your connectors:
<repositoryconnector name="_pretty_name_" class="_connector_class_"/>
<authorityconnector name="_pretty_name_" class="_connector_class_"/>
<outputconnector name="_pretty_name_" class="_connector_class_"/>
h3. Framework and connectors
The core part of Apache Connectors Framework consists of several pieces. These basic pieces
are enumerated below:
* A database, which is where ACF keeps all of its configuration and state information, usually
Postgresql
* A synchronization directory, which how ACF coordinates activity among its various processes
* An *agents* process, which is the process that actually crawls documents and ingests them
* A *crawler-ui* web application, which presents the UI users interact with to configure
and control the crawler
* An *authority-service* web application, which responds to requests for authorization tokens,
given a user name
In addition, there are a number of java classes in Apache Connectors Framework that are intended
to be called directly, to perform specific actions in the environment or in the database.
These classes are usually invoked from the command line, with appropriate arguments supplied,
and are thus considered to be ACF *commands*. Basic functionality supplied by these command
classes are as follows:
* Create/Destroy the ACF database instance
* Start/Stop the *agents* process
* Register/Unregister an agent class (there's currently only one included)
* Register/Unregister an output connector
* Register/Unregister a repository connector
* Register/Unregister an authority connector
* Clean up synchronization directory garbage resulting from an ungraceful interruption of
an ACF process
* Query for certain kinds of job-related information
Individual connectors may contribute additional command classes and processes to this picture.
A properly built connector typically consists of:
* One or more jar files meant to be included in the library area meant for connector jars
and their dependencies.
* Possibly some java commands, which are meant to support or configure the connector in some
way.
* Possibly a connector-specific process or two, each requiring a distinct classpath, which
usually serves to isolate the *crawler-ui* web application, *authority service* web application,
*agents* process, and any commands from problematic aspects of the client environment
* A recommended set of java "define" variables, which should be used consistently with all
involved processes, e.g. the *agents* process, the application server running the *authority-service*
and *crawler-ui*, and any commands. (This is historical, and no connectors as of this writing
have any of these any longer).
An individual connector package will typically supply an output connector, or a repository
connector, or both a repository connector and an authority connector. The ant build script
under _modules_ automatically forms each individual connector's contribution to the overall
system into the overall package.
The basic steps required to set up and run ACF are as follows:
1. Check out and build, using "ant". The default target builds everything.
2. Install postgresql. The postgresql JDBC driver included with ACF is known to work with
version 8.3.x, so that version is the currently recommended one. Configure postgresql for
your environment; the default configuration is acceptable for testing and experimentation.
3. Install a Java application server, such as Tomcat.
4. Create a home directory for ACF. To do this, make a copy of the contents of _modules/dist_
from the build. In this directory, create properties.ini and logging.ini, as described above.
Note that you will also need to create a synchronization directory, also detailed above,
and refer to this directory within your properties.ini.
5. Deploy the war files in _<LCF_HOME>/web/war_ to your application server.
6. Set the starting environment variables for your app server to include the -D commands found
in _<LCF_HOME>/web/define_. The -D commands should be of the form, "-D<file name>=<file
contents>".
7. Use the _<LCF_HOME>/processes/script/executecommand.bat_ command from execute the
appropriate commands from the next section below, being sure to first set the JAVA_HOME and
LCF_HOME environment variables properly.
8. Start any supporting processes that result from your build. (Some connectors such as Documentum
and FileNet have auxiliary processes you need to run to make these connectors functional.)
9. Start your application server.
10. Start the ACF agents process.
11. At this point, you should be able to interact with the ACF UI, which can be accessed via
the lcf-crawler-ui web application
For each of the described steps, details are furnished in the steps below.
h5. Configuring the Postgresql database
Despite having an internal architecture that cleanly abstracts from specific database details,
Apache Connectors Framework is currently fairly specific to Postgresql at this time. There
are a number of reasons for this.
# Apache Connectors Framework uses the database for its document queue, which places a significant
load on it. The back-end database is thus a significant factor in ACF's performance. But,
in exchange, ACF benefits enormously from the underlying ACID properties of the database.
# The strategy for getting optimal query plans from the database is not abstracted. For
example, Postgresql 8.3+ is very sensitive to certain statistics about a database table, and
will not generate a performant plan if the statistics are inaccurate by even a little, in
some cases. So, for Postgresql, the database table must be analyzed very frequently, to avoid
catastrophically bad plans. But luckily, Postgresql is pretty good at doing analysis quickly.
Oracle, on the other hand, takes a very long time to perform analysis, but its plans are
much less sensitive.
# Postgresql always does a sequential scan in order to count the number of rows in a table,
while other databases return this efficiently. This has affected the design of the ACF UI.
# The choice of query form influences the query plan. Ideally, this is not true, but for
both Postgresql and for (say) Oracle, it is.
# Postgresql has a high degree of parallelism and lack of internal single-threadedness.
Apache Connectors Framework has been tested against Postgresql 8.3.7. We recommend the following
configuration parameter settings to work optimally with ACF:
* A default database encoding of UTF-8
* _postgresql.conf_ settings as described in the table below
* _pg_hba.conf_ settings to allow password access for TCP/IP connections from Apache Connectors
Framework
* A maintenance strategy involving cronjob-style vacuuming, rather than Postgresql autovacuum
|| _postgresql.conf_ parameter || Tested value ||
| shared_buffers | 1024MB |
| checkpoint_segments | 300 |
| maintenance_work_mem | 2MB |
| tcpip_socket | true |
| max_connections | 400 |
| checkpoint_timeout | 900 |
| datastyle | ISO,European |
| autovacuum | off |
h5. A note about maintenance
Postgresql's architecture causes it to accumulate dead tuples in its data files, which do
not interfere with its performance but do bloat the database over time. The usage pattern
of ACF is such that it can cause significant bloat to occur to the underlying Postgresql database
in only a few days, under sufficient load. Postgresql has a feature to address this bloat,
called *vacuuming*. This comes in three varieties: autovacuum, manual vacuum, and manual
full vacuum.
We have found that Postgresql's autovacuum feature is inadequate under such conditions, because
it not only fights for database resources pretty much all the time, but it falls further and
further behind as well. Postgresql's in-place manual vacuum functionality is a bit better,
but is still much, much slower than actually making a new copy of the database files, which
is what happens when a manual full vacuum is performed.
Dead-tuple bloat also occurs in indexes in Postgresql, so tables that have had a lot of activity
may benefit from being reindexed at the time of maintenance.
We therefore recommend periodic, scheduled maintenance operations instead, consisting of the
following:
* VACUUM FULL VERBOSE;
* REINDEX DATABASE <the_db_name>;
During maintenance, Postgresql locks tables one at a time. Nevertheless, the crawler ui may
become unresponsive for some operations, such as when counting outstanding documents on the
job status page. ACF thus has the ability to check for the existence of a file prior to such
sensitive operations, and will display a useful "maintenance in progress" message if that
file is found. This allows a user to set up a maintenance system that provides adequate feedback
for an ACF user of the overall status of the system.
h5. The ACF configuration file
Currently, ACF requires two configuration files: the main configuration property file, and
the logging configuration file.
The property file path can be specified by the system property "org.apache.lcf.configfile".
If not specified through a -D operation, its name is presumed to be _<user_home>/lcf/properties.xml_.
The form of the property file is XML, of the following basic form:
{code:xml}
<?xml version="1.0" encoding="UTF-8" ?>
<configuration>
(clauses>
</configuration>
{code}
h5. Properties
The configuration file allows properties to be specified. A property clause has the form:
<property name="_property_name_" value="_property_value_"/>
One of the optional properties is the name of the logging configuration file. This property's
name is "org.apache.lcf.logconfigfile". If not present, the logging configuration file will
be assumed to be _<user_home>/lcf/logging.ini_. The logging configuration file is a
standard commons-logging property file, and should be formatted accordingly.
Note that all properties described below can also be specified on the command line, via a
-D switch. If both methods of setting the property are used, the -D switch value will override
the property file value.
The following table describes the configuration property file properties, and what they do:
|| Property || Required? || Function ||
| org.apache.lcf.lockmanagerclass | No | Specifies the class to use to implement synchronization.
Default is a built-in file-based synchronization class. |
| org.apache.lcf.databaseimplementationclass | No | Specifies the class to use to implement
database access. Default is a built-in Postgresql implementation. |
| org.apache.lcf.synchdirectory | Yes, if file-based synchronization class is used | Specifies
the path of a synchronization directory. All ACF process owners *must* have read/write privileges
to this directory. |
| org.apache.lcf.database.maxhandles | No | Specifies the maximum number of database connection
handles that will by pooled. Recommended value is 200. |
| org.apache.lcf.database.handletimeout | No | Specifies the maximum time a handle is to live
before it is presumed dead. Recommend a value of 604800, which is the maximum allowable.
|
| org.apache.lcf.logconfigfile | No | Specifies location of logging configuration file. |
| org.apache.lcf.database.name | No | Describes database name for ACF; defaults to "dbname"
if not specified. |
| org.apache.lcf.database.username | No | Describes database user name for ACF; defaults to
"lcf" if not specified. |
| org.apache.lcf.database.password | No | Describes database user's password for ACF; defaults
to "local_pg_password" if not specified. |
| org.apache.lcf.crawler.threads | No | Number of crawler worker threads created. Suggest
a value of 30. |
| org.apache.lcf.crawler.deletethreads | No | Number of crawler delete threads created. Suggest
a value of 10. |
| org.apache.lcf.misc | No | Miscellaneous debugging output. Legal values INFO, WARN, or
DEBUG. |
| org.apache.lcf.db | No | Database debugging output. Legal values INFO, WARN, or DEBUG.
|
| org.apache.lcf.lock | No | Lock management debugging output. Legal values INFO, WARN, or
DEBUG. |
| org.apache.lcf.cache | No | Cache management debugging output. Legal values INFO, WARN,
or DEBUG. |
| org.apache.lcf.agents | No | Agent management debugging output. Legal values INFO, WARN,
or DEBUG. |
| org.apache.lcf.perf | No | Performance logging debugging output. Legal values INFO, WARN,
or DEBUG. |
| org.apache.lcf.crawlerthreads | No | Log crawler thread activity. Legal values INFO, WARN,
or DEBUG. |
| org.apache.lcf.hopcount | No | Log hopcount tracking activity. Legal values INFO, WARN,
or DEBUG. |
| org.apache.lcf.jobs | No | Log job activity. Legal values INFO, WARN, or DEBUG. |
| org.apache.lcf.connectors | No | Log connector activity. Legal values INFO, WARN, or DEBUG.
|
| org.apache.lcf.scheduling | No | Log document scheduling activity. Legal values INFO, WARN,
or DEBUG. |
| org.apache.lcf.authorityconnectors | No | Log authority connector activity. Legal values
INFO, WARN, or DEBUG. |
| org.apache.lcf.authorityservice | No | Log authority service activity. Legal values are
INFO, WARN, or DEBUG. |
| org.apache.lcf.sharepoint.wsddpath | Yes, for SharePoint Connector | Path to the SharePoint
Connector wsdd file. |
| org.apache.lcf.meridio.wsddpath | Yes, for Meridio Connector | Path to the Meridio Connector
wsdd file. |
h5. Class path libraries
The configuration file can also specify a set of directories which will be searched for connector
jars. The directive that adds to the class path is:
<libdir path="_path_"/>
Note that the path can be relative. For the purposes of path resolution, "." means the directory
in which the properties.xml file is located.
h5. Examples
An example properties file might be:
{code}
<?xml version="1.0" encoding="UTF-8" ?>
<configuration>
<property name="org.apache.lcf.synchdirectory" value="c:/mysynchdir"/>
<property name="org.apache.lcf.logconfigfile" value="c:/conf/logging.ini"/>
<libdir path="./lib"/>
</configuration>
{code}
An example simple logging configuration file might be:
{code}
# Set the default log level and parameters
# This gets inherited by all child loggers
log4j.rootLogger=WARN, MAIN
log4j.additivity.org.apache=false
log4j.appender.MAIN=org.apache.log4j.RollingFileAppender
log4j.appender.MAIN.File=c:/dataarea/lcf.log
log4j.appender.MAIN.MaxFileSize=50MB
log4j.appender.MAIN.MaxBackupIndex=10
log4j.appender.MAIN.layout=org.apache.log4j.PatternLayout
log4j.appender.MAIN.layout.ConversionPattern=[%d]%-5p %m%n
{code}
h5. Commands
After you have created the necessary configuration files, you will need to initialize the
database, register the "pull-agent" agent, and then register your individual connectors.
ACF provides a set of commands for performing these actions, and others as well. The classes
implementing these commands are specified below.
|| Core Command Class || Arguments || Function ||
| org.apache.lcf.core.DBCreate |_dbuser_ \[_dbpassword_\] | Create ACF database instance |
| org.apache.lcf.core.DBDrop | _dbuser_ \[_dbpassword_\] | Drop ACF database instance |
| org.apache.lcf.core.LockClean | None | Clean out synchronization directory |
|| Agents Command Class || Arguments || Function ||
| org.apache.lcf.agents.Install | None | Create ACF agents tables |
| org.apache.lcf.agents.Uninstall | None | Remove ACF agents tables |
| org.apache.lcf.agents.Register | _classname_ | Register an agent class |
| org.apache.lcf.agents.UnRegister | _classname_ | Un-register an agent class |
| org.apache.lcf.agents.UnRegisterAll | None | Un-register all current agent classes |
| org.apache.lcf.agents.SynchronizeAll | None | Un-register all registered agent classes that
can't be found |
| org.apache.lcf.agents.RegisterOutput | _classname_ _description_ | Register an output connector
class |
| org.apache.lcf.agents.UnRegisterOutput | _classname_ | Un-register an output connector class
|
| org.apache.lcf.agents.UnRegisterAllOutputs | None | Un-register all current output connector
classes |
| org.apache.lcf.agents.SynchronizeOutputs | None | Un-register all registered output connector
classes that can't be found |
| org.apache.lcf.agents.AgentRun | None | Main *agents* process class |
| org.apache.lcf.agents.AgentStop | None | Stops the running *agents* process |
|| Crawler Command Class || Arguments || Function ||
| org.apache.lcf.crawler.Register | _classname_ _description_ | Register a repository connector
class |
| org.apache.lcf.crawler.UnRegister | _classname_ | Un-register a repository connector class
|
| org.apache.lcf.crawler.UnRegisterAll | None | Un-register all repository connector classes
|
| org.apache.lcf.crawler.SynchronizeConnectors | None | Un-register all registered repository
connector classes that can't be found |
| org.apache.lcf.crawler.ExportConfiguration | _filename_ | Export crawler configuration to
a file |
| org.apache.lcf.crawler.ImportConfiguration | _filename_ | Import crawler configuration from
a file |
|| Authority Command Class || Arguments || Function ||
| org.apache.lcf.authorities.RegisterAuthority | _classname_ _description_ | Register an authority
connector class |
| org.apache.lcf.authorities.UnRegisterAuthority | _classname_ | Un-register an authority
connector class |
| org.apache.lcf.authorities.UnRegisterAllAuthorities | None | Un-register all authority connector
classes |
| org.apache.lcf.authorities.SynchronizeAuthorities | None | Un-register all registered authority
connector classes that can't be found |
Remember that you need to include all the jars under _module/dist/processes_ in the classpath
whenever you run one of these commands! You also must include the corresponding -D switches,
as described earlier.
h5. Initializing the database
These are some of the commands you will need to use to create the database instance, initialize
the schema, and register all of the appropriate components:
|| Command || Arguments ||
| org.apache.lcf.core.DBCreate | postgres postgres |
| org.apache.lcf.agents.Install | |
| org.apache.lcf.agents.Register | org.apache.lcf.crawler.system.CrawlerAgent |
| org.apache.lcf.agents.RegisterOutput | org.apache.lcf.agents.output.gts.GTSConnector "GTS
Connector" |
| org.apache.lcf.agents.RegisterOutput | org.apache.lcf.agents.output.solr.SolrConnector "SOLR
Connector" |
| org.apache.lcf.agents.RegisterOutput | org.apache.lcf.agents.output.nullconnector.NullConnector
"Null Connector" |
| org.apache.lcf.authorities.RegisterAuthority | org.apache.lcf.authorities.authorities.activedirectory.ActiveDirectoryAuthority
"Active Directory Authority" |
| org.apache.lcf.crawler.Register | org.apache.lcf.crawler.connectors.DCTM.DCTM "Documentum
Connector" |
| org.apache.lcf.authorities.RegisterAuthority | org.apache.lcf.crawler.authorities.DCTM.AuthorityConnector
"Documentum Authority" |
| org.apache.lcf.crawler.Register | org.apache.lcf.crawler.connectors.filenet.FilenetConnector
"FileNet Connector" |
| org.apache.lcf.crawler.Register | org.apache.lcf.crawler.connectors.filesystem.FileConnector
"Filesystem Connector" |
| org.apache.lcf.crawler.Register | org.apache.lcf.crawler.connectors.jdbc.JDBCConnector "Database
Connector" |
| org.apache.lcf.crawler.Register | org.apache.lcf.crawler.connectors.sharedrive.SharedDriveConnector
"Windows Share Connector" |
| org.apache.lcf.crawler.Register | org.apache.lcf.crawler.connectors.livelink.LivelinkConnector
"LiveLink Connector" |
| org.apache.lcf.authorities.RegisterAuthority | org.apache.lcf.crawler.connectors.livelink.LivelinkAuthority
"LiveLink Authority" |
| org.apache.lcf.crawler.Register | org.apache.lcf.crawler.connectors.memex.MemexConnector
"Memex Connector" |
| org.apache.lcf.authorities.RegisterAuthority | org.apache.lcf.crawler.connectors.memex.MemexAuthority
"Memex Authority" |
| org.apache.lcf.crawler.Register | org.apache.lcf.crawler.connectors.meridio.MeridioConnector
"Meridio Connector" |
| org.apache.lcf.authorities.RegisterAuthority | org.apache.lcf.crawler.connectors.meridio.MemexAuthority
"Meridio Authority" |
| org.apache.lcf.crawler.Register | org.apache.lcf.crawler.connectors.rss.RSSConnector "RSS
Connector" |
| org.apache.lcf.crawler.Register | org.apache.lcf.crawler.connectors.sharepoint.SharePointRepository
"SharePoint Connector" |
| org.apache.lcf.crawler.Register | org.apache.lcf.crawler.connectors.webcrawler.WebcrawlerConnector
"Web Connector" |
h5. Deploying the *lcf-crawler-ui*, *lcf-authority-service*, and *lcf-api* web applications
If you built ACF using ant under the _modules_ directory, then the ant build will have constructed
three war files for you under _modules/dist/web_. Take these war files and deploy them as
web applications under one or more instances of your application server. There is no requirement
that the *lcf-crawler-ui*, *lcf-authority-service*, and *lcf-api* web applications be deployed
on the same instance of the application server. With the current architecture of ACF, they
must be deployed on the same physical server, however.
Under _modules/dist/web_, you may also see files that are not war files. These files are
meant to be used as command-line -D switches for the application server process. The switches
may or may not be identical for the two web applications, but they will never conflict. You
may need to alter environment variables or your application server startup scripts in order
to provide these switches. (More about this in the future...)
h5. Running the *agents* process
The *agents* process is the process that actually performs the crawling for ACF. Start this
process by running the command "org.apache.lcf.agents.AgentRun". This class will run until
stopped by invoking the command "org.apache.lcf.agents.AgentStop". It is highly recommended
that you stop the process in this way. You may also stop the process using a SIGTERM signal,
but "kill -9" or the equivalent is NOT recommended, because that may result in dangling locks
in the ACF synchronization directory. (If you have to, clean up these locks by shutting down
all ACF processes, including the application server instances that are running the web applications,
and invoking the command "org.apache.lcf.core.LockClean".)
h5. Running connector-specific processes
Connector-specific processes require the classpath for their invocation to include all the
jars that are in the corresponding _modules/dist/<process_name>-process_ directory.
The Documentum and FileNet connectors are the only two connectors that currently require
additional processes. Start these processes using the commands listed below, and stop them
with SIGTERM.
|| Connector || Process || Start class ||
| Documentum | documentum-server-process | org.apache.lcf.crawler.server.DCTM.DCTM |
| Documentum | documentum-registry-process | org.apache.lcf.crawler.registry.DCTM.DCTM |
| FileNet | filenet-server-process | org.apache.lcf.crawler.server.filenet.Filenet |
| FileNet | filenet-registry-process | org.apache.lcf.crawler.registry.filenet.Filenet |
h3. Running the ACF Apache2 plug in
The ACF Apache2 plugin, mod-authz-annotate, is designed to convert an authenticated principle
(e.g. from mod-auth-kerb), and query a set of authority services for access tokens using an
HTTP request. These access tokens are then passed to a (not included) search engine UI, which
can use them to help compose a search that properly excludes content that the user is not
supposed to see.
The list of authority services so queried is configured in Apache's httpd.conf file. This
project includes only one such service: the java authority service, which uses authority connections
defined in the crawler UI to obtain appropriate access tokens.
In order for mod-authz-annotate to be used, it must be placed into Apache2's extensions directory,
and configured appropriately in the httpd.conf file.
Note: The ACF project now contains support for converting a Kerberos principal to a list of
Active Directory SIDs. This functionality is contained in the Active Directory Authority.
The following connectors are expected to make use of this authority:
* FileNet
* Meridio
* SharePoint
h5. Configuring the ACF Apache2 plug in
mod-authz-annotate understands the following httpd.conf commands:
|| Command || Meaning || Values ||
| AuthzAnnotateEnable | Turn on/off the plugin | "On", "Off" |
| AuthzAnnotateAuthority | Point to an authority service that supports ACL queries, but not
ID queries | The authority URL |
| AuthzAnnotateACLAuthority | Point to an authority service that supports ACL queries, but
not ID queries | The authority URL |
| AuthzAnnotateIDAuthority | Point to an authority service that supports ID queries, but not
ACL queries | The authority URL |
| AuthzAnnotateIDACLAuthority | Point to an authority service that supports both ACL queries
and ID queries | The authority URL |
Change your notification preferences: https://cwiki.apache.org/confluence/users/viewnotifications.action