Overview

The main goal of this project is to design and implement a new standard framework for indexing, searching and semantic analysis of vast amount of structured and unstructured data.

Functional requirements

Use of standards

The clear maxim of SMILA should be the use of standards. Since one of major goals is to make SMILA open as much as possible to other contributors and vendors, established and relevant emerging standards should be used whenever and wherever possible.

Componentization

SMILA has to be highly flexible by the way it handles its components. The contributors and software vendors have to have the ability to easily extend or substitute connectors (agents and crawlers), services (e.g. search, text mining, data conversion & annotation...) and even core components - if desired.

Central management

One of the major requirements posed on SMILA, is to be capable of handling an enormous amount of documents i.e. data. Since SMILA will run on normal PC-nodes, this means, that we will have a large amount (hundreds!) of nodes in our cluster. Without a central management feature the system will be completely unmanageable. Several management aspects have to be concerned:

Software management

First of all, before even configuring and running the framework, all SMILA-nodes have to be supplied with SMILA-components in some concrete version. This can be best done by storing all components on the central repository and distributing those components from there. To be able to do that, there has to be some initial software deployment on each node so that node is able to register itself to the software repository and fetch the rest of its components from it. This can by easily achieved by copy deployment, i.e. each node will have initially the same software component installed.

Important: SMILA will not provide any implementation of such a central software repository but only the necessary APIs.

Configuration management

After the node has been powered on, registered (via e.g. broadcast) itself on the central management console and received its components, the next step is to acquire its configuration from the configuration repository.
The configuration itself has two major parts. The first part contains the information about the system topology. In other words it answers the question: "Which are other SMILA-nodes that I must communicate with?" The second part contains the information about the function of each component i.e. the business logic configuration.

Important: SMILA will not provide any implementation of a central configuration repository but only the necessary APIs.

Operation control

By installing components and configuring them on the node we moved this node to the "ready for operation" state. Now, after the management console has been informed about that, it can engage/start the node via JMX/SNMP.

Important: SMILA will not provide any implementation of a central management console but only the necessary APIs.

Monitoring

During the whole operation time of the SMILA-system there has to be the possibility to monitor its components and, depending of their status, take some action. Hence a monitoring functionality should be provided.
There are basically two monitoring aspects. First, we need to monitor the operating status (health) of a component. Second, we need also to monitor the performance parameters of a component.
The monitoring should be realized by utilizing SNMP & JMX.

Security

There are several requirements related to security.
First, the user generally has to authenticate himself (if not using guest account) before he/she can access the data stored in SMILA.
Second, the access rights have to be transferred from the data sources into SMILA's indices and used there for authorization each time a user sends a query.
Third, SMILA has also to make sure, that even an administrator cannot access confidential data. This should be achieved by encrypting all stored data in SMILA.

Preservation of processing information

There may be no accidental information loss in SMILA. For example, if some information comes in SMILA and for some reason the processing fails (e.g. the whole node or just a process has crashed), then this information may not leave the framework without manual action of the administrator.

Deployment flexibility

SMILA has to be designed for large enterprises. Therefore the main deployment scenario is some kind of cluster environment. Nevertheless, the framework must also run on a single node being it just for development or demonstration purpose.

Implementation language neutrality

One of the ways of SMILA being open, is the implementation language neutrality of connect and service components. SMILA has to provide the ability for contributors and software vendors to implement their components in other programming languages than Java.

Incremental index update

Search may not change the state

The search process may not change the state of the data being stored in SMILA. Which means, that the index and search process have to be completely separated from each other.

Limited bidirectional component communication

Eventual spread of SMILA' components running in separated processes across several networks and hence the existence of firewalls has to be taken into account.

Important: The core components must be deployed in same network and have unlimited communication possibility.

Buffering of external information flow

One of the advanced features of SMILA should also be the buffering of information transferred from the data sources. This feature will optimize the performance of SMILA by reducing some unnecessary load caused by e.g. several consecutive changes on the same document.

Mash up of data

Another advanced feature of SMILA should be the possibility to mash up existing data and thereby provide some new interesting information.

Reporting

Further advanced feature of SMILA should be reporting.

Backup

The ability to backup the system is highly important. The system must be designed to allow this.

Nonfunctional requirements

Deployment on inexpensive hardware

Hardware nodes used for deployment of SMILA should not exceed the capabilities of a contemporary normal PC. More precise: The use of 1Gbit/s network adapter should be completely sufficient.

Scalability

The framework must be capable of handling huge amounts of data. The goal is to be able to deal with one billion documents and more.

Reliability

Careful deployment, planning and configuration of SMILA by e.g. avoiding single points of failure must ensure, that the operation of SMILA will not be interrupted if some of its core components are suddenly not available.

Robustness

Some bad component, misbehaving by taking 100% of CPU time or utilizing large amounts of memory, may not have an impact on the overall framework stability.

Data consistency

Persisted application data must be consistent at any time. No matter what happens: power outage; the loss of complete network connectivity; total hardware failure; crash of all instances of a service the data stored in the framework must not be corrupted.

Live component upgrade

During the normal system operation it must be possible to gradually upgrade its components. In other words, the system may not be shut down even for a complete upgrade. Instead of shutting it down, the update of the system should be possible within a lengthy time slot and in an asynchronous manner. This is of course only possible if the third party components, like e.g. Queue-Server or BPEL-Engine offer the same functionality. Otherwise this discontinuation of the upgrade-chain must be clearly documented.

Hint: DBs, which are being used by SMILA's persistence layer, are not considered as its components and therefore do not underlay this requirement.

Important: The only constraint for this use case is that the system is being upgraded to a new maintenance release. In other words: bug fixes only, no new features, upward and backward API compatibility.

Copy deployment

The addition of the new hardware nodes to SMILA-cluster must be simple as much as possible. The best way so achieve this simplicity is by designing SMILA so that, at least its basic components, are installed on target node by using "copy deployment" i.e. by simply copying (parts of) SMILA on a new machine's hard drive.

High indexing throughput

The performance of indexing data source may be only limited by available hardware capacity .The framework itself must guarantee high data throughput by being able of parallel access to external data sources and also by multiplying its processing components.

Community and Partner readiness

In order to reduce the amount of effort for utilizing SMILA some actions in community and partner readiness direction must be taken. The documentation of best practices, use case recommendations should be the part of SMILA's distribution.

Ease of use

Ease of use is an important aspect of the system. The amount of required technologies for a person who wants to take part in development process must be kept at a minimum level. We do not want to overwhelm potential contributors with a plethora of new technologies and discourage them from involving in our project. Furthermore, simple deployment and operation of a single node installation must be supported.