T3.1 - Prototype and integration

Implementation of a categorization tool prototype (UniZD) – The implementation of the prototype includes the following subtasks:

Several vector space generation algorithms will be implemented (i.e. token-based context vectors, lemma-based ones, or algorithms based on other features that are provided either by linguistic annotation and analysis components or by the meta-data of documents). We cannot foresee what features will provide the best results and thus we might need to provide means to get specific document features relying on linguistic pre-processing at various levels (e.g. lemmatization, morphological annotation, etc.)

Vector Space compression/reduction algorithms will be implemented next, e.g. using covariation analysis over different features.

Then, means for class specification will be provided (i.e. sample documents from one class, selection of meta-information properties, other feature sets), as well as similarity metric calculation for class-document distances.

We will adopt existing implementations of kNN to make use of the vector spaces models to provide means for class suggestions (classifierless clustering of documents and similarity search).

Next, we will experiment with existing SVM tools to improve the above and also experiment with Latent Semantic Analysis to get language independent classifiers (abstract document property extraction and similarity metric over the document matrix).

Technical review of the categorization tool (Tetracom) – Once a stable and fully functional version of the categorization tool has been produced, a team of qualified personnel from Tetracom will examine the suitability of the component for its intended use and will identify any discrepancies, from specifications and standards. Tetracom will produce a technical review report, which may include change requests. The technical review report is then sent to the beneficiary responsible for the categorization tool. The beneficiary modifies the tool if necessary and sends it back to Tetracom for integration.

Integration of the categorization tool into ATLAS (Tetracom) – This task encompass the integration of the categorization tool into the ATLAS platform and will be performed by Tetracom after the tool has been thoroughly tested by Atlantis as part of T7.2.