Ease of use

TML aims to help developers write applications that use TM techniques,
without having to be an expert in the area and with no licensing problems (TML is Apache v2.0).
TML also aims to help researchers to speed up their experimenting providing a platform
they can trust (validated using academic papers) so they can focus on their new ideas.

Scalability

One of the biggest problems in TM is that many algorithms are computationally expensive.
TML doesn't solve this problem, however it tackles scalability by decoupling the most
complicated processes.

TML is integrated with the high performance Apache's Lucene search engine for high speed
document indexing and corpus definition (the documents you'll work on). Lucene can be
scaled to eat the whole WWW so it has no limits, and TML defines a corpus as a set of
search results so document selection happens incredibly fast.

TML has a parallel process that adds annotations on demand, for example if you want to
use Part Of Speech tags (POS), you can run the annotator offline and only when you know
the server will be ok. In this way TML will always respond, and will use new data
as it becomes available.

Extensibility

New analyzers can be added to Lucene for different tokenizing, stemming, etc.

New term weighting schemes can be added to TML when building a VSM (term-doc matrix).

New annotators can be easily added to extract information from documents.

New factorisations can be added for LSA style research.

New operations can be easily added to put them all together.

Implemented operations

TML already implements several operations:

LSA based distances between passages

Topic extraction and clustering

Automatic extraction of Concept Maps

It is able to create semantic spaces from a corpus of documents, and use that space as background knowledge to calculate semantic distances within the same corpus or on a different one.
TML processes all documents at three levels: Document, paragraph and sentence. This means that corpora can be created using whole documents, its parts or a combination of both.

TML is built on top of Lucene therefore it can perform any search to create a corpus.
In other words, you can build a corpus with all the sentences of all the documents that contain the word dog.

TML also uses grammatical information from the Stanford parser at the sentence level, so each sentence contains its own PennTree string.
This allows to reconstruct the grammatical tree in a fast way to perform grammatical operations.

For a full list of the available operations, check the package tml.vectorspace.operations in the API docs.

Using TML from a Java program

To use TML from another java program you have to include TML in your classpath.
You can use the provided tml-xxx-core.jar that does not include dependencies to avoid conflicting jars and save disk space.