Purpose

This manual is dedicated to train customized word2vec models.
A general purpose model is already trained and available at Query Suggestion API (section "Local Installation"). To replace this with a customized one please edit the property file in the web archive "WEB-INF\classes\application.properties" and set the parameter "corpus.location" to our training result file path.

Please note that the code is written in Java, and this project is a Maven project.

Build the executables

These maven build commands compiles the source code and package the compiled binaries into
two JAR files into the sub folder "target":

lailapssuggestion.war: the RESTful web service
The Java web application archive can be deployed in all Java EE containers

LAILAPS-QSM.jar: Bundle of tools to generate word2vec model used by the RESTful service

Retrieve a life science text corpus

The default text corpus is based on abstracts that can be downloaded from PubMed. Because of copyright issues, we can only download the titles and abstracts of all the articles. We strongly recommend you to do batch downloading, for example, if you want to download the articles that published in 2016, then just type in "("2016"[Date - Create] : "2016"[Date - Create])" in the search box, and click the "search" button, in left side of the search result page,
click "Abstract", which means you only want to download the title and abstract part of the article. The last step is to click "Send to->File->Abstract(text)" in the search result page.

We have a program to format the text format, for example, if you want to download the
articles that published in 2016, then just type in "("2016"[Date - Create] : "2016"[Date - Create])" in the search box, and click the "search" button, in left side of the search result page,
click "Abstract", which means you only want to download the title and abstract part of the article. The last step is to click "Send to->File->Abstract(text)" in the search result page.

Command line tool

To train a word2vec model from a text corpus please execute the JAVA archive LAILAPS-QSM.jar:

This tool tokenize all text documents of the input folder to a final text corpus. It support PubMed abstracts and a list of text document, whereas each line comprise one document. So please ensure to remove newlines before you compile them into the below container format:

Bioremediation of vegetable and agrowastes by Pleurotus ostreatus: a novel strategy to produce edible mushroom with enhanced yield and nutrition.

The command line parameter are:

* `-i`: The folder path of input files - all files in this folder are read in and must be of the same type.
* `-o`: The folder path of out file(this program will generate a corpus file in output folder, it's name is corpus.txt).
* `-f`: The format of input file, 0 is PubMed text format file, 1 is others format(each line a document), in this case we should use "0".
* For example: java -cp LAILAPS-QSM.jar -Xms2048M -Xmx4096M de.ipk_gatersleben.data.DataExtract -i /data/text -o /data/output -f 0

(2) Word2Phrase

This tool extends the text corpus with phrases. A phrase is a group of words that functions as a constituent in the syntax of a sentence, a single unit within a grammatical hierarchy, such as "heading date", "flowering time" in life science field. The command line parameter are:

License

Copyright (c) 2017 Leibniz Institute of Plant Genetics and Crop Plant
Research (IPK), Gatersleben, Germany.
All rights reserved. This program and the accompanying materials
are made available under the terms of the GNU General Public License,
version 2 which accompanies this distribution, and is available at
https://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html (C)

This library is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
General Public License for more details.