Within Patents, biological sequences are to be presented in a structured manner (INSDC XML, www.insdc.org ). Biological sequences disclosed in patents are frequently poorly annotated, need to be re-submitted, thereby creating a burden for the applicant, the EPO and the scientific community at large.

Aim is to integrate a online sequence submission tool and expert system in EPO’s epoline environment.(EPO Online Services have been designed to allow applicants, attorneys and other users to conduct their business with the European Patent Office electronically in a state-of-the-art secure environment, protected by smart card or username/password access). We will first provide detailed specifications and requirements (including integration into EPO's production environment) and report those as D13.1. The web based expert system will enable the submission of sequences to a dedicated secured server. Verification of the data will be interactive and immediate. EMBL’s verifications criteria will apply as much as possible. The final product will delivered after 36 months (D13.04)

Task 2: Text mining, data extraction and database population

The second main development assignment will continue the tasks initiated during Felics towards text mining. Within Felics, the EPO’s ambitious aim was to extract names from chemical compounds disclosed in patents. We will persist and expand towards extraction of chemical compounds disclosed as tiff images. Resolved compound will populate ChEBI (www.ebi.ac.uk/Chebi). The extraction algorithms will be mainly enhanced by

Improve OCR : Enhance OCR output using an open source like CAPTCHA, so post-processing is improved. (For instance, IUPAC names are long and conform to a grammar so can be corrected for any OCR error.) We will access error probabilities/confidence values from within the OCR framework to make the data to process accurate.

Develop OSRA (http://cactus.nci.nih.gov/osra/) a tool designed to convert graphical representations of chemical structures, as they appear in journal articles, patent documents. It needs better algorithmic enhancements, better software packaging.
For this task we will use parallel development of the three environments. First specifications and initial testing will be reported in D13.02. The final delivery will occur at month 36 (D13.05)

Task 3: Cross referencing

In their description, patents do contain cross references to scientific articles. Still using text mining we aim at extracting relevant cross references to prior art literature and establish hyperlinks to those papers to be delivered as D13.06 (shared with WP14). Extracted information will also populate a database of cross references. Finally we will continue to apply text mining techniques to enrich sequence annotations.

We will assess the best text-mining tools, like pattern matching aided with machine learning methods (e.g. hidden Markov models or support vector machines); NLP (Natural Language Processing), including entity identification (using dictionary-based approach). We anticipate to use a combination of these techniques to optimize the outcome. The strategy will be reported in D13.03.

Detection of relevant information will require semi-automatic iterative analysis of the well-annotated set of publications (test set) and compare with the outcomes of the automated methods specifically develop for patent literature.

The result will consist in detailed annotations of sequences AND patent texts, discovering new relationships between sets of patents, biological sequences and literature.

Those information shall be then incorporated publicly available databases. Quality will be measure by the average increased number of annotations per entry in the sequence database and the average number of newly created hyperlinks.
The cross-links will enhance the service functionality in WP4.

Note:
1. This work will require regular visits to the EBI from the EPO – we anticipate six per year.

2. EPO policy is to lease computers for externally funded tasks of this nature.

3. Software licences for software on the leased computers will need to be acquired for the project.

Information

User login

The SLING project is funded by the European Commission within Research Infrastructures of the FP7 Capacities Specific Programme, grant agreement number 226073 (Integrating Activity)
Site maintained by the External Services team at EMBL-EBI | Terms of Use | Privacy | Cookies