Introduction

To access and utilize the rich information contained in biomedical literature, the ability to recognize and normalize gene mentions referenced in the literature is crucial. In this paper, we focus on improvements to the accuracy of gene normalization in cases where species information is not provided. Gene names are often ambiguous, in that they can refer to the genes of many species. Therefore, gene normalization is a difficult challenge.

We define "gene normalization" as a series of tasks involving several issues, including gene name recognition, species assignation and species-specific gene normalization. We propose an integrated method consisting of three modules to handle the issues of this task. Every issue can affect overall performance, though the most important is species assignation. Clearly, correct identification of the species can decrease the ambiguity of orthologous genes. We propose an integrative method to handle the three issues of the GN task. Our approach uses three modules, the gene name recognition (GNR) module, the species assignation (SA) module and the species-specific gene normalization (SGN) module.

Figure1. Architecture of the gene normalization method.

We participated in theGN task of the BioCreaTive III (BC3) competition by adopting an integrated method based on our previous work to handle intra-species gene ambiguity. Results demonstrated that our method worked well, ranking at the top level of performance among all teams. Our proposed method makes sufficient use of gene/species information in context and of a thesaurus of gene/species.
In experiments, the proposed model attained the top threshold average precision (TAP-k) scores of 0.3297 (k=5), 0.3538 (k=10), and 0.3535 (k=20) when tested against 50 articles that had been selected for their difficulty and the most divergent results from pooled team submissions. The second highest scores were obtained with the full test set of articles (TAP-k score of 0.4591 for k=5, 10)

Table1. Performance statistic by BC3 test and training data

Corpus

Data set

TAP-5

TAP-10

TAP-20

Precision

Recall

F-measure

test data (1st run)

50 (gold standard)

0.3254

0.3538

0.3535

53.85%

39.44%

45.53%

test data (2nd run)

50 (gold standard)

0.3216

0.3435

0.3435

55.54%

39.07%

45.87%

test data (3rd run)

50 (gold standard)

0.3297

0.3514

0.3514

56.23%

39.72%

46.56%

test data (1st run)

50 (silver standard)

0.3567

0.3600

0.3600

58.94%

38.95%

46.90%

test data (2nd run)

50 (silver standard)

0.3291

0.3291

0.3291

58.60%

37.64%

45.84%

test data (3rd run)

50 (silver standard)

0.3382

0.3382

0.3382

59.46%

38.35%

46.62%

test data (1st run)

507(silver standard)

0.4591

0.4591

0.4591

71.79%

44.69%

55.09%

test data (2nd run)

507 (silver standard)

0.4323

0.4323

0.4323

72.08%

42.70%

53.64%

test data (3rd run)

507 (silver standard)

0.4327

0.4327

0.4327

72.41%

42.82%

53.82%

Training data

32 (gold standard)

0.4703

0.4969

0.4969

63.82%

67.71%

65.70%

Table2. Performance on the gene normalization task by the top 4 performing teams for this task in the BioCreaTive III competition