Evaluation service

@inproceedings{seedev2016overview,
title={Overview of the Regulatory Network of Plant Seed Development (SeeDev) Task at the BioNLP Shared Task 2016.},
author={Chaix, Estelle and Dubreucq, Bertrand and Fatihi, Abdelhak and Valsamou, Dialekti and Bossy, Robert and Ba, Mouhamadou and Del{\.e}ger, Louise and Zweigenbaum, Pierre and Bessieres, Philippe and Lepiniec, Loic and others},
booktitle={Proceedings of the 4th BioNLP Shared Task Workshop},
pages={1--11},
year={2016}
}

General evaluation algorithm

The evaluation is performed in two steps:

Pairing of reference and predicted annotations.

Filtering of reference-prediction pairs.

Computation of measures

There
may be additional filtering or rearrangement steps in order to
accommodate a specific sub-task or to compute alternate scores. The
description of each task details the specifics of the evaluation.

Pairing

The
pairing step associates each reference annotation with the best
matching predicted annotation. The criterion for "best matching" is a
similarity function Sp that, given a reference
annotation and a predicted annotation, yields a real value between 0 and
1. The algorithm selects the pairing that maximizes the sum of Sp on all pairs, and so that no pair has a Sp equal to zero. Sp is specific to the task, refer to the description of the evaluation of each sub-task for the specification of Sp.

A pair where Sp equals to 1 is called a True Positive, or a Match.

A pair where Sp is below 1 is called a Partial Match, or a Substitution.

A reference annotation that has not been paired is called a False Negative, or a Deletion.

A predicted annotation that has not been paired is called a False Positive, or an Insertion.

Filtering

The
filtering step selects a subset of reference-predicted pairs, from
which the scores will be computed. In all sub-tasks the main score is
computed on all pairs without any filter applied. Filtering is used to
compute alternate scores in order to assess the strengths and weaknesses
of a prediction. One typical use of filtering is to distinguish the
performance of different annotation types.

Measures

Measures
are scores computed from the reference-predicted annotation pairs after
filtering. They may count False Positives, False Negatives, Matches,
Partial Matches, or an aggregate of these scores like Recall, Precision,
F1, Slot Error Rate, etc.

Each sub-task has a different set of measures. Participants are ranked by the first measure.

Sub-tasks evaluations

DeeDev-binary

The pairing similarity function of SeeDev-binary is defined as:

If the reference and predicted events have the same type and if the two arguments are the same, then Sp = 1; otherwise Sp = 0.

The submissions are evaluated with Recall, Precision and F-1. Note that the events of type Is_Linked_To, Has_Sequence_Identical_To, Is_Functionally_Equivalent_To are considered commutative: the two arguments can be reversed. Event of all other types are not commutative.

Alternate scores are provided for each event type.

SeeDev-full

The pairing similarity function of SeeDev-full is derived from SeeDev-binary, additionally it allows for mistakes in the optional arguments:

Sp = Sp-binary . SNeg . SOpt

Where Sp-binary is the pairing function of SeeDev-binary described above. Therefore, in order to be paired, a reference and a predicted event must have the same type and the same mandatory arguments.

SNeg is the negation similarity:

If both reference and predicted events are negated, then SNeg = 1, if neither reference and predicted events are negated, then SNeg = 1, otherwise SNeg = 0.5

If the predicted event is negated where the reference event is not (or vice-versa), then SNeg applies a penalty halving the score.

SOpt is the similarity for optional arguments:

SOpt = 1 - (EOpt / N)

EOpt is the number of errors in the optional arguments, and N is the cardinality of the union of all optional arguments in the reference and predicted events. The errors are counted as follows:

A missing optional argument counts as 1 error.

An extra optional argument counts as 1 error.

A wrong optional argument counts as 2 errors.

The submissions are evaluated with Recall, Precision and F-1. The service also computes the Slot Error Rate.