A PROPOS DES INGRÉDIENTS

Would it be possible to know more about the annotation
process. It seems that there are some cases where the annotator
added an ingredient from the ingredients list without this being
present in the body. for instance in recette_95792 "sel" is
mentioned only in the ingredients section.

It
appears that the terminology list has been generate
automatically: for instance in recipe 10516 the "gold" list
mention "couteau" rather than "mouscade" and it does not mention
"champignon" which is the main ingredient of the recipe. The
question is: will the evaluation be carried on a manually
validated list or just on the output of the algorithm which
generated the terminology list?

The annotation in
the gold standard is directly based on meta-information found on
the Marmiton Web site, for which we have no description. We
found out that it is not correct in all instances: some
ingredients present in the recipe or <ingredients> element
may happen to be absent from the gold standard, and some
ingredients listed in the gold standard may be
errors. Fortunately these represent a small proportion of the
ingredients. This means that there is a ceiling on the top score
that a system may obtain. This reflects the fact that this
task is exploratory: it does not aim at obtaining perfect
results but at reaching full automation.

Can you confirm that no ingredient absent from the list in
the terminology file will be present in the evaluation
set?

We have added a new terminology file (see the
download page) which contains a full list of gold standard
normalized ingredient names. All ingredient names which must
be predicted for the test set belong to this list. Therefore,
no ingredient absent from this new list will be present in the
test set.

Are implicit ingredients to be detected? In 96143 we find
"Fariner, saler et poivrer l'espadon" which implies "sel"
"farine" et "poivre", which are however not present in the gold
standard: could you confirm us that the behaviour is
constant?

Unfortunately we cannot guarantee that
(see above why).

Can you confirm that the <ingredients> elements will be
absent in the evaluation set? The question stems from an example
such as 96176 96176|appr|2013/recette_96176.xml|Plat
principal-Très facile|6|ail lardons-fumes oignon poivre
pomme-de-terre sel the text of the recipe says " et les faire
revenir à sec avec les lardons": "lardon-fumes" is present only
in the ingredient part.

It is true that some of the ingredients
are only present in the <ingredients> element of some
recipes, and will therefore be impossible to find from the body
of the recipe. This task reveals the actual issues that a system
for automatically creating a list of ingredients from the
description of a recipe would have to cope with. So this kind
of issue is part of the specification of the task, for which a
perfect answer is not reachable in some cases. We expect the
proportion of such cases in the test set should not be much
different from that in the training set.

A PROPOS DE L'ÉVALUATION

Is it foreseen a metrics for evaluating partial matches of
single ingredients? For instance in 96143 the gold annotation
mention huile-d-olive, but in the recipe we find "Le faire
dorer des 2 côtés dans l'huile". A system which retrieve just
"huile" is it equivalent to one which does not retrieve
anything are there is a "reward" for guessing the more generic
term, given the fact that the more specific is not
mentioned?

The main metric, which will be used in
the official ranking of the systems in the challenge, will use
exact match: so in this example, including 'huile' in the list
of ingredients returned by the system will be considered as
noise.A secondary metric is planned which will take into
account kind-of relations. With this secondary metric, 'huile'
will obtain a partial match to the gold answer 'huile-d-olive'
(it will be partly a true positive and partly noise, and
therefore obtain only a part of the full score). We do not plan
to have this extended metric as main metric because we are not
certain we can to give it a sound behaviour in cases where a
system selects very generic ingredients which partially match
multiple gold ingredients.

Would it be possible to know more on the evaluation metrics?
Ideally, as in other evaluations the best thing would be to have
the evaluation software itself. In case this is not possible,
which is the formula for evaluating partial matches,
i.e. ingredient A B C in the gold and A B or A C E in system's
answer.

The metric will be the Mean Average Precision
(MAP): the average precision over all relevant documents,
non-interpolated. It is the average of the precision value
obtained after each relevant document is retrieved. (When a
relevant document is not retrieved at all, its precision is
assumed to be 0.) We shall use the standard
trec_eval program to compute
it:

trec_eval -mmap gold_standard_qrels
system_ranked_list

The system must return
a ranked list of ingredients. The MAP rewards systems
which rank all correct ingredients close to the top of their
list. Conversely, adding incorrect ingredients at the bottom of
the list (after the correct ingredients) does not penalize a
system.

For instance, given a recipe with 6 ingredients
in the gold standard, a systems which finds these 6 ingredients
but adds one incorrect ingredient will obtain a score ranging
from 0.7345 to 1 depending on whether the incorrect ingredient
is located from the first to the eighth position.Conversely,
with the same gold standard, a system which ranks 5 out of the 6
ingredients on top of the list and does not provide the missing
sixth ingredient at all obtains a MAP of 0.8333. Finding the
sixth ingredient and ranking it after ten noisy ingredients
increases the score to 0.8958. As mentioned above, adding noisy
ingredients at the bottom of the list does not change the
score.

A system which ranks the 6 gold ingredients after
ten noisy ingredients obtains a score of 0.2471.Finally, a
system which only ranks 3 of the 6 gold ingredients at the top
and none of the missing 3 gold ingredients obtains a MAP of 0.5,
because missing gold ingredients are considered to have a
precision of zero (it is as though they were ranked at an
infinitely far position).

However, trec_eval ignores
queries (=recipes) for which no documents (ingredients) are
provided. This behaviour would not penalize a system which does
not propose any ingredient for a given test recipe. To avoid
this, evaluation will be performed after checking that the
system answer contains at least one ingredient for every test
recipe. If this is not the case, the evaluation program will add
a dummy ingredient for each missing recipe (which will cause
this recipe to receive an average precision of zero). The script
used to perform this normalization
is t4-normalize-system-results.pl. Run perl
t4-normalize-system-results.pl to print its usage
information."

Input is assumed to be sorted numerically by qid. Sim is assumed to
be higher for the docs to be retrieved first.Given the gold
standard, in the results file above, all ingredients are correct
except sel and pesto. Besides, this answer lacks several ingredients
(which will get a precision of zero): bouillon, parmesan, pignon,
poivre, tomate.Its MAP, computed with trec_eval, is
0.5227.