|
Datasets for the French lexical substitution task
Description
This webpage contains the data sets used for the French lexical
subbstitution task that took place in 2014. For more information about
the task please refer to the SemDis 2014
webpage.
- The 300 test sentences for which a single target words had to be substituted. There are 30 different target words (10 adjectives, 10 nouns, 10 verbs)
- The first version of the gold standard, established a priori by
asking judges to provide substitutes. The score for each substitute
is the number of judges who suggested it (1 to 7). It contains a total of 1,771 substitutes.
- The second version of the gold standard, established
post-hoc by asking judges to evaluate all the substitutes in the
submitted runs. The score for each substitute is the average score
given by the judges (0 to 3). Substitutes from the first gold standard
have also been re-annotated. This dataset contains a total of 6,034
substitutes (we discarded the substitutes that received an unanimous
null score).
Design
Person in charge
Ludovic Tanguy
Contact : ludovic.tanguy@univ-tlse2.fr
Licence
Some rights are reserved. The dataset is available under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.
Download
File format
The two versions of the gold standard have the same format, one sentence per line:
target word.POS sentence id :: substitute1 score; substitute2 score;
Example :
affection.n 145 :: maladie 3.0; pathologie 3.0; lésion 2.25; syndrome 2.0; mal 1.6666666666666667; complication 1.6666666666666667; inflammation 1.5; trouble 1.5; malformation 1.3333333333333333; atteinte 1.0; altération 1.0; anomalie 1.0; dysfonctionnement 1.0; pépin 0.8333333333333334; infection 0.75; algie 0.5;
Note
In the second gold standard there are no annotations for the (adjective) substitutes for sentences number
52, 90, 106, 179, 185, 186, 187, 192, 204, 209, 211, 231, 232, 272 and 287. The corresponding items had problems that did not allow them to get a reliable or useful evaluation (too short context, wrong POS, duplicates, etc.).
References
- For the corpus and first gold standard:
C. Fabre, N. Hathout, L.-M. Ho-Dac, F. Morlane-Hondère, P. Muller, F. Sajous, L. Tanguy et T. Van de Cruys (2014). Présentation de l'atelier SemDis 2014 : sémantique distributionnelle pour la substitution lexicale et l'exploration de corpus spécialisés. Actes de l'atelier SemDis 2014, 21e Conférence sur le Traitement Automatique des Langues Naturelles (TALN 2014).
pp. 196-205, Marseille.
[ Article ]
[ Bibtex ]
- For the second gold standard:
L. Tanguy, C. Fabre and L. Rivière (2018). Extending the gold standard for a lexical substitution task: is it worth it? Proceedings of LREC.
|