Test sets for the SemDis French lexical substitution task

REssources Développées À CLLE
Accueil Ressources Applications Corpus Lexiques Autres À propos CLLE Site Mention légale Contact

Datasets for the French lexical substitution task

Description

This webpage contains the data sets used for the French lexical subbstitution task that took place in 2014. For more information about the task please refer to the SemDis 2014 webpage.

The 300 test sentences for which a single target words had to be substituted. There are 30 different target words (10 adjectives, 10 nouns, 10 verbs)
The first version of the gold standard, established a priori by asking judges to provide substitutes. The score for each substitute is the number of judges who suggested it (1 to 7). It contains a total of 1,771 substitutes.
The second version of the gold standard, established post-hoc by asking judges to evaluate all the substitutes in the submitted runs. The score for each substitute is the average score given by the judges (0 to 3). Substitutes from the first gold standard have also been re-annotated. This dataset contains a total of 6,034 substitutes (we discarded the substitutes that received an unanimous null score).

Design

Person in charge

Ludovic Tanguy
Contact : ludovic.tanguy@univ-tlse2.fr

Licence

Some rights are reserved. The dataset is available under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.

Download

Corpus: XML file and DTD
First gold standard: TXT file
Second gold standard: TXT file

File format

The two versions of the gold standard have the same format, one sentence per line:

target word.POS sentence id :: substitute1 score; substitute2 score;

Example :

affection.n 145 :: maladie 3.0; pathologie 3.0; lésion 2.25; syndrome 2.0; mal 1.6666666666666667; complication 1.6666666666666667; inflammation 1.5; trouble 1.5; malformation 1.3333333333333333; atteinte 1.0; altération 1.0; anomalie 1.0; dysfonctionnement 1.0; pépin 0.8333333333333334; infection 0.75; algie 0.5;

Note

In the second gold standard there are no annotations for the (adjective) substitutes for sentences number 52, 90, 106, 179, 185, 186, 187, 192, 204, 209, 211, 231, 232, 272 and 287. The corresponding items had problems that did not allow them to get a reliable or useful evaluation (too short context, wrong POS, duplicates, etc.).

References

For the corpus and first gold standard:
C. Fabre, N. Hathout, L.-M. Ho-Dac, F. Morlane-Hondère, P. Muller, F. Sajous, L. Tanguy et T. Van de Cruys (2014). Présentation de l'atelier SemDis 2014 : sémantique distributionnelle pour la substitution lexicale et l'exploration de corpus spécialisés. Actes de l'atelier SemDis 2014, 21e Conférence sur le Traitement Automatique des Langues Naturelles (TALN 2014). pp. 196-205, Marseille. [ Article ] [ Bibtex ]
For the second gold standard:
L. Tanguy, C. Fabre and L. Rivière (2018). Extending the gold standard for a lexical substitution task: is it worth it? Proceedings of LREC.