Datasets for the French lexical substitution task

This webpage contains the data sets used for the French lexical subbstitution task that took place in 2014. For more information about the task please refer to the SemDis 2014 webpage.

  • The 300 test sentences for which a single target words had to be substituted. There are 30 different target words (10 adjectives, 10 nouns, 10 verbs)
  • The first version of the gold standard, established a priori by asking judges to provide substitutes. The score for each substitute is the number of judges who suggested it (1 to 7). It contains a total of 1,771 substitutes.
  • The second version of the gold standard, established post-hoc by asking judges to evaluate all the substitutes in the submitted runs. The score for each substitute is the average score given by the judges (0 to 3). Substitutes from the first gold standard have also been re-annotated. This dataset contains a total of 6,034 substitutes (we discarded the substitutes that received an unanimous null score).
Ludovic Tanguy
Some rights are reserved. The dataset is available under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.


File format
The two versions of the gold standard have the same format, one sentence per line:

target word.POS sentence id :: substitute1 score; substitute2 score;

Example :

affection.n 145 :: maladie 3.0; pathologie 3.0; lésion 2.25; syndrome 2.0; mal 1.6666666666666667; complication 1.6666666666666667; inflammation 1.5; trouble 1.5; malformation 1.3333333333333333; atteinte 1.0; altération 1.0; anomalie 1.0; dysfonctionnement 1.0; pépin 0.8333333333333334; infection 0.75; algie 0.5;


In the second gold standard there are no annotations for the (adjective) substitutes for sentences number 52, 90, 106, 179, 185, 186, 187, 192, 204, 209, 211, 231, 232, 272 and 287. The corresponding items had problems that did not allow them to get a reliable or useful evaluation (too short context, wrong POS, duplicates, etc.).

