Gold standard for the TALN corpus

The file below is the result of the manual annotation carried out in the framework of the comparison of distributional models, as described in (Tanguy et al., 2015).

The dataset is based on the analysis of the version of the TALN corpus corresponding to the years 2007-2013.

4 different annotators have judged the relevancy of the neighbors computed by the distributional models for the 30 following words:

  • Ajectives: complexe, computationnel, correct, empirique, important, précis, sémantique, significatif, spécialisé, temporel
  • Noun: élément, contrainte, dépendant, fréquence, graphe, méthode, performance, sémantique, signification, trait
  • Verbs: annoter, apparier, évaluer, calculer, caractériser, conduire, décrire, extraire, indexer, valider
The list of the evaluated candidates has been generated by the pooling method described in the article available below.

The notion of neighborhood is not restricted to a specific semantic relation: the neighbors may be synonyms, hypernyms, hyponyms, antonyms or words that are otherwise semantically related.

The dataset is a tabulated file (UTF-8 encoded) whose lines contain:

  • the target word
  • the POS of the target word (ADJ for adjective, NC for common noun, V for verb)
  • a neighbor
  • the number of annotators (ranging from 1 to 4) who have judged this neighbor relevant.
Person in charge
Franck Sajous

Some rights are reserved. The gold standard file is available under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.


L. Tanguy, F. Sajous et N. Hathout (2015). Évaluation sur mesure de modèles distributionnels sur un corpus spécialisé : comparaison des approches par contextes syntaxiques et par fenêtres graphiques. TAL, 56(2), pp 103-127. [ Article ] [ Bibtex ]