Test set for semantic shift detection
A test set was developed in order to facilitate the use of the CanEn corpus for the detection of contact-induced semantic shifts in Quebec English. More specifically, it allows for the evaluation of semantic change detection systems, where semantic change detection is formulated as a binary classification task (stable vs. changing words).
A total of 80 items are included in the test set: 40 correspond to semantic shifts in Quebec English, described in the sociolinguistic literature and attested in the CanEn corpus; the remaining 40 are control items which are unlikely to be affected by contact-related semantic influence and do not present regional variation in the corpus. The construction of the test set and its use in an evaluation of semantic change detection systems are presented in more detail by Miletic et al. (2021).
Each line in the file contains a lexical item, its POS tag, and its semantic change label (separated by tabs). The label is "1" if the lexical item is a semantic shift, and "0" if it is a control item.
For the 40 words corresponding to semantic shifts, we have developed an additional resource enriched with different types of information characterizing semantic change. These specifically include (i) a range of computational estimates of semantic change; (ii) empirical linguistic properties of the target words (e.g. frequency, polysemy); (iii) the results of sociolinguistic interviews with 15 speakers from Montreal, in particular their acceptability ratings and qualitative remarks regarding the use of the 40 words attested in our corpus of tweets. For more details, see Miletic et al. (2023).
Contact personFilip Miletic
The files below are released under the Creative Commons BY-NC-SA 4.0 licence.