REDAC
REsources Developed At CLLE CLLE research unit






CanEnVersion française
A corpus of tweets in Canadian English
Description

CanEn is a corpus of tweets aimed at studying regional variation in Canadian English, with a specific focus on the dialect regions of Toronto, Montreal, and Vancouver. It contains 78.8 million tweets, corresponding to 1.3 billion tokens, which were published by 196,000 distinct users.

The corpus was constructed by identifying Twitter users from the three cities (January-November 2019), crawling their entire timelines, filtering the collected data in terms of user location and tweet language, and automatically excluding near-duplicate content. This ensures that the retained tweets are written in English; were posted by users who state that they live in Toronto, Montreal, or Vancouver; present limited repetitive content; and are roughly equally distributed across the three geographic areas. See Miletic et al. (2020) for more details on the structure of the resulting corpus, data collection pipeline, as well as case studies on regional variation.

In accordance with Twitter’s Developer Policy, the corpus is released in the form of tweet IDs, grouped into three lists, each of which corresponds to a city. The lists can be used to easily collect the complete Twitter data in JSON format using widely available software (e.g. Hydrator). Note that some of the initially identified tweets may be unavailable due to the removal of individual tweets or entire user accounts.


Contact person
Filip Miletic
Contact:

Licence

The dataset is released under the Creative Commons BY-NC-SA 4.0 licence.

Note that Twitter’s Developer Policy stipulates that “[a]cademic researchers are permitted to distribute an unlimited number of Tweet IDs and/or User IDs if they are doing so on behalf of an academic institution and for the sole purpose of non-commercial research”. Any reuse of this dataset is therefore limited to non-commercial academic research. By downloading this corpus, you accept the Twitter Terms of Service, Privacy Policy, Developer Agreement, and Developer Policy.


Download

References
  • F. Miletic, A. Przewozny-Desriaux, L. Tanguy. (2020). Collecting Tweets to Investigate Regional Variation in Canadian English. Proceedings of LREC 2020, 12th International Conference on Language Resources and Evaluation. Marseille, France. [ PDF ]
  • Miletic, F., Przewozny-Desriaux, A. and Tanguy, L. (2021). Detecting contact-induced semantic shifts: What can embedding-based methods do in practice? Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 10852-10865.PDF ] [ Test set ]