A corpus of tweets in Canadian English
CanEn is a corpus of tweets aimed at studying regional variation in Canadian English, with a specific focus on the dialect regions of Toronto, Montreal, and Vancouver. It contains 78.8 million tweets, corresponding to 1.3 billion tokens, which were published by 196,000 distinct users.
In accordance with Twitterâ€™s Developer Policy, the corpus is released in the form of tweet IDs, grouped into three lists, each of which corresponds to a city. The lists can be used to easily collect the complete Twitter data in JSON format using widely available software (e.g. Hydrator). Note that some of the initially identified tweets may be unavailable due to the removal of individual tweets or entire user accounts.
Contact personFilip Miletic