REDAC
REsources Developed At CLLE CLLE: Cognition, Langues, Langage, Ergonomie







Version française
CORPORA
Corpora available from the REDAC website
CanEn CanEn is a corpus of tweets aimed at studying regional variation in Canadian English, with a specific focus on the dialect regions of Toronto, Montreal, and Vancouver. It contains 78.8 million tweets, corresponding to 1.3 billion tokens, which were published by 196,000 distinct users.
RésolCo (Resolution of Cohesion issues) is a corpus of handwritten manuscripts written by French pupils and students in response to a task aiming to resolve cohesion issues. The corpus is enriched with manual annotations of graphical revisions, misspelling and discourse structures.
Est Républicain Syntactically parsed version of a newspaper corpus published in years 1999, 2002 and 2003.
ParcoTrain ParCoTrain is a training and test corpus for POS tagging and lemmatization of Serbian. The corpus was developed as part of the ParCoLab project. The lemmatized part of the corpus contains 95 585 manually annotated tokens. The POS-tagged part contains 153 625 tokens, with 95 585 tokens annotated manually and the remaining 57 977 tokens annotated automatically and then validated manually. The source texts are contemporary Serbian novels from the second half of the 20th century.
TALN Corpus made up of 1602 scientific articles from the proceedings of the TALN and RECITAL conferences between 1997 and 2019.
The GÉOPO corpus includes 32 articles about geopolitics. This 270 000 word French corpus has been syntactically parsed and annotated with discourse-level information.
ANNODIS The ANNODIS resource is a discourse-level annotated corpus of written French. The corpus (687,000 words) is diversified with respect to genre, length and type of discourse organisation. The annotated objects, which reflect two distinct approaches to discourse, are rhetorical relations and two types of multi-level structures: topical chains and enumerative structures. The texts are made available in XML format according to the TEI-P5 norm (meta-data and document structure) and in GLOZZ format (format resulting from the manual annotation via the GLOZZ interface).
WikipédiaFR2008 Raw tex and pos-tagged corpora extracted from the French Wikipedia. This corpus includes 664982 articles containing 262 million words.