REDAC : corpora

REsources Developed At CLLE
Homepage Resources Applications Corpora Lexicons Other resources About CLLE This website Legal notice Contact

CORPORA
Corpora available from the REDAC website

CanEn	CanEn is a corpus of tweets aimed at studying regional variation in Canadian English, with a specific focus on the dialect regions of Toronto, Montreal, and Vancouver. It contains 78.8 million tweets, corresponding to 1.3 billion tokens, which were published by 196,000 distinct users.
	RésolCo (Resolution of Cohesion issues) is a corpus of handwritten manuscripts written by French pupils and students in response to a task aiming to resolve cohesion issues. The corpus is enriched with manual annotations of graphical revisions, misspelling and discourse structures.
Est Républicain	Syntactically parsed version of a newspaper corpus published in years 1999, 2002 and 2003.
ParcoTrain	ParCoTrain is a training and test corpus for POS tagging and lemmatization of Serbian. The corpus was developed as part of the ParCoLab project. The lemmatized part of the corpus contains 95 585 manually annotated tokens. The POS-tagged part contains 153 625 tokens, with 95 585 tokens annotated manually and the remaining 57 977 tokens annotated automatically and then validated manually. The source texts are contemporary Serbian novels from the second half of the 20th century.
TALN	Corpus made up of 1602 scientific articles from the proceedings of the TALN and RECITAL conferences between 1997 and 2019.
	The GÉOPO corpus includes 32 articles about geopolitics. This 270 000 word French corpus has been syntactically parsed and annotated with discourse-level information.
	The ANNODIS resource is a discourse-level annotated corpus of written French. The corpus (687,000 words) is diversified with respect to genre, length and type of discourse organisation. The annotated objects, which reflect two distinct approaches to discourse, are rhetorical relations and two types of multi-level structures: topical chains and enumerative structures. The texts are made available in XML format according to the TEI-P5 norm (meta-data and document structure) and in GLOZZ format (format resulting from the manual annotation via the GLOZZ interface).
WikipédiaFR2008	Raw tex and pos-tagged corpora extracted from the French Wikipedia. This corpus includes 664982 articles containing 262 million words.