REDAC
REsources Developed At CLLE-ERSS CLLE-ERSS research unit







Version française
CORPORA
Corpora available from the REDAC website
ParcoTrain ParCoTrain is a training and test corpus for POS tagging and lemmatization of Serbian. The corpus was developed as part of the ParCoLab project. The lemmatized part of the corpus contains 95 585 manually annotated tokens. The POS-tagged part contains 153 625 tokens, with 95 585 tokens annotated manually and the remaining 57 977 tokens annotated automatically and then validated manually. The source texts are contemporary Serbian novels from the second half of the 20th century.
TALN Corpus made up of 586 scientific articles from the proceedings of the TALN and RECITAL conferences between 2007 and 2013.
The GÉOPO corpus includes 32 articles about geopolitics. This 270 000 word French corpus has been syntactically parsed and annotated with discourse-level information.
ANNODIS The ANNODIS resource is a discourse-level annotated corpus of written French. The corpus (687,000 words) is diversified with respect to genre, length and type of discourse organisation. The annotated objects, which reflect two distinct approaches to discourse, are rhetorical relations and two types of multi-level structures: topical chains and enumerative structures. The texts are made available in XML format according to the TEI-P5 norm (meta-data and document structure) and in GLOZZ format (format resulting from the manual annotation via the GLOZZ interface).
WikipédiaFR2008 Raw tex and pos-tagged corpora extracted from the French Wikipedia. This corpus includes 664982 articles containing 262 million words.