ParCoTrain corpus
POS tagging and lemmatisation of Serbian

ParCoTrain is a training and test corpus for the POS-tagging and lemmatisation of Serbian. The lemmatised section of the corpus contains 95585 tokens, whereas the POS-tagged section counts 153625 tokens (95585 of which are annotated manually, with the remaining 57977 annotated automatically and validated manually). The source texts for the corpus are contemporary Serbian novels from the second half of the 20th century.

The POS-tagging gives the main POS and the subcategory. It also indicates the degree of comparison for adjectives and adverbs. A detailed description of the tagset used in the corpus can be found in the PDF documentation downloadable from this page.

This resource was developed as part of the ParCoLab project by Aleksandra Miletic (CLLE-ERSS, Université Toulouse - Jean Jaurès), Antonio Balvet (STL, Université Lille 3) and Dejan Stosic (CLLE-ERSS, Université Toulouse - Jean Jaurès).

Person in charge
Aleksandra Miletic

Some rights are reserved. ParCoTrain is distributed under a Creative Commons BY-NC-SA 3.0 licence.

  • Balvet, A., Stosic, D., & Miletic, A. (2014). TALC-sef, Un corpus étiqueté de traductions littéraires en serbe, anglais et français. Actes du 4e Congrès Mondial de Linguistique Française (CMLF 2014), pp. 2551-2563, Berlin, Germany. [PDF] [BibTex]
  • Balvet, A., Stosic, D., & Miletic, A. (2014). TALC-Sef a Manually-revised POS-Tagged Literary Corpus in Serbian, English and French. Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pp. 4105-4110, Reykjavik, Iceland. [PDF] [BibTex]