This archive contains files that consitute a training and test corpus for POS tagging and lemmatisation of Serbian. The corpus was developed as part of the ParCoLab project (http://parcolab.univ-tlse2.fr/). It contains literary texts from the second half of the 20th century.

Authors:	Aleksandra Miletic (CLLE-ERSS, University of Toulouse - Jean Jaurès)
			Dejan Stosic (CLLE-ERSS, University of Toulouse - Jean Jaurès)
			Antonio Balvet (STL, University Lille 3)
Contact:	aleksandra.miletic at univ-tlse2.fr

General description:
Format: csv
Field separator: tab (\t)
Character encoding: UTF-8
EOL character: CR-LF (\r\n)

Source texts:
Kiš, Danilo. "Enciklopedija mrtvih", 2000. Beograd: BIGZ 
Stevanović, Vidosav. "Testament", 1986. Beograd: SKZ.
Kiš, Danilo. "Bašta, pepeo", 2010. Podgorica: Narodna knjiga.

File list:

1. enciklopedija-testament.txt
Size:						95585 tokens	
	"Enciklopedija mrtvih":	47792 tokens
	"Testament":			47793 tokens
Annotation: POS and lemmatisation. Manually annotated.
Format: [token][tab][lemma][tab][POS]
Content: "Enciklopedija mrtvih" and "Testament". Sentences appear in a randomized order.

enciklopedija-testament.txt is a concatenation of the following two files:

2. enciklopedija.txt
Size:	47792 tokens
Annotation: POS and lemmatisation. Manually annotated.
Format: [token][tab][lemma][tab][POS]
Content: "Enciklopedija mrtvih". Sentences appear in a randomized order.

3. testament.txt
Size:	47793 tokens
Annotation: POS and lemmatisation. Manually annotated.
Format: [token][tab][lemma][tab][POS]
Content: "Testament". Sentences appear in a randomized order.

The following 4 files are balanced samples derived from enciklopedija.txt and testament.txt. Each of the two big files was divided into 2 samples in a way that allowed to have sample sizes as close as possible while maintaining sentence integrity.
As a result, these files have the same format and the same type of annotation as the two preceding files.
These four files may be used for a 4-fold cross-validation. 

4. enciklopedija-sample1.txt
Size: 23908 tokens

5. enciklopedija-sample2.txt
Size: 23885

6. testament-sample1.txt
Size: 23908 tokens

7. testament-sample2.txt
Size: 23884 tokens

Another file with only POS tagging is available. This file was autmatically tagged with BTagger (Gesmundo & Samardzic, 2012) trained on the enciklopedija-testament.txt corpus. The output of the tagger was validated manually. This file was subsequently used to extend the training corpus for POS-tagging and get a total of 153625 POS-tagged tokens.

8. basta.txt
Size: 57977 tokens
Annotation: POS tagging. Automatic annotation validated manually.
Format: [token][tab][POS]
Content: "Bašta, pepeo". The sentences appear in the original order.

If you wish to create a training corpus for POS-tagging by merging basta.txt and enciklopedija-testament.txt, it can be done as follows:

1. create a file containing only the POS annotation from enciklopedija-testament.txt:
$ cut -f 1,3 enciklopedija-testament.txt > enciklopedija-testament-POS.txt

2. concatenate this file with basta.txt:
$ cat enciklopedija-testament-POS.txt basta.txt > enciklopedija-testament-basta-POS.txt

POS tagging implemented in the corpus:
The tagset used for the POS-tagging contains 45 tags. As the initial plan was to use TreeTagger to annotate the three sub-corpora, ParCoTrain tagset was initially based on the one proposed by TreeTagger, adapted to Serbian. The tags encode the main part-of-speech and the subcategory. For adjectives and adverbs, they also indicate the degree of comparison. The tags have the following format: the main POS is given in capitals, followed by a semicolon, followd by the subcategory, if applicable. Some tag examples are given below.

NOM:nam = proper noun
NOM:com = common noun
ADJ:sup = superlative adjective
ADJ:comp = comparative adjective
ADJ:rel = relative adjective

A more detailed presentation of the tagset can be found in the PDF documentation available at the download page.



