REDAC: Corpus extracted from the French Wikipedia

REsources Developed At CLLE
Homepage Resources Applications Corpora Lexicons Other resources About CLLE This website Legal notice Contact

WikipediaFR2008 Corpus
Corpus made of the 664982 articles taken from the French version of the Wikipedia encyclopedia.

Description

The FR-Wikipedia corpus has been extracted from the last static HTML dump (18/06/2008) available at: http://dumps.wikimedia.org/.

This dump has been minimally processed to extract only text parts from the articles. The sommaire (table of contents) box have been removed, as well as the Voir aussi (see also) sections. The notes sections remain.

The corpus has been lemmatized and pos-tagged with TreeTagger, from the Stuttgart University.

Person in charge

Franck Sajous
Contact :

Licence

This corpus, as the Wikipedia encyclopedia from which it has been extracted, is available under the Creative Commons By-SA licence (Attribution - Share alike).

Download

Raw text corpus [.txt.7z] (433 MB).
File format: each article starts with a line such as:
<#id_num>
where id_num is a unique identifier.
Text with initial tabular layouts appears with the | (pipe) character instead of column breaks.
POS-tagged corpus [.tag.7z] (612 MB).
File format: each article starts with a line such as:
<#id_num>
where id_num is a unique identifier.
Other lines have the following format (1 token per line):
Wordform \t POS \t Lemma.

Meta-data [.txt.7z] (12 MB): for each article, referred by its identifier, is mentioned its title and the number of words extracted. The article's categories are also provided. Information is organised as follow:

File	Format
`wikipediaArticles.txt`	article's id, article's title, number of tokens
`wikipediaCategories.txt`	category unique id, category title
`wikipediaArticlesCategorie.txt`	article id, category id A `<articleId, categoryId>` line means that the article with id `articleId` belongs to the category with id `categoryId`. It is often the case that an article belong to several categories.

Frequency Table [.txt.7z] (16 MB): gives for each inflected form found in the corpus its number of occurrences. This file has been built by using a script program taken from the book by Ludovic Tanguy and Nabil Hathout: Perl pour les linguistes. Is is available from the perl.linguistes.free.fr website.