The FR-Wikipedia corpus has been extracted from the last static HTML dump (18/06/2008)
available at: http://dumps.wikimedia.org/.
This dump has been minimally processed to extract only text parts from the articles.
The sommaire (table of contents) box have been removed, as well as the Voir aussi (see also) sections.
The notes sections remain.
The corpus has been lemmatized and pos-tagged with
TreeTagger, from the Stuttgart University.
This corpus, as the Wikipedia encyclopedia from which it has been extracted, is available under the
Creative Commons By-SA licence (Attribution - Share alike).
Download
Raw text corpus [.txt.7z] (433 MB).
File format: each article starts with a line such as: <#id_num>
where id_num is a unique identifier.
Text with initial tabular layouts appears with the | (pipe) character instead of column breaks.
POS-tagged corpus [.tag.7z] (612 MB).
File format: each article starts with a line such as: <#id_num>
where id_num is a unique identifier.
Other lines have the following format (1 token per line): Wordform \t POS \t Lemma.
Meta-data [.txt.7z] (12 MB):
for each article, referred by its identifier, is mentioned its title and the number of words extracted.
The article's categories are also provided. Information is organised as follow:
File
Format
wikipediaArticles.txt
article's id, article's title, number of tokens
wikipediaCategories.txt
category unique id, category title
wikipediaArticlesCategorie.txt
article id, category id
A <articleId, categoryId> line means that the article with id articleId
belongs to the category with id categoryId.
It is often the case that an article belong to several categories.
Frequency Table [.txt.7z] (16 MB):
gives for each inflected form found in the corpus its number of occurrences.
This file has been built by using a script program taken from the book by Ludovic Tanguy and Nabil Hathout:
Perl pour les linguistes.
Is is available from the perl.linguistes.free.fr website.