REDAC
REsources Developed At CLLE-ERSS CLLE-ERSS research unit







Version française
WikipediaFR2008 Corpus
Corpus made of the 664982 articles taken from the French version of the Wikipedia encyclopedia.
Description

The FR-Wikipedia corpus has been extracted from the last static HTML dump (18/06/2008) available at: http://dumps.wikimedia.org/.

This dump has been minimally processed to extract only text parts from the articles. The sommaire (table of contents) box have been removed, as well as the Voir aussi (see also) sections. The notes sections remain.

The corpus has been lemmatized and pos-tagged with TreeTagger, from the Stuttgart University.

Person in charge
Franck Sajous
Contact :

Licence
This corpus, as the Wikipedia encyclopedia from which it has been extracted, is available under the Creative Commons By-SA licence (Attribution - Share alike). Licence Creative Commons By-SA3.0

Download
  • Raw text corpus [.txt.7z] (433 MB).
    File format: each article starts with a line such as:
    <#id_num>
    where id_num is a unique identifier.
    Text with initial tabular layouts appears with the | (pipe) character instead of column breaks.
  • POS-tagged corpus [.tag.7z] (612 MB).
    File format: each article starts with a line such as:
    <#id_num>
    where id_num is a unique identifier.
    Other lines have the following format (1 token per line):
    Wordform    \t    POS    \t    Lemma.
  • Meta-data [.txt.7z] (12 MB): for each article, referred by its identifier, is mentioned its title and the number of words extracted. The article's categories are also provided. Information is organised as follow:
    FileFormat
    wikipediaArticles.txtarticle's id, article's title, number of tokens
    wikipediaCategories.txtcategory unique id, category title
    wikipediaArticlesCategorie.txtarticle id, category id
    A <articleId, categoryId> line means that the article with id articleId belongs to the category with id categoryId. It is often the case that an article belong to several categories.
  • Frequency Table [.txt.7z] (16 MB): gives for each inflected form found in the corpus its number of occurrences. This file has been built by using a script program taken from the book by Ludovic Tanguy and Nabil Hathout: Perl pour les linguistes. Is is available from the perl.linguistes.free.fr website.