GLAFFi: GLÀFF interface Search ]  [ Preferences ] [ About/Help ] CLLE-ERSS

GLÀFFOLI help page

Frequences

Corpora

Frequences of forms and lemmas have been computed over 3 different corpora:
  • Frantext 20e is a 30 million word corpus made of contemporary (20th century) novels taken from the Frantext database ;
  • LM10 is a 200 million word corpus made of the archives of the newpaper Le Monde from 1991 to 2000 ;
  • FrWaC is a 1.6 billion word corpus made of web pages collected from the .fr domain

Frequences computation

All corpora have been pos-tagged with the Talismane parser. The automatic lemmatization of unknown inflected forms has been performed by applying a set of morphological rules written by Nabil Hathtout.
For each entry of GLÀFF is given:
  • the wordform's absolute frequency in each corpus ;
  • the wordform's relative frequency (per million word) in each corpus ;
  • the lemma's absolute frequency in each corpus ;
  • the lemma's relative frequency (per million word) in each corpus.
As not all inflection features are provided by the tagger for a given form, the frequence of a wordform relates to the main syntactic category of the form. Thus, two entries having the same form, lemma and main category, such as conjugue|Vmip1s-|conjuguer and conjugue|Vmsp1s-|conjuguer will have the same frequence.

How to display frequences?

To display frequences, click on the Preferences link in the top menu and check the Frequences box. Then validate by clicking the Set prefs button.

FrWaC contexts

If you choose to display frequencies, FrWaC frequences display links that enable to see contexts that contain forms/lemmas in FrWaC corpus. These link redirect to the NoSketch Engine concordancer, that permits to browse FrWaC.