GLÀFFOLI help page
Frequences
Corpora
Frequences of forms and lemmas have been computed over 3 different corpora:
- Frantext 20e is a 30 million word corpus made of contemporary (20th century) novels taken from
the Frantext database ;
- LM10 is a 200 million word corpus made of the archives of the newpaper Le Monde from 1991 to 2000 ;
- FrWaC is a 1.6 billion word corpus made of web pages collected from the
.fr domain
Frequences computation
All corpora have been pos-tagged with the Talismane parser.
The automatic lemmatization of unknown inflected forms has been performed by
applying a set of morphological rules written by Nabil Hathtout.
For each entry of GLÀFF is given:
- the wordform's absolute frequency in each corpus ;
- the wordform's relative frequency (per million word) in each corpus ;
- the lemma's absolute frequency in each corpus ;
- the lemma's relative frequency (per million word) in each corpus.
As not all inflection features are provided by the tagger for a given form,
the frequence of a wordform relates to the main syntactic category of the form.
Thus, two entries having the same form, lemma and main category, such as conjugue|Vmip1s-|conjuguer and
conjugue|Vmsp1s-|conjuguer will have the same frequence.
How to display frequences?
To display frequences, click on the Preferences link in the top menu and
check the Frequences box. Then validate by clicking the Set prefs button.
FrWaC contexts
If you choose to display frequencies, FrWaC frequences display links that enable to
see contexts that contain forms/lemmas in FrWaC corpus. These link redirect
to the NoSketch Engine concordancer,
that permits to browse FrWaC.
|
|