REsources Developed At CLLE-ERSS CLLE-ERSS research unit


This documentation is based on (Sajous and Hathout, 2015) and (Hathout and Sajous, 2016).

GLAWI is a free French Machine-Readable Dictionary encoded in XML format. It is a structured and normalized version of Wiktionnaire (the French language edition of Wiktionary).

This dictionary includes: simple words, compounds and multiword expressions; inflected forms and lemmas; etymologies; pronunciations in API; definitions (glosses and examples); translations; semantic relations; morphological relations; spelling variations.

This page describes how the information encoded in GLAWI is structured.

  1. Root element: glawi <!ELEMENT glawi (article)*> <!ATTLIST glawi lang CDATA #REQUIRED dateDump CDATA #IMPLIED endParsingDate CDATA #IMPLIED> glawi is the root element. It has three attributes:
    • lang is the Wiktionary's language edition on which the resource is grounded. Here, fr denotes Wiktionary's French language edition (a.k.a. Wiktionnaire).
    • dateDump is the version of Wiktionary's dump used to build GLAWI. Here, 2015-12-26 refers to Wiktionnaire's dump released on the 26th December 2015.
    • endParsingDate indicates when this version of GLAWI have been produced (this attribute may be used as a version identifier).
    Example: <glawi lang="fr" dateDump="2015-21-26" endParsingDate="2016-02-02_16:23:01"> The root element contains the articles of the dictionary.
  2. article <!ELEMENT article (title, pageId, meta, text)> This element corresponds to a page (URL) of Wiktionnaire. The basic unit of Wiktionnaire's articles is the written form (or grapheme). A given article may contain several entries having distinct or identical parts of speech (POSs). A POS section may correspond to a canonical form (i.e. a lemma) or an inflection.
  3. pageId <!ELEMENT pageId (#PCDATA)> A page identifier (an integer, as found in the dump).
  4. title <!ELEMENT title (#PCDATA)> The article's entry/written form, which corresponds to Wiktionnaire's associated web page.
  5. meta <!ELEMENT meta (import|reference|category|spellingVariation)*> Metadata, which may be a mix of various optional elements:
    • import <!ELEMENT import (#PCDATA)> Wiktionnaire has been primarily bootstrapped by automatic imports from editions of dictionaries fallen into the public domain: mostly the 8th edition (1932-1935) of the Dictionnaire de l'Académie française (DAF8) and the 2nd edition (1872-1877) of the Littré. The import element is used to mention such import.
    • reference <!ELEMENT reference (#PCDATA)> Reference to another resource: this field is used by contributors to indicate that she/he consulted a given resource when editing an article. Such resources may be online or printed dictionaries, specialized websites, etc.
    • category <!ELEMENT category (#PCDATA)> Just as in Wikipedia, categories are manually assigned to pages in Wiktionary. GLAWI's category elements indicate the categories an article belongs to (if any).
    • spellingVariation <!ELEMENT spellingVariation (#PCDATA)> <!ATTLIST spellingVariation norm CDATA #REQUIRED> This element indicates that a written form is a spelling variant of another one (e.g. nénuphar/nenufar `water lily')
      See example of spelling variations.

    Example of a meta section for the article nénuphar:

  6. text <!ELEMENT text (pronunciations?,etymology?,pos*,section*)>

    Article's content. It may include pronunciation elements, an etymology, one or several pos (part of speech) and various subsection elements.

    Summary of children elements of text:

    Elements without link are described inside the parent element's description.

Back to [ GLAWI's main page ]