REDAC
REsources Developed At CLLE CLLE: Cognition, Langues, Langage, Ergonomie






ENGLAWI

ENGLAWI
Documentation

ENGLAWI is a free English Machine-Readable Dictionary encoded in XML format. It is a structured and normalized version of Wiktionary.

This dictionary includes: simple words, compounds and multiword expressions; inflected forms and lemmas; etymologies; pronunciations in API; definitions (glosses and examples); translations; semantic relations; morphological relations; spelling variations.

This page describes how is structured the information encoded in ENGLAWI.

  1. Root element: glawi <!ELEMENT glawi (article)*> <!ATTLIST glawi lang CDATA #REQUIRED dateDump CDATA #IMPLIED endParsingDate CDATA #IMPLIED> glawi is the root element. It has three attributes:
    • lang is the Wiktionary's language edition on which the resource is grounded. Here, en means Wiktionary's English language edition.
    • dateDump is the version of Wiktionary's dump used to build ENGLAWI. Here, 2017-06-01 refers to Wiktionary's dump released on the 1st June 2017.
    • endParsingDate indicates when this version of ENGLAWI have been produced (this attribute may be used as a version identifier).
    Example: <glawi lang="en" dateDump="2017-06-01" endParsingDate="2017-10-24_16:23:01"> The root element contains the dictionary's articles.
  2. article <!ELEMENT article (title, pageId, meta?, text)> This element corresponds to a page (URL) of Wiktionary. The basic unit of Wiktionary's articles is the written form (or grapheme). A given article may contain several entries having distinct or identical parts of speech (POSs). A POS section may correspond to a canonical form (i.e. a lemma) or an inflection.
  3. pageId <!ELEMENT pageId (#PCDATA)> A page identifier (an integer, as found in the dump).
  4. title <!ELEMENT title (#PCDATA)> The article's entry/written form, which corresponds to Wiktionary's associated web page.
  5. meta <!ELEMENT meta (category|reference)*> Metadata, which may be a mix of two optional elements:
    • reference <!ELEMENT reference (#PCDATA)> Reference to another resource: this field is used by contributors to indicate that she/he consulted a given resource when editing an article. Such resources may be online or printed dictionaries (e.g. 5th edition of the OED, 1976 edition of the Merriam Webster, specialized websites, etc.)
    • category <!ELEMENT category (#PCDATA)> <!ATTLIST category wikisaurus (0|1) #IMPLIED> Just as in Wikipedia, categories are manually assigned to pages in Wiktionary. ENGLAWI's category elements indicate the categories an article belongs to (if any). Categories may correspond to domains (Golf, Anatomy), specific categories of words (English words suffixed with -ism, English basic words), etc.
      Categories may occur in regular articles or in Wiktionary's thesaurus (Wikisaurus). A category found in Wikisaurus is signaled by the attribute wikisaurus. For example, the article LSD belongs to Wiktionary's category "Recreational drugs" and is assigned to the "Intoxication and intoxicants" category in Wikisaurus. The resulting XML is the following:

    Example of a meta section for the article abbey:

  6. text <!ELEMENT text (pronunciations?, etymologies?, alternativeForms?, pos*)>

    Article's content. It may include pronunciation elements, etymologies, one or several pos (part of speech) section(s) and a section that enumerates alternative forms.

    Summary of children elements of text:

    Elements without link are described inside the parent element's description.


Back to [ ENGLAWI's main page ]