This documentation is based on (Sajous and Hathout, 2015)
and (Hathout and Sajous, 2016)
"Textual sections" (i.e. free text) found in etymology sections and definitions (glosses and examples)
are proposed in four different formats that are described below:
- wiki: wiki elements (only in dev version) is the initial wikicode that has been parsed.
- xml: XML formatted version where markups encode wiki typesetting (boldface, italic, etc.),
dates, foreign words, mathematical/chemical formulae, external/inner links, etc.
- txt: raw text version. This text version has been produced from the
xml element. Other versions can be generated from the XML one by formatting specific elements differently:
markups embedded in xml elements can be used for example to remove non-textual content (formulae) or unwanted words (e.g. foreign words),
or, conversely, to look specifically for these elements (their relevance being task-dependent).
In dev version, txt markups have a unique integer identifier attribute named id.
- parsed: a CoNLL (Nivre et al., 2007) output
of the Talismane syntactic parser.
XML structure of the main elements
elements may include the following children (in turn optionally mixed with other xml
- b and i elements mark bold and italic typefaces.
When bold and/or italic are unbalanced in the wikicode, they are removed in order to produce well-formed XML.
- cf: this element is used in etymology sections to explain word formation (compounds, derived words). Attributes are:
- lang: origin language (optional);
- value: word(s) from which the entry has been coined.
When several words are involved, they are separated by a
- foreignWord: used in etymology sections to mark non French words. Attributes are:
- lang: word's origin language;
- sense: optional indication of the word's meaning;
- translit: optional word's transliteration when the word is written in a non Latin alphabet.
- date, century: diachronic information.
There is no formal syntax to encode dates inside these elements.
The date element may mark a precise year, or day:
<date>1826</date>, <date>18 avril 1982</date>,
or more fuzzy periods:
<date>vers 1820</date>, <date>avant 1915</date>, <date>annÃ©es 1960</date>.
Uncertain dates are indicated by question marks:
Although a dedicated markup exists (century), the
date element may be used to mark centuries as well:
- math, fchim: respectively mathematical or chemical formula.
- inlinePron: pronunciation occurring in a textual part of the article.
- innerLink: link to a Wiktionnaire's entry. The target wordform is given by the ref attribute.
- linkWikiProject: link to another Wikimedia project. The target is given by the target attribute.
- link: external link. The target url is given by the url attribute.
Back to GLAWI's [ main documentation page ] [ project page ]
XML structure of subcomponents
The above-mentioned elements may be nested and mixed together, as described by the following section of GLAWI's DTD: