REsources Developed At CLLE-ERSS CLLE-ERSS research unit


This documentation is based on (Sajous and Hathout, 2015) and (Hathout and Sajous, 2016).

Textual sections


"Textual sections" (i.e. free text) found in etymology sections and definitions (glosses and examples) are proposed in four different formats that are described below: wiki, xml, txt and parsed.

  • wiki: wiki elements (only in dev version) is the initial wikicode that has been parsed.
  • xml: XML formatted version where markups encode wiki typesetting (boldface, italic, etc.), dates, foreign words, mathematical/chemical formulae, external/inner links, etc.
  • txt: raw text version. This text version has been produced from the xml element. Other versions can be generated from the XML one by formatting specific elements differently: markups embedded in xml elements can be used for example to remove non-textual content (formulae) or unwanted words (e.g. foreign words), or, conversely, to look specifically for these elements (their relevance being task-dependent).
    In dev version, txt markups have a unique integer identifier attribute named id.
  • parsed: a CoNLL (Nivre et al., 2007) output of the Talismane syntactic parser.

XML structure of the main elements

<!ELEMENT wiki (#PCDATA)> <!ELEMENT xml (#PCDATA|b|cf|date|fchim|foreignWord|i|innerLink|link|linkWikiProject|math|inlinePron|century)*> <!ELEMENT txt (#PCDATA)> <!ATTLIST txt id CDATA #IMPLIED> <!ELEMENT parsed (#PCDATA)> The xml elements may include the following children (in turn optionally mixed with other xml elements' children):
  • b and i elements mark bold and italic typefaces. When bold and/or italic are unbalanced in the wikicode, they are removed in order to produce well-formed XML.
  • cf: this element is used in etymology sections to explain word formation (compounds, derived words). Attributes are:
    • lang: origin language (optional);
    • value: word(s) from which the entry has been coined. When several words are involved, they are separated by a | character.
  • foreignWord: used in etymology sections to mark non French words. Attributes are:
    • lang: word's origin language;
    • sense: optional indication of the word's meaning;
    • translit: optional word's transliteration when the word is written in a non Latin alphabet.
  • date, century: diachronic information. There is no formal syntax to encode dates inside these elements. The date element may mark a precise year, or day: <date>1826</date>, <date>18 avril 1982</date>, or more fuzzy periods: <date>vers 1820</date>, <date>avant 1915</date>, <date>années 1960</date>. Uncertain dates are indicated by question marks: <date>1845 ?</date>. Although a dedicated markup exists (century), the date element may be used to mark centuries as well: <date>xxie siècle</date>
  • math, fchim: respectively mathematical or chemical formula.
  • inlinePron: pronunciation occurring in a textual part of the article.
  • innerLink: link to a Wiktionnaire's entry. The target wordform is given by the ref attribute.
  • linkWikiProject: link to another Wikimedia project. The target is given by the target attribute.
  • link: external link. The target url is given by the url attribute.

    XML structure of subcomponents

    The above-mentioned elements may be nested and mixed together, as described by the following section of GLAWI's DTD: <!ELEMENT b (#PCDATA|b|i|date|fchim|foreignWord|innerLink|link|linkWikiProject|math|inlinePron|century|cf)*> <!ELEMENT i (#PCDATA|b|i|date|fchim|foreignWord|innerLink|link|linkWikiProject|math|inlinePron|century|cf)*> <!ELEMENT date (#PCDATA|b|date|fchim|foreignWord|i|innerLink|link|linkWikiProject|math|inlinePron|century|cf)*> <!ELEMENT fchim (#PCDATA|b|date|fchim|foreignWord|i|innerLink|link|linkWikiProject|math|inlinePron|century|cf)*> <!ELEMENT inlinePron (#PCDATA|b|i|date|fchim|foreignWord|innerLink|link|linkWikiProject|math|inlinePron|century|cf)*> <!ELEMENT cf EMPTY> <!ATTLIST cf lang CDATA #IMPLIED value CDATA #IMPLIED> <!ELEMENT foreignWord (#PCDATA|b|i|date|fchim|foreignWord|innerLink|link|linkWikiProject|math|inlinePron|century|cf)*> <!ATTLIST foreignWord lang CDATA #REQUIRED sense CDATA #IMPLIED translit CDATA #IMPLIED> <!ELEMENT innerLink (#PCDATA|b|i|date|fchim|foreignWord|innerLink|link|linkWikiProject|math|inlinePron|century|cf)*> <!ATTLIST innerLink ref CDATA #IMPLIED> <!ELEMENT link (#PCDATA|b|i)*> <!ATTLIST link url CDATA #IMPLIED> <!ELEMENT linkWikiProject (#PCDATA|b|i)*> <!ATTLIST linkWikiProject target CDATA #IMPLIED> <!ELEMENT century (#PCDATA)> <!ELEMENT math (#PCDATA)>

Back to GLAWI's [ main documentation page ] [ project page ]