REDAC
REsources Developed At CLLE CLLE: Cognition, Langues, Langage, Ergonomie






ENGLAWI

ENGLAWI's documentation
Textual elements

Textual elements

Description

Textual elements found in etymology sections, definitions (glosses and examples) and usage notes are available under four different formats that are described below: wiki, xml, txt and parsed.

  • wiki: wiki elements (only in dev version) is the initial wikicode found in Wiktionary's dump.
  • xml: XML formatted version where markups encode wiki typesetting (boldface, italic, etc.), dates, foreign words, mathematical/chemical formulae, external/inner links, etc.
  • txt: raw text version. This text version has been produced from the xml element. Other versions can be generated from the XML one by formatting specific elements differently: markups embedded in xml elements can be used for example to remove non-textual content (formulae) or unwanted words (e.g. foreign words), or, conversely, to look specifically for these elements (their relevance being task-dependent).
  • parsed: a CoNLL (Nivre et al., 2007) output of the Talismane syntactic parser.

XML structure of the main elements

<!ELEMENT wiki (#PCDATA)>
<!ELEMENT xml (#PCDATA|affixUX|b|br|defdate|doublet|etym|etymon|formOf|formattedText
                      |givenName|i|inflectionOf|innerLink|IPA|lang|namedAfter|nonGloss|note
                      |quotation|sense|sub|sup|surname|taxon|transOnly|unknown|wordFormation)*>
<!ELEMENT txt (#PCDATA)>
<!ATTLIST txt
                      id CDATA #IMPLIED>
<!ELEMENT parsed (#PCDATA)>
The xml elements may include the following children (in turn optionally mixed with other xml elements' children):
  • b and i elements mark bold and italic typefaces. When bold and/or italic are unbalanced in the wikicode, they are removed in order to produce well-formed XML.
  • sup and sub elements mark superscript and subscript elements.
  • br line break
  • formattedText delimits elements displayed in a specific font, depending on the content. The type of content is identified by the type attibute whose value may be one of the following:
    • code: usually used to delimit computer code
    • hiero: used for hieroglyphs encoding
    • math: mathematic formulae, sometimes using latex commands
    • nuclide: in chemistry, delimits nuclide formulae
    • tt: teletype text (rendered in monospace font), usually used for newsgroup addresses, Unix command line, etc.
    <!ELEMENT formattedText (#PCDATA)> <!ATTLIST formattedText type (code|hiero|math|nuclide|tt) #IMPLIED> An example is given below for a gloss of quadratic:

    Please note that the text version reproduces the content of the formattedText element. Other options, like ignoring this content (e.g. considered linguistically irrelevant) may be chosen, depending on the task at hand.
  • defdate: Mostly in definitions (glosses) and sometimes in etymologies, provides information about when a sense was first used. No formal content encoding: from 17th c., 17th-19th c., Late 16th century. First attested prior to 1150, etc.
  • IPA: used outside pronunciation sections (in etymologies, definitions and usage notes) to delimit phonemic transcriptions in International Phonetic Alphabet. Transcriptions may indifferently be written between square brackets, slashes, or without delimiter. The example below is found the the Usage notes section of the article lima beans:
  • surname: empty element used to signal glosses defining a surname. An example is given below for the gloss of Stewartson:
  • innerLink: delimits inner links, i.e. words that link to Wiktionary's entries. Inner links may have a ref attribute when the text of the link is slightly different from Wiktionary's headword (e.g. a different case, an inflected form linking to its lemma, a unit's symbol, etc.). A ref may also be used to link to a specific section (e.g. POS section) of the target article.
    <!ELEMENT innerLink (#PCDATA|b|i|sub|sup)*> <!ATTLIST innerLink ref CDATA #IMPLIED>
    Examples of inner links found in a definition of letter are given below:
  • note: additional information given in a gloss. For instance, notes distinguish two senses of the noun toaster described by the same gloss (one who toasts):
  • sense: used to specify a sense qualifier in definitions (glosses), etymologies or usage notes. <!ELEMENT sense (#PCDATA|innerLink)*> For example, libero has two definitions, one related to soccer and the other to volleyball. A usage notes explains that in the volleyball context, Libero is always capitalised:
  • givenName: used to define proper nouns denoting a given name. <!ELEMENT givenName EMPTY> <!ATTLIST givenName diminutive CDATA #IMPLIED type (male|female|both) #IMPLIED> An example corresponding to one of the definitions of the entry April is given below:
  • namedAfter: used in etymplogies sections when a word is named after someone or something. <!ELEMENT namedAfter (#PCDATA)> <!ATTLIST namedAfter occ CDATA #IMPLIED nat CDATA #IMPLIED born CDATA #IMPLIED died CDATA #IMPLIED> Optional attributes occ, nat, born and died may indicate the occupation, nationality, year of birth and year of death. An example corresponding to the etymology of the entry Dirac delta is given below:
  • nonGloss: encloses a definition, or a part of it, which is not formulated like a gloss.
    <!ELEMENT nonGloss (#PCDATA|etymon|formOf|givenName|i|innerLink|note|sub)*> An example corresponding to the definition of man, used as an interjection, is given below:

    Another example, corresponding to the definition of Vte, illustrates that the nonGloss tag may also mark only a text segment that is not to be considered as a proper part of the gloss (French for in the example below):
  • taxon: categorizes a taxonomic name. The type indicates the level (rank) of the taxon (family, tribe, genus, subgenus, etc.). The date attribute indicates a date (YYMMDD) on which the spelling of the taxon in the entry was verified. See Wiktionary's documentation for more detail. <!ELEMENT taxon (#PCDATA)> <!ATTLIST taxon type CDATA #IMPLIED date CDATA #IMPLIED> Examples below are taken from the definitions of Nelson's elk and one of the definition of cowslip:



  • unknown: in etymologies, used to indicate an unknown or uncertain origin. <!ELEMENT unknown (#PCDATA)> <!ATTLIST unknown txt CDATA #IMPLIED> The rendering of this markup may be unknown (most of the time), as illustrated in the yomp article below, but also any other text. In the latter case, this text is given by the txt attribute, like in the brickhouse article.


  • inflectionOf: Wiktionary's entries may correspond to inflected forms of words. In such cases, Wiktionary's gloss is a plain text description of the inflection type and lemma. The inflectionOf tag encodes this information into two child elements: inflectionType and lemma. When an indication (e.g. diachronic information) on the inflected form is given in Wiktionary, it is reported in the qual (qualifier) attribute (see e.g. the XML for closeth). <!ELEMENT inflectionOf (inflectionType, lemma)> <!ELEMENT inflectionType (#PCDATA)> <!ATTLIST inflectionType qual (alternative|archaic|irregular|nonstandard|obsolete) #IMPLIED> <!ELEMENT lemma (#PCDATA)> For example, the definition of accuratest is:

    And built is defined as follows:

    The inflection information found in a gloss is reported in the inflectionInfos element within the same pos parent element. It is also reported in the paradigm element of the lemma's article.
  • wordFormation: mostly in etymologies and sometimes in definitions, provides information about the entry's word formation. <!ELEMENT wordFormation (#PCDATA|innerLink|lang)*> <!ATTLIST wordFormation type (affix|back-formation|blend|circumfix|compound|confix|infix|prefix|suffix) #IMPLIED indication CDATA #IMPLIED> The wordFormation tag encloses the words (separated by the | character) that take part in the formation. The type attribute is the kind of formation, as indicated in Wiktionary.
    The two examples below show the etymologies of multiculturalism and newsletter:

    The indication attribute may provide information on the meaning of words, their origin, etc. For instance, the etymology of parasol indicates that in this word, the Italian para means to shield and the Italian sole means sun.

  • affixUX (affix usage example): in entries describing affixes, illustrates how a given affix is used to create a derived word. <!ELEMENT affixUX (#PCDATA)> <!ATTLIST affixUX type (prefix|suffix) #REQUIRED> Below is given a shortened example of the -ette prefix entry:
  • quotation: see the dedicated page
  • etym, lang and etymon
    • The etym element only occurs in etymologies, while the etymon element may also occur in definitions. The type attribute indicates the nature of the link between the etymon and the entry, or between two etymons. The lang of origin may be given by the lang element or by the langCode and langName attributes of the etymon element (cf. below).
    • <!ELEMENT etym (#PCDATA|lang|etymon)*> <!ATTLIST etym type (borrowing|calque|cognate|derived|inherited|learnedBorrowing|loan) #IMPLIED term CDATA #IMPLIED> <!ELEMENT lang (#PCDATA)> <!ATTLIST lang langCode CDATA #IMPLIED> <!ELEMENT etymon (#PCDATA|innerLink)*> <!ATTLIST etymon langCode CDATA #IMPLIED langName CDATA #IMPLIED gloss CDATA #IMPLIED translit CDATA #IMPLIED>
    • etymon: marks an etymon and provides additional information:
      • The language of the etymon may be given by the langCode and langName attributes. These two attributes are generally not present when the etymon is preceded by a lang element.
      • a gloss, often an English translation of the etymon (attribute gloss)
      • a transliteration, for non-Latin-script etymons (attribute translit)
      An etymon element may be enclosed within an etym parent element or may occur alone.
    • lang: marks a language written in plain text. This element may precede an etymon element. In suche case, the etymon element should not have langCode and langName attributes.

  • doublet: in etymologies. Wiktionary defines a doublet as "one of two (or more) words in a language that have the same etymological root, but have come to the modern language through different routes [...]". The doublet element may be empty (e.g. below: pyre) or it may mark a word (e.g. below: tract).
    For example, the etymology section of pyre states that this word is a doublet of fire:


    Tract is described as a doublet of trait:

  • unknown: empty element used to indicate unkown origins in etymologies. For example the etymology of ofay is:

  • formOf: mostly in definitions' glosses and sometimes in etymologies, when a headword is defined as a variant of another form. <!ELEMENT formOf (#PCDATA)*> <!ATTLIST formOf type CDATA #REQUIRED gloss CDATA #IMPLIED langCode CDATA #IMPLIED> chimaera, for example, is defined as an alternative spelling of chimera:

  • transOnly (translation only): empty element that indicates a "translation hub", i.e., in Wiktionary's jargon: "An English multi-word entry that may be sum of parts and is there to host translations and enable navigation from one non-English entry to another non-English entry." A list of such entries can be found here. For example, day after tomorrow is defined as follows in Wiktionary:

    Noun
    day after tomorrow
    1. (This entry is a translation hub.)
    Synonym
    • overmorrow (obsolete)

    The corresponding XML in ENGLAWI is given below:


    The English Wiktionary's article day after tomorrow is, for instance, a target link of the French Wiktionnaires's après-demain.
    Some "translation hubs" have a gloss defining the entry, such as in be good for, which is a translation target link of the German taugen:
    Verb
    be good for
    1. (This entry is a translation hub.) to be fit,to be useful




Back to ENGLAWI's [ main documentation page ] [ project page ]