REsources Developed At CLLE
Homepage Resources Applications Corpora Lexicons Other resources About CLLE This website Legal notice Contact

G-PeTo: GLAWI Perl Tools

G-PeTo
GLAWI Perl Tools

This page provides some scripts that we hope to be useful to extract information from GLAWI.

The first list of programs is made of quick-and-dirty scripts (quickly written, but also running faster than those using SAX/DOM parsing) that do not require any specific module installation and should run under any OS with Perl installed. They are also easy to modify if needed, with few programming skills.
The programs perform a line-by-line reading of GLAWI and allow an extraction of one or several titles (headwords) or articles in XML format (intended e.g. to be transformed with an XSL sheet, manually browsed or further queried by any program).

The programs using XML::DOM first work like SAX-parsers in order to build the textual content of a given article and then build a DOM document of the current article (discarding the DOM of the previous article). Once the DOM built, the search/extraction is performed.

Note: the first shebang line (e.g. #!/usr/bin/perl) has been removed from each script for compatibility reason. You may want to add it in order to make a direct call to the program instead of calling the perl command. In that case, make sure the execution rights are correctly set.

Person in charge

Franck Sajous
Contact:

Licence

Some rights are reserved. G-PeTo programs are available under a Creative Commons BY-NC-SA 3.0 license.

References

Franck Sajous, Basilio Calderone and Nabil Hathout (2020). ENGLAWI: From Human- to Machine-Readable Wiktionary. Proceedings of the 12th International Conference on Language Resources and Evaluation (LREC 2020), pp. 3016-3026, Marseille, France. [ PDF ] [ Bibtex ]

Program Description Requires

• splitGLAWI.pl splits the big GLAWI file into several files of smaller size
Command line: perl splitGLAWI.pl GLAWI.xml size(Mo) dstDir filePrefix
Example: splitGLAWI.pl GLAWI.xml 100 /tmp/SPLITS/ glawiSplit
→ produces files of size equal to 100 Mo, located in directory /tmp/SPLITS/, named filePrefix-1.xml, filePrefix-2.xml ... filePrefix-N.xml -

• extractArticle.pl extracts a single article matching (exact match) a given title
Command line: perl extractArticle.pl GLAWI.xml title [outFile]
Example: perl extractArticle.pl GLAWI.xml dictionnaire dict.xml
→ extracts the article "dictionnaire" -

• extractArticles.pl extracts articles whose titles match the specified regexp.
Command line: perl extractArticles.pl GLAWI.xml regexp [outFile]
Example: perl extractArticles.pl GLAWI.xml "^anti" anti.xml
→ extracts all entries starting with the anti- prefix -

• extractTitles.pl Same as above, extracts titles only instead of articles. -

• extractArticlesWithLabelValue.pl extracts articles having a definition including a label whose value matches the specified one (whatever its type). The label value is to be matched against a case-insensitive regexp.
Command line: perl extractArticlesWithLabelValue.pl GLAWI.xml labelValueRegexp [outFile]
Examples: perl extractArticlesWithLabelValue.pl GLAWI.xml "^vieilli\$" dated.xml
→ extracts articles with at least one gloss including a vieilli (dated) label value.
extractArticlesWithLabelValue.pl GLAWI.xml "^chimie\$" chemistry.xml
→ extracts articles with at least one gloss related to the chimie (chemistry) domain. -

• extractTitlesWithLabelValue.pl extracts titles of articles having a definition including a label whose value matches the specified one (whatever its type). The label value is to be matched against a case-insensitive regexp.
Command line: perl extractTitlesWithLabelValue.pl GLAWI.xml labelValueRegexp [outFile]
Example: perl extractTitlesWithLabelValue.pl GLAWI.xml "^vieilli\$" dated.xml
→ extracts article's titles with at least one gloss including a vieilli (dated) label value. -

• extractTitlesWithLabelValueAllSenses.pl same as the previous script (extractTitlesWithLabelValue.pl) but the label has to be found in every gloss, i.e. the label is found in the gloss of a monosemic entry or the label marks all the glosses of a polysemic entry (for a given POS). The label value is to be matched against a case-insensitive regexp.
Command line: perl extractTitlesWithLabelValueAllSenses.pl GLAWI.xml labelValueRegexp [outFile]
Example: perl extractTitlesWithLabelValueAllSenses.pl GLAWI.xml "^vieilli\$" dated.xml
→ extracts article's titles with all glosses including a vieilli (dated) label value. -

• extractGlossWithLabelValue.pl

extracts glosses including a label whose value matches the specified one (whatever its type). The label value is to be matched against a case-insensitive regexp.
Command line: perl extractGlossWithLabelValue.pl GLAWI.xml labelValueRegexp [outFile]
Example: perl extractGlossWithLabelValue.pl GLAWI.xml "^péjoratif\$" pej.xml
→ extracts glosses with at péjoratif label value.
Output format: article's title TAB gloss
Example:

moscoutaire	Communiste qui ne jure que par l'Union soviétique.
cartelliste	Relatif un cartel, une entente.
encagoulé	Terroriste qui porte une cagoule.

• extractGlossMatchingCriteria.pl

extracts glosses including a given word (to be matched against a case-insensitive regexp).
Command line: perl extractGlossMatchingCriteria.pl [-H] [-l] [-w word] [-f wordsFile] [-p POS] GLAWI.xml [outFile]
Options:

-H: HTML formatted output
-l: lemmas' glosses only


	-p POS: selects only glosses within a given POS section type (regexp match)
	-w word: glosses matching against word (regexp match) are selected
	-f labelValuesFile : file including a list of words to be found in glosses (UTF-8 text file, one value per line, regexp allowed).



      Example: perl extractGlossMatchingCriteria.pl -H -l -p "nom" -w "anti.*" GLAWI.xml anti.html 

      → extracts nouns' glosses (lemmas only) including a word starting with anti and outputs the HTML-formatted result into file anti.html.

Getopt::Std

• extractArticlesMatchingCriteria.pl

extracts articles (or articles' titles) matching a set of criteria.
Command line: perl extractArticlesMatchingCriteria.pl [OPTIONS] GLAWI.xml [outFile]
Options:

-e : outputs only entries' titles instead of the whole articles
-t regexp : selects only articles whose titles match against the specified regexp (case-insensitive)
-p POS : selects only articles having a given syntactic category (equal to POS)
-c labelCategory : selects only articles having a gloss definition including a label whose category equals labelCategory (whatever the label's value)
-v labelValue : selects only articles having a gloss definition including a label whose value equals labelValue (whatever the label's category)
-f labelValuesFile : selects only articles having a gloss definition with a label whose value is included in the specified file (UTF-8 text file, one value per line, regexp allowed). Example file.

When both -c and -v options are used, they are intended to apply on the same label.
When -c and/or -v are used together with the -p option, the targeted label is to be found in a POS section of the specified type.
-v and -f options are mutually exclusive.

Examples:

perl extractArticlesMatchingCriteria.pl -v 'rare' -p verbe GLAWI.xml
→ extracts the articles with verbs that include a rare label (value) in one of their glosses.
perl extractArticlesMatchingCriteria.pl -e -p verbe -f computerScienceLabelValues.txt GLAWI.xml csVerbs.txt
→ extracts titles of verb entries having a gloss definition with a label whose value is included in the computerScienceLabelValues.txt file. Output is written in file csVerbs.txt.

Getopt::Std, XML::DOM

Back to [ GLAWI's main page ]