This page provides some scripts that we hope to be useful to extract information from GLAWI.
The first list of programs is made of quick-and-dirty scripts (quickly written, but also running faster than those using SAX/DOM parsing) that do not require any specific module installation and should run under any OS with Perl installed. They are also easy to modify if needed, with few programming skills.
The programs perform a line-by-line reading of GLAWI and allow an extraction of one or several titles (headwords) or articles in XML format (intended e.g. to be transformed with an XSL sheet, manually browsed or further queried by any program).
The programs using XML::DOM first work like SAX-parsers in order to build the textual content of a given article and then build a DOM document of the current article
(discarding the DOM of the previous article). Once the DOM built, the search/extraction is performed.
Note: the first shebang line (e.g.#!/usr/bin/perl) has been removed from each script for compatibility reason.
You may want to add it in order to make a direct call to the program instead of calling the perl command.
In that case, make sure the execution rights are correctly set.
splits the big GLAWI file into several files of smaller size Command line:perl splitGLAWI.pl GLAWI.xml size(Mo) dstDir filePrefix
Example: splitGLAWI.pl GLAWI.xml 100 /tmp/SPLITS/ glawiSplit
→ produces files of size equal to 100 Mo, located in directory /tmp/SPLITS/, named filePrefix-1.xml, filePrefix-2.xml ... filePrefix-N.xml
extracts a single article matching (exact match) a given title Command line:perl extractArticle.pl GLAWI.xml title [outFile]
Example: perl extractArticle.pl GLAWI.xml dictionnaire dict.xml
→ extracts the article "dictionnaire"
extracts articles having a definition including a label whose value matches the specified one (whatever its type).
The label value is to be matched against a case-insensitive regexp. Command line:perl extractArticlesWithLabelValue.pl GLAWI.xml labelValueRegexp [outFile]
Examples: perl extractArticles.pl GLAWI.xml "^vieilli\$" dated.xml
→ extracts articles with at least one gloss including a vieilli (dated) label value. extractArticles.pl GLAWI.xml "^chimie\$" chemistry.xml
→ extracts articles with at least one gloss related to the chimie (chemistry) domain.
extracts titles of articles having a definition including a label whose value matches the specified one (whatever its type).
The label value is to be matched against a case-insensitive regexp. Command line:perl extractTitlesWithLabelValue.pl GLAWI.xml labelValueRegexp [outFile]
Example: perl extractArticles.pl GLAWI.xml "^vieilli\$" dated.xml
→ extracts article's titles with at least one gloss including a vieilli (dated) label value.
extracts articles (or articles' titles) matching a set of criteria. Command line:perl extractArticlesMatchingCriteria.pl [OPTIONS] GLAWI.xml [outFile]
-e : outputs only entries' titles instead of the whole articles
-t regexp : selects only articles whose titles match against the specified regexp (case-insensitive)
-p POS : selects only articles having a given syntactic category (equal to POS)
-c labelCategory : selects only articles having a gloss definition including a label whose category equals labelCategory (whatever the label's value)
-v labelValue : selects only articles having a gloss definition including a label whose value equals labelValue (whatever the label's category)
-f labelValuesFile : selects only articles having a gloss definition with a label whose value is included in the specified file (UTF-8 text file, one value per line, regexp allowed). Example file.
When both -c and -v options are used, they are intended to apply on the same label.
When -c and/or -v are used together with the -p option, the targeted label is to be found in a POS section of the specified type. -v and -f options are mutually exclusive.
perl extractArticlesMatchingCriteria.pl -v 'rare' -p verbe GLAWI.xml
→ extracts the articles with verbs that include a rare label (value) in one of their glosses.
perl extractArticlesMatchingCriteria.pl -e -p verbe -f computerScienceLabelValues.txt GLAWI.xml csVerbs.txt
→ extracts titles of verb entries having a gloss definition with a label whose value is included in the computerScienceLabelValues.txt file.
Output is written in file csVerbs.txt.