REDAC
REssources Développées À CLLE-ERSS Laboratoire CLLE-ERSS







English version
TALISMANE
Tool for the Analysis of Language,
Inferring Statistical Models from the Annotation of Numerous Examples
Description

Talismane is a syntax analyser developped by Assaf Urieli within the framework of his thesis in the CLLE-ERSS laboratory, under the direction of Ludovic Tanguy. It is written entirely in Java, functions on all operating systems, and can easily be integrated with other applications.

Talismane consists of four main modules which transform a raw unannotated text into a series of syntax dependency trees: phrase boundary detection, tokenising, pos-tagging (assigning a part-of-speech to each word) and parsing (generation and labeling of syntax dependencies between words).

Each module's task is defined as a classification problem, and resolved statistically, by training a probabilistic model on an annotated corpus.

Each module may be configured both in terms of features and in terms of rules. Features describe the information available to the algorithm in order to take its decision in a given context, while rules are constraints which force (or prohibit) certain local decisions.

The default French model proposed by Talismane used standard features for each of these operations. For pos-tagging, for example, we calculate for each word features realted to its lexical form, to the parts-of-speech associated with it in a reference lexicon, to the categories of the words surrounding it, etc. The feature definition syntax is expressive enough to define more complex features, such as the fact that the preceding word is surrounded by parentheses.

The rules, which are only applied when analysing (not when training), make it possible to replace or constrain the responses provided by the proababilistic classifier, if a criterion has been met. These rules follow a flexible syntax similar to that of features, and make it possible to avoid aberrant results (such as assigning a closed-class part-of-speech to a word unknown in the lexicon, or assigning two subjects to a verb), as well as respecting the constraints of a specific corpus (by assigning a fixed part-of-speech to a given word, for example).

For parsing, Talismane is based on the algorithm described by (Urieli et Tanguy, 2013) with certain modifications to facilitate the beam search functionality.

Links

Person responsible for resource
Assaf Urieli
Contact :

License
Talismane is distributed under the Affero GPL v3 license.

Quick start

You must install a recent version of Java (> 1.8) and unzip talismane-XXX.zip
You need three files:

  • talismane-distribution-X.X.X-bin.zip
  • talismane-fr-X.X.X.conf
  • frenchLanguagePackvX.X.X.zip
Unzip the file talismane-distribution-X.X.X-bin.zip, but not frenchLanguagePackvX.X.X.zip. Then copy the other two files into into the folder where talismane-distribution-X.X.X-bin.zip was unzipped.

To analyse raw text in French using the default configuration, enter the following command in a terminal:

java -Xmx1G -Dconfig.file=talismane-fr-X.X.X.conf -jar talismane-core-X.X.X.jar encoding=UTF8 inFile=data/frTest.txt outFile=data/frTest.tal

The default encoding (latin1 or UTF-8) is the same as that of your work environment. To Change it, consult the documentation.

Note that the system takes about 20 seconds to load the lexicons into memory before beginning analysis.

For example, if the file example.txt contained the sentence "Les poules du couvent couvent.", the output in exemple.tal should look like:

1LesleDETdetp2det__
2poulespouleNCncfp5suj__
3dudeP+DP+Dms2dep__
4couventcouventNCncms3obj__
5couventcouverVvPS3p0root__
6..PONCTPONCTnull5punct__
References
  • Urieli, Assaf (2013). Robust French syntax analysis: reconciling statistical methods and linguistic knowledge in the Talismane toolkit. PhD thesis. Université de Toulouse II-Le Mirail. [ PDF ] [ BIBTEX ]
  • Urieli, Assaf et Tanguy, Ludovic (2013). L'apport du faisceau dans l'analyse syntaxique en dépendances par transitions : études de cas avec l'analyseur Talismane. Actes de la conférence Traitement Automatique des Langues Naturelles (TALN 2013). Les Sables d'Olonne, France. [ PDF ] [ BIBTEX ]