Tool for the Analysis of Language,
Inferring Statistical Models from the Annotation of Numerous Examples
Talismane is a syntax analyser developped by Assaf Urieli within the framework of his thesis in the CLLE-ERSS laboratory, under the direction of Ludovic Tanguy. It is written entirely in Java, functions on all operating systems, and can easily be integrated with other applications.
Talismane consists of four main modules which transform a raw unannotated text into a series of syntax dependency trees: phrase boundary detection, tokenising, pos-tagging (assigning a part-of-speech to each word) and parsing (generation and labeling of syntax dependencies between words).
Each module's task is defined as a classification problem, and resolved statistically, by training a probabilistic model on an annotated corpus.
Each module may be configured both in terms of features and in terms of rules. Features describe the information available to the algorithm in order to take its decision in a given context, while rules are constraints which force (or prohibit) certain local decisions.
The default French model proposed by Talismane used standard features for each of these operations. For pos-tagging, for example, we calculate for each word features realted to its lexical form, to the parts-of-speech associated with it in a reference lexicon, to the categories of the words surrounding it, etc. The feature definition syntax is expressive enough to define more complex features, such as the fact that the preceding word is surrounded by parentheses.
The rules, which are only applied when analysing (not when training), make it possible to replace or constrain the responses provided by the proababilistic classifier, if a criterion has been met. These rules follow a flexible syntax similar to that of features, and make it possible to avoid aberrant results (such as assigning a closed-class part-of-speech to a word unknown in the lexicon, or assigning two subjects to a verb), as well as respecting the constraints of a specific corpus (by assigning a fixed part-of-speech to a given word, for example).
For parsing, Talismane is based on the algorithm described by (Urieli et Tanguy, 2013) with certain modifications to facilitate the beam search functionality.
Person responsible for resourceAssaf Urieli
LicenseTalismane is distributed under the Affero GPL v3 license.
You must install a recent version of Java (> 1.8) and unzip talismane-XXX.zip
To analyse raw text in French using the default configuration, enter the following command in a terminal:
java -Xmx1G -Dconfig.file=talismane-fr-X.X.X.conf -jar talismane-core-X.X.X.jar encoding=UTF8 inFile=data/frTest.txt outFile=data/frTest.tal
The default encoding (latin1 or UTF-8) is the same as that of your work environment. To Change it, consult the documentation.
Note that the system takes about 20 seconds to load the lexicons into memory before beginning analysis.
For example, if the file example.txt contained the sentence "Les poules du couvent couvent.", the output in exemple.tal should look like: