TALN Corpus

REsources Developed At CLLE
Homepage Resources Applications Corpora Lexicons Other resources About CLLE This website Legal notice Contact

TALN Corpus
articles extracted from the proceedings of the TALN and RECITAL conferences between 1997 and 2019

Version note

The present page refers to the latest version of the TALN corpus. The previous version, compiled and used for the Semdis 2014 workshop, can still be accessed here.

Description

The TALN corpus is based on the proceedings of the TALN and RECITAL conferences, from 1997 till 2019.

It contains 1602 research articles dealing with Natural Language Processing. All articles are written in French and add up to a total of 5.8 million words.

The articles are in TEI format, encoding the following elements:

metadata: title, author names, publication year, location of the conference, abstract (in French and English). Each article received a unique ID that indicates the conference (TALN/RECITAL) and article type (long, short, poster, keynote, system demonstration, etc.)
body: section and subsection headers (number, text and category), figure and table captions, footnotes. Paragraphs are marked up but roughly corresponds to text blocks that are separated by a structural element or a page break. Bibliography is marked up as a whole but reference items are not.

Original PDF files have been converted to TXT format and then to XML using opensource tools for text extraction (pdfminer ) and structure mark-up (ParsCit).

The following stage consisted in cleaning up and marking-up additional elements. Lastly, semi-automatic checking and manual proofreading were performed to check for wrongly tagged or missing elements, before encoding using the TEI-P5 scheme.

This corpus has been compiled and built in the context of the ANR-funded ADDICTE project. Its primary objective was to experiment distributional analysis techniques on a specialty language corpus. More precisely, we studied the impact of text structure on word embeddings.

Contact

Ludovic Tanguy :

The proceedings of the TALN and RECITAL conferences are the property of the French Association for Natural Language Processing (Association pour Traitement Automatique des LAngues (ATALA).
Please refer to the licence (in French only).

Download

Corpus
Documentation (in French)

Références

L. Tanguy, C. Fabre et Y. Board. (2020). Impact de la structure logique des documents sur les modèles distributionnels : expérimentations sur le corpus TALN. Actes TALN 2020. Nancy, France.