|
|||||||
|
articles extracted from the proceedings of the TALN and RECITAL conferences between 1997 and 2019 Version note
The present page refers to the latest version of the TALN corpus. The previous version, compiled and used for the Semdis 2014 workshop, can still be accessed here.Description
The TALN corpus is based on the proceedings of the TALN and RECITAL conferences, from 1997 till 2019. It contains 1602 research articles dealing with Natural Language Processing. All articles are written in French and add up to a total of 5.8 million words. The articles are in TEI format, encoding the following elements:
Original PDF files have been converted to TXT format and then to XML using opensource tools for text extraction (pdfminer ) and structure mark-up (ParsCit). The following stage consisted in cleaning up and marking-up additional elements. Lastly, semi-automatic checking and manual proofreading were performed to check for wrongly tagged or missing elements, before encoding using the TEI-P5 scheme. This corpus has been compiled and built in the context of the ANR-funded ADDICTE project. Its primary objective was to experiment distributional analysis techniques on a specialty language corpus. More precisely, we studied the impact of text structure on word embeddings. Contact
Ludovic Tanguy :
Copyright and Licence
The proceedings of the TALN and RECITAL conferences are the property of the French Association for Natural Language Processing (Association pour Traitement Automatique des LAngues (ATALA).Please refer to the licence (in French only). Download
Références
|