Publications

(last update: Apr. 2025)

Information Extraction

Adam Remaki, Jacques Ung, Pierre Pages, Perceval Wajsbürt, Elise Liu, Guillaume Faure, Thomas Petit-Jean, Xavier Tannier, Christel Gérardin.
Improving Phenotyping of Patients With Immune-Mediated Inflammatory Diseases Through Automated Processing of Discharge Summaries: Multicenter Cohort Study.
JMIR Medical Informatics. 13, April 2025. doi: 10.2196/68704
[abstract] [BibTeX] [JMIR link]
Background: Valuable insights gathered by clinicians during their inquiries and documented in textual reports are often unavailable in the structured data recorded in electronic health records (EHRs).
Objective: This study aimed to highlight that mining unstructured textual data with natural language processing techniques complements the available structured data and enables more comprehensive patient phenotyping. A proof-of-concept for patients diagnosed with specific autoimmune diseases is presented, in which the extraction of information on laboratory tests and drug treatments is performed.
Methods: We collected EHRs available in the clinical data warehouse of the Greater Paris University Hospitals from 2012 to 2021 for patients hospitalized and diagnosed with 1 of 4 immune-mediated inflammatory diseases: systemic lupus erythematosus, systemic sclerosis, antiphospholipid syndrome, and Takayasu arteritis. Then, we built, trained, and validated natural language processing algorithms on 103 discharge summaries selected from the cohort and annotated by a clinician. Finally, all discharge summaries in the cohort were processed with the algorithms, and the extracted data on laboratory tests and drug treatments were compared with the structured data.
Results: Named entity recognition followed by normalization yielded F1-scores of 71.1 (95% CI 63.6-77.8) for the laboratory tests and 89.3 (95% CI 85.9-91.6) for the drugs. Application of the algorithms to 18,604 EHRs increased the detection of antibody results and drug treatments. For instance, among patients in the systemic lupus erythematosus cohort with positive antinuclear antibodies, the rate increased from 18.34% (752/4102) to 71.87% (2949/4102), making the results more consistent with the literature.
Conclusions: While challenges remain in standardizing laboratory tests, particularly with abbreviations, this work, based on secondary use of clinical data, demonstrates that automated processing of discharge summaries enriched the information available in structured data and facilitated more comprehensive patient profiling.
@Article{Remaki2025, 
  title = {{Improving Phenotyping of Patients With Immune-Mediated Inflammatory Diseases Through Automated Processing of Discharge Summaries: Multicenter Cohort Study}},
  author = {Remaki, Adam and Ung, Jacques and Pages, Pierre and Wajsbürt, Perceval and Liu, Elise and Faure, Guillaume and Petit-Jean, Thomas and Tannier, Xavier and Gérardin, Christel},
  year = {2025}, 
  month = apr, 
  journal = {JMIR Medical Informatics}, 
  volume = {13}, 
  doi = {10.2196/68704}
}
Chi-en Amy Tai, Xavier Tannier.
Clinical trial cohort selection using Large Language Models on n2c2 Challenges.
January 2025.
[abstract] [BibTeX] [arXiv]
Clinical trials are a critical process in the medical field for introducing new treatments and innovations. However, cohort selection for clinical trials is a time-consuming process that often requires manual review of patient text records for specific keywords. Though there have been studies on standardizing the information across the various platforms, Natural Language Processing (NLP) tools remain crucial for spotting eligibility criteria in textual reports. Recently, pre-trained large language models (LLMs) have gained popularity for various NLP tasks due to their ability to acquire a nuanced understanding of text. In this paper, we study the performance of large language models on clinical trial cohort selection and leverage the n2c2 challenges to benchmark their performance. Our results are promising with regard to the incorporation of LLMs for simple cohort selection tasks, but also highlight the difficulties encountered by these models as soon as fine-grained knowledge and reasoning are required.
@Misc{Tai2025, 
  title = {{Clinical trial cohort selection using Large Language Models on n2c2 Challenges}},
  author = {Tai, Chi-en Amy and Tannier, Xavier},
  year = {2025}, 
  month = jan, 
  note = {arXiv}
}
Marco Naguib, Xavier Tannier, Aurélie Névéol.
Few-shot clinical entity recognition in English, French and Spanish: masked language models outperform generative model prompting.
in Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2024. Miami, Florida, USA, November 2024. © Association for Computational Linguistics.
[abstract] [BibTeX] [ACL Anthology]
Large language models (LLMs) have become the preferred solution for many natural language processing tasks. In low-resource environments such as specialized domains, their few-shot capabilities are expected to deliver high performance. Named Entity Recognition (NER) is a critical task in information extraction that is not covered in recent LLM benchmarks. There is a need for better understanding the performance of LLMs for NER in a variety of settings including languages other than English. This study aims to evaluate generative LLMs, employed through prompt engineering, for few-shot clinical NER. We compare 13 auto-regressive models using prompting and 16 masked models using fine-tuning on 14 NER datasets covering English, French and Spanish. While prompt-based auto-regressive models achieve competitive F1 for general NER, they are outperformed within the clinical domain by lighter biLSTM-CRF taggers based on masked models. Additionally, masked models exhibit lower environmental impact compared to auto-regressive models. Findings are consistent across the three languages studied, which suggests that LLM prompting is not yet suited for NER production in the clinical domain.
@InProceedings{Naguib2024b, 
  title = {{Few-shot clinical entity recognition in English, French and Spanish: masked language models outperform generative model prompting}},
  author = {Naguib, Marco and Tannier, Xavier and Névéol, Aurélie},
  booktitle = {Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2024}, 
  address = {Miami, Florida, USA}, 
  year = {2024}, 
  month = nov, 
  publisher = {Association for Computational Linguistics}
}
Jamil Zaghir, Marco Naguib, Mina Bjelogrlic, Aurélie Névéol, Xavier Tannier, Christian Lovis.
Prompt Engineering Paradigms for Medical Applications: Scoping Review.
Journal of Medical Internet Research. September 2024. doi: 10.2196/60501
[abstract] [BibTeX] [JMIR Link]
Background: Prompt engineering, focusing on crafting effective prompts to large language models (LLMs), has garnered attention for its capabilities at harnessing the potential of LLMs. This is even more crucial in the medical domain due to its specialized terminology and language technicity. Clinical natural language processing applications must navigate complex language and ensure privacy compliance. Prompt engineering offers a novel approach by designing tailored prompts to guide models in exploiting clinically relevant information from complex medical texts. Despite its promise, the efficacy of prompt engineering in the medical domain remains to be fully explored.
Objective: The aim of the study is to review research efforts and technical approaches in prompt engineering for medical applications as well as provide an overview of opportunities and challenges for clinical practice.
Methods: Databases indexing the fields of medicine, computer science, and medical informatics were queried in order to identify relevant published papers. Since prompt engineering is an emerging field, preprint databases were also considered. Multiple data were extracted, such as the prompt paradigm, the involved LLMs, the languages of the study, the domain of the topic, the baselines, and several learning, design, and architecture strategies specific to prompt engineering. We include studies that apply prompt engineering–based methods to the medical domain, published between 2022 and 2024, and covering multiple prompt paradigms such as prompt learning (PL), prompt tuning (PT), and prompt design (PD).
Results: We included 114 recent prompt engineering studies. Among the 3 prompt paradigms, we have observed that PD is the most prevalent (78 papers). In 12 papers, PD, PL, and PT terms were used interchangeably. While ChatGPT is the most commonly used LLM, we have identified 7 studies using this LLM on a sensitive clinical data set. Chain-of-thought, present in 17 studies, emerges as the most frequent PD technique. While PL and PT papers typically provide a baseline for evaluating prompt-based approaches, 61% (48/78) of the PD studies do not report any nonprompt-related baseline. Finally, we individually examine each of the key prompt engineering–specific information reported across papers and find that many studies neglect to explicitly mention them, posing a challenge for advancing prompt engineering research.
Conclusions: In addition to reporting on trends and the scientific landscape of prompt engineering, we provide reporting guidelines for future studies to help advance research in the medical field. We also disclose tables and figures summarizing medical prompt engineering papers available and hope that future contributions will leverage these existing works to better advance the field.
@Article{Zaghir2024, 
  title = {{Prompt Engineering Paradigms for Medical Applications: Scoping Review}},
  author = {Zaghir, Jamil and Naguib, Marco and Bjelogrlic, Mina and Névéol, Aurélie and Tannier, Xavier and Lovis, Christian},
  year = {2024}, 
  month = sep, 
  journal = {Journal of Medical Internet Research}, 
  doi = {10.2196/60501}
}
Ariel Cohen, Alexandrine Lanson, Emmanuelle Kempf, Xavier Tannier.
Leveraging Information Redundancy of Real-World Data through Distant Supervision.
in Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). Torino, Italia, pages 10352–10364, May 2024. © ELRA and ICCL.
[abstract] [BibTeX] [free copy]
We explore the task of event extraction and classification by harnessing the power of distant supervision. We present a novel text labeling method that leverages the redundancy of temporal information in a data lake. This method enables the creation of a large programmatically annotated corpus, allowing the training of transformer models using distant supervision. This aims to reduce expert annotation time, a scarce and expensive resource. Our approach utilizes temporal redundancy between structured sources and text, enabling the design of a replicable framework applicable to diverse real-world databases and use cases. We employ this method to create multiple silver datasets to reconstruct key events in cancer patients’ pathways, using clinical notes from a cohort of 380,000 oncological patients. By employing various noise label management techniques, we validate our end-to-end approach and compare it with a baseline classifier built on expert-annotated data. The implications of our work extend to accelerating downstream applications, such as patient recruitment for clinical trials, treatment effectiveness studies, survival analysis, and epidemiology research. While our study showcases the potential of the method, there remain avenues for further exploration, including advanced noise management techniques, semi-supervised approaches, and a deeper understanding of biases in the generated datasets and models.
@InProceedings{Cohen2024, 
  title = {{Leveraging Information Redundancy of Real-World Data through Distant Supervision}},
  author = {Cohen, Ariel and Lanson, Alexandrine and Kempf, Emmanuelle and Tannier, Xavier},
  booktitle = {Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)}, 
  address = {Torino, Italia}, 
  year = {2024}, 
  month = may, 
  publisher = {ELRA and ICCL}, 
  pages = {10352–10364}
}
Nesrine Bannour, Christophe Servan, Aurélie Névéol, Xavier Tannier.
A Benchmark Evaluation of Clinical Named Entity Recognition in French.
in Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). Torino, Italia, pages 14-21, May 2024. © ELRA and ICCL.
[abstract] [BibTeX] [free copy]
Background: Transformer-based language models have shown strong performance on many Natural Language Processing (NLP) tasks. Masked Language Models (MLMs) attract sustained interest because they can be adapted to different languages and sub-domains through training or fine-tuning on specific corpora while remaining lighter than modern Large Language Models (MLMs). Recently, several MLMs have been released for the biomedical domain in French, and experiments suggest that they outperform standard French counterparts. However, no systematic evaluation comparing all models on the same corpora is available. Objective: This paper presents an evaluation of masked language models for biomedical French on the task of clinical named entity recognition. Material and methods: We evaluate biomedical models CamemBERT-bio and DrBERT and compare them to standard French models CamemBERT, FlauBERT and FrAlBERT as well as multilingual mBERT using three publically available corpora for clinical named entity recognition in French. The evaluation set-up relies on gold-standard corpora as released by the corpus developers. Results: Results suggest that CamemBERT-bio outperforms DrBERT consistently while FlauBERT offers competitive performance and FrAlBERT achieves the lowest carbon footprint. Conclusion: This is the first benchmark evaluation of biomedical masked language models for French clinical entity recognition that compares model performance consistently on nested entity recognition using metrics covering performance and environmental impact.
@InProceedings{Bannour2024, 
  title = {{A Benchmark Evaluation of Clinical Named Entity Recognition in French}},
  author = {Bannour, Nesrine and Servan, Christophe and Névéol, Aurélie and Tannier, Xavier},
  booktitle = {Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)}, 
  address = {Torino, Italia}, 
  year = {2024}, 
  month = may, 
  publisher = {ELRA and ICCL}, 
  pages = {14-21}
}
Christel Gérardin, Yuhan Xiong, Perceval Wajsbürt, Fabrice Carrat, Xavier Tannier.
Impact of translation on biomedical information extraction: an experiment on real-life clinical notes.
JMIR Medical Informatics. January 2024. doi: 10.2196/49607
[abstract] [BibTeX] [JMIR Link]
Background:Biomedical natural language processing tasks are best performed with English models, and translation tools have undergone major improvements. On the other hand, building annotated biomedical datasets remains a challenge.
Objective: The aim of our study is to determine whether the use of English tools to extract and normalize French medical concepts on translations provides comparable performance to that of French models trained on a set of annotated French clinical notes.
Methods: We compare two methods: one involving French-language models and one involving English-language models. For the native French method, the Named Entity Recognition (NER) and normalization steps are performed separately. For the translated English method, after the first translation step, we compare a two-step method and a terminology-oriented method that performs extraction and normalization at the same time. We used French, English and bilingual annotated datasets to evaluate all stages (NER, normalization and translation) of our algorithms.
Results: The native French method outperformed the translated English method, with an overall f1 score of 0.51 [0.47;0.55], compared with 0.39 [0.34;0.44] and 0.38 [0.36;0.40] for the two English methods tested.
Conclusions: Despite recent improvements in translation models, there is a significant difference in performance between the two approaches in favor of the native French method, which is more effective on French medical texts, even with few annotated documents.
@Article{Gerardin2024, 
  title = {{Impact of translation on biomedical information extraction: an experiment on real-life clinical notes}},
  author = {Gérardin, Christel and Xiong, Yuhan and Wajsbürt, Perceval and Carrat, Fabrice and Tannier, Xavier},
  year = {2024}, 
  month = jan, 
  journal = {JMIR Medical Informatics}, 
  doi = {10.2196/49607}
}
Thomas Petit-Jean, Christel Gérardin, Emmanuelle Berthelot, Gilles Chatellier, Marie Franck, Xavier Tannier, Emmanuelle Kempf, Romain Bey.
Collaborative and privacy-enhancing workflows on a clinical data warehouse: an example developing natural language processing pipelines to detect medical conditions.
Journal of the American Medical Informatics Association. Vol. 31, Issue 6, April 2024. doi: 10.1093/jamia/ocae069
[abstract] [BibTeX] [JAMIA Link]
Objective: To develop and validate a natural language processing (NLP) pipeline that detects 18 conditions in French clinical notes, including 16 comorbidities of the Charlson index, while exploring a collaborative and privacy-enhancing workflow.
Materials and Methods: The detection pipeline relied both on rule-based and machine learning algorithms, respectively, for named entity recognition and entity qualification, respectively. We used a large language model pre-trained on millions of clinical notes along with annotated clinical notes in the context of 3 cohort studies related to oncology, cardiology, and rheumatology. The overall workflow was conceived to foster collaboration between studies while respecting the privacy constraints of the data warehouse. We estimated the added values of the advanced technologies and of the collaborative setting.
Results: The pipeline reached macro-averaged F1-score positive predictive value, sensitivity, and specificity of 95.7 (95%CI 94.5-96.3), 95.4 (95%CI 94.0-96.3), 96.0 (95%CI 94.0-96.7), and 99.2 (95%CI 99.0-99.4), respectively. F1-scores were superior to those observed using alternative technologies or non-collaborative settings. The models were shared through a secured registry.
Conclusions: We demonstrated that a community of investigators working on a common clinical data warehouse could efficiently and securely collaborate to develop, validate and use sensitive artificial intelligence models. In particular, we provided an efficient and robust NLP pipeline that detects conditions mentioned in clinical notes.
@Article{PetitJean2024, 
  title = {{Collaborative and privacy-enhancing workflows on a clinical data warehouse: an example developing natural language processing pipelines to detect medical conditions}},
  author = {Petit-Jean, Thomas and Gérardin, Christel and Berthelot, Emmanuelle and Chatellier, Gilles and Franck, Marie and Tannier, Xavier and Kempf, Emmanuelle and Bey, Romain},
  number = {6}, 
  year = {2024}, 
  month = apr, 
  journal = {Journal of the American Medical Informatics Association}, 
  volume = {31}, 
  doi = {10.1093/jamia/ocae069}
}
Xavier Tannier, Perceval Wajsbürt, Alice Calliger, Basile Dura, Alexandre Mouchet, Martin Hilka, Romain Bey.
Development and Validation of a Natural Language Processing Algorithm to Pseudonymize Documents in the Context of a Clinical Data Warehouse.
Methods of Information in Medicine. Vol. 63, Issue 01/02, March 2024. doi: 10.1055/s-0044-1778693
[abstract] [BibTeX] [Thieme Link]
Objective: The objective of this study is to address the critical issue of deidentification of clinical reports to allow access to data for research purposes, while ensuring patient privacy. The study highlights the difficulties faced in sharing tools and resources in this domain and presents the experience of the Greater Paris University Hospitals (AP-HP for Assistance Publique-Hôpitaux de Paris) in implementing a systematic pseudonymization of text documents from its Clinical Data Warehouse.
Methods: We annotated a corpus of clinical documents according to 12 types of identifying entities and built a hybrid system, merging the results of a deep learning model as well as manual rules.
Results and Discussion: Our results show an overall performance of 0.99 of F1-score. We discuss implementation choices and present experiments to better understand the effort involved in such a task, including dataset size, document types, language models, or rule addition. We share guidelines and code under a 3-Clause BSD license.
@Article{Tannier2024, 
  title = {{Development and Validation of a Natural Language Processing Algorithm to Pseudonymize Documents in the Context of a Clinical Data Warehouse}},
  author = {Tannier, Xavier and Wajsbürt, Perceval and Calliger, Alice and Dura, Basile and Mouchet, Alexandre and Hilka, Martin and Bey, Romain},
  number = {01/02}, 
  year = {2024}, 
  month = mar, 
  journal = {Methods of Information in Medicine}, 
  volume = {63}, 
  doi = {10.1055/s-0044-1778693}
}
Romain Bey, Ariel Cohen, Vincent Trebossen, Basile Dura, Pierre-Alexis Geoffroy, Charline Jean, Benjamin Landman, Thomas Petit-Jean, Gilles Chatellier, Kankoe Sallah, Xavier Tannier, Aurelie Bourmaud, Richard Delorme.
Natural language processing of multi-hospital electronic health records for public health surveillance of suicidality.
npj Mental Health Research. Vol. 3, Issue 6, February 2024. doi: 10.1038/s44184-023-00046-7
[abstract] [BibTeX] [Nature Link]
There is an urgent need to monitor the mental health of large populations, especially during crises such as the COVID-19 pandemic, to timely identify the most at-risk subgroups and to design targeted prevention campaigns. We therefore developed and validated surveillance indicators related to suicidality: the monthly number of hospitalisations caused by suicide attempts and the prevalence among them of five known risks factors. They were automatically computed analysing the electronic health records of fifteen university hospitals of the Paris area, France, using natural language processing algorithms based on artificial intelligence. We evaluated the relevance of these indicators conducting a retrospective cohort study. Considering 2,911,920 records contained in a common data warehouse, we tested for changes after the pandemic outbreak in the slope of the monthly number of suicide attempts by conducting an interrupted time-series analysis. We segmented the assessment time in two sub-periods: before (August 1, 2017, to February 29, 2020) and during (March 1, 2020, to June 31, 2022) the COVID-19 pandemic. We detected 14,023 hospitalisations caused by suicide attempts. Their monthly number accelerated after the COVID-19 outbreak with an estimated trend variation reaching 3.7 (95%CI 2.1–5.3), mainly driven by an increase among girls aged 8–17 (trend variation 1.8, 95%CI 1.2–2.5). After the pandemic outbreak, acts of domestic, physical and sexual violence were more often reported (prevalence ratios: 1.3, 95%CI 1.16–1.48; 1.3, 95%CI 1.10–1.64 and 1.7, 95%CI 1.48–1.98), fewer patients died (p = 0.007) and stays were shorter (p < 0.001). Our study demonstrates that textual clinical data collected in multiple hospitals can be jointly analysed to compute timely indicators describing mental health conditions of populations. Our findings also highlight the need to better take into account the violence imposed on women, especially at early ages and in the aftermath of the COVID-19 pandemic.
@Article{Bey2024, 
  title = {{Natural language processing of multi-hospital electronic health records for public health surveillance of suicidality}},
  author = {Romain Bey and Ariel Cohen and Vincent Trebossen and Basile Dura and Pierre-Alexis Geoffroy and Charline Jean and Benjamin Landman and Thomas Petit-Jean and Gilles Chatellier and Kankoe Sallah and Xavier Tannier and Aurelie Bourmaud and Richard Delorme},
  number = {6}, 
  year = {2024}, 
  month = feb, 
  journal = {npj Mental Health Research}, 
  volume = {3}, 
  doi = {10.1038/s44184-023-00046-7}
}
Marco Naguib, Aurélie Névéol, Xavier Tannier.
Reconnaissance d’entités cliniques en few-shot en trois langues.
in Actes de la 31ème conférence Traitement Automatique des Langues Naturelles (TALN 2024). Toulouse, France, July 2024.
[abstract] [BibTeX] [pdf]
Les grands modèles de langage deviennent la solution de choix pour de nombreuses tâches de traitement du langage naturel, y compris dans des domaines spécialisés où leurs capacités few-shot devraient permettre d’obtenir des performances élevées dans des environnements à faibles ressources. Cependant, notre évaluation de 10 modèles auto-régressifs et 16 modèles masqués montre que, bien que les modèles auto-régressifs utilisant des prompts puissent rivaliser en termes de reconnaissance d’entités nommées (REN) en dehors du domaine clinique, ils sont dépassés dans le domaine clinique par des taggers biLSTM-CRF plus légers reposant sur des modèles masqués. De plus, les modèles masqués ont un bien moindre impact environnemental que les modèles auto-régressifs. Ces résultats, cohérents dans les trois langues étudiées, suggèrent que les modèles à apprentissage few-shot ne sont pas encore adaptés à la production de REN dans le domaine clinique, mais pourraient être utilisés pour accélérer la création de données annotées de qualité.
@InProceedings{Naguib2024, 
  title = {{Reconnaissance d’entités cliniques en few-shot en trois langues}},
  author = {Naguib, Marco and Névéol, Aurélie and Tannier, Xavier},
  booktitle = {Actes de la 31ème conférence Traitement Automatique des Langues Naturelles (TALN 2024)}, 
  address = {Toulouse, France}, 
  year = {2024}, 
  month = jul
}
Solène Delourme, Adam Remaki, Christel Gérardin, Pascal Vaillant, Xavier Tannier, Brigitte Séroussi, Akram Redjdal.
LIMICS@DEFT'24 : Un mini-LLM peut-il tricher aux QCM de pharmacie en fouillant dans Wikipédia et NACHOS ?.
in Défi Fouille de Texte (DEFT), Traitement Automatique des Langues Naturelles, 2024. Toulouse, France, July 2024.
[abstract] [BibTeX] [HAL link]
Ce papier explore deux approches pour répondre aux questions à choix multiples (QCM) de pharmacie du défi DEFT 2024 en utilisant des modèles de langue (LLMs) entraînés sur des données ouvertes avec moins de 3 milliards de paramètres. Les deux approches reposent sur l'architecture RAG (Retrieval Augmented Generation) pour combiner la récupération de contexte à partir de bases de connaissances externes (NACHOS et Wikipédia) avec la génération de réponses par le LLM Apollo-2B. La première approche traite directement les QCMs et génère les réponses en une seule étape, tandis que la seconde approche reformule les QCMs en questions binaires (Oui/Non) puis génère une réponse pour chaque question binaire. Cette dernière approche obtient un Exact Match Ratio de 14.7 et un Hamming Score de 51.6 sur le jeu de test, ce qui démontre le potentiel du RAG pour des tâches de Q/A sous de telles contraintes.
@InProceedings{Delourme2024, 
  title = {{LIMICS@DEFT'24 : Un mini-LLM peut-il tricher aux QCM de pharmacie en fouillant dans Wikipédia et NACHOS ?}},
  author = {Delourme, Solène and Remaki, Adam and Gérardin, Christel and Vaillant, Pascal and Tannier, Xavier and Séroussi, Brigitte and Redjdal, Akram},
  booktitle = {Défi Fouille de Texte (DEFT), Traitement Automatique des Langues Naturelles, 2024}, 
  address = {Toulouse, France}, 
  year = {2024}, 
  month = jul
}
Emmanuelle Kempf, Sonia Priou, Akram Redjdal, Étienne Guével, Xavier Tannier.
The More, the Better? Modalities of Metastatic Status Extraction on Free Medical Reports Based on Natural Language Processing (Response to Ahumada et al on Methodological and Practical Aspects of a Distant Metastasis Detection Model).
JCO Clinical Cancer Informatics. 8, August 2024. doi: 10.1200/CCI.24.00026
[BibTeX] [Ask me!] [Direct link]
Ce papier explore deux approches pour répondre aux questions à choix multiples (QCM) de pharmacie du défi DEFT 2024 en utilisant des modèles de langue (LLMs) entraînés sur des données ouvertes avec moins de 3 milliards de paramètres. Les deux approches reposent sur l'architecture RAG (Retrieval Augmented Generation) pour combiner la récupération de contexte à partir de bases de connaissances externes (NACHOS et Wikipédia) avec la génération de réponses par le LLM Apollo-2B. La première approche traite directement les QCMs et génère les réponses en une seule étape, tandis que la seconde approche reformule les QCMs en questions binaires (Oui/Non) puis génère une réponse pour chaque question binaire. Cette dernière approche obtient un Exact Match Ratio de 14.7 et un Hamming Score de 51.6 sur le jeu de test, ce qui démontre le potentiel du RAG pour des tâches de Q/A sous de telles contraintes.
@Article{Kempf2024b, 
  title = {{The More, the Better? Modalities of Metastatic Status Extraction on Free Medical Reports Based on Natural Language Processing (Response to Ahumada et al on Methodological and Practical Aspects of a Distant Metastasis Detection Model)}},
  author = {Kempf, Emmanuelle and Priou, Sonia and Redjdal, Akram and Guével, Étienne and Tannier, Xavier},
  year = {2024}, 
  month = aug, 
  journal = {JCO Clinical Cancer Informatics}, 
  volume = {8}, 
  doi = {10.1200/CCI.24.00026}
}
Christel Gérardin, Adam Remaki, Jacques Ung, P Pagès, Perceval Wajsbürt, Guillaume Faure, Thomas Petit-Jean, Xavier Tannier.
Améliorer la caractérisation phénotypique des patients atteints de maladies inflammatoires à médiation immunitaire par l’analyse automatique des comptes-rendus hospitaliers.
in 89ème congrès français de médecine interne, Revue de Médecine Interne. March 2024.
[BibTeX] [ScienceDirect Link]
Ce papier explore deux approches pour répondre aux questions à choix multiples (QCM) de pharmacie du défi DEFT 2024 en utilisant des modèles de langue (LLMs) entraînés sur des données ouvertes avec moins de 3 milliards de paramètres. Les deux approches reposent sur l'architecture RAG (Retrieval Augmented Generation) pour combiner la récupération de contexte à partir de bases de connaissances externes (NACHOS et Wikipédia) avec la génération de réponses par le LLM Apollo-2B. La première approche traite directement les QCMs et génère les réponses en une seule étape, tandis que la seconde approche reformule les QCMs en questions binaires (Oui/Non) puis génère une réponse pour chaque question binaire. Cette dernière approche obtient un Exact Match Ratio de 14.7 et un Hamming Score de 51.6 sur le jeu de test, ce qui démontre le potentiel du RAG pour des tâches de Q/A sous de telles contraintes.
@InProceedings{Gerardin2024b, 
  title = {{Améliorer la caractérisation phénotypique des patients atteints de maladies inflammatoires à médiation immunitaire par l’analyse automatique des comptes-rendus hospitaliers}},
  author = {Gérardin, Christel and Remaki, Adam and Ung, Jacques and Pagès, P and Wajsbürt, Perceval and Faure, Guillaume and Petit-Jean, Thomas and Tannier, Xavier},
  booktitle = {89ème congrès français de médecine interne, Revue de Médecine Interne}, 
  year = {2024}, 
  month = mar
}
Emmanuelle Kempf, Sonia Priou, Basile Dura, Julien Calderaro, Clara Brones, Perceval Wajsbürt, Lina Bennani, Xavier Tannier.
Structuration des critères histopronostiques tumoraux par traitement automatique du langage naturel - Une comparaison entre apprentissage machine et règles.
in Congrès ÉMOIS, Special Issue of the Journal of Epidemiology and Population Health. March 2024.
[BibTeX] [ScienceDirect Link]
Ce papier explore deux approches pour répondre aux questions à choix multiples (QCM) de pharmacie du défi DEFT 2024 en utilisant des modèles de langue (LLMs) entraînés sur des données ouvertes avec moins de 3 milliards de paramètres. Les deux approches reposent sur l'architecture RAG (Retrieval Augmented Generation) pour combiner la récupération de contexte à partir de bases de connaissances externes (NACHOS et Wikipédia) avec la génération de réponses par le LLM Apollo-2B. La première approche traite directement les QCMs et génère les réponses en une seule étape, tandis que la seconde approche reformule les QCMs en questions binaires (Oui/Non) puis génère une réponse pour chaque question binaire. Cette dernière approche obtient un Exact Match Ratio de 14.7 et un Hamming Score de 51.6 sur le jeu de test, ce qui démontre le potentiel du RAG pour des tâches de Q/A sous de telles contraintes.
@InProceedings{Kempf2024, 
  title = {{Structuration des critères histopronostiques tumoraux par traitement automatique du langage naturel - Une comparaison entre apprentissage machine et règles}},
  author = {Kempf, Emmanuelle and Priou, Sonia and Dura, Basile and Calderaro, Julien and Brones, Clara and Wajsbürt, Perceval and Bennani, Lina and Tannier, Xavier},
  booktitle = {Congrès ÉMOIS, Special Issue of the Journal of Epidemiology and Population Health}, 
  year = {2024}, 
  month = mar
}
Perceval Wajsburt, Xavier Tannier.
An end-to-end neural model based on cliques and scopes for frame extraction in long breast radiology reports.
in The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks. Toronto, Canada, pages 156–170, July 2023. © Association for Computational Linguistics.
[abstract] [BibTeX] [ACL Anthology]
We consider the task of automatically extracting various overlapping frames, i.e, structured entities composed of multiple labels and mentions, from long clinical breast radiology documents. While many methods exist for related topics such as event extraction, slot filling, or discontinuous entity recognition, a challenge in our study resides in the fact that clinical reports typically contain overlapping frames that span multiple sentences or paragraphs.We propose a new method that addresses these difficulties and evaluate it on a new annotated corpus. Despite the small number of documents, we show that the hybridization between knowledge injection and a learning-based system allows us to quickly obtain proper results.We will also introduce the concept of scope relations and show that it both improves the performance of our system, and provides a visual explanation of the predictions.
@InProceedings{Wajsburt2023, 
  title = {{An end-to-end neural model based on cliques and scopes for frame extraction in long breast radiology reports}},
  author = {Wajsburt, Perceval and Tannier, Xavier},
  booktitle = {The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks}, 
  address = {Toronto, Canada}, 
  year = {2023}, 
  month = jul, 
  publisher = {Association for Computational Linguistics}, 
  pages = {156–170}
}
Nesrine Bannour, Bastien Rance, Xavier Tannier, Aurelie Neveol.
Event-independent temporal positioning: application to French clinical text.
in The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks. Toronto, Canada, pages 191–205, July 2023. © Association for Computational Linguistics.
[abstract] [BibTeX] [ACL Anthology]
Extracting temporal relations usually entails identifying and classifying the relation between two mentions. However, the definition of temporal mentions strongly depends on the text type and the application domain. Clinical text in particular is complex. It may describe events that occurred at different times, contain redundant information and a variety of domain-specific temporal expressions. In this paper, we propose a novel event-independent representation of temporal relations that is task-independent and, therefore, domain-independent. We are interested in identifying homogeneous text portions from a temporal standpoint and classifying the relation between each text portion and the document creation time. Temporal relation extraction is cast as a sequence labeling task and evaluated on oncology notes. We further evaluate our temporal representation by the temporal positioning of toxicity events of chemotherapy administrated to colon and lung cancer patients described in French clinical reports. An overall macro F-measure of 0.86 is obtained for temporal relation extraction by a neural token classification model trained on clinical texts written in French. Our results suggest that the toxicity event extraction task can be performed successfully by automatically identifying toxicity events and placing them within the patient timeline (F-measure .62). The proposed system has the potential to assist clinicians in the preparation of tumor board meetings.
@InProceedings{Bannour2023b, 
  title = {{Event-independent temporal positioning: application to French clinical text}},
  author = {Bannour, Nesrine and Rance, Bastien and Tannier, Xavier and Neveol, Aurelie},
  booktitle = {The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks}, 
  address = {Toronto, Canada}, 
  year = {2023}, 
  month = jul, 
  publisher = {Association for Computational Linguistics}, 
  pages = {191–205}
}
Emmanuelle Kempf, Sonia Priou, Guillaume Lamé, Alexis Laurent, Etienne Guével, Stylianos Tzedakis, Romain Bey, David Fuks, Gilles Chatellier, Xavier Tannier, Gilles Galula, Rémi Flicoteaux, Christel Daniel, Christophe Tournigand.
No changes in clinical presentation, treatment strategies and survival of pancreatic cancer cases during the SARS-COV-2 outbreak: A retrospective multicenter cohort study on real-world data.
International Journal of Cancer. August 2023. doi: 10.1002/ijc.34675
[abstract] [BibTeX] [Ask me!] [Direct link (Wiley)]
The SARS-COV-2 pandemic disrupted healthcare systems. We assessed its impact on the presentation, care trajectories and outcomes of new pancreatic cancers (PCs) in the Paris area. We performed a retrospective multicenter cohort study on the data warehouse of Greater Paris University Hospitals (AP-HP). We identified all patients newly referred with a PC between January 1, 2019, and June 30, 2021, and excluded endocrine tumors. Using claims data and health records, we analyzed the timeline of care trajectories, the initial tumor stage, the treatment categories: pancreatectomy, exclusive systemic therapy or exclusive best supportive care (BSC). We calculated patients' 1-year overall survival (OS) and compared indicators in 2019 and 2020 to 2021. We included 2335 patients. Referral fell by 29% during the first lockdown. The median time from biopsy and from first MDM to treatment were 25 days (16-50) and 21 days (11-40), respectively. Between 2019 and 2020 to 2021, the rate of metastatic tumors (36% vs 33%, P = .39), the pTNM distribution of the 464 cases with upfront tumor resection (P = .80), and the proportion of treatment categories did not vary: tumor resection (32% vs 33%), exclusive systemic therapy (49% vs 49%), exclusive BSC (19% vs 19%). The 1-year OS rates in 2019 vs 2020 to 2021 were 92% vs 89% (aHR = 1.42; 95% CI, 0.82-2.48), 52% vs 56% (aHR = 0.88; 95% CI, 0.73-1.08), 13% vs 10% (aHR = 1.00; 95% CI, 0.78-1.25), in the treatment categories, respectively. Despite an initial decrease in the number of new PCs, we did not observe any stage shift. OS did not vary significantly.
@Article{Kempf2023b, 
  title = {{No changes in clinical presentation, treatment strategies and survival of pancreatic cancer cases during the SARS-COV-2 outbreak: A retrospective multicenter cohort study on real-world data}},
  author = {Kempf, Emmanuelle and Priou, Sonia and Lamé, Guillaume and Laurent, Alexis and Guével, Etienne and Tzedakis, Stylianos and Bey, Romain and Fuks, David and Chatellier, Gilles and Tannier, Xavier and Galula, Gilles and Flicoteaux, Rémi and Daniel, Christel and Tournigand, Christophe},
  year = {2023}, 
  month = aug, 
  journal = {International Journal of Cancer}, 
  doi = {10.1002/ijc.34675}
}
Emmanuelle Kempf, Morgan Vaterkowski, Damien Leprovost, Nicolas Griffon, David Ouagne, Stéphane Bréant, Patricia Serre, Alexandre Mouchet, Bastien Rance, Gilles Chatellier, Ali Bellamine, Marie Frank, Julien Guerin, Xavier Tannier, Alain Livartowski, Martin Hilka, Christel Daniel.
How to Improve Cancer Patients ENrollment in Clinical Trials From rEal-Life Databases Using the Observational Medical Outcomes Partnership Oncology Extension: Results of the PENELOPE Initiative in Urologic Cancers.
JCO Clinical Cancer Informatics. 7, May 2023. doi: 10.1200/CCI.22.00179
[abstract] [BibTeX] [Ask me!] [Direct link]
Purpose: To compare the computability of Observational Medical Outcomes Partnership (OMOP)-based queries related to prescreening of patients using two versions of the OMOP common data model (CDM; v5.3 and v5.4) and to assess the performance of the Greater Paris University Hospital (APHP) prescreening tool.
Materials and methods: We identified the prescreening information items being relevant for prescreening of patients with cancer. We randomly selected 15 academic and industry-sponsored urology phase I-IV clinical trials (CTs) launched at APHP between 2016 and 2021. The computability of the related prescreening criteria (PC) was defined by their translation rate in OMOP-compliant queries and by their execution rate on the APHP clinical data warehouse (CDW) containing data of 205,977 patients with cancer. The overall performance of the prescreening tool was assessed by the rate of true- and false-positive cases of three randomly selected CTs.
Results: We defined a list of 15 minimal information items being relevant for patients' prescreening. We identified 83 PC of the 534 eligibility criteria from the 15 CTs. We translated 33 and 62 PC in queries on the basis of OMOP CDM v5.3 and v5.4, respectively (translation rates of 40% and 75%, respectively). Of the 33 PC translated in the v5.3 of the OMOP CDM, 19 could be executed on the APHP CDW (execution rate of 58%). Of 83 PC, the computability rate on the APHP CDW reached 23%. On the basis of three CTs, we identified 17, 32, and 63 patients as being potentially eligible for inclusion in those CTs, resulting in positive predictive values of 53%, 41%, and 21%, respectively.
Conclusion: We showed that PC could be formalized according to the OMOP CDM and that the oncology extension increased their translation rate through better representation of cancer natural history.
@Article{Kempf2023a, 
  title = {{How to Improve Cancer Patients ENrollment in Clinical Trials From rEal-Life Databases Using the Observational Medical Outcomes Partnership Oncology Extension: Results of the PENELOPE Initiative in Urologic Cancers}},
  author = {Emmanuelle Kempf and Morgan Vaterkowski and Damien Leprovost and Nicolas Griffon and David Ouagne and Stéphane Bréant and Patricia Serre and Alexandre Mouchet and Bastien Rance and Gilles Chatellier and Ali Bellamine and Marie Frank and Julien Guerin and Xavier Tannier and Alain Livartowski and Martin Hilka and Christel Daniel},
  year = {2023}, 
  month = may, 
  journal = {JCO Clinical Cancer Informatics}, 
  volume = {7}, 
  doi = {10.1200/CCI.22.00179}
}
Marco Naguib, Aurélie Névéol, Xavier Tannier.
Stratégies d'apprentissage actif pour la reconnaissance d'entités nommées en français.
in Actes de la 30ème conférence Traitement Automatique des Langues Naturelles (TALN 2023). Paris, France, June 2023.
[abstract] [BibTeX] [pdf]
L'annotation manuelle de corpus est un processus coûteux et lent, notamment pour la tâche de re-connaissance d'entités nommées. L'apprentissage actif vise à rendre ce processus plus efficace, ensélectionnant les portions les plus pertinentes à annoter. Certaines stratégies visent à sélectionner lesportions les plus représentatives du corpus, d'autres, les plus informatives au modèle de langage.Malgré un intérêt grandissant pour l'apprentissage actif, rares sont les études qui comparent cesdifférentes stratégies dans un contexte de reconnaissance d'entités nommées médicales. Nous pro-posons une comparaison de ces stratégies en fonction des performances de chacune sur 3 corpus dedocuments cliniques en langue française : MERLOT, QuaeroFrenchMed et E3C. Nous comparonsles stratégies de sélection mais aussi les différentes façons de les évaluer. Enfin, nous identifions lesstratégies qui semblent les plus efficaces et mesurons l'amélioration qu'elles présentent, à différentesphases de l'apprentissage.
@InProceedings{Naguib2023, 
  title = {{Stratégies d'apprentissage actif pour la reconnaissance d'entités nommées en français}},
  author = {Naguib, Marco and Névéol, Aurélie and Tannier, Xavier},
  booktitle = {Actes de la 30ème conférence Traitement Automatique des Langues Naturelles (TALN 2023)}, 
  address = {Paris, France}, 
  year = {2023}, 
  month = jun
}
Nesrine Bannour, Xavier Tannier, Bastien Rance, Aurélie Névéol.
Positionnement temporel indépendant des évènements : application à des textes cliniques en français.
in Actes de la 30ème conférence Traitement Automatique des Langues Naturelles (TALN 2023). Paris, France, June 2023.
[abstract] [BibTeX] [pdf]
L'extraction de relations temporelles consiste à identifier et classifier la relation entre deux mentions. Néanmoins, la définition des mentions temporelles dépend largement du type du texte et du domained'application. En particulier, le texte clinique est complexe car il décrit des évènements se produisant à des moments différents et contient des informations redondantes et diverses expressions temporellesspécifiques au domaine. Dans cet article, nous proposons une nouvelle représentation des relations temporelles, qui est indépendante du domaine et de l'objectif de la tâche d'extraction. Nous nousintéressons à extraire la relation entre chaque portion du texte et la date de création du document. Nous formulons l'extraction de relations temporelles comme une tâche d'étiquetage de séquences.Une macro F-mesure de 0,8 est obtenue par un modèle neuronal entraîné sur des textes cliniques, écrits en français. Nous évaluons notre représentation temporelle par le positionnement temporel desévènements de toxicité des chimiothérapies.
@InProceedings{Bannour2023, 
  title = {{Positionnement temporel indépendant des évènements : application à des textes cliniques en français}},
  author = {Bannour, Nesrine and Tannier, Xavier and Rance, Bastien and Névéol, Aurélie},
  booktitle = {Actes de la 30ème conférence Traitement Automatique des Langues Naturelles (TALN 2023)}, 
  address = {Paris, France}, 
  year = {2023}, 
  month = jun
}
Christel Gérardin, Arthur Mageau, Arsène Mékinian, Xavier Tannier, Fabrice Carrat.
Construction of Cohorts of Similar Patients From Automatic Extraction of Medical Concepts: Phenotype Extraction Study.
JMIR Medical Informatics. Vol. 10, Issue 12, December 2022. doi: 10.2196/42379
[abstract] [BibTeX] [JMIR link]
Background: Reliable and interpretable automatic extraction of clinical phenotypes from large electronic medical record databases remains a challenge, especially in a language other than English.
Objective: We aimed to provide an automated end-to-end extraction of cohorts of similar patients from electronic health records for systemic diseases.
Methods: Our multistep algorithm includes a named-entity recognition step, a multilabel classification using medical subject headings ontology, and the computation of patient similarity. A selection of cohorts of similar patients on a priori annotated phenotypes was performed. Six phenotypes were selected for their clinical significance: P1, osteoporosis; P2, nephritis in systemic erythematosus lupus; P3, interstitial lung disease in systemic sclerosis; P4, lung infection; P5, obstetric antiphospholipid syndrome; and P6, Takayasu arteritis. We used a training set of 151 clinical notes and an independent validation set of 256 clinical notes, with annotated phenotypes, both extracted from the Assistance Publique-Hôpitaux de Paris data warehouse. We evaluated the precision of the 3 patients closest to the index patient for each phenotype with precision-at-3 and recall and average precision.
Results: For P1-P4, the precision-at-3 ranged from 0.85 (95% CI 0.75-0.95) to 0.99 (95% CI 0.98-1), the recall ranged from 0.53 (95% CI 0.50-0.55) to 0.83 (95% CI 0.81-0.84), and the average precision ranged from 0.58 (95% CI 0.54-0.62) to 0.88 (95% CI 0.85-0.90). P5-P6 phenotypes could not be analyzed due to the limited number of phenotypes.
Conclusions: Using a method close to clinical reasoning, we built a scalable and interpretable end-to-end algorithm for extracting cohorts of similar patients.
@Article{Gérardin2022b, 
  title = {{Construction of Cohorts of Similar Patients From Automatic Extraction of Medical Concepts: Phenotype Extraction Study}},
  author = {Gérardin, Christel and Mageau, Arthur and Mékinian, Arsène and Tannier, Xavier and Carrat, Fabrice},
  number = {12}, 
  year = {2022}, 
  month = dec, 
  journal = {JMIR Medical Informatics}, 
  volume = {10}, 
  doi = {10.2196/42379}
}
Christel Gérardin, Perceval Wajsbürt, Pascal Vaillant, Ali Bellamine, Fabrice Carrat, Xavier Tannier.
Multilabel classification of medical concepts for patient clinical profile identification.
Artificial Intelligence in Medicine. 128, June 2022. doi: 10.1016/j.artmed.2022.102311
[abstract] [BibTeX] [Ask me!] [ScienceDirect link]
Highlights
  • Extracting key informations from clinical narratives is a NLP Challenge.
  • There is a particular need to improve NLP tasks in languages other than English.
  • Our approach allows automatic pathological domains detection from clinical notes.
  • Using multilingual vocabularies and multilingual model leads to better results.
Abstract
Background: The development of electronic health records has provided a large volume of unstructured biomedical information. Extracting patient characteristics from these data has become a major challenge, especially in languages other than English.
Objective: We developed a methodology able to explore and target topics of interest via an interactive user interface for health professionals with limited computer science knowledge. We aim to reach near state-of-the-art performance while reducing memory consumption, increasing scalability, and minimizing user interaction effort to improve the clinical decision-making process. The performance was evaluated on diabetes-related abstracts from PubMed.
Methods: Inspired by the French Text Mining Challenge (DEFT 2021) [1] in which we participated, our study proposes a multilabel classification of clinical narratives, allowing us to automatically extract the main features of a patient report. Our system is an end-to-end pipeline from raw text to labels with two main steps: named entity recognition and multilabel classification. Both steps are based on a neural network architecture based on transformers. To train our final classifier, we extended the dataset with all English and French Unified Medical Language System (UMLS) vocabularies related to human diseases. We focus our study on the multilingualism of training resources and models, with experiments combining French and English in different ways (multilingual embeddings or translation).
Results: We obtained an overall average micro-F1 score of 0.811 for the multilingual version, 0.807 for the French-only version and 0.797 for the translated version.
Conclusion: Our study proposes an original multilabel classification of French clinical notes for patient phenotyping. We show that a multilingual algorithm trained on annotated real clinical notes and UMLS vocabularies leads to the best results.
@Article{Gérardin2022, 
  title = {{Multilabel classification of medical concepts for patient clinical profile identification}},
  author = {Gérardin, Christel and Wajsbürt, Perceval and Vaillant, Pascal and Bellamine, Ali and Carrat, Fabrice and Tannier, Xavier},
  year = {2022}, 
  month = jun, 
  journal = {Artificial Intelligence in Medicine}, 
  volume = {128}, 
  doi = {10.1016/j.artmed.2022.102311}
}
Nesrine Bannour, Perceval Wajsbürt, Bastien Rance, Xavier Tannier, Aurélie Névéol.
Privacy-Preserving Mimic Models for clinical Named Entity Recognition in French.
Journal of Biomedical Informatics. 130, June 2022. doi: 10.1016/j.jbi.2022.104073
[abstract] [BibTeX] [Ask me!] [ScienceDirect link]
Highlights
  • We propose Privacy-Preserving Mimic Models for clinical named entity recognition.
  • Models are trained without processing any sensitive data or private model weights.
  • Mimic models achieve up to 0.706 macro exact F-measure on 15 clinical entity types.
  • Our approach offers a good compromise between performance and privacy preservation.
Abstract
A vast amount of crucial information about patients resides solely in unstructured clinical narrative notes. There has been a growing interest in clinical Named Entity Recognition (NER) task using deep learning models. Such approaches require sufficient annotated data. However, there is little publicly available annotated corpora in the medical field due to the sensitive nature of the clinical text. In this paper, we tackle this problem by building privacy-preserving shareable models for French clinical Named Entity Recognition using the mimic learning approach to enable the knowledge transfer through a teacher model trained on a private corpus to a student model. This student model could be publicly shared without any access to the original sensitive data. We evaluated three privacy-preserving models using three medical corpora and compared the performance of our models to those of baseline models such as dictionary-based models. An overall macro F-measure of 70.6% could be achieved by a student model trained using silver annotations produced by the teacher model, compared to 85.7% for the original private teacher model. Our results revealed that these privacy-preserving mimic learning models offer a good compromise between performance and data privacy preservation.
@Article{Bannour2022, 
  title = {{Privacy-Preserving Mimic Models for clinical Named Entity Recognition in French}},
  author = {Bannour, Nesrine and Wajsbürt, Perceval and Rance, Bastien and Tannier, Xavier and Névéol, Aurélie},
  year = {2022}, 
  month = jun, 
  journal = {Journal of Biomedical Informatics}, 
  volume = {130}, 
  doi = {10.1016/j.jbi.2022.104073}
}
Adrian Ahne, Guy Fagherazzi, Xavier Tannier, Thomas Czernichow, Francisco Orchard.
Improving Diabetes-Related Biomedical Literature Exploration in the Clinical Decision-making Process via Interactive Classification and Topic Discovery: Methodology Development Study.
Journal of Medical Internet Research. Vol. 24, Issue 1, January 2022. doi: 10.2196/27434
[abstract] [BibTeX] [JMIR link]
Background:The amount of available textual health data such as scientific and biomedical literature is constantly growing and becoming more and more challenging for health professionals to properly summarize those data and practice evidence-based clinical decision making. Moreover, the exploration of unstructured health text data is challenging for professionals without computer science knowledge due to limited time, resources, and skills. Current tools to explore text data lack ease of use, require high computational efforts, and incorporate domain knowledge and focus on topics of interest with difficulty.
Objective:We developed a methodology able to explore and target topics of interest via an interactive user interface for health professionals with limited computer science knowledge. We aim to reach near state-of-the-art performance while reducing memory consumption, increasing scalability, and minimizing user interaction effort to improve the clinical decision-making process. The performance was evaluated on diabetes-related abstracts from PubMed.
Methods:The methodology consists of 4 parts: (1) a novel interpretable hierarchical clustering of documents where each node is defined by headwords (words that best represent the documents in the node), (2) an efficient classification system to target topics, (3) minimized user interaction effort through active learning, and (4) a visual user interface. We evaluated our approach on 50,911 diabetes-related abstracts providing a hierarchical Medical Subject Headings (MeSH) structure, a unique identifier for a topic. Hierarchical clustering performance was compared against the implementation in the machine learning library scikit-learn. On a subset of 2000 randomly chosen diabetes abstracts, our active learning strategy was compared against 3 other strategies: random selection of training instances, uncertainty sampling that chooses instances about which the model is most uncertain, and an expected gradient length strategy based on convolutional neural networks (CNNs).
Results:For the hierarchical clustering performance, we achieved an F1 score of 0.73 compared to 0.76 achieved by scikit-learn. Concerning active learning performance, after 200 chosen training samples based on these strategies, the weighted F1 score of all MeSH codes resulted in a satisfying 0.62 F1 score using our approach, 0.61 using the uncertainty strategy, 0.63 using the CNN, and 0.45 using the random strategy. Moreover, our methodology showed a constant low memory use with increased number of documents.
Conclusions:We proposed an easy-to-use tool for health professionals with limited computer science knowledge who combine their domain knowledge with topic exploration and target specific topics of interest while improving transparency. Furthermore, our approach is memory efficient and highly parallelizable, making it interesting for large Big Data sets. This approach can be used by health professionals to gain deep insights into biomedical literature to ultimately improve the evidence-based clinical decision making process.
@Article{Ahne2022, 
  title = {{Improving Diabetes-Related Biomedical Literature Exploration in the Clinical Decision-making Process via Interactive Classification and Topic Discovery: Methodology Development Study}},
  author = {Ahne, Adrian and Fagherazzi, Guy and Tannier, Xavier and Czernichow, Thomas and Orchard, Francisco},
  number = {1}, 
  year = {2022}, 
  month = jan, 
  journal = {Journal of Medical Internet Research}, 
  volume = {24}, 
  doi = {10.2196/27434}
}
Nesrine Bannour, Perceval Wajsbürt, Bastien Rance, Xavier Tannier, Aurélie Névéol.
Modèles préservant la confidentialité des données par mimétisme pour la reconnaissance d’entités nommées en français.
in Actes de la journée d’étude sur la robustesse des systemes de TAL. Paris, France, December 2022.
[BibTeX] [Link to free copy]
Les données de vie réelle suscitent un intérêt croissant des agences sanitaires à travers le monde, que ce soit pour étudier l’usage, l’efficacité et la sécurité des produits de santé, suivre et améliorer la qualité des soins, réaliser des études épidémiologiques ou faciliter la veille sanitaire. Parmi les différentes sources de données, les entrepôts de données de santé hospitaliers (EDSH) connaissent actuellement un développement rapide sur le territoire français. Dans la perspective de mobiliser ces données dans le cadre de ses missions, la Haute Autorité de santé (HAS) a souhaité mieux comprendre cette dynamique et le potentiel de ces données. Elle a initié en novembre 2021 un travail de recherche visant à dresser un état des lieux des EDSH en France. Ce rapport pose d’abord le cadre et des définitions en s’appuyant sur la littérature. Il détaille ensuite la méthodologie de recherche, fondée sur des entretiens menés auprès d’acteurs impliqués dans les EDSH de 17 CHU et 5 autres établissements hospitaliers. Le résultat de ces entretiens est structuré par thématiques : historique, gouvernance, données intégrées, usages couverts, transparence, architecture technique et qualité de la donnée. Ce rapport discute ensuite les points d’attention identifiés pour le bon développement des EDSH et des usages secondaires des données. Il ébauche enfin deux cas d’usages pertinents pour la HAS.
@InProceedings{Bannour2022b, 
  title = {{Modèles préservant la confidentialité des données par mimétisme pour la reconnaissance d’entités nommées en français}},
  author = {Bannour, Nesrine and Wajsbürt, Perceval and Rance, Bastien and Tannier, Xavier and Névéol, Aurélie},
  booktitle = {Actes de la journée d’étude sur la robustesse des systemes de TAL}, 
  address = {Paris, France}, 
  year = {2022}, 
  month = dec
}
Perceval Wajsbürt, Arnaud Sarfati, Xavier Tannier.
Medical concept normalization in French using multilingual terminologies and contextual embeddings.
Journal of Biomedical Informatics. 114, January 2021. doi: 10.1016/j.jbi.2021.103684
[abstract] [BibTeX] [Ask me!] [ScienceDirect link]
Highlights
  • We train a model to normalize medical entities in French with a very large list of concepts.
  • Our method is a neural network model that requires no prior translation.
  • Multilingual training data improves the performance of medical normalization in French.
  • Multilingual embeddings are of less importance than multilingual data.
Introduction: Concept normalization is the task of linking terms from textual medical documents to their concept in terminologies such as the UMLS®. Traditional approaches to this problem depend heavily on the coverage of available resources, which poses a problem for languages other than English.
Objective: We present a system for concept normalization in French. We consider textual mentions already extracted and labeled by a named entity recognition system, and we classify these mentions with a UMLS concept unique identifier. We take advantage of the multilingual nature of available terminologies and embedding models to improve concept normalization in French without translation nor direct supervision.
Materials and methods: We consider the task as a highly-multiclass classification problem. The terms are encoded with contextualized embeddings and classified via cosine similarity and softmax. A first step uses a subset of the terminology to finetune the embeddings and train the model. A second step adds the entire target terminology, and the model is trained further with hard negative selection and softmax sampling.
Results: On two corpora from the Quaero FrenchMed benchmark, we show that our approach can lead to good results even with no labeled data at all; and that it outperforms existing supervised methods with labeled data.
Discussion: Training the system with both French and English terms improves by a large margin the performance of the system on a French benchmark, regardless of the way the embeddings were pretrained (French, English, multilingual). Our distantly supervised method can be applied to any kind of documents or medical domain, as it does not require any concept-labeled documents.
Conclusion: These experiments pave the way for simpler and more effective multilingual approaches to processing medical texts in languages other than English.
@Article{Wajsburt2021, 
  title = {{Medical concept normalization in French using multilingual terminologies and contextual embeddings}},
  author = {Wajsbürt, Perceval and Sarfati, Arnaud and Tannier, Xavier},
  year = {2021}, 
  month = jan, 
  journal = {Journal of Biomedical Informatics}, 
  volume = {114}, 
  doi = {10.1016/j.jbi.2021.103684}
}
Pierre Chastang, Sergio Torres Aguilar, Xavier Tannier.
A Named Entity Recognition Model for Medieval Latin Charters.
Digital Humanities Quarterly. Vol. 15, Issue 4, November 2021.
[abstract] [BibTeX] [DHQ free link]
Named entity recognition is an advantageous technique with an increasing presence in digital humanities. In theory, automatic detection and recovery of named entities can provide new ways of looking up unedited information in edited sources and can allow the parsing of a massive amount of data in a short time for supporting historical hypotheses. In this paper, we detail the implementation of a model for automatic named entity recognition in medieval Latin sources and we test its robustness on different datasets. Different models were trained on a vast dataset of Burgundian diplomatic charters from the 9th to 14th centuries and validated by using general and century ad hoc models tested on short sets of Parisian, English, Italian and Spanish charters. We present the results of cross-validation in each case and we discuss the implications of these results for the history of medieval place-names and personal names.
@Article{Chastang2021, 
  title = {{A Named Entity Recognition Model for Medieval Latin Charters}},
  author = {Chastang, Pierre and Torres Aguilar, Sergio and Tannier, Xavier},
  number = {4}, 
  year = {2021}, 
  month = nov, 
  journal = {Digital Humanities Quarterly}, 
  volume = {15}
}
Perceval Wajsbürt, Yoann Taillé, Xavier Tannier.
Effect of depth order on iterative nested named entity recognition models.
in Conference on Artificial Intelligence in Medecine (AIME 2021). Porto, Portugal, June 2021.
[abstract] [BibTeX] [Long version on arXiv]
This paper studies the effect of the order of depth of mention on nested named entity recognition (NER) models. NER is an essential task in the extraction of biomedical information, and nested entities are common since medical concepts can assemble to form larger entities. Conventional NER systems only predict disjointed entities. Thus, iterative models for nested NER use multiple predictions to enumerate all entities, imposing a predefined order from largest to smallest or smallest to largest. We design an order-agnostic iterative model and a procedure to choose a custom order during training and prediction. We propose a modification of the Transformer architecture to take into account the entities predicted in the previous steps. We provide a set of experiments to study the model capabilities and the effects of the order on performance. Finally, we show that the smallest to largest order gives the best results.
@InProceedings{Wajsburt2021b, 
  title = {{Effect of depth order on iterative nested named entity recognition models}},
  author = {Perceval Wajsbürt and Yoann Taillé and Xavier Tannier},
  booktitle = {Conference on Artificial Intelligence in Medecine (AIME 2021)}, 
  address = {Porto, Portugal}, 
  year = {2021}, 
  month = jun
}
Christel Gérardin, Pascal Vaillant, Perceval Wajsbürt, Clément Gilavert, Ali Bellamine, Emmanuelle Kempf, Xavier Tannier.
Classification multilabel de concepts médicaux pour l’identification du profil clinique du patient.
in Défi Fouille de Texte (DEFT), Traitement Automatique des Langues Naturelles, 2021. Lille, France, June 2021.
[abstract] [BibTeX] [HAL link]
La première tâche du Défi fouille de textes 2021 a consisté à extraire automatiquement, à partir de cas cliniques, les phénotypes pathologiques des patients regroupés par tête de chapitre du MeSH-maladie. La solution présentée est celle d’un classifieur multilabel basé sur un transformer. Deux transformers ont été utilisés : le camembert-large classique (run 1) et le camembert-large fine-tuné (run 2) sur des articles biomédicaux français en accès libre. Nous avons également proposé un modèle « bout-enbout », avec une première phase d’extraction d’entités nommées également basée sur un transformer de type camembert-large et un classifieur de genre sur un modèle Adaboost. Nous obtenons un très bon rappel et une précision correcte, pour une F1-mesure autour de 0,77 pour les trois runs. La performance du modèle « bout-en-bout » est similaire aux autres méthodes.
@InProceedings{Gerardin2021, 
  title = {{Classification multilabel de concepts médicaux pour l’identification du profil clinique du patient}},
  author = {Christel Gérardin and Pascal Vaillant and Perceval Wajsbürt and Clément Gilavert and Ali Bellamine and Emmanuelle Kempf and Xavier Tannier},
  booktitle = {Défi Fouille de Texte (DEFT), Traitement Automatique des Langues Naturelles, 2021}, 
  address = {Lille, France}, 
  year = {2021}, 
  month = jun
}
Ali Bellamine, Christel Daniel, Perceval Wajsbürt, Christian Roux, Xavier Tannier, Karine Briot.
Identification automatique des patients avec fractures ostéoporotiques à partir de comptes rendus médicaux.
in 34e Congrès Français de Rhumatologie. Paris, France, December 2021.
[abstract] [BibTeX] [ScienceDirect]
Introduction Les fractures ostéoporotiques sont associées à un excès de morbi-mortalité. La mise en œuvre de parcours de soins de type filière fracture est efficace pour réduire le risque de nouvelle fracture et l’excès de mortalité. La mobilisation de ressources humaines et la difficulté à identifier les patients éligibles est l’une des limites à la mise en place et au fonctionnement de ces filières. L’objectif de l’étude est de développer et valider un outil de détection automatique permettant d’identifier les fractures ostéoporotiques chez les sujets de plus de 50 ans à partir de comptes rendus médicaux.
Patients et méthodesLe développement de l’outil de détection automatique s’appuie sur une chaîne de traitement d’algorithmes utilisant des techniques de traitement automatique du langage et de d’apprentissage automatique (Natural Language processing, Machine Learning and Rule-based solutions). Le développement de l’outil et sa validation ont été réalisés à partir des comptes rendus médicaux des départements des urgences et d’orthopédie de l’entrepôt de données de santé (EDS) de l’Assistance publique–Hôpitaux de Paris (AP–HP). L’outil a été développé à partir d’un échantillon aléatoire de 4917 documents issus d’un centre hospitalier. Les documents qui ont servi aux développements des algorithmes sont différents de ceux qui ont servi à leurs entraînements. La validation externe a été réalisée sur l’ensemble des comptes rendus médicaux d’orthopédie et des urgences recueillies en 3 mois dans l’EDS soit 154 031 documents. Les performances de l’outil (Sensibilité Se, Spécificité Sp, valeur prédictive positive VPP, valeur prédictive négative VPN) ont été calculées pour le développement et la validation de l’outil.
RésultatsL’outil a été développé à partir de 3913 documents des Urgences et 1004 documents d’orthopédie. Les performances des différents algorithmes conduisant à l’outil sont : Se comprise entre 80 et 93 %, Sp entre 62 et 99 %, VPP entre 90 et 96 % et VPN entre 69 et 99 %. L’outil a été validé dans une base de 154 031 documents (148 423 des urgences et 5608 d’orthopédie) (46 % de femmes, âge moyen 67 ans). L’outil a permis d’identifier 4 % de documents des urgences avec fracture susceptible d’être ostéoporotique (n = 5806) et 27 % des documents d’orthopédie (n = 1503), soit une population âgée de 74 ans en moyenne avec 68 % de femmes. Une validation manuelle par un expert a été réalisée sur 1000 documents avec fracture identifiée et 1000 documents sans fracture, sélectionnés au hasard. Les Se, Sp, VPP et VPN sont de 68 %, 100 %, 78 % et 99 % pour les comptes rendus des urgences et 84 %, 97 %, 92 % et 93 % pour les comptes rendus d’orthopédie.
ConclusionCette étude est le premier travail montrant qu’un outil d’identification automatique basé sur le traitement automatique du langage et d’apprentissage automatique permet d’identifier des patients avec des fractures susceptibles d’être ostéoporotique sur des comptes médicaux des urgences et d’orthopédie. Les performances de l’outil sont bonnes et permettent de répondre au besoin d’assistance à l’identification des patients dans le cadre de parcours de soins post fracture.
@InProceedings{Bellamine2021, 
  title = {{Identification automatique des patients avec fractures ostéoporotiques à partir de comptes rendus médicaux}},
  author = {Ali Bellamine and Christel Daniel and Perceval Wajsbürt and Christian Roux and Xavier Tannier and Karine Briot},
  booktitle = {34e Congrès Français de Rhumatologie}, 
  address = {Paris, France}, 
  year = {2021}, 
  month = dec
}
Nesrine Bannour, Aurélie Névéol, Xavier Tannier, Bastien Rance.
Traitement Automatique de la Langue et Intégration de Données pour les Réunions de Concertations Pluridisciplinaires en Oncologie.
in Journée AFIA/ATALA "la santé et le langage". February 2021.
[abstract] [BibTeX] [Link]
Les réunions de concertations pluridisciplinaires (RCP) en oncologie permettent aux experts desdifférentes spécialités de choisir les meilleures options thérapeutiques pour les patients. Les donnéesnécessaires à ces réunions sont souvent collectées manuellement, avec un risque d’erreur lors del’extraction et un coût important pour les professionnels de santé. Plusieurs travaux scientifiquesportant sur des documents en anglais se sont intéressés à l’extraction automatique d’informations(telles que la localisation de la tumeur, les classifications histologiques, TNM, ...) dans les rapportscliniques des dossiers médicaux. Dans le cadre du projet ASIMOV (ASsIster la recherche en oncologie par le Machine Learning, l’intégration de dOnnées et la Visualisation), nous utiliserons le traitement automatique de la langue et l’intégrationde données pour l’extraction d’informations liées au cancer dans les entrepôts de données et les textescliniques en français.
@InProceedings{Bannour2021, 
  title = {{Traitement Automatique de la Langue et Intégration de Données pour les Réunions de Concertations Pluridisciplinaires en Oncologie}},
  author = {Bannour, Nesrine and Névéol, Aurélie and Tannier, Xavier and Rance, Bastien},
  booktitle = {Journée AFIA/ATALA "la santé et le langage"}, 
  year = {2021}, 
  month = feb
}
Julien Tourille, Olivier Ferret, Aurélie Névéol, Xavier Tannier.
Modèle neuronal pour la résolution de la coréférence dans les dossiers médicaux électroniques.
in Actes de la 27ème conférence Traitement Automatique des Langues Naturelles (TALN 2020). Nancy, France, June 2020.
[abstract] [BibTeX] [HAL link]
La résolution de la coréférence est un élément essentiel pour la constitution automatique de chronologies médicales à partir des dossiers médicaux électroniques. Dans ce travail, nous présentons une approche neuronale pour la résolution de la coréférence dans des textes médicaux écrits en anglais pour les entités générales et cliniques en nous évaluant dans le cadre de référence pour cette tâche que constitue la tâche 1C de la campagne i2b2 2011.
@InProceedings{Tourille2020, 
  title = {{Modèle neuronal pour la résolution de la coréférence dans les dossiers médicaux électroniques}},
  author = {Tourille, Julien and Ferret, Olivier and Névéol, Aurélie and Tannier, Xavier},
  booktitle = {Actes de la 27ème conférence Traitement Automatique des Langues Naturelles (TALN 2020)}, 
  address = {Nancy, France}, 
  year = {2020}, 
  month = jun
}
Perceval Wajsbürt, Yoann Taillé, Guillaumé Lainé, Xavier Tannier.
Participation de l'équipe du LIMICS à DEFT 2020.
in Défi Fouille de Texte (DEFT) 2020. Nancy, France, June 2020.
[abstract] [BibTeX] [HAL link]
Nous présentons dans cet article les méthodes conçues et les résultats obtenus lors de notre participation à la tâche 3 de la campagne d'évaluation DEFT 2020, consistant en la reconnaissance d'entités nommées du domaine médical. Nous proposons deux modèles différents permettant de prendre en compte les entités imbriquées, qui représentent une des difficultés du jeu de données proposées, et présentons les résultats obtenus. Notre meilleur run obtient la meilleure performance parmi les participants, sur l'une des deux sous-tâches du défi.
@InProceedings{Wajsburt2020, 
  title = {{Participation de l'équipe du LIMICS à DEFT 2020}},
  author = {Perceval Wajsbürt and Yoann Taillé and Guillaumé Lainé and Xavier Tannier},
  booktitle = {Défi Fouille de Texte (DEFT) 2020}, 
  address = {Nancy, France}, 
  year = {2020}, 
  month = jun
}
Xavier Tannier, Nicolas Paris, Hugo Cisneros, Christel Daniel, Matthieu Doutreligne, Catherine Duclos, Nicolas Griffon, Claire Hassen-Khodja, Ivan Lerner, Adrien Parrot, Éric Sadou, Cyril Saussol, Pascal Vaillant.
Hybrid Approaches for our Participation to the n2c2 Challenge on Cohort Selection for Clinical Trials.
March 2019.
[abstract] [BibTeX] [arXiv]
Objective: Natural language processing can help minimize human interventionin identifying patients meeting eligibility criteria for clinical trials, butthere is still a long way to go to obtain a general and systematic approachthat is useful for researchers. We describe two methods taking a step in thisdirection and present their results obtained during the n2c2 challenge oncohort selection for clinical trials.Materials and Methods: The first methodis a weakly supervised method using an unlabeled corpus (MIMIC) to build asilver standard, by producing semi-automatically a small and very precise setof rules to detect some samples of positive and negative patients. This silverstandard is then used to train a traditional supervised model. The secondmethod is a terminology-based approach where a medical expert selects theappropriate concepts, and a procedure is defined to search the terms and checkthe structural or temporal constraints.Results: On the n2c2 dataset containing annotated data about 13 selection criteria on 288 patients, we obtained anoverall F1-measure of 0.8969, which is the third best result out of 45participant teams, with no statistically significant difference with thebest-ranked team.Discussion: Both approaches obtained very encouraging resultsand apply to different types of criteria. The weakly supervised method requiresexplicit descriptions of positive and negative examples in some reports. Theterminology-based method is very efficient when medical concepts carry most ofthe relevant information.Conclusion: It is unlikely that much more annotateddata will be soon available for the task of identifying a wide range of patientphenotypes. One must focus on weakly or non-supervised learning methods usingboth structured and unstructured data and relying on a comprehensiverepresentation of the patients.
@Misc{Tannier2019, 
  title = {{Hybrid Approaches for our Participation to the n2c2 Challenge on Cohort Selection for Clinical Trials}},
  author = {Xavier Tannier and Nicolas Paris and Hugo Cisneros and Christel Daniel and Matthieu Doutreligne and Catherine Duclos and Nicolas Griffon and Claire Hassen-Khodja and Ivan Lerner and Adrien Parrot and Éric Sadou and Cyril Saussol and Pascal Vaillant},
  year = {2019}, 
  month = mar, 
  note = {arXiv}
}
Charlotte Rudnik, Thibault Ehrhart, Olivier Ferret, Denis Teyssou, Raphaël Troncy, Xavier Tannier.
Searching News Articles Using an Event Knowledge Graph Leveraged by Wikidata.
in Proceedings of the Wiki Workshop 2019 (The Web Conference). San Francisco, USA, May 2019.
[abstract] [BibTeX] [arXiv]
News agencies produce thousands of multimedia stories describing events happening in the world that are either scheduled such as sports competitions, political summits and elections, or breaking events such as military conflicts, terrorist attacks, natural disasters, etc. When writing up those stories, journalists refer to contextual background and to compare with past similar events. However, searching for precise facts described in stories is hard. In this paper, we propose a general method that leverages the Wikidata knowledge base to produce semantic annotations of news articles. Next, we describe a semantic search engine that supports both keyword based search in news articles and structured data search providing filters for properties belonging to specific event schemas that are automatically inferred.
@InProceedings{Rudnik2019, 
  title = {{Searching News Articles Using an Event Knowledge Graph Leveraged by Wikidata}},
  author = {Rudnik, Charlotte and Ehrhart, Thibault and Ferret, Olivier and Teyssou, Denis and Troncy, Raphaël and Tannier, Xavier},
  booktitle = {Proceedings of the Wiki Workshop 2019 (The Web Conference)}, 
  address = {San Francisco, USA}, 
  year = {2019}, 
  month = may
}
Nicolas Paris, Matthieu Doutreligne, Adrien Parrot, Xavier Tannier.
Désidentification de comptes-rendus hospitaliers dans une base de données OMOP.
in Actes de TALMED 2019 : Symposium satellite francophone sur le traitement automatique des langues dans le domaine biomédical. Lyon, France, August 2019.
[abstract] [BibTeX] [paper]
En médecine, la recherche sur les données de patients vise à améliorer les soins. Pour préserver la vie privée des patients, ces données sont usuellement désidentifiées. Les documents textuels contiennent de nombreuses informa-tions présentes uniquement dans ce matériel et représentent donc un attrait important pour la recherche. Cependant ils représentent aussi un challenge technique lié au processus de désidentification. Ce travail propose une méthode hybride de désidentification évaluée sur un échantillon des textes de l'entrepôt de données de santé de l'Assistance Publique des Hôpitaux de Paris. Les deux apports principaux sont des performances de dési-dentification supérieures à l'état de l'art en langue française, et l'implémentation d'une chaîne de traitement standardisée librement accessible implémentée sur OMOP-CDM, un mo-dèle commun de représentation des données médicales large-ment utilisé dans le monde.
@InProceedings{Paris2019, 
  title = {{Désidentification de comptes-rendus hospitaliers dans une base de données OMOP}},
  author = {Nicolas Paris and Matthieu Doutreligne and Adrien Parrot and Xavier Tannier},
  booktitle = {Actes de TALMED 2019 : Symposium satellite francophone sur le traitement automatique des langues dans le domaine biomédical}, 
  address = {Lyon, France}, 
  year = {2019}, 
  month = aug
}
Jacques Hilbey, Louise Deléger, Xavier Tannier.
Participation de l’équipe LAI à DEFT 2019.
in Défi Fouille de Texte (DEFT) 2019. Toulouse, France, July 2019.
[abstract] [BibTeX] [paper]
We present in this article the methods developed and the results obtained during our participation in task 3 of the DEFT 2019 evaluation campaign. We used simple rule-based or machine-learning approaches ; our results are very good on the information that is simple to extract (age, gender), they remain mixed on the more difficult tasks.

Nous présentons dans cet article les méthodes conçues et les résultats obtenus lors de notre participation à la tâche 3 de la campagne d’évaluation DEFT 2019. Nous avons utilisé des approches simples à base de règles ou d’apprentissage automatique, et si nos résultats sont très bons sur les informationssimples à extraire comme l’âge et le sexe du patient, ils restent mitigés sur les tâches plus difficiles.
@InProceedings{Hilbey2019, 
  title = {{Participation de l’équipe LAI à DEFT 2019}},
  author = {Jacques Hilbey and Louise Deléger and Xavier Tannier},
  booktitle = {Défi Fouille de Texte (DEFT) 2019}, 
  address = {Toulouse, France}, 
  year = {2019}, 
  month = jul
}
Julien Tourille, Matthieu Doutreligne, Olivier Ferret, Nicolas Paris, Aurélie Névéol, Xavier Tannier.
Evaluation of a Sequence Tagging Tool for Biomedical Texts.
in Proceedings of the EMNLP Workshop on Health Text Mining and Information Analysis (LOUHI 2018). Brussels, Belgium, October 2018.
[abstract] [BibTeX] [ACL Anthology]
Many applications in biomedical natural language processing rely on sequence tagging asan initial step to perform more complex analysis. To support text analysis in the biomedicaldomain, we introduce Yet Another SEquenceTagger (YASET), an open-source multi purpose sequence tagger that implements state-of-the-art deep learning algorithms for sequencetagging. Herein, we evaluate YASET on part-of-speech tagging and named entity recognition in a variety of text genres including articles from the biomedical literature in English and clinical narratives in French. Tofurther characterize performance, we reportdistributions over 30 runs and different sizesof training datasets. YASET provides state-of-the-art performance on the CoNLL 2003NER dataset (F1=0.87), MEDPOST corpus(F1=0.97), MERLoT corpus (F1=0.99) andNCBI disease corpus (F1=0.81). We believethat YASET is a versatile and efficient tool thatcan be used for sequence tagging in biomedical and clinical texts.
@InProceedings{Tourille2018, 
  title = {{Evaluation of a Sequence Tagging Tool for Biomedical Texts}},
  author = {Julien Tourille and Matthieu Doutreligne and Olivier Ferret and Nicolas Paris and Aurélie Névéol and Xavier Tannier},
  booktitle = {Proceedings of the EMNLP Workshop on Health Text Mining and Information Analysis (LOUHI 2018)}, 
  address = {Brussels, Belgium}, 
  year = {2018}, 
  month = oct
}
Sylvie Cazalens, Philippe Lamarre, Julien Leblay, Ioana Manolescu, Xavier Tannier.
Computational fact-checking: a content management perspective.
Rio de Janeiro, Brazil, August 2018.
Tutorial presented at the conference VLDB.
[abstract] [BibTeX] [slides]
The tremendous value of Big Data has been noticed oflate also by the media, and the term “data journalism” hasbeen coined to refer to journalistic work inspired by dig-ital data sources. A particularly popular and active areaof data journalism is concerned with fact-checking. Theterm was born in the journalist community and referred theprocess of verifying and ensuring the accuracy of publishedmedia content; since 2012, however, it has increasingly fo-cused on the analysis of politics, economy, science, and newscontent shared in any form, but first and foremost on theWeb (social and otherwise). These trends have been no-ticed by computer scientists working in the industry andacademia. Thus, a very lively area of digital content man-agement research has taken up these problems and works topropose foundations (models), algorithms, and implementthem through concrete tools.Our proposed tutorial:
  1. Outlines the current state ofaffairs in the area of digital (or computational) fact-checkingin newsrooms, by journalists, NGO workers, scientists andIT companies;
  2. Shows which areas of digital contentmanagement research, in particular those relying on theWeb, can be leveraged to help fact-checking, and gives acomprehensive survey of efforts in this area;
  3. Highlightsongoing trends, unsolved problems, and areas where we envision future scientific and practical advances.
@Misc{Cazalens2018b, 
  title = {{Computational fact-checking: a content management perspective}},
  author = {Sylvie Cazalens and Philippe Lamarre and Julien Leblay and Ioana Manolescu and Xavier Tannier},
  address = {Rio de Janeiro, Brazil}, 
  year = {2018}, 
  month = aug, 
  note = {Tutorial presented at the conference VLDB.}
}
Sylvie Cazalens, Philippe Lamarre, Julien Leblay, Ioana Manolescu, Xavier Tannier.
A Content Management Perspective on Fact-Checking.
in Proceedings of the Web Conference 2018. Lyon, France, April 2018.
[abstract] [BibTeX] [pdf] [html]
Fact checking has captured the attention of the media and the public alike; it has also recently received strong attention from the computer science community, in particular from data and knowledge management, natural language processing and information retrieval; we denote these together under the term "content management". In this paper, we identify the fact checking tasks which can be performed with the help of content management technologies, and survey the recent research works in this area, before laying out some perspectives for the future. We hope our work will provide interested researchers, journalists and fact checkers with an entry point in the existing literature as well as help develop a roadmap for future research and development work.
@InProceedings{Cazalens2018, 
  title = {{A Content Management Perspective on Fact-Checking}},
  author = {Sylvie Cazalens and Philippe Lamarre and Julien Leblay and Ioana Manolescu and Xavier Tannier},
  booktitle = {Proceedings of the Web Conference 2018}, 
  address = {Lyon, France}, 
  year = {2018}, 
  month = apr
}
Julien Leblay, Ioana Manolescu, Xavier Tannier.
Computational fact-checking: problems, state of the art, and perspectives.
Lyon, France, April 2018.
Tutorial presented at the Web Conference 2018.
[abstract] [BibTeX] [See our more complete VLDB slides]
The tremendous value of Big Data has been noticed of late also by the media, and the term "data journalism'' has been coined to refer to journalistic work inspired by digital data sources. A particularly popular and active area of data journalism is concerned with fact-checking. The term was born in the journalist community and referred the process of verifying and ensuring the accuracy of published media content; since 2012, however, it has increasingly focused on the analysis of politics, economy, science, and news content shared in any form, but first and foremost on the Web (social and otherwise). These trends have been noticed by computer scientists working in the industry and academia. Thus, a very lively area of digital content management research has taken up these problems and works to propose foundations (models), algorithms, and implement them through concrete tools. To cite just one example, Google has recognized the usefulness and importance of fact-checking efforts, by making an effort to index and show them next to links returned to the users.Our tutorial:
  1. Outlines the current state of affairs in the area of digital (or computational) fact-checking in newsrooms, by journalists, NGO workers, scientists and IT companies;
  2. Shows which areas of digital content management research, in particular those relying on the Web, can be leveraged to help fact-checking, and gives a comprehensive survey of efforts in this area;
  3. Highlights ongoing trends, unsolved problems, and areas where we envision future scientific and practical advances.
@Misc{Leblay2018, 
  title = {{Computational fact-checking: problems, state of the art, and perspectives}},
  author = {Julien Leblay and Ioana Manolescu and Xavier Tannier},
  address = {Lyon, France}, 
  year = {2018}, 
  month = apr, 
  note = {Tutorial presented at the Web Conference 2018.}
}
Tien Duc Cao, Ioana Manolescu, Xavier Tannier.
Searching for Truth in a Database of Statistics.
in Proceedings of the 21st International Workshop on the Web and Databases (WebDB 2018). Houston, USA, June 2018.
[abstract] [BibTeX]
The proliferation of falsehood and misinformation, in particular through the Web, has lead to increasing energy being invested into journalistic fact-checking. Fact-checking journalists typically check the accuracy of a claim against some trusted data source. Statistic databases such as those compiled by state agencies or by reputed international organizations are often used as trusted data sources, as they contain valuable, high-quality information. However, their usability is limited when they are shared in a format such as HTML or spreadsheets: this makes it hard to find the most relevant dataset for checking a specific claim, or to quickly extract from a dataset the best answer to a given query. We present a novel algorithm enabling the exploitation of such statistic tables, by 1) identifying the statistic datasets most relevant for a given fact-checking query, and 2) extracting from each dataset the best specific (precise) query answer it may contain. We have implemented our approach and experimented on the complete corpus of statistics obtained from INSEE, the French national statistic institute. Our experiments and comparisons demonstrate the effectiveness of our proposed method.
@InProceedings{Cao2018, 
  title = {{Searching for Truth in a Database of Statistics}},
  author = {Cao, Tien Duc and Manolescu, Ioana and Tannier, Xavier},
  booktitle = {Proceedings of the 21st International Workshop on the Web and Databases (WebDB 2018)}, 
  address = {Houston, USA}, 
  year = {2018}, 
  month = jun
}
Judith Jeyafreeda Andrew, Xavier Tannier.
Automatic Extraction of Entities and Relation from Legal Documents.
in Proceedings of the ACL Named Entities Workshop (NEWS 2018). Melbourne, Australia, pages 1-8, July 2018.
[abstract] [BibTeX] [ACL Anthology]
In recent years, the journalists and computer sciences speak to each other to identify useful technologies which would help them in extracting useful information. This is called "computational Journalism". In this paper, we present a method that will enable the journalists to automatically identifies and annotates entities such as names of people, organizations, role and functions of people in legal documents; the relationship between these entities are also explored. The system uses a combination of both statistical and rule based technique. The statistical method used is Conditional Random Fields and for the rule based technique, document and language specific regular expressions are used.
@InProceedings{Andrew2018, 
  title = {{Automatic Extraction of Entities and Relation from Legal Documents}},
  author = {Andrew, Judith Jeyafreeda and Tannier, Xavier},
  booktitle = {Proceedings of the ACL Named Entities Workshop (NEWS 2018)}, 
  address = {Melbourne, Australia}, 
  year = {2018}, 
  month = jul, 
  pages = {1-8}
}
Tien Duc Cao, Ioana Manolescu, Xavier Tannier.
Extracting Linked Data from statistic spreadsheets.
in 34ème Conférence sur la Gestion de Données – Principes, Technologies et Applications (BDA 2018). Bucarest, Romania, October 2018.
[abstract] [BibTeX]
Fact-checking journalists typically check the accuracy of a claimagainst some trusted data source. Statistic databases suchas those compiled by state agencies are often used as trusted data sources, as they contain valuable, high-quality information. However, their usability is limited when they are shared in a format such as HTML or spreadsheets: this makes it hard to find the most relevant dataset for checking a specific claim, or to quickly extract from a dataset the best answer to a given query. In this work, we provide a conceptual model for the open data comprised instatistics published by INSEE, the national French economic and societalstatistics institute. Then, we describe a novel method for extractingRDF Linked Open Data, to populate an instance of this model.We used our method to produce RDF data out of 20k+Excel spreadsheets, and our validation indicates a 91% rate ofsuccessful extraction.Further, we also present a novel algorithm enabling the exploitationof such statistic tables, by (i) identifying the statistic datasetsmost relevant for a given fact-checking query, and (ii) extractingfrom each dataset the best specific (precise) query answer it maycontain. We have implemented our approach and experimented on thecomplete corpus of statistics obtained from INSEE, the French nationalstatistic institute. Our experiments and comparisons demonstrate the effectiveness of our proposed method.
@InProceedings{Cao2018b, 
  title = {{Extracting Linked Data from statistic spreadsheets}},
  author = {Cao, Tien Duc and Manolescu, Ioana and Tannier, Xavier},
  booktitle = {34ème Conférence sur la Gestion de Données – Principes, Technologies et Applications (BDA 2018)}, 
  address = {Bucarest, Romania}, 
  year = {2018}, 
  month = oct
}
Julien Tourille, Olivier Ferret, Xavier Tannier, Aurélie Névéol.
Neural Architecture for Temporal Relation Extraction: A Bi-LSTM Approach for Detecting Narrative Containers.
in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL 2017, short paper). Vancouver, Canada, August 2017.
[abstract] [BibTeX] [ACL Anthology]
We present a neural architecture for containment relation identification between medical events and/or temporal expressions. We experiment on a corpus of de-identified clinical notes in English from the Mayo Clinic, namely the THYME corpus. Our model achieves an F-measure of 0.591 and outperforms the best results reported on this corpus to date.
@InProceedings{Tourille2017b, 
  title = {{Neural Architecture for Temporal Relation Extraction: A Bi-LSTM Approach for Detecting Narrative Containers}},
  author = {Tourille, Julien and Ferret, Olivier and Tannier, Xavier and Névéol, Aurélie},
  booktitle = {Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL 2017, short paper)}, 
  address = {Vancouver, Canada}, 
  year = {2017}, 
  month = aug
}
Jose Moreno, Romaric Besançon, Romain Beaumont, Eva D'Hondt, Anne-Laure Ligozat, Sophie Rosset, Xavier Tannier, Brigitte Grau.
Combining Word and Entity Embeddings for Entity Linking.
in Proceedings of the 14th Extended Semantic Web Conference (ESWC 2017). Portorož, Slovenia, May 2017.
[abstract] [BibTeX] [SpringerLink]
The correct identification of the link between an entity mention in a text and a known entity in a large knowledge base is important in information retrieval or information extraction. The general approach for this task is to generate, for a given mention, a set of candidate entities from the base and, in a second step, determine which is the best one. This paper proposes a novel method for the second step which is based on the joint learning of embeddings for the words in the text and the entities in the knowledge base. By learning these embeddings in the same space we arrive at a more conceptually grounded model that can be used for candidate selection based on the surrounding context.The relative improvement of this approach is experimentally validated on a benchmark corpus from the TAC-EDL 2015 evaluation campaign.
@InProceedings{Moreno2017, 
  title = {{Combining Word and Entity Embeddings for Entity Linking}},
  author = {Jose Moreno and Romaric Besançon and Romain Beaumont and Eva D'Hondt and Anne-Laure Ligozat and Sophie Rosset and Xavier Tannier and Brigitte Grau},
  booktitle = {Proceedings of the 14th Extended Semantic Web Conference (ESWC 2017)}, 
  address = {Portorož, Slovenia}, 
  year = {2017}, 
  month = may
}
Julien Tourille, Olivier Ferret, Xavier Tannier, Aurélie Névéol.
Temporal information extraction from clinical text.
in Proceedings of the European Chapter of the ACL (EACL 2017, short paper). Valencia, Spain, April 2017.
[abstract] [BibTeX] [ACL Anthology] [poster]
In this paper, we present a method for temporal relation extraction from clinical narratives in French and in English. We experiment on two comparable corpora, the MERLOT corpus for French and the THYME corpus for English, and show that a common approach can be used for both languages.
@InProceedings{Tourille2017, 
  title = {{Temporal information extraction from clinical text}},
  author = {Tourille, Julien and Ferret, Olivier and Tannier, Xavier and Névéol, Aurélie},
  booktitle = {Proceedings of the European Chapter of the ACL (EACL 2017, short paper)}, 
  address = {Valencia, Spain}, 
  year = {2017}, 
  month = apr
}
Swen Ribeiro, Olivier Ferret, Xavier Tannier.
Unsupervised Event Clustering and Aggregation from Newswire and Web Articles.
in Proceedings of the 2nd workshop "Natural Language meets Journalism" (EMNLP 2017). Copenhagen, Denmark, September 2017.
[abstract] [BibTeX] [ACL Anthology]
In this paper we present an unsupervised pipeline approach for clustering news articles based on identified event instances in their content. We leverage press agency newswire and monolingual word alignment techniques to build meaningful and linguistically varied clusters of articles from the Web in the perspective of a broader event type detection task. We validate our approach on a manually annotated corpus of Web articles.
@InProceedings{Ribeiro2017, 
  title = {{Unsupervised Event Clustering and Aggregation from Newswire and Web Articles}},
  author = {Ribeiro, Swen and Ferret, Olivier and Tannier, Xavier},
  booktitle = {Proceedings of the 2nd workshop "Natural Language meets Journalism" (EMNLP 2017)}, 
  address = {Copenhagen, Denmark}, 
  year = {2017}, 
  month = sep
}
Julien Tourille, Olivier Ferret, Xavier Tannier, Aurélie Névéol.
LIMSI-COT at SemEval-2017 Task 12: Neural Architecture for Temporal Information Extraction from Clinical Narratives.
in Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval 2017). Vancouver, Canada, August 2017.
Selected for the "Best of SemEval 2017"
[abstract] [BibTeX] [ACL anthology]
In this paper we present our participation to SemEval 2017 Task 12. We used aneural network based approach for entity and temporal relation extraction, and experimented with two domain adaptation strategies. We achieved competitive performance for both tasks.
@InProceedings{Tourille2017c, 
  title = {{LIMSI-COT at SemEval-2017 Task 12: Neural Architecture for Temporal Information Extraction from Clinical Narratives}},
  author = {Tourille, Julien and Ferret, Olivier and Tannier, Xavier and Névéol, Aurélie},
  booktitle = {Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval 2017)}, 
  address = {Vancouver, Canada}, 
  year = {2017}, 
  month = aug
}
Tien Duc Cao, Ioana Manolescu, Xavier Tannier.
Extracting Linked Data from statistic spreadsheets.
in Proceedings of the SIGMOD workshop on Semantic Big Data (SBD 2017). Chicago, USA, May 2017.
[abstract] [BibTeX] [ACMLink] [paper]
Statistic data is an important sub-category of open data; it is interesting for many applications, including but not limited to data journalism, as such data is typically of high quality, and reflects (under an aggregated form) important aspects of a society’s life such as births, immigration, economic output etc. However, such open data is often not published as Linked Open Data (LOD) limiting its usability.We provide a conceptual model for the open data comprised in statistic files published by INSEE, the leading French economic and societal statistics institute. Then, we describe a novel method for extracting RDF LOD populating an instance of this conceptual model. Our method was used to produce RDF data out of 20k+ Excel spreadsheets, and our validation indicates a 91% rate of successful extraction.
@InProceedings{Cao2017, 
  title = {{Extracting Linked Data from statistic spreadsheets}},
  author = {Cao, Tien Duc and Manolescu, Ioana and Tannier, Xavier},
  booktitle = {Proceedings of the SIGMOD workshop on Semantic Big Data (SBD 2017)}, 
  address = {Chicago, USA}, 
  year = {2017}, 
  month = may
}
José Moreno, Romaric Besançon, Romain Beaumont, Eva D'Hondt, Anne-Laure Ligozat, Sophie Rosset, Xavier Tannier, Brigitte Grau.
Apprendre des représentations jointes de mots et d'entités pour la désambiguïsation d'entités.
in Actes de la Conférence Traitement Automatique des Langues Naturelles (TALN 2017). Orléans, France, June 2017.
[abstract] [BibTeX]
The correct identification of the link between an entity mention in a text anda known entity in a large knowledge base is important in information retrieval or information extraction. However, systems have to deal with ambiguity as numerous entities could be linked to a mention. This paper proposes a novel method for entity disambiguation which is based on the joint learning of embeddings for the words in the text and the entities in the knowledge base. By learning these embeddings in the same space we arrive at a more conceptually grounded model that can be used for candidate selection based on the surrounding context.

La désambiguïsation d'entités (ou liaison d'entités), qui consiste à relier des mentions d'entités d'un texte à des entités d'une base de connaissance, est un problème qui se pose, en particulier, pour le peuplement automatique de bases de connaissances par extraction d'information à partir de textes. Une difficulté principale de cette tâche est la résolution d'ambiguïtés car les systèmes ont à choisir parmi un nombre important de candidats. Cet article propose une nouvelle approche fondée sur l'apprentissage joint de représentations distribuées des mots et des entités dans le même espace, ce qui permet d'établir un modèle solide pour la comparaison entre le contexte local de la mention d'entité et les entités candidates.
@InProceedings{Moreno2017b, 
  title = {{Apprendre des représentations jointes de mots et d'entités pour la désambiguïsation d'entités}},
  author = {José Moreno and Romaric Besançon and Romain Beaumont and Eva D'Hondt and Anne-Laure Ligozat and Sophie Rosset and Xavier Tannier and Brigitte Grau},
  booktitle = {Actes de la Conférence Traitement Automatique des Langues Naturelles (TALN 2017)}, 
  address = {Orléans, France}, 
  year = {2017}, 
  month = jun
}
Xavier Tannier.
NLP-driven Data Journalism: Time-Aware Mining and Visualization of International Alliances.
in Proceedings of "Natural Language meets Journalism", workshop of the International Joint Conference on Artificial Intelligence (IJCAI 2016). New York, USA, July 2016.
[abstract] [BibTeX] [poster] [paper]
We take inspiration of computational and data journalism, and propose to combine techniques from information extraction, information aggregation and visualization to build a tool identifying the evolution of alliance and opposition relations between countries, on specific topics. These relations are aggregated into numerical data that are visualized by time-series plots or dynamic graphs.
@InProceedings{Tannier2016a, 
  title = {{NLP-driven Data Journalism: Time-Aware Mining and Visualization of International Alliances}},
  author = {Xavier Tannier},
  booktitle = {Proceedings of "Natural Language meets Journalism", workshop of the International Joint Conference on Artificial Intelligence (IJCAI 2016)}, 
  address = {New York, USA}, 
  year = {2016}, 
  month = jul
}
Xavier Tannier, Frédéric Vernier.
Creation, Visualization and Edition of Timelines for Journalistic Use.
in Proceedings of "Natural Language meets Journalism", workshop of the International Joint Conference on Artificial Intelligence (IJCAI 2016). New York, USA, July 2016.
[abstract] [BibTeX] [paper] [slides]
We describe in this article a system for building and visualizing thematic timelines automatically. The input of the system is a set of keywords, together with temporal user-specified boundaries. The output is a timeline graph showing at the same time the chronology and the importance of the events concerning the query. This requires natural language processing and information retrieval techniques, allied to a very specific temporal smoothing and visualization approach. The result can be edited so that the journalist always has the final say on what is finally displayed to the reader.
@InProceedings{Tannier2016b, 
  title = {{Creation, Visualization and Edition of Timelines for Journalistic Use}},
  author = {Xavier Tannier and Frédéric Vernier},
  booktitle = {Proceedings of "Natural Language meets Journalism", workshop of the International Joint Conference on Artificial Intelligence (IJCAI 2016)}, 
  address = {New York, USA}, 
  year = {2016}, 
  month = jul
}
Sergio Torres Aguilar, Xavier Tannier, Pierre Chastang.
Named entity recognition applied on a data base of Medieval Latin charters. The case of chartae burgundiae.
in Proceedings of the 3rd International Workshop on Computational History (HistoInformatics 2016). Krakow, Poland, July 2016.
[abstract] [BibTeX] [paper]
We take inspiration of computational and data journalism, and propose to combine techniques from information extraction, information aggregation and visualization to build a tool identifying the evolution of alliance and opposition relations between countries, on specific topics. These relations are aggregated into numerical data that are visualized by time-series plots or dynamic graphs.
@InProceedings{Torres2016, 
  title = {{Named entity recognition applied on a data base of Medieval Latin charters. The case of chartae burgundiae}},
  author = {Torres Aguilar, Sergio and Tannier, Xavier and Chastang, Pierre},
  booktitle = {Proceedings of the 3rd International Workshop on Computational History (HistoInformatics 2016)}, 
  address = {Krakow, Poland}, 
  year = {2016}, 
  month = jul
}
Maria Pontiki, Dimitrios Galanis, Haris Papageorgiou, Ion Androutsopoulos, Suresh Manandhar, Mohammad Al-Smadi, Mahmoud Al-Ayyoub, Yanyan Zhao, Bing Qin, Orphée De Clecq, Véronique Hoste, Marianna Apidianaki, Xavier Tannier, Natalia Loukachevitch, Evgeny Kotelnikov, Nuria Bel, Salud María Jiménez-Zafra, Gülşen Eryiğit.
SemEval-2016 Task 5: Aspect Based Sentiment Analysis.
in Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval 2016). San Diego, USA, June 2016.
[abstract] [BibTeX] [Annotation Guidelines] [Paper]
This paper describes the SemEval 2016 sharedtask on Aspect Based Sentiment Analysis(ABSA), a continuation of the respective tasksof 2014 and 2015. In its third year, the taskprovided 19 training and 20 testing datasetsfor 8 languages and 7 domains, as well as acommon evaluation procedure. From thesedatasets, 25 were for sentence-level and 14 fortext-level ABSA; the latter was introduced forthe first time as a subtask in SemEval. The taskattracted 245 submissions from 29 teams.
@InProceedings{Pontiki2016, 
  title = {{SemEval-2016 Task 5: Aspect Based Sentiment Analysis}},
  author = {Pontiki, Maria and Galanis, Dimitrios and Papageorgiou, Haris and Androutsopoulos, Ion and Manandhar, Suresh and Al-Smadi, Mohammad and Al-Ayyoub, Mahmoud and Zhao, Yanyan and Qin, Bing and De Clecq, Orphée and Hoste, Véronique and Apidianaki, Marianna and Tannier, Xavier and Loukachevitch, Natalia and Kotelnikov, Evgeny and Bel, Nuria and Jiménez-Zafra, Salud María and Eryiğit, Gülşen},
  booktitle = {Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval 2016)}, 
  address = {San Diego, USA}, 
  year = {2016}, 
  month = jun
}
Julien Tourille, Olivier Ferret, Aurélie Névéol, Xavier Tannier.
LIMSI-COT at SemEval-2016 Task 12: Temporal relation identification using a pipeline of classifiers.
in Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval 2016). San Diego, USA, June 2016.
Selected for the "Best of SemEval 2016"
[abstract] [BibTeX] ["Best of SemEval" slides] [paper] [poster]
SemEval 2016 Task 12 addresses temporal reasoning in the clinical domain. In this paper, we present our participation for relation extraction based on gold standard entities (subtasks DR and CR). We used a supervised approach comparing plain lexical features to word embeddings for temporal relation identification, and obtained above-median scores.
@InProceedings{Tourille2016b, 
  title = {{LIMSI-COT at SemEval-2016 Task 12: Temporal relation identification using a pipeline of classifiers}},
  author = {Tourille, Julien and Ferret, Olivier and Névéol, Aurélie and Tannier, Xavier},
  booktitle = {Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval 2016)}, 
  address = {San Diego, USA}, 
  year = {2016}, 
  month = jun
}
Julien Tourille, Olivier Ferret, Aurélie Névéol, Xavier Tannier.
Extraction de relations temporelles dans des dossiers électroniques patient.
in Actes de la Conférence Traitement Automatique des Langues Naturelles (TALN 2016, article court). Paris, France, July 2016.
[abstract] [BibTeX] [poster] [free copy]
Temporal analysis of clinical documents yields complex representations of the information contained in Electronic Health Records. This type of analysis relies on the extraction of medical events, temporal expressions and the relations between them. In this work, we assume that relevant events and temporal expressions are available and we focus on the extraction of relations between two events or between an event and a temporal expression. We present supervised classification models and apply them to clinical documents written in French and in English. The performance we achieve is high and similar an both languages. We believe these results suggest that temporal analysis may be approached generically across clinical domains and languages.

L'analyse temporelle des documents cliniques permet d'obtenir des représentations riches des informations contenues dans les dossiers électroniques patient. Cette analyse repose sur l'extraction d'événements, d'expressions temporelles et des relations entre eux. Dans ce travail, nous considérons que nous disposons des événements et des expressions temporelles pertinents et nous nous intéressons aux relations temporelles entre deux événements ou entre un événement et une expression temporelle. Nous présentons des modèles de classification supervisée pour l'extraction de des relations en français et en anglais. Les performances obtenues sont similaires dans les deux langues, suggérant ainsi que différents domaines cliniques et différentes langues pourraient être abordés de manière similaire.
@InProceedings{Tourille2016a, 
  title = {{Extraction de relations temporelles dans des dossiers électroniques patient}},
  author = {Tourille, Julien and Ferret, Olivier and Névéol, Aurélie and Tannier, Xavier},
  booktitle = {Actes de la Conférence Traitement Automatique des Langues Naturelles (TALN 2016, article court)}, 
  address = {Paris, France}, 
  year = {2016}, 
  month = jul
}
Kiem-Hieu Nguyen, Xavier Tannier, Olivier Ferret, Romaric Besançon.
A Dataset for Open Event Extraction in English.
in Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016). Portorož, Slovenia, May 2016.
[abstract] [BibTeX] [free copy] [poster]
This article presents a corpus for development and testing of event schema induction systems in English. Schema induction is the task of learning templates with no supervision from unlabeled texts, and to group together entities corresponding to the same role in a template. Most of the previous work on this subject relies on the MUC-4 corpus. We describe the limits of using this corpus (size, non-representativeness, similarity of roles across templates) and propose a new, partially-annotated corpus in English which remedies some of these shortcomings. We make use of Wikinews to select the data inside the category Laws & Justice, and query Google search engine to retrieve different documents on the same events. Only Wikinews documents are manually annotated and can be used for evaluation, while the others can be used for unsupervised learning. We detail the methodology used for building the corpus and evaluate some existing systems on this new data.
@InProceedings{Nguyen2016, 
  title = {{A Dataset for Open Event Extraction in English}},
  author = {Kiem-Hieu Nguyen and Xavier Tannier and Olivier Ferret and Romaric Besançon},
  booktitle = {Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016)}, 
  address = {Portorož, Slovenia}, 
  year = {2016}, 
  month = may
}
Marianna Apidianaki, Xavier Tannier, Cécile Richart.
Datasets for Aspect-Based Sentiment Analysis in French.
in Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016). Portorož, Slovenia, May 2016.
[abstract] [BibTeX] [poster] [free copy]
Aspect Based Sentiment Analysis (ABSA) is the task of mining and summarizing opinions from text about specific entities and their aspects. This article describes two datasets for the development and testing of ABSA systems for French which comprise user reviews annotated with relevant entities, aspects and polarity values. The first dataset contains 457 restaurant reviews (2365 sentences) for training and testing ABSA systems, while the second contains 162 museum reviews (655 sentences) dedicated to out-of-domain evaluation. Both datasets were built as part of SemEval-2016 Task 5 "Aspect-Based Sentiment Analysis" where seven different languages were represented, and are publicly available for research purposes. This article provides examples and statistics by annotation type, summarizes the annotation guidelines and discusses their cross-lingual applicability. It also explains how the data was used for evaluation in the SemEval ABSA task and briefly presents the results obtained for French.
@InProceedings{Apidianaki2016, 
  title = {{Datasets for Aspect-Based Sentiment Analysis in French}},
  author = {Marianna Apidianaki and Xavier Tannier and Cécile Richart},
  booktitle = {Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016)}, 
  address = {Portorož, Slovenia}, 
  year = {2016}, 
  month = may
}
Aurélie Névéol, K. Bretonnel Cohen, Cyril Grouin, Thierry Hamon, Thomas Lavergne, Liadh Kelly, Lorraine Goeuriot, Grégoire Rey, Aude Robert, Xavier Tannier, Pierre Zweigenbaum.
Clinical Information Extraction at the CLEF eHealth Evaluation lab 2016.
in CLEF 2016 (online working notes). Evora, Portugal, September 2016.
[abstract] [BibTeX] [free copy]
This paper reports on Task 2 of the 2016 CLEF eHealth eval-uation lab which extended the previous information extraction tasks ofShARe/CLEF eHealth evaluation labs. The task continued with namedentity recognition and normalization in French narratives, as offered in CLEF eHealth 2015. Named entity recognition involved ten types of entities including disorders that were defined according to Semantic Groups in the Unified Medical Language System (UMLS), which was also used for normalizing the entities. In addition, we introduced a large-scale classification task in French death certificates, which consisted ofextracting causes of death as coded in the International Classification of Diseases, tenth revision (ICD10). Participant systems were evaluated against a blind reference standard of 832 titles of scientific articles indexed in MEDLINE, 4 drug monographs published by the European Medicines Agency (EMEA) and 27,850 death certificates using Precision, Recall and F-measure. In total, seven teams participated, including five in the entity recognition and normalization task, and five in the death certificate coding task. Three teams submitted their systems to our newly offered reproducibility track. For entity recognition, the highest performance was achieved on the EMEA corpus, with an overall F-measure of 0.702 for plain entities recognition and 0.529 for normalized entity recognition. For entity normalization, the highest performance was achieved on the MEDLINE corpus, with an overall F-measure of 0.552. For death certificate coding, the highest performance was 0.848 F-measure.
@InProceedings{Neveol2016, 
  title = {{Clinical Information Extraction at the CLEF eHealth Evaluation lab 2016}},
  author = {Névéol, Aurélie and Cohen, K. Bretonnel and Grouin, Cyril and Hamon, Thierry and Lavergne, Thomas and Kelly, Liadh and Goeuriot, Lorraine and Rey, Grégoire and Robert, Aude and Tannier, Xavier and Zweigenbaum, Pierre},
  booktitle = {CLEF 2016 (online working notes)}, 
  address = {Evora, Portugal}, 
  year = {2016}, 
  month = sep
}
Kiem-Hieu Nguyen, Xavier Tannier, Olivier Ferret, Romaric Besançon.
Generative Event Schema Induction with Entity Disambiguation.
in Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing (ACL-IJCNLP 2015). Beijing, China, July 2015.
[abstract] [BibTeX] [slides] [ACL anthology] [video]
This paper presents a generative model to event schema induction. Previous methods in the literature only use head words to represent entities. However, elements other than head words contain useful information. For instance, an armed man is more discriminative than man. Our model takes into account this information and precisely represents it using probabilistic topic distributions. We illustrate that such information plays an important role in parameter estimation. Mostly, it makes topic distributions more coherent and more discriminative. Experimental results on benchmark dataset empirically confirm this enhancement.
@InProceedings{Nguyen2015, 
  title = {{Generative Event Schema Induction with Entity Disambiguation}},
  author = {Kiem-Hieu Nguyen and Xavier Tannier and Olivier Ferret and Romaric Besançon},
  booktitle = {Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing (ACL-IJCNLP 2015)}, 
  address = {Beijing, China}, 
  year = {2015}, 
  month = jul
}
Mike Donald Tapi Nzali, Aurélie Névéol, Xavier Tannier.
Automatic Extraction of Time Expressions Accross Domains in French Narratives.
in Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2015, short paper). Lisbon, Portugal, September 2015.
[abstract] [BibTeX] [ACL Anthology]
The prevalence of temporal referencesacross all types of natural language utterances makes temporal analysis a key issue in Natural Language Processing. Thiswork adresses three research questions:1/is temporal expression recognition specific to a particular domain? 2/if so, can we characterize domain specificity? and3/how can subdomain specificity be integrated in a single tool for unified temporalexpression extraction? Herein, we assess temporal expression recognition from documents written in French covering three domains. We present a new corpus of clinical narratives annotated for temporal expressions, and also use existing corpora in the newswire and historical domains. We show that temporal expressions can be extracted with high performance across domains (best F-measure 0.96 obtained with a CRF model on clinical narratives). We argue that domain adaptation for the extraction of temporal expressions can be done with limited efforts and should cover pre-processing as well as temporal specific tasks.
@InProceedings{TapiNzali2015b, 
  title = {{Automatic Extraction of Time Expressions Accross Domains in French Narratives}},
  author = {Tapi Nzali, Mike Donald and Névéol, Aurélie and Tannier, Xavier},
  booktitle = {Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2015, short paper)}, 
  address = {Lisbon, Portugal}, 
  year = {2015}, 
  month = sep
}
Béatrice Arnulphy, Vincent Claveau, Xavier Tannier, Anne Vilnat.
Supervised Machine Learning Techniques to Detect TimeML Events in French and English.
in Proceedings of the 20th International Conference on Applications of Natural Language to Information Systems (NLDB 2015). Passau, Germany, June 2015.
[abstract] [BibTeX] [SpringerLink] [paper]
Identifying events from texts is an information extraction task necessary for many NLP applications. Through the TimeML specifications and TempEval challenges, it has received some attention in the last years, yet, no reference result is available for French. In this paper, we try to fill this gap by proposing several event extraction systems, combining for instance Conditional Random Fields, language modeling and k-nearest-neighbors. These systems are evaluated on French corpora and compared with state-of-the-art methods on English. The very good results obtained on both languages validate our whole approach.
@InProceedings{Arnulphy2015, 
  title = {{Supervised Machine Learning Techniques to Detect TimeML Events in French and English}},
  author = {Béatrice Arnulphy and Vincent Claveau and Xavier Tannier and Anne Vilnat},
  booktitle = {Proceedings of the 20th International Conference on Applications of Natural Language to Information Systems (NLDB 2015)}, 
  address = {Passau, Germany}, 
  year = {2015}, 
  month = jun
}
Aurélie Névéol, Cyril Grouin, Xavier Tannier, Thierry Hamon, Liadh Kelly, Lorraine Goeuriot ad Pierre Zweigenbaum.
CLEF eHealth Evaluation Lab 2015 Task 1b: clinical named entity recognition.
in Working Notes of the Conference and Labs of the Evaluation Forum (CLEF 2015, CEUR-WS 1391). Toulouse, France, September 2015.
[abstract] [BibTeX] [CEUR-WS copy]
This paper reports on Task 1b of the 2015 CLEF eHealthevaluation lab which extended the previous information extraction tasksof ShARe/CLEF eHealth evaluation labs by considering ten types of entities including disorders, that were to be extracted from biomedical textin French. The task consisted of two phases: entity recognition (phase1), in which participants could supply plain or normalized entities, andentity normalization (phase 2). The entities to be extracted were definedaccording to Semantic Groups in the Unified Medical Language System), which was also used for normalizing the entities. Participantsystems were evaluated against a blind reference standard of 832 titles ofscientific articles indexed in MEDLINE and 3 full text drug monographspublished by the European Medicines Agency (EMEA) using Precision,Recall and F-measure. In total, seven teams participated in phase 1,and three teams in phase 2. The highest performance was obtained onthe EMEA corpus, with an overall F-measure of 0.756 for plain entityrecognition, 0.711 for normalized entity recognition and 0.872 for entitynormalization.
@InProceedings{Neveol2015, 
  title = {{CLEF eHealth Evaluation Lab 2015 Task 1b: clinical named entity recognition}},
  author = {Aurélie Névéol and Cyril Grouin and Xavier Tannier and Thierry Hamon and Liadh Kelly and Lorraine Goeuriot ad Pierre Zweigenbaum},
  booktitle = {Working Notes of the Conference and Labs of the Evaluation Forum (CLEF 2015, CEUR-WS 1391)}, 
  address = {Toulouse, France}, 
  year = {2015}, 
  month = sep
}
Kiem-Hieu Nguyen, Xavier Tannier, Olivier Ferret, Romaric Besançon.
Désambiguïsation d'entités pour l'induction non supervisée de schémas événementiels.
in Actes de la Conférence Traitement Automatique des Langues Naturelles (TALN 2015). Caen, France, June 2015.
[abstract] [BibTeX] [free copy]
In this paper, we present an approach for event induction with a generative model. This model makes possible to consider more relational information than previous models, and has been applied to noun attributes. By their influence on parameter estimation, this new information make probabilistic topic distribution more discriminative and more robust. We evaluated different versions of our model on MUC-4 datasets.

Cet article présente un modèle génératif pour l'induction non supervisée d'événements. Les précédentes méthodes de la littérature utilisent uniquement les têtes des syntagmes pour représenter les entités. Pourtant, le groupe complet (par exemple, "un homme armé") apporte une information plus discriminante (que "homme"). Notre modèle tient compte de cette information et la représente dans la distribution des schémas d'événements. Nous montrons que ces relations jouent un rôle important dans l'estimation des paramètres, et qu'elles conduisent à des distributions plus cohérentes et plus discriminantes. Les résultats expérimentaux sur le corpus de MUC-4 confirment ces progrès.
@InProceedings{Nguyen2015a, 
  title = {{Désambiguïsation d'entités pour l'induction non supervisée de schémas événementiels}},
  author = {Kiem-Hieu Nguyen and Xavier Tannier and Olivier Ferret and Romaric Besançon},
  booktitle = {Actes de la Conférence Traitement Automatique des Langues Naturelles (TALN 2015)}, 
  address = {Caen, France}, 
  year = {2015}, 
  month = jun
}
Mike Donald Tapi Nzali, Aurélie Névéol, Xavier Tannier.
Analyse d'expressions temporelles dans les dossiers électroniques patients.
in Actes de la Conférence Traitement Automatique des Langues Naturelles (TALN 2015). Caen, France, June 2015.
[abstract] [BibTeX] [Slides in pdf] [free copy]
References to phenomena ocurring in the world and their temporal caracterization can be found in a variety of natural language utterances. For this reason, temporal analysis is a key issue in natural language processing. This article presents a temporal analysis of specialized documents. We use a corpus of documents contained in several de-identified Electronic Health Records to develop an annotated resource of temporal expressions relying on the TimeML standard. We then use this corpus to evaluate several methods for the automatic extraction of temporal expressions. Our best statistical model yields 0.91 F-measure, which provides significant improvement on extraction, over the state-of-the-art system HeidelTime. We also compare our medical corpus to FR-Timebank in order to characterize the uses of temporal expressions in two different subdomains.

Les références à des phénomènes du monde réel et à leur caractérisation temporelle se retrouvent dans beaucoup de types de discours en langue naturelle. Ainsi, l’analyse temporelle apparaît comme un élément important en traitement automatique de la langue. Cet article présente une analyse de textes en domaine de spécialité du point de vue temporel. En s'appuyant sur un corpus de documents issus de plusieurs dossiers électroniques patient désidentifiés, nous décrivons la construction d'une ressource annotée en expressions temporelles selon la norme TimeML. Par suite, nous utilisons cette ressource pour évaluer plusieurs méthodes d'extraction automatique d'expressions temporelles adaptées au domaine médical. Notre meilleur système statistique offre une performance de 0,91 de F-mesure, surpassant pour l'identification le système état de l'art HeidelTime. La comparaison de notre corpus de travail avec le corpus journalistique FR-Timebank permet également de caractériser les différences d'utilisation des expressions temporelles dans deux domaines de spécialité.
@InProceedings{TapiNzali2015, 
  title = {{Analyse d'expressions temporelles dans les dossiers électroniques patients}},
  author = {Mike Donald Tapi Nzali and Aurélie Névéol and Xavier Tannier},
  booktitle = {Actes de la Conférence Traitement Automatique des Langues Naturelles (TALN 2015)}, 
  address = {Caen, France}, 
  year = {2015}, 
  month = jun
}
Kiem-Hieu Nguyen, Xavier Tannier, Véronique Moriceau.
Ranking Multidocument Event Descriptions for Building Thematic Timelines.
in Proceedings of the 30th International Conference on Computational Linguistics (Coling 14). Dublin, Ireland, August 2014.
[abstract] [BibTeX] [ACL Anthology]
This paper tackles the problem of timeline generation from traditional newssources. Our system builds thematic timelines for a general-domain topic defined by a user query. The system selects and ranks events relevant to the input query. Each event is represented by a one-sentence description in the output timeline.We present an inter-cluster ranking algorithm that takes events from multiple clusters as input and that selects the most salient and relevant events. Acluster, in our work, contains all the events happening in a specific date. Our algorithm utilizes the temporal information derived from a large collection of extensively temporal analyzed texts. Such temporal information is combined with textual contents into an event scoring model in order to rank events based on their salience and query-relevance.
@InProceedings{Nguyen2014a, 
  title = {{Ranking Multidocument Event Descriptions for Building Thematic Timelines}},
  author = {Kiem-Hieu Nguyen and Xavier Tannier and Véronique Moriceau},
  booktitle = {Proceedings of the 30th International Conference on Computational Linguistics (Coling 14)}, 
  address = {Dublin, Ireland}, 
  year = {2014}, 
  month = aug
}
Xavier Tannier.
Traitement des événements et ciblage d'information.
June 2014. Habilitation à Diriger des Recherches (HDR)
[abstract] [BibTeX] [Thesis] [Slides in pdf] [Slides in pptx]
Dans ce mémoire, nous organisons nos travaux principaux autour de quatre axes de traitement des informations textuelles : le ciblage, l'agrégation, la hiérarchisation et la contextualisation d'information. La majeure partie du document est dédiée à l'analyse des événements. Nous introduisons d'abord la notion d'événement à travers les diverses spécialités du traitement automatique des langues qui s'en sont préoccupées. Nous proposons ainsi un survol des différents modes de représentation des événements, tout en instaurant un fil rouge pour l'ensemble de la première partie. Nous distinguons ensuite deux grand es classes de travaux autour des événements, deux grandes visions que nous avons nommées, pour la première, l'"événement dans le texte", et pour la seconde, l'"événement dans le monde". Dans la première, nous considérons l'événement comme la désignation linguistique de quelque chose qui se passe, et nous tentons d'une part d'identifier ces désignations dans les textes, et d'autre part d'induire les relations temporelles existant entre ces événements, que ce soit dans des textes journalistiques ou médicaux. Nous réfléchissons enfin à une métrique d'évaluation adaptée à ce type d'informations. Pour ce qui est de l'"événement dans le monde", nous envisageons plus l'événement tel qu'il est perçu par le citoyen, et nous proposons plusieurs approches originales pour aider celui-ci à mieux appréhender la quantité écrasante d'événements dont il prend connaissance chaque jour : les chronologies thématiques, les fils temporels, et une approche automatisée du journalisme de données. La deuxième partie revient sur des travaux en lien avec le ciblage d'information. Nous décrivons tout d'abord nos travaux sur les systèmes de questions-réponses, dans les quels nous avons eu recours à l'analyse syntaxique pour aider à justifier les réponses trouvées à une question en langage naturel. Enfin, nous abordons le sujet de la collecte thématique de documents sur le Web, dans le but de créer automatiquement des corpus et des lexiques spécialisés. Enfin, nous concluons et revenons sur les perspectives associées aux travaux présentés sur les événements, avec pour but d'abolir partiellement la frontière qui séparent les différents axes présentés.
@Misc{Tannier2014b, 
  title = {{Traitement des événements et ciblage d'information}},
  author = {Xavier Tannier},
  year = {2014}, 
  month = jun, 
  school = {Université Paris-Sud, École Doctorale d'Informatique}, 
  howpublished = {Habilitation à Diriger des Recherches (HDR)}, 
  note = {}
}
Clément De Groc, Xavier Tannier, Claude De Loupy.
Thematic Cohesion: Measuring Terms Discriminatory Power Toward Themes.
in Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC 2014). Reykjavík, Iceland, May 2014.
[abstract] [BibTeX] [free copy] [Slides]
We present a new measure of thematic cohesion. This measure associates each term with a weight representing its discriminatory power toward a theme, this theme being itself expressed by a list of terms (a thematic lexicon). This thematic cohesion criterion can be used in many applications, such as query expansion, computer-assisted translation, or iterative construction of domain-specific lexicons and corpora. The measure is computed in two steps. First, a set of documents related to the terms is gathered from the Web by querying a Web search engine. Then, we produce an oriented co-occurrence graph, where vertices are the terms and edges represent the fact that two terms co-occur in a document. This graph can be interpreted as a recommendation graph, where two terms occurring in a same document means that they recommend each other. This leads to using a random walk algorithm that assigns a global importance value to each vertex of the graph. After observing the impact of various parameters on those importance values, we evaluate their correlation with retrieval effectiveness.
@InProceedings{DeGroc2014a, 
  title = {{Thematic Cohesion: Measuring Terms Discriminatory Power Toward Themes}},
  author = {Clément De Groc and Xavier Tannier and Claude De Loupy},
  booktitle = {Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC 2014)}, 
  address = {Reykjavík, Iceland}, 
  year = {2014}, 
  month = may
}
Clément De Groc, Xavier Tannier.
Evaluating Web-as-corpus Topical Document Retrieval with an Index of the OpenDirectory.
in Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC 2014). Reykjavík, Iceland, May 2014.
[abstract] [BibTeX] [free copy] [Slides]
This article introduces a novel protocol and resource to evaluate Web-as-corpus topical document retrieval. To the contrary of previous work, our goal is to provide an automatic, reproducible and robust evaluation for this task. We rely on the OpenDirectory (DMOZ) as a source of topically annotated webpages and index them in a search engine. With this OpenDirectory search engine, we can then easily evaluate the impact of various parameters such as the number of seed terms, queries or documents, or the usefulness of various term selection algorithms. A first fully automatic evaluation is described and provides baseline performances for this task. The article concludes with practical information regarding the availability of the index and resource files.
@InProceedings{DeGroc2014b, 
  title = {{Evaluating Web-as-corpus Topical Document Retrieval with an Index of the OpenDirectory}},
  author = {Clément De Groc and Xavier Tannier},
  booktitle = {Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC 2014)}, 
  address = {Reykjavík, Iceland}, 
  year = {2014}, 
  month = may
}
Véronique Moriceau, Xavier Tannier.
French Resources for Extraction and Normalization of Temporal Expressions with HeidelTime.
in Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC 2014). Reykjavík, Iceland, May 2014.
[abstract] [BibTeX] [free copy] [Poster]
In this paper, we describe the development of French resources for the extraction and normalization of temporal expressions with HeidelTime, a open-source multilingual, cross-domain temporal tagger. HeidelTime extracts temporal expressions from documents and normalizes them according to the TIMEX3 annotation standard. Several types of temporal expressions are extracted: dates, times, durations and temporal sets. French resources have been evaluated in two different ways: on the French TimeBank corpus, a corpus of newspaper articles in French annotated according to the ISO-TimeML standard, and on a user application for automatic building of event timelines. Results on the French TimeBank are quite satisfaying as they are comparable to those obtained by HeidelTime in English and Spanish on newswire articles. Concerning the user application, we used two temporal taggers for the preprocessing of the corpus in order to compare their performance and results show that the performances of our application on French documents are better with HeidelTime. The French resources and evaluation scripts are publicly available with HeidelTime.
@InProceedings{Moriceau2014a, 
  title = {{French Resources for Extraction and Normalization of Temporal Expressions with HeidelTime}},
  author = {Véronique Moriceau and Xavier Tannier},
  booktitle = {Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC 2014)}, 
  address = {Reykjavík, Iceland}, 
  year = {2014}, 
  month = may
}
Xavier Tannier.
Extracting News Web Page Creation Time with DCTFinder.
in Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC 2014). Reykjavík, Iceland, May 2014.
[abstract] [BibTeX] [free copy] [Poster]
Web pages do not offer reliable metadata concerning their creation date and time. However, getting the document creation time is a necessary step for allowing to apply temporal normalization systems to web pages. In this paper, we present DCTFinder, a system that parses a web page and extracts from its content the title and the creation date of this web page. DCTFinder combines heuristic title detection, supervised learning with Conditional Random Fields (CRFs) for document date extraction, and rule-based creation time recognition. Using such a system allows further deep and efficient temporal analysis of web pages. Evaluation on three corpora of English and French web pages indicates that the tool can extract document creation times with reasonably high accuracy (between 87 and 92%).
DCTFinder is made freely available on http://sourceforge.net/projects/dctfinder/, as well as all resources (vocabulary and annotated documents) built for training and evaluating the system in English and French, and the English trained model itself.
@InProceedings{Tannier2014a, 
  title = {{Extracting News Web Page Creation Time with DCTFinder}},
  author = {Xavier Tannier},
  booktitle = {Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC 2014)}, 
  address = {Reykjavík, Iceland}, 
  year = {2014}, 
  month = may
}
Béatrice Arnulphy, Vincent Claveau, Xavier Tannier, Anne Vilnat.
Techniques d’apprentissage supervisé pour l’extraction d’événements TimeML en anglais et français.
in Actes de la COnférence en Recherche d'Information et ses Applications (CORIA 2014). Nancy, France, March 2014.
[abstract] [BibTeX] [free copy]
Identifying events from texts is an information extraction task necessary for manyNLP applications. Through the TimeML specifications and TempEval challenges, it has received some attention in the last years, yet, no reference result is available for French. In thispaper, we try to fill this gap by proposing several event extraction systems, combining for instance Conditional Random Fields, language modeling and k-nearest-neighbors. These systemsare evaluated on French corpora and compared with state-of-the-art methods on English. Thevery good results obtained on both languages validate our whole approach.

L’identification des événements au sein de textes est une tâche d’extraction d’informations importante et préalable à de nombreuses applications. Au travers des spécifications TimeML et des campagnes TempEval, cette tâche a reçu une attention particulière ces dernières années, mais aucun résultat de référence n’est disponible pour le français. Dans cet article nous tentons de répondre à ce problème en proposant plusieurs systèmes d’extraction, en faisant notamment collaborer champs aléatoires conditionnels, modèles de langues et k-plus-proches-voisins. Ces systèmes sont évalués sur le français et confrontés à l’état-de-l’art sur l’anglais. Les très bons résultats obtenus sur les deux langues valident notre approche.
@InProceedings{Arnulphy2014, 
  title = {{Techniques d’apprentissage supervisé pour l’extraction d’événements TimeML en anglais et français}},
  author = {Béatrice Arnulphy and Vincent Claveau and Xavier Tannier and Anne Vilnat},
  booktitle = {Actes de la COnférence en Recherche d'Information et ses Applications (CORIA 2014)}, 
  address = {Nancy, France}, 
  year = {2014}, 
  month = mar
}
Clément de Groc, Xavier Tannier.
Apprendre à ordonner la frontière de crawl pour le crawling orienté.
in Actes de la COnférence en Recherche d'Information et ses Applications (CORIA 2014). Nancy, France, March 2014.
[abstract] [BibTeX] [free copy]
Focused crawling consists in searching and retrieving a set of documents relevant to a specific domain of interest from the Web. Such crawlers prioritize their fetches by relying on a crawl frontier ordering strategy. In this article, we propose to learn this ordering strategy from annotated data using learning-to-rank algorithms. Such approach allows us to cope with tunneling and to integrate a large number of heterogeneous features to guide the crawler. We describe a novel method to learn a domain-independent ranking function for topical Web crawling. We validate the relevance of our approach on "large" crawls of 40,000 documents on a set of 15 topics from the OpenDirectory, and show that our approach provides an increase in precision (harvest rate) of up to 10% compared to a baseline Shark Search algorithm. Finally, we discuss future leads regarding the application of learning-to-rank to focused Web crawling.

Le crawling orienté consiste à parcourir le Web au travers des hyperliens en orientant son parcours en direction des pages pertinentes. Pour cela, ces crawlers ordonnent leurs téléchargements suivant une stratégie d'ordonnancement. Dans cet article, nous proposons d'apprendre cette fonction d'ordonnancement à partir de données annotées. Une telle approche nous permet notamment d'intégrer un grand nombre de traits hétérogènes et de les combiner. Nous décrivons une méthode permettant d'apprendre une fonction d'ordonnancement indépendante du domaine pour la collecte thématique de documents. Nous évaluons notre approche sur de "longs" crawls de 40 000 documents sur 15 thèmes différents issus de l'OpenDirectory, et montrons que notre méthode permet d'améliorer la précision de près de 10 % par rapport à l'algorithme Shark Search. Enfin, nous discutons les avantages et inconvénients de notre approche, ainsi que les pistes de recherche ouvertes.
@InProceedings{deGroc2014, 
  title = {{Apprendre à ordonner la frontière de crawl pour le crawling orienté}},
  author = {Clément de Groc and Xavier Tannier},
  booktitle = {Actes de la COnférence en Recherche d'Information et ses Applications (CORIA 2014)}, 
  address = {Nancy, France}, 
  year = {2014}, 
  month = mar
}
Xavier Tannier, Véronique Moriceau.
Building Event Threads out of Multiple News Articles.
in Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2013). Seattle, USA, 2013.
[abstract] [BibTeX] [Poster] [ACL Anthology]
We present an approach for building multidocument event threads from a large corpus of newswire articles. An event thread is basically a succession of events belonging to the same story. It helps the reader to contextualize the information contained in a single article, by navigating backward or forward in the thread from this article. A specific effort is also made on the detection of reactions to a particular event.
In order to build these event threads, we use a cascade of classifiers and other modules, taking advantage of the redundancy of information in the newswire corpus.
We also share interesting comments concerning our manual annotation procedure for building a training and testing set
@InProceedings{Tannier2013b, 
  title = {{Building Event Threads out of Multiple News Articles}},
  author = {Xavier Tannier and Véronique Moriceau},
  booktitle = {Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2013)}, 
  address = {Seattle, USA}, 
  year = {2013}
}
Cyril Grouin, Natalia Grabar, Thierry Hamon, Sophie Rosset, Xavier Tannier, Pierre Zweigenbaum.
Eventual situations for timeline extraction from clinical reports.